Practical Machine Learning: From Raw CSV to Predictions in Python

Introduction

This guide walks through the complete process of training a machine learning model on your dataset – from loading and cleaning data to making production-ready predictions. You’ll learn how to:

  • Properly preprocess real-world CSV data
  • Train and evaluate a robust model
  • Save and reuse models for predictions

Prerequisites

  • Python 3.8+
  • Essential libraries:bashCopyDownloadpip install pandas scikit-learn matplotlib
  • Any CSV dataset (e.g., sales records, housing prices, customer data)

Step 1: Loading Your Dataset

import pandas as pd

<em># Load data with proper error handling</em>
try:
    df = pd.read_csv('your_data.csv', encoding='utf-8')
    print("Data loaded successfully. First 5 rows:")
    print(df.head())
except FileNotFoundError:
    print("Error: File not found. Check your file path.")
except Exception as e:
    print(f"An error occurred: {str(e)}")

Key considerations:

  • Always verify file paths and encoding
  • Inspect data structure with df.info()
  • Handle missing values during loading with na_values parameter

Step 2: Data Cleaning and Preparation

<em># Remove irrelevant columns</em>
df = df.drop(columns=['id', 'useless_column'])

<em># Handle missing data</em>
df = df.dropna()  <em># or use df.fillna() for imputation</em>

<em># Convert categorical data</em>
df = pd.get_dummies(df, columns=['category_column'])

print(f"Cleaned data shape: {df.shape}")

Data quality checks:

  1. Verify no null values remain
  2. Ensure numeric columns are properly typed
  3. Check for outliers using df.describe()

Step 3: Feature Selection and Train-Test Split

from sklearn.model_selection import train_test_split

<em># Separate features and target</em>
X = df.drop('target_column', axis=1)
y = df['target_column']

<em># Create training and test sets</em>
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

Step 4: Model Training

from sklearn.ensemble import RandomForestRegressor

<em># Initialize and train model</em>
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

print("Model training complete")

Alternative models:

  • For classification: RandomForestClassifier
  • For linear relationships: LinearRegression

Step 5: Model Evaluation

from sklearn.metrics import mean_squared_error, r2_score

# Generate predictions
y_pred = model.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"Model Performance:")
print(f"- RMSE: {rmse:.2f}")
print(f"- R² Score: {r2:.2f}")

Interpretation:

  • RMSE: Lower is better (in target variable units)
  • R²: 1 is perfect, 0 is baseline

Step 6: Making New Predictions

# Prepare new data (must match training features)
new_sample = pd.DataFrame({
    'feature1': [value1],
    'feature2': [value2],
    # ... other features
})

# Generate prediction
prediction = model.predict(new_sample)
print(f"Predicted value: {prediction[0]:.2f}")

Step 7: Saving Your Model

import joblib

# Save model
joblib.dump(model, 'trained_model.pkl')

# Later... load model
loaded_model = joblib.load('trained_model.pkl')

Production Considerations

  1. Version Control: Track model versions
  2. Monitoring: Set up performance alerts
  3. Retraining: Schedule periodic updates

Common Applications

IndustryUse Case Example
E-commerceCustomer lifetime value
FinanceCredit risk assessment
HealthcareReadmission prediction
ManufacturingEquipment failure warning

Next Steps

  • Experiment with feature engineering
  • Try different model architectures
  • Deploy as a web service with Flask/FastAPI

Troubleshooting

  • ValueError: Shapes mismatch: Verify feature count matches training data
  • Poor performance: Try feature scaling or more data
  • Memory issues: Reduce dataset size or use Dask