Practical Machine Learning: From Raw CSV to Predictions in Python
Introduction
This guide walks through the complete process of training a machine learning model on your dataset – from loading and cleaning data to making production-ready predictions. You’ll learn how to:
- Properly preprocess real-world CSV data
- Train and evaluate a robust model
- Save and reuse models for predictions
Prerequisites
- Python 3.8+
- Essential libraries:bashCopyDownloadpip install pandas scikit-learn matplotlib
- Any CSV dataset (e.g., sales records, housing prices, customer data)
Step 1: Loading Your Dataset
import pandas as pd
<em># Load data with proper error handling</em>
try:
df = pd.read_csv('your_data.csv', encoding='utf-8')
print("Data loaded successfully. First 5 rows:")
print(df.head())
except FileNotFoundError:
print("Error: File not found. Check your file path.")
except Exception as e:
print(f"An error occurred: {str(e)}")
Key considerations:
- Always verify file paths and encoding
- Inspect data structure with
df.info()
- Handle missing values during loading with
na_values
parameter
Step 2: Data Cleaning and Preparation
<em># Remove irrelevant columns</em>
df = df.drop(columns=['id', 'useless_column'])
<em># Handle missing data</em>
df = df.dropna() <em># or use df.fillna() for imputation</em>
<em># Convert categorical data</em>
df = pd.get_dummies(df, columns=['category_column'])
print(f"Cleaned data shape: {df.shape}")
Data quality checks:
- Verify no null values remain
- Ensure numeric columns are properly typed
- Check for outliers using
df.describe()
Step 3: Feature Selection and Train-Test Split
from sklearn.model_selection import train_test_split
<em># Separate features and target</em>
X = df.drop('target_column', axis=1)
y = df['target_column']
<em># Create training and test sets</em>
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
Step 4: Model Training
from sklearn.ensemble import RandomForestRegressor
<em># Initialize and train model</em>
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
print("Model training complete")
Alternative models:
- For classification:
RandomForestClassifier
- For linear relationships:
LinearRegression
Step 5: Model Evaluation
from sklearn.metrics import mean_squared_error, r2_score
# Generate predictions
y_pred = model.predict(X_test)
# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Model Performance:")
print(f"- RMSE: {rmse:.2f}")
print(f"- R² Score: {r2:.2f}")
Interpretation:
- RMSE: Lower is better (in target variable units)
- R²: 1 is perfect, 0 is baseline
Step 6: Making New Predictions
# Prepare new data (must match training features)
new_sample = pd.DataFrame({
'feature1': [value1],
'feature2': [value2],
# ... other features
})
# Generate prediction
prediction = model.predict(new_sample)
print(f"Predicted value: {prediction[0]:.2f}")
Step 7: Saving Your Model
import joblib
# Save model
joblib.dump(model, 'trained_model.pkl')
# Later... load model
loaded_model = joblib.load('trained_model.pkl')
Production Considerations
- Version Control: Track model versions
- Monitoring: Set up performance alerts
- Retraining: Schedule periodic updates
Common Applications
Industry | Use Case Example |
---|---|
E-commerce | Customer lifetime value |
Finance | Credit risk assessment |
Healthcare | Readmission prediction |
Manufacturing | Equipment failure warning |
Next Steps
- Experiment with feature engineering
- Try different model architectures
- Deploy as a web service with Flask/FastAPI
Troubleshooting
- ValueError: Shapes mismatch: Verify feature count matches training data
- Poor performance: Try feature scaling or more data
- Memory issues: Reduce dataset size or use Dask