Overfitting vs Underfitting in Machine Learning: Explained with Code
Introduction
In the world of machine learning, one of the biggest challenges is creating models that not only perform well on training data but also generalize effectively to new, unseen data. Two common pitfalls that hinder model performance are overfitting and underfitting. Understanding these concepts isn’t just theoretical knowledge—it’s essential for building robust, production-ready machine learning solutions.
In this comprehensive guide, we’ll explore:
- What overfitting and underfitting actually mean
- How to visually identify these issues in your models
- Practical code examples to detect and fix both problems
- Real-world strategies to achieve the right model complexity
What Are Overfitting and Underfitting?
Before diving into code, let’s establish a clear understanding of these concepts:
Overfitting
Overfitting occurs when a model learns the training data too well, including its noise and outliers. An overfit model essentially “memorizes” the training data rather than learning the underlying patterns.
Signs of overfitting:
- Excellent performance on training data
- Poor performance on validation/test data
- Large gap between training and validation metrics
Underfitting
Underfitting happens when a model is too simple to capture the underlying patterns in the data. An underfit model fails to learn even the basic relationships present.
Signs of underfitting:
- Poor performance on training data
- Similarly poor performance on validation/test data
- Little difference between training and validation metrics, but both are poor
Overfitting vs Underfitting: A Comparison
Aspect | Overfitting | Underfitting |
---|---|---|
Training Accuracy | High | Low |
Test Accuracy | Low | Low |
Model Complexity | Too complex | Too simple |
Bias | Low | High |
Variance | High | Low |
Fix Strategies | Simplify, Regularize, Early stop | Add complexity, Feature engineer |
Visualizing the Bias-Variance Tradeoff
The concepts of overfitting and underfitting relate directly to the bias-variance tradeoff. Let’s visualize what this looks like in practice:
- High Bias (Underfitting): The model is too simple and makes strong assumptions about the data structure
- High Variance (Overfitting): The model is too complex and sensitive to small fluctuations in the training data
- Balanced Model: Captures the true underlying patterns without being distracted by noise
Practical Code: Detecting Overfitting and Underfitting
Let’s implement a practical example using Python, scikit-learn, and matplotlib to identify these issues:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline
# Generate synthetic data with some noise
np.random.seed(42)
X = np.sort(np.random.rand(100, 1), axis=0)
y = np.sin(2 * np.pi * X).ravel() + np.random.normal(0, 0.1, X.shape[0])
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Function to fit polynomial regression models of different degrees
def fit_polynomial_regression(degree):
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
model.fit(X_train, y_train)
# Generate predictions for plotting
X_plot = np.linspace(0, 1, 100).reshape(-1, 1)
y_plot = model.predict(X_plot)
# Calculate training and testing errors
train_error = mean_squared_error(y_train, model.predict(X_train))
test_error = mean_squared_error(y_test, model.predict(X_test))
return X_plot, y_plot, train_error, test_error
# Models with different complexity levels
degrees = [1, 3, 15]
models = []
plt.figure(figsize=(16, 4))
for i, degree in enumerate(degrees):
X_plot, y_plot, train_error, test_error = fit_polynomial_regression(degree)
models.append((degree, X_plot, y_plot, train_error, test_error))
# Create subplot
plt.subplot(1, 3, i+1)
plt.scatter(X_train, y_train, color='blue', alpha=0.5, label='Training data')
plt.scatter(X_test, y_test, color='green', alpha=0.5, label='Testing data')
plt.plot(X_plot, y_plot, color='red', label=f'Polynomial (degree={degree})')
plt.ylim(-1.5, 1.5)
plt.title(f'Degree {degree}\nTrain MSE: {train_error:.4f}, Test MSE: {test_error:.4f}')
plt.legend()
plt.tight_layout()
plt.savefig('polynomial_fitting.png', dpi=300)
plt.show()
# Plot learning curves
degrees = list(range(1, 20))
train_errors = []
test_errors = []
for degree in degrees:
_, _, train_error, test_error = fit_polynomial_regression(degree)
train_errors.append(train_error)
test_errors.append(test_error)
plt.figure(figsize=(10, 6))
plt.plot(degrees, train_errors, 'o-', color='blue', label='Training error')
plt.plot(degrees, test_errors, 'o-', color='green', label='Testing error')
plt.xlabel('Polynomial Degree')
plt.ylabel('Mean Squared Error')
plt.title('Learning Curves: Error vs. Model Complexity')
plt.legend()
plt.grid(True)
plt.savefig('learning_curves.png', dpi=300)
plt.show()
This code demonstrates three key scenarios:
- Underfitting (Degree 1): A linear model that’s too simple to capture the sinusoidal pattern
- Good Fit (Degree 3): A model with appropriate complexity that captures the true pattern
- Overfitting (Degree 15): A complex model that fits training data perfectly but fails on test data
The learning curves plot shows how both training and testing errors change with model complexity—providing a clear visualization of the optimal complexity level.
Strategies to Combat Overfitting
1. Collect More Data
More data helps models learn true patterns rather than noise. If obtaining more data isn’t feasible, consider data augmentation techniques.
2. Feature Selection and Dimensionality Reduction
Reducing unnecessary features can prevent models from learning noise:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
# Example of feature selection
selector = SelectFromModel(RandomForestRegressor(n_estimators=100, random_state=42))
selector.fit(X_train, y_train)
# Get selected features
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)
print(f"Original features: {X_train.shape[1]}")
print(f"Selected features: {X_train_selected.shape[1]}")
3. Regularization
Adding penalty terms to prevent large coefficients helps control model complexity:
from sklearn.linear_model import Ridge, Lasso
# Ridge Regression (L2 regularization)
ridge_model = Ridge(alpha=1.0) # alpha controls regularization strength
ridge_model.fit(X_train, y_train)
ridge_train_score = ridge_model.score(X_train, y_train)
ridge_test_score = ridge_model.score(X_test, y_test)
# Lasso Regression (L1 regularization)
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)
lasso_train_score = lasso_model.score(X_train, y_train)
lasso_test_score = lasso_model.score(X_test, y_test)
print(f"Ridge - Train R²: {ridge_train_score:.4f}, Test R²: {ridge_test_score:.4f}")
print(f"Lasso - Train R²: {lasso_train_score:.4f}, Test R²: {lasso_test_score:.4f}")
4. Early Stopping
Stop training when validation metrics begin to worsen:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
# Split training data to create a validation set
X_train, X_val, y_train, y_val = train_test_split(
X_train, y_train, test_size=0.2, random_state=42
)
# Train with early stopping
gb_model = GradientBoostingRegressor(
n_estimators=1000, # Maximum number of estimators
learning_rate=0.1,
subsample=0.8,
random_state=42,
validation_fraction=0.2,
n_iter_no_change=10, # Stop if no improvement after 10 iterations
tol=1e-4
)
gb_model.fit(X_train, y_train)
# The model automatically used early stopping
print(f"Optimal number of estimators: {gb_model.n_estimators_}")
print(f"Best validation score: {gb_model.best_score_:.4f}")
5. Dropout and Batch Normalization (for Neural Networks)
For deep learning models, these techniques help prevent overfitting:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping
# Define model with dropout and batch normalization
model = Sequential([
Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
BatchNormalization(),
Dropout(0.3), # Drop 30% of neurons during training
Dense(64, activation='relu'),
BatchNormalization(),
Dropout(0.3),
Dense(1)
])
# Compile model
model.compile(optimizer='adam', loss='mse')
# Early stopping callback
early_stopping = EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True
)
# Train with validation data
history = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=200,
callbacks=[early_stopping],
batch_size=32,
verbose=0
)
# Plot training history
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.grid(True)
plt.savefig('training_history.png', dpi=300)
plt.show()
Strategies to Combat Underfitting
1. Increase Model Complexity
Add more features or use more complex models:
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
# Random Forest with more trees and depth
rf_model = RandomForestRegressor(
n_estimators=100,
max_depth=None, # Allow full depth
min_samples_split=2,
random_state=42
)
rf_model.fit(X_train, y_train)
rf_train_score = rf_model.score(X_train, y_train)
rf_test_score = rf_model.score(X_test, y_test)
# Support Vector Machine with nonlinear kernel
svm_model = SVR(kernel='rbf', C=100, gamma='auto')
svm_model.fit(X_train, y_train)
svm_train_score = svm_model.score(X_train, y_train)
svm_test_score = svm_model.score(X_test, y_test)
print(f"Random Forest - Train R²: {rf_train_score:.4f}, Test R²: {rf_test_score:.4f}")
print(f"SVM - Train R²: {svm_train_score:.4f}, Test R²: {svm_test_score:.4f}")
2. Feature Engineering
Create new features that better capture the underlying patterns:
from sklearn.preprocessing import PolynomialFeatures
# Create polynomial features
poly = PolynomialFeatures(degree=3, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
# Train linear model on polynomial features
linear_model = LinearRegression()
linear_model.fit(X_train_poly, y_train)
linear_train_score = linear_model.score(X_train_poly, y_train)
linear_test_score = linear_model.score(X_test_poly, y_test)
print(f"Polynomial Features - Train R²: {linear_train_score:.4f}, Test R²: {linear_test_score:.4f}")
3. Reduce Regularization
If your model is underfit, try reducing regularization strength:
# Ridge with lower regularization strength
ridge_weak = Ridge(alpha=0.01) # Lower alpha means less regularization
ridge_weak.fit(X_train, y_train)
ridge_weak_train_score = ridge_weak.score(X_train, y_train)
ridge_weak_test_score = ridge_weak.score(X_test, y_test)
print(f"Weak Ridge - Train R²: {ridge_weak_train_score:.4f}, Test R²: {ridge_weak_test_score:.4f}")
Finding the Sweet Spot: Cross-Validation
Cross-validation is key to finding the optimal model complexity:
from sklearn.model_selection import GridSearchCV
# Parameters to try
param_grid = {
'polynomialfeatures__degree': [1, 2, 3, 4, 5],
'ridge__alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
}
# Create pipeline
pipeline = make_pipeline(
PolynomialFeatures(include_bias=False),
Ridge()
)
# Grid search with cross-validation
grid_search = GridSearchCV(
pipeline,
param_grid,
cv=5, # 5-fold cross-validation
scoring='neg_mean_squared_error',
n_jobs=-1 # Use all available cores
)
grid_search.fit(X_train, y_train)
# Print best parameters
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {-grid_search.best_score_:.4f}")
# Evaluate on test set
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print(f"Test R²: {test_score:.4f}")
# Plot best model predictions
X_plot = np.linspace(0, 1, 100).reshape(-1, 1)
y_plot = best_model.predict(X_plot)
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', alpha=0.5, label='Training data')
plt.scatter(X_test, y_test, color='green', alpha=0.5, label='Testing data')
plt.plot(X_plot, y_plot, color='red', label='Best model')
plt.title(f'Best Model: {grid_search.best_params_}')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.savefig('best_model.png', dpi=300)
plt.show()
Practical Example: Recognizing Overfitting in a Real-World Dataset
Let’s apply these concepts to a real-world dataset:
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import learning_curve
# Load California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=42
)
# Function to plot learning curves
def plot_learning_curve(estimator, title, X, y, cv=5, n_jobs=-1):
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='neg_mean_squared_error'
)
train_scores_mean = -np.mean(train_scores, axis=1)
test_scores_mean = -np.mean(test_scores, axis=1)
plt.figure(figsize=(10, 6))
plt.title(title)
plt.xlabel('Training Examples')
plt.ylabel('Mean Squared Error')
plt.plot(train_sizes, train_scores_mean, 'o-', color='blue', label='Training error')
plt.plot(train_sizes, test_scores_mean, 'o-', color='green', label='Cross-validation error')
plt.legend(loc='best')
plt.grid(True)
# Compare different model complexities
models = {
'Underfitting (Linear)': LinearRegression(),
'Good Fit (Random Forest)': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
'Overfitting (Random Forest)': RandomForestRegressor(n_estimators=100, max_depth=None, min_samples_leaf=1, random_state=42)
}
for name, model in models.items():
plot_learning_curve(model, name, X_train, y_train)
plt.savefig(f'{name.replace(" ", "_").replace("(", "").replace(")", "")}.png', dpi=300)
plt.show()
# Fit and evaluate model
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"{name} - Train R²: {train_score:.4f}, Test R²: {test_score:.4f}")
Practical Tips for Production-Ready Models
- Always Monitor the Gap: Keep a close eye on the difference between training and validation performance.
- Use Learning Curves: Plotting learning curves helps visualize whether your model is overfitting or underfitting.
- K-Fold Cross-Validation: For smaller datasets, use k-fold cross-validation to get more reliable estimates of model performance.
- Ensemble Methods: Consider using ensemble methods like stacking or blending, which often balance bias and variance effectively.
- Regular Evaluation: Continuously monitor model performance on new data to detect concept drift.
Conclusion
Finding the perfect balance between overfitting and underfitting is more art than science. It requires understanding your data, choosing appropriate models, and applying the right techniques to control model complexity.
By implementing the practical strategies outlined in this guide, you’ll be better equipped to build machine learning models that generalize well to new data—the ultimate goal of any production-ready solution.
Remember the key signs:
- Underfitting: Poor performance on both training and test data
- Overfitting: Excellent training performance but poor test performance
- Just Right: Good performance on both with a minimal gap
With these tools and techniques, you’re now ready to tackle the bias-variance tradeoff with confidence and build better machine learning models.