Overfitting vs Underfitting in Machine Learning: Explained with Code

Introduction

In the world of machine learning, one of the biggest challenges is creating models that not only perform well on training data but also generalize effectively to new, unseen data. Two common pitfalls that hinder model performance are overfitting and underfitting. Understanding these concepts isn’t just theoretical knowledge—it’s essential for building robust, production-ready machine learning solutions.

In this comprehensive guide, we’ll explore:

  • What overfitting and underfitting actually mean
  • How to visually identify these issues in your models
  • Practical code examples to detect and fix both problems
  • Real-world strategies to achieve the right model complexity

What Are Overfitting and Underfitting?

Before diving into code, let’s establish a clear understanding of these concepts:

Overfitting

Overfitting occurs when a model learns the training data too well, including its noise and outliers. An overfit model essentially “memorizes” the training data rather than learning the underlying patterns.

Signs of overfitting:

  • Excellent performance on training data
  • Poor performance on validation/test data
  • Large gap between training and validation metrics

Underfitting

Underfitting happens when a model is too simple to capture the underlying patterns in the data. An underfit model fails to learn even the basic relationships present.

Signs of underfitting:

  • Poor performance on training data
  • Similarly poor performance on validation/test data
  • Little difference between training and validation metrics, but both are poor

Overfitting vs Underfitting: A Comparison

AspectOverfittingUnderfitting
Training AccuracyHighLow
Test AccuracyLowLow
Model ComplexityToo complexToo simple
BiasLowHigh
VarianceHighLow
Fix StrategiesSimplify, Regularize, Early stopAdd complexity, Feature engineer

Visualizing the Bias-Variance Tradeoff

The concepts of overfitting and underfitting relate directly to the bias-variance tradeoff. Let’s visualize what this looks like in practice:

  • High Bias (Underfitting): The model is too simple and makes strong assumptions about the data structure
  • High Variance (Overfitting): The model is too complex and sensitive to small fluctuations in the training data
  • Balanced Model: Captures the true underlying patterns without being distracted by noise

Practical Code: Detecting Overfitting and Underfitting

Let’s implement a practical example using Python, scikit-learn, and matplotlib to identify these issues:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline

# Generate synthetic data with some noise
np.random.seed(42)
X = np.sort(np.random.rand(100, 1), axis=0)
y = np.sin(2 * np.pi * X).ravel() + np.random.normal(0, 0.1, X.shape[0])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Function to fit polynomial regression models of different degrees
def fit_polynomial_regression(degree):
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(X_train, y_train)
    
    # Generate predictions for plotting
    X_plot = np.linspace(0, 1, 100).reshape(-1, 1)
    y_plot = model.predict(X_plot)
    
    # Calculate training and testing errors
    train_error = mean_squared_error(y_train, model.predict(X_train))
    test_error = mean_squared_error(y_test, model.predict(X_test))
    
    return X_plot, y_plot, train_error, test_error

# Models with different complexity levels
degrees = [1, 3, 15]
models = []

plt.figure(figsize=(16, 4))

for i, degree in enumerate(degrees):
    X_plot, y_plot, train_error, test_error = fit_polynomial_regression(degree)
    models.append((degree, X_plot, y_plot, train_error, test_error))
    
    # Create subplot
    plt.subplot(1, 3, i+1)
    plt.scatter(X_train, y_train, color='blue', alpha=0.5, label='Training data')
    plt.scatter(X_test, y_test, color='green', alpha=0.5, label='Testing data')
    plt.plot(X_plot, y_plot, color='red', label=f'Polynomial (degree={degree})')
    plt.ylim(-1.5, 1.5)
    plt.title(f'Degree {degree}\nTrain MSE: {train_error:.4f}, Test MSE: {test_error:.4f}')
    plt.legend()

plt.tight_layout()
plt.savefig('polynomial_fitting.png', dpi=300)
plt.show()

# Plot learning curves
degrees = list(range(1, 20))
train_errors = []
test_errors = []

for degree in degrees:
    _, _, train_error, test_error = fit_polynomial_regression(degree)
    train_errors.append(train_error)
    test_errors.append(test_error)

plt.figure(figsize=(10, 6))
plt.plot(degrees, train_errors, 'o-', color='blue', label='Training error')
plt.plot(degrees, test_errors, 'o-', color='green', label='Testing error')
plt.xlabel('Polynomial Degree')
plt.ylabel('Mean Squared Error')
plt.title('Learning Curves: Error vs. Model Complexity')
plt.legend()
plt.grid(True)
plt.savefig('learning_curves.png', dpi=300)
plt.show()

This code demonstrates three key scenarios:

  1. Underfitting (Degree 1): A linear model that’s too simple to capture the sinusoidal pattern
  2. Good Fit (Degree 3): A model with appropriate complexity that captures the true pattern
  3. Overfitting (Degree 15): A complex model that fits training data perfectly but fails on test data

The learning curves plot shows how both training and testing errors change with model complexity—providing a clear visualization of the optimal complexity level.

Strategies to Combat Overfitting

1. Collect More Data

More data helps models learn true patterns rather than noise. If obtaining more data isn’t feasible, consider data augmentation techniques.

2. Feature Selection and Dimensionality Reduction

Reducing unnecessary features can prevent models from learning noise:

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor

# Example of feature selection
selector = SelectFromModel(RandomForestRegressor(n_estimators=100, random_state=42))
selector.fit(X_train, y_train)

# Get selected features
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

print(f"Original features: {X_train.shape[1]}")
print(f"Selected features: {X_train_selected.shape[1]}")

3. Regularization

Adding penalty terms to prevent large coefficients helps control model complexity:

from sklearn.linear_model import Ridge, Lasso

# Ridge Regression (L2 regularization)
ridge_model = Ridge(alpha=1.0)  # alpha controls regularization strength
ridge_model.fit(X_train, y_train)
ridge_train_score = ridge_model.score(X_train, y_train)
ridge_test_score = ridge_model.score(X_test, y_test)

# Lasso Regression (L1 regularization)
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)
lasso_train_score = lasso_model.score(X_train, y_train)
lasso_test_score = lasso_model.score(X_test, y_test)

print(f"Ridge - Train R²: {ridge_train_score:.4f}, Test R²: {ridge_test_score:.4f}")
print(f"Lasso - Train R²: {lasso_train_score:.4f}, Test R²: {lasso_test_score:.4f}")

4. Early Stopping

Stop training when validation metrics begin to worsen:

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

# Split training data to create a validation set
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)

# Train with early stopping
gb_model = GradientBoostingRegressor(
    n_estimators=1000,  # Maximum number of estimators
    learning_rate=0.1,
    subsample=0.8,
    random_state=42,
    validation_fraction=0.2,
    n_iter_no_change=10,  # Stop if no improvement after 10 iterations
    tol=1e-4
)

gb_model.fit(X_train, y_train)

# The model automatically used early stopping
print(f"Optimal number of estimators: {gb_model.n_estimators_}")
print(f"Best validation score: {gb_model.best_score_:.4f}")

5. Dropout and Batch Normalization (for Neural Networks)

For deep learning models, these techniques help prevent overfitting:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping

# Define model with dropout and batch normalization
model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    BatchNormalization(),
    Dropout(0.3),  # Drop 30% of neurons during training
    Dense(64, activation='relu'),
    BatchNormalization(),
    Dropout(0.3),
    Dense(1)
])

# Compile model
model.compile(optimizer='adam', loss='mse')

# Early stopping callback
early_stopping = EarlyStopping(
    monitor='val_loss', 
    patience=10, 
    restore_best_weights=True
)

# Train with validation data
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=200,
    callbacks=[early_stopping],
    batch_size=32,
    verbose=0
)

# Plot training history
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.grid(True)
plt.savefig('training_history.png', dpi=300)
plt.show()

Strategies to Combat Underfitting

1. Increase Model Complexity

Add more features or use more complex models:

from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

# Random Forest with more trees and depth
rf_model = RandomForestRegressor(
    n_estimators=100, 
    max_depth=None,  # Allow full depth
    min_samples_split=2,
    random_state=42
)
rf_model.fit(X_train, y_train)
rf_train_score = rf_model.score(X_train, y_train)
rf_test_score = rf_model.score(X_test, y_test)

# Support Vector Machine with nonlinear kernel
svm_model = SVR(kernel='rbf', C=100, gamma='auto')
svm_model.fit(X_train, y_train)
svm_train_score = svm_model.score(X_train, y_train)
svm_test_score = svm_model.score(X_test, y_test)

print(f"Random Forest - Train R²: {rf_train_score:.4f}, Test R²: {rf_test_score:.4f}")
print(f"SVM - Train R²: {svm_train_score:.4f}, Test R²: {svm_test_score:.4f}")

2. Feature Engineering

Create new features that better capture the underlying patterns:

from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features
poly = PolynomialFeatures(degree=3, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Train linear model on polynomial features
linear_model = LinearRegression()
linear_model.fit(X_train_poly, y_train)
linear_train_score = linear_model.score(X_train_poly, y_train)
linear_test_score = linear_model.score(X_test_poly, y_test)

print(f"Polynomial Features - Train R²: {linear_train_score:.4f}, Test R²: {linear_test_score:.4f}")

3. Reduce Regularization

If your model is underfit, try reducing regularization strength:

# Ridge with lower regularization strength
ridge_weak = Ridge(alpha=0.01)  # Lower alpha means less regularization
ridge_weak.fit(X_train, y_train)
ridge_weak_train_score = ridge_weak.score(X_train, y_train)
ridge_weak_test_score = ridge_weak.score(X_test, y_test)

print(f"Weak Ridge - Train R²: {ridge_weak_train_score:.4f}, Test R²: {ridge_weak_test_score:.4f}")

Finding the Sweet Spot: Cross-Validation

Cross-validation is key to finding the optimal model complexity:

from sklearn.model_selection import GridSearchCV

# Parameters to try
param_grid = {
    'polynomialfeatures__degree': [1, 2, 3, 4, 5],
    'ridge__alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
}

# Create pipeline
pipeline = make_pipeline(
    PolynomialFeatures(include_bias=False),
    Ridge()
)

# Grid search with cross-validation
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='neg_mean_squared_error',
    n_jobs=-1  # Use all available cores
)

grid_search.fit(X_train, y_train)

# Print best parameters
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {-grid_search.best_score_:.4f}")

# Evaluate on test set
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print(f"Test R²: {test_score:.4f}")

# Plot best model predictions
X_plot = np.linspace(0, 1, 100).reshape(-1, 1)
y_plot = best_model.predict(X_plot)

plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', alpha=0.5, label='Training data')
plt.scatter(X_test, y_test, color='green', alpha=0.5, label='Testing data')
plt.plot(X_plot, y_plot, color='red', label='Best model')
plt.title(f'Best Model: {grid_search.best_params_}')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.savefig('best_model.png', dpi=300)
plt.show()

Practical Example: Recognizing Overfitting in a Real-World Dataset

Let’s apply these concepts to a real-world dataset:

from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import learning_curve

# Load California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

# Function to plot learning curves
def plot_learning_curve(estimator, title, X, y, cv=5, n_jobs=-1):
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='neg_mean_squared_error'
    )
    
    train_scores_mean = -np.mean(train_scores, axis=1)
    test_scores_mean = -np.mean(test_scores, axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.title(title)
    plt.xlabel('Training Examples')
    plt.ylabel('Mean Squared Error')
    plt.plot(train_sizes, train_scores_mean, 'o-', color='blue', label='Training error')
    plt.plot(train_sizes, test_scores_mean, 'o-', color='green', label='Cross-validation error')
    plt.legend(loc='best')
    plt.grid(True)
    
# Compare different model complexities
models = {
    'Underfitting (Linear)': LinearRegression(),
    'Good Fit (Random Forest)': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
    'Overfitting (Random Forest)': RandomForestRegressor(n_estimators=100, max_depth=None, min_samples_leaf=1, random_state=42)
}

for name, model in models.items():
    plot_learning_curve(model, name, X_train, y_train)
    plt.savefig(f'{name.replace(" ", "_").replace("(", "").replace(")", "")}.png', dpi=300)
    plt.show()
    
    # Fit and evaluate model
    model.fit(X_train, y_train)
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    print(f"{name} - Train R²: {train_score:.4f}, Test R²: {test_score:.4f}")

Practical Tips for Production-Ready Models

  1. Always Monitor the Gap: Keep a close eye on the difference between training and validation performance.
  2. Use Learning Curves: Plotting learning curves helps visualize whether your model is overfitting or underfitting.
  3. K-Fold Cross-Validation: For smaller datasets, use k-fold cross-validation to get more reliable estimates of model performance.
  4. Ensemble Methods: Consider using ensemble methods like stacking or blending, which often balance bias and variance effectively.
  5. Regular Evaluation: Continuously monitor model performance on new data to detect concept drift.

Conclusion

Finding the perfect balance between overfitting and underfitting is more art than science. It requires understanding your data, choosing appropriate models, and applying the right techniques to control model complexity.

By implementing the practical strategies outlined in this guide, you’ll be better equipped to build machine learning models that generalize well to new data—the ultimate goal of any production-ready solution.

Remember the key signs:

  • Underfitting: Poor performance on both training and test data
  • Overfitting: Excellent training performance but poor test performance
  • Just Right: Good performance on both with a minimal gap

With these tools and techniques, you’re now ready to tackle the bias-variance tradeoff with confidence and build better machine learning models.