The Ultimate Guide to Training Deep Learning Models with Google Colab’s Free T4 GPU

Are you interested in deep learning but don’t have access to an expensive GPU? You’re not alone! This comprehensive guide will show you exactly how to leverage Google Colab’s free T4 GPU to train your PyTorch and TensorFlow models without spending a dime.

Why Use Google Colab’s Free GPU?

  • Completely Free: Access to NVIDIA T4 GPUs at zero cost
  • No Hardware Setup: Skip the hassle of configuring drivers and CUDA
  • Pre-installed Libraries: Many ML frameworks come pre-installed
  • Easy Sharing: Collaborate with others through Google Drive integration
  • 12-hour Runtime: Sufficient for many training tasks

Part 1: Setting Up Your Environment

Checking Your GPU

First, let’s verify you have GPU access:

# Check if GPU is available
!nvidia-smi

# For PyTorch users
import torch
print("GPU Available:", torch.cuda.is_available())
print("GPU Name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")

# For TensorFlow users
import tensorflow as tf
print("TensorFlow GPU Available:", tf.config.list_physical_devices('GPU'))

Quick Installation Shortcuts

If the libraries aren’t already installed:

For PyTorch:

!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

For TensorFlow:

!pip install tensorflow

Restart Runtime Tip: After installing packages, you often need to restart the runtime. Add this code to auto-restart:

import os
os.kill(os.getpid(), 9)

Part 2: Dataset Upload Tricks

Method 1: Google Drive Integration (Best for Large Datasets)

from google.colab import drive
drive.mount('/content/drive')

# Access your dataset
dataset_path = '/content/drive/MyDrive/your_dataset_folder'

Method 2: Direct Upload (Best for Small Files)

from google.colab import files
uploaded = files.upload()

# Process uploaded file
import io
import pandas as pd
df = pd.read_csv(io.BytesIO(uploaded['your_file.csv']))

Method 3: Download from URL (Best for Public Datasets)

!wget https://raw.githubusercontent.com/username/repo/master/dataset.csv
# Or for compressed files
!wget https://example.com/dataset.zip
!unzip dataset.zip

Method 4: Use Built-in Datasets (Fastest)

# PyTorch
from torchvision.datasets import MNIST
train_dataset = MNIST(root='./data', train=True, download=True)

# TensorFlow
import tensorflow_datasets as tfds
mnist = tfds.load('mnist', split='train', as_supervised=True)

Part 3: Training Models with PyTorch

Basic PyTorch Training Loop with GPU Acceleration

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR10
from torchvision.transforms import transforms

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Define transforms
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Load dataset
train_dataset = CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=2)

# Define model
class SimpleConvNet(nn.Module):
    def __init__(self):
        super(SimpleConvNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.pool(torch.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = SimpleConvNet().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Training loop
num_epochs = 5
save_path = './model_checkpoints'
os.makedirs(save_path, exist_ok=True)

for epoch in range(num_epochs):
    running_loss = 0.0
    for i, (inputs, labels) in enumerate(train_loader):
        # Move data to device
        inputs, labels = inputs.to(device), labels.to(device)
        
        # Zero the parameter gradients
        optimizer.zero_grad()
        
        # Forward + backward + optimize
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        # Print statistics
        running_loss += loss.item()
        if i % 200 == 199:
            print(f'[Epoch {epoch + 1}, Batch {i + 1}] loss: {running_loss / 200:.3f}')
            running_loss = 0.0
    
    # Save checkpoint after each epoch
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': running_loss,
    }, f'{save_path}/model_epoch_{epoch}.pth')
    
print('Finished Training')

Loading Checkpoints to Resume Training

checkpoint = torch.load(f'{save_path}/model_epoch_2.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

# Continue training from this point

Part 4: Training Models with TensorFlow

Basic TensorFlow/Keras Training

import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
import os

# Check for GPU
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs Available: ", len(physical_devices))
tf.config.experimental.set_memory_growth(physical_devices[0], True)

# Load and preprocess the CIFAR10 dataset
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

# Create the model
model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(10)
])

# Compile the model
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

# Create checkpoint callback
checkpoint_path = "training_checkpoints/cp-{epoch:04d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
os.makedirs(checkpoint_dir, exist_ok=True)

# Create a callback that saves the model's weights every epoch
cp_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_path,
    verbose=1,
    save_weights_only=True,
    save_freq='epoch')

# Train the model with checkpoint saving
history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(test_images, test_labels),
                    callbacks=[cp_callback])

# Save the entire model
model.save('saved_model/my_model')

Resuming Training from Checkpoint

# Load the model weights
model.load_weights('training_checkpoints/cp-0005.ckpt')

# Continue training
history = model.fit(train_images, train_labels, epochs=5, 
                   validation_data=(test_images, test_labels),
                   initial_epoch=5) # Start from epoch 5

Part 5: Common Errors and Their Fixes

Error 1: CUDA Out of Memory

Error Message:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 15.90 GiB total capacity; 14.73 GiB already allocated; 964.88 MiB free; 14.83 GiB reserved in total by PyTorch)

Fix:

# Reduce batch size
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)  # Smaller batch size

# Or use gradient accumulation
accumulation_steps = 4  # Accumulate gradients over 4 batches
optimizer.zero_grad()
for i, (inputs, labels) in enumerate(train_loader):
    outputs = model(inputs.to(device))
    loss = criterion(outputs, labels.to(device))
    loss = loss / accumulation_steps  # Normalize loss
    loss.backward()
    if (i + 1) % accumulation_steps == 0:  # Update weights every few batches
        optimizer.step()
        optimizer.zero_grad()

Error 2: Runtime Disconnection

Problem: Colab disconnects after periods of inactivity or long training runs.

Fix:

// Run this in the browser console (F12) to keep your session alive
function ClickConnect(){
    console.log("Working");
    document.querySelector("colab-toolbar-button#connect").click() 
}
setInterval(ClickConnect, 60000)

Error 3: Package Version Conflicts

Error Message:

ImportError: cannot import name 'softmax' from 'tensorflow.python.ops.nn_ops'

Fix:

# Uninstall problematic versions
!pip uninstall -y tensorflow tensorflow-gpu

# Install specific compatible versions
!pip install tensorflow==2.12.0

Error 4: Dataset Loading Issues

Error Message:

FileNotFoundError: [Errno 2] No such file or directory: '/content/data/train'

Fix: Check your paths and use absolute paths when possible:

import os
print("Current working directory:", os.getcwd())
!ls -la  # List all files in current directory

# Create directory if it doesn't exist
os.makedirs('/content/data/train', exist_ok=True)

Part 6: Advanced Performance Tips

1. Use Mixed Precision Training for Speed

PyTorch Implementation:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for epoch in range(num_epochs):
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        
        optimizer.zero_grad()
        
        # Use autocast for mixed precision
        with autocast():
            outputs = model(inputs)
            loss = criterion(outputs, labels)
        
        # Scale gradients and optimize
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

TensorFlow Implementation:

# Enable mixed precision
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')

2. Optimize Data Loading

# For PyTorch: Use num_workers and pin_memory for faster data loading
train_loader = DataLoader(
    train_dataset, 
    batch_size=64,
    shuffle=True,
    num_workers=2,  # Parallelize data loading
    pin_memory=True  # Speed up CPU to GPU transfers
)

# For TensorFlow: Use prefetching and caching
train_ds = train_ds.cache().prefetch(tf.data.AUTOTUNE)

3. Monitor and Visualize Training (TensorBoard Integration)

# PyTorch with TensorBoard
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('runs/experiment_1')

# Log metrics during training
for epoch in range(num_epochs):
    running_loss = 0.0
    for i, (inputs, labels) in enumerate(train_loader):
        # Training code here...
        running_loss += loss.item()
        
    # Log average loss for the epoch    
    writer.add_scalar('training loss', running_loss / len(train_loader), epoch)
    
    # Add model graph
    writer.add_graph(model, inputs.to(device))

# TensorFlow with TensorBoard
tensorboard_callback = tf.keras.callbacks.TensorBoard(
    log_dir="./logs",
    histogram_freq=1,
    write_graph=True
)

model.fit(train_images, train_labels, epochs=10,
          validation_data=(test_images, test_labels),
          callbacks=[tensorboard_callback])

4. Prevent Disconnections

# Save frequently (every 100 batches)
if i % 100 == 0:
    torch.save({
        'batch': i,
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
    }, f'{save_path}/checkpoint_e{epoch}_b{i}.pth')

Another effective method is to use a “keepalive” script:

from IPython.display import display, Javascript
import time

def keep_alive(delay_sec=60):
    """Execute Javascript to keep the Colab runtime alive."""
    display(Javascript('''
        function click(){
            document.querySelector('#top-toolbar > colab-connect-button').click();
        }
        setInterval(click, ''' + str(delay_sec*1000) + ''');
    '''))
    
keep_alive()

Part 7: Working with Custom Datasets

Creating a Custom Dataset in PyTorch

from torch.utils.data import Dataset
import os
from PIL import Image

class CustomImageDataset(Dataset):
    def __init__(self, annotations_file, img_dir, transform=None):
        self.img_labels = pd.read_csv(annotations_file)
        self.img_dir = img_dir
        self.transform = transform

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
        image = Image.open(img_path).convert('RGB')
        label = self.img_labels.iloc[idx, 1]
        
        if self.transform:
            image = self.transform(image)
            
        return image, label

# Using the custom dataset
train_dataset = CustomImageDataset(
    annotations_file='/content/drive/MyDrive/annotations.csv',
    img_dir='/content/drive/MyDrive/images',
    transform=transform
)

Creating a Data Generator in TensorFlow

from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Create a data generator with augmentation
train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)

# Load images from directory
train_generator = train_datagen.flow_from_directory(
    '/content/drive/MyDrive/dataset/train',
    target_size=(150, 150),
    batch_size=32,
    class_mode='binary'
)

# Train using the generator
model.fit(
    train_generator,
    steps_per_epoch=train_generator.samples // 32,
    epochs=10
)

Troubleshooting Guide: When Things Go Wrong

1. Colab Disconnects Frequently

Causes:

  • Inactivity
  • Long-running cells
  • Browser tab closed
  • Limited resources

Solutions:

  • Use the keep-alive script shown above
  • Break training into smaller epochs
  • Save checkpoints more frequently
  • Run important cells at the beginning of your session

2. Training Is Too Slow

Causes:

  • Large dataset
  • Complex model
  • Inefficient data loading
  • CPU operations

Solutions:

  • Use a smaller subset of data for prototyping
  • Apply data caching and prefetching
  • Ensure operations are running on GPU:
# Check where your tensors are
print(f"Input tensor device: {inputs.device}")

# Profile your code to find bottlenecks
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    model(inputs)
print(prof.key_averages().table(sort_by="cuda_time_total"))

3. Model Accuracy Is Poor

Potential issues:

  • Learning rate too high/low
  • Overfitting
  • Underfitting
  • Data quality issues

Solutions:

  • Implement learning rate scheduling
  • Add regularization (dropout, weight decay)
  • Increase model complexity
  • Check for data imbalance
# Learning rate scheduling in PyTorch
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.1, patience=3, verbose=True
)

# After validation in your training loop
scheduler.step(val_loss)

Conclusion: Taking Your Skills Further

You’ve now learned how to effectively use Google Colab’s free T4 GPU resources to train deep learning models. This knowledge allows you to:

  1. Experiment with complex models without expensive hardware
  2. Prototype ideas quickly and efficiently
  3. Share your work with collaborators instantly
  4. Gradually scale up to more powerful resources when needed

For your next steps, consider:

  • Exploring pre-trained models for transfer learning
  • Experimenting with different architectures
  • Participating in Kaggle competitions using these techniques
  • Collaborating with others by sharing your notebooks

Remember that Google Colab’s free tier has limitations, including:

  • 12-hour maximum runtime
  • Potential disconnections
  • Limited storage
  • No guaranteed GPU availability

But with the techniques in this guide, you can maximize your productivity within these constraints!

Happy training, and don’t forget to save your work frequently!