Training Issues¶

This guide covers training-specific issues and solutions.

Loss Issues¶

Loss Not Decreasing¶

Issue: Loss plateaus or increases

Solutions: 1. Check Learning Rate:

# Too high: loss explodes
# Too low: loss decreases very slowly
config = {'learning_rate': 1e-4}  # Try different values

Verify Data Normalization:

pipeline = SummarySetPipeline(normalize=True)

Check Model Capacity:

# Model may be too small
model = MLPModel(input_dim=15, hidden_dims=[128, 64, 32])

Check Data Quality:

# Verify targets are reasonable
print(y_train.min(), y_train.max(), y_train.mean())

Loss Becomes NaN¶

Issue: Loss becomes NaN during training

Solutions: 1. Check for NaN in Data:

import numpy as np
print(np.isnan(X_train).any())
print(np.isnan(y_train).any())

Reduce Learning Rate:

config = {'learning_rate': 1e-4}  # Lower learning rate

Increase Gradient Clipping:

config = {'gradient_clip': 2.0}  # Stronger clipping

Check Model Initialization:

# Verify weights are initialized correctly
for param in model.parameters():
    print(param.data.mean(), param.data.std())

Loss Oscillates¶

Issue: Loss oscillates wildly

Solutions: 1. Reduce Learning Rate:

config = {'learning_rate': 5e-4}  # Lower learning rate

Increase Batch Size:

config = {'batch_size': 64}  # Larger batches = more stable

Use Learning Rate Schedule:

# Cosine annealing helps stabilize training
config = {'scheduler_T0': 50}

Convergence Issues¶

Overfitting¶

Issue: Training loss decreases but validation loss increases

Solutions: 1. Add Dropout:

model = MLPModel(input_dim=15, hidden_dims=[64, 32], dropout=0.3)

Increase Weight Decay:

config = {'weight_decay': 0.1}  # Stronger regularization

Early Stopping:

config = {'early_stopping_patience': 20}

Reduce Model Capacity:

model = MLPModel(input_dim=15, hidden_dims=[32, 16])  # Smaller model

Underfitting¶

Issue: Both training and validation loss are high

Solutions: 1. Increase Model Capacity:

model = MLPModel(input_dim=15, hidden_dims=[128, 64, 32])  # Larger model

Reduce Regularization:

config = {'weight_decay': 0.001}  # Less regularization
model = MLPModel(dropout=0.1)  # Less dropout

Train Longer:

config = {'epochs': 500}  # More epochs

Check Feature Engineering:

# May need better features
pipeline = SummarySetPipeline(include_arrhenius=True)

Performance Issues¶

Slow Training¶

Issue: Training is very slow

Solutions: 1. Use GPU:

device = 'cuda' if torch.cuda.is_available() else 'cpu'

Enable Mixed Precision:

config = {'use_amp': True}  # Faster GPU training

Increase Batch Size:

config = {'batch_size': 64}  # Larger batches = faster

Use Faster Model:

# LightGBM is faster than neural networks for tabular data
model = LGBMModel()

Out of Memory¶

Error: CUDA out of memory

Solutions: 1. Reduce Batch Size:

config = {'batch_size': 16}  # Smaller batches

Use Gradient Accumulation:

# Accumulate gradients over multiple batches

Enable Mixed Precision:

config = {'use_amp': True}  # Saves memory

Use Adjoint Method (for ODEs):

model = NeuralODEModel(use_adjoint=True)  # Memory efficient

Early Stopping Issues¶

Stops Too Early¶

Issue: Training stops before convergence

Solutions: 1. Increase Patience:

config = {'early_stopping_patience': 50}  # More patience

Check Validation Set:

# Ensure validation set is representative

Monitor Training Loss:

# May need to monitor training loss instead

Never Stops¶

Issue: Training never stops (no improvement)

Solutions: 1. Check Validation Metrics:

# Verify validation metrics are being computed correctly

Reduce Patience:

config = {'early_stopping_patience': 10}  # Less patience

Check for Bugs:

# Verify early stopping logic is correct

Debugging Tips¶

Monitor Training¶

# Use TensorBoard to monitor training
tensorboard --logdir artifacts/runs

# Check:
# - Loss curves
# - Learning rate schedule
# - Gradient norms

Check Gradients¶

# Monitor gradient norms
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: {param.grad.norm().item()}")

Validate Data¶

# Check data before training
print(f"X shape: {X_train.shape}")
print(f"y shape: {y_train.shape}")
print(f"X range: [{X_train.min():.2f}, {X_train.max():.2f}]")
print(f"y range: [{y_train.min():.2f}, {y_train.max():.2f}]")

Best Practices¶

Start Small: Begin with small model and simple data
Monitor Closely: Watch training curves in TensorBoard
Validate Early: Check validation metrics frequently
Save Checkpoints: Save model checkpoints regularly
Experiment Systematically: Change one thing at a time

Next Steps¶

Common Issues - Other common problems
Performance - Performance optimization
Training Guide - Training documentation