Training Issues¶
This guide covers training-specific issues and solutions.
Loss Issues¶
Loss Not Decreasing¶
Issue: Loss plateaus or increases
Solutions: 1. Check Learning Rate:
# Too high: loss explodes
# Too low: loss decreases very slowly
config = {'learning_rate': 1e-4} # Try different values
-
Verify Data Normalization:
-
Check Model Capacity:
-
Check Data Quality:
Loss Becomes NaN¶
Issue: Loss becomes NaN during training
Solutions: 1. Check for NaN in Data:
-
Reduce Learning Rate:
-
Increase Gradient Clipping:
-
Check Model Initialization:
Loss Oscillates¶
Issue: Loss oscillates wildly
Solutions: 1. Reduce Learning Rate:
-
Increase Batch Size:
-
Use Learning Rate Schedule:
Convergence Issues¶
Overfitting¶
Issue: Training loss decreases but validation loss increases
Solutions: 1. Add Dropout:
-
Increase Weight Decay:
-
Early Stopping:
-
Reduce Model Capacity:
Underfitting¶
Issue: Both training and validation loss are high
Solutions: 1. Increase Model Capacity:
-
Reduce Regularization:
-
Train Longer:
-
Check Feature Engineering:
Performance Issues¶
Slow Training¶
Issue: Training is very slow
Solutions: 1. Use GPU:
-
Enable Mixed Precision:
-
Increase Batch Size:
-
Use Faster Model:
Out of Memory¶
Error: CUDA out of memory
Solutions: 1. Reduce Batch Size:
-
Use Gradient Accumulation:
-
Enable Mixed Precision:
-
Use Adjoint Method (for ODEs):
Early Stopping Issues¶
Stops Too Early¶
Issue: Training stops before convergence
Solutions: 1. Increase Patience:
-
Check Validation Set:
-
Monitor Training Loss:
Never Stops¶
Issue: Training never stops (no improvement)
Solutions: 1. Check Validation Metrics:
-
Reduce Patience:
-
Check for Bugs:
Debugging Tips¶
Monitor Training¶
# Use TensorBoard to monitor training
tensorboard --logdir artifacts/runs
# Check:
# - Loss curves
# - Learning rate schedule
# - Gradient norms
Check Gradients¶
# Monitor gradient norms
for name, param in model.named_parameters():
if param.grad is not None:
print(f"{name}: {param.grad.norm().item()}")
Validate Data¶
# Check data before training
print(f"X shape: {X_train.shape}")
print(f"y shape: {y_train.shape}")
print(f"X range: [{X_train.min():.2f}, {X_train.max():.2f}]")
print(f"y range: [{y_train.min():.2f}, {y_train.max():.2f}]")
Best Practices¶
- Start Small: Begin with small model and simple data
- Monitor Closely: Watch training curves in TensorBoard
- Validate Early: Check validation metrics frequently
- Save Checkpoints: Save model checkpoints regularly
- Experiment Systematically: Change one thing at a time
Next Steps¶
- Common Issues - Other common problems
- Performance - Performance optimization
- Training Guide - Training documentation