Common Issues¶
This guide covers common issues and their solutions.
Installation Issues¶
Import Errors¶
Error: ModuleNotFoundError: No module named 'src'
Solution:
# Ensure you're in the project root
cd battery-ml
# Install in development mode
pip install -e .
# Or add to PYTHONPATH
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
CUDA Not Available¶
Error: CUDA not available or torch.cuda.is_available() == False
Solutions: 1. Verify CUDA installation:
-
Reinstall PyTorch with CUDA:
-
Check PyTorch CUDA version matches your CUDA version
Data Loading Issues¶
File Not Found¶
Error: FileNotFoundError: Performance summary not found
Solutions: 1. Verify data path:
-
Check experiment ID (should be 1-5)
-
Verify file naming convention matches expected format
Missing Columns¶
Error: KeyError: 'column_name'
Solutions: 1. Check CSV file structure:
-
Some experiments may have different column names
-
Check unit normalization is applied correctly
Pipeline Issues¶
Cache Errors¶
Error: PickleError or cache corruption
Solutions: 1. Clear cache:
from pathlib import Path
cache_dir = Path("artifacts/cache")
cache_dir.rmdir() # Remove cache directory
- Disable caching temporarily:
Feature Dimension Mismatch¶
Error: RuntimeError: Expected input size ... but got ...
Solutions: 1. Check feature dimensions:
-
Ensure pipeline is fitted before transforming test data
-
Verify pipeline parameters match between fit and transform
Model Issues¶
Out of Memory¶
Error: RuntimeError: CUDA out of memory
Solutions: 1. Reduce batch size:
-
Use gradient accumulation:
-
Enable mixed precision:
-
Use CPU if GPU memory insufficient
NaN Losses¶
Error: Loss becomes NaN during training
Solutions: 1. Check data for NaN/inf:
-
Reduce learning rate:
-
Increase gradient clipping:
-
Check model initialization
Model Not Learning¶
Issue: Loss doesn't decrease
Solutions: 1. Check learning rate (may be too high or too low)
-
Verify data is normalized:
-
Check model capacity (may be too small)
-
Verify data quality and labels
Training Issues¶
Early Stopping Too Early¶
Issue: Training stops too early
Solutions: 1. Increase patience:
-
Check if validation loss is actually improving
-
Verify validation set is representative
Slow Training¶
Issue: Training is very slow
Solutions: 1. Use GPU if available:
-
Enable mixed precision:
-
Increase batch size (if memory allows)
-
Use faster model (e.g., LightGBM for baselines)
Tracking Issues¶
TensorBoard Not Showing Data¶
Issue: TensorBoard shows no data
Solutions: 1. Check log directory:
-
Verify files exist:
-
Refresh TensorBoard or restart
MLflow Connection Issues¶
Error: MLflowException: Unable to connect
Solutions: 1. Check MLflow URI:
-
For remote MLflow, check network connectivity
-
Verify MLflow server is running (if using remote)
Getting Help¶
If you encounter issues not covered here:
- Check Data Issues for data-specific problems
- Check Training Issues for training problems
- Check FAQ for frequently asked questions
- Open an issue on GitHub with:
- Error message
- Steps to reproduce
- Environment details (Python version, OS, etc.)