Common Issues¶

This guide covers common issues and their solutions.

Installation Issues¶

Import Errors¶

Error: ModuleNotFoundError: No module named 'src'

Solution:

# Ensure you're in the project root
cd battery-ml

# Install in development mode
pip install -e .

# Or add to PYTHONPATH
export PYTHONPATH="${PYTHONPATH}:$(pwd)"

CUDA Not Available¶

Error: CUDA not available or torch.cuda.is_available() == False

Solutions: 1. Verify CUDA installation:

nvidia-smi

Reinstall PyTorch with CUDA:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Check PyTorch CUDA version matches your CUDA version

Data Loading Issues¶

File Not Found¶

Error: FileNotFoundError: Performance summary not found

Solutions: 1. Verify data path:

from pathlib import Path
base_path = Path("Raw Data")
print(base_path.exists())  # Should be True

Check experiment ID (should be 1-5)
Verify file naming convention matches expected format

Missing Columns¶

Error: KeyError: 'column_name'

Solutions: 1. Check CSV file structure:

import pandas as pd
df = pd.read_csv("path/to/file.csv")
print(df.columns.tolist())

Some experiments may have different column names
Check unit normalization is applied correctly

Pipeline Issues¶

Cache Errors¶

Error: PickleError or cache corruption

Solutions: 1. Clear cache:

from pathlib import Path
cache_dir = Path("artifacts/cache")
cache_dir.rmdir()  # Remove cache directory

Disable caching temporarily:

pipeline = ICAPeaksPipeline(use_cache=False)

Feature Dimension Mismatch¶

Error: RuntimeError: Expected input size ... but got ...

Solutions: 1. Check feature dimensions:

print(sample.feature_dim)
print(model.input_dim)

Ensure pipeline is fitted before transforming test data
Verify pipeline parameters match between fit and transform

Model Issues¶

Out of Memory¶

Error: RuntimeError: CUDA out of memory

Solutions: 1. Reduce batch size:

config = {'batch_size': 16}  # Instead of 32

Use gradient accumulation:

# Accumulate gradients over multiple batches

Enable mixed precision:
```
config = {'use_amp': True}
```
Use CPU if GPU memory insufficient

NaN Losses¶

Error: Loss becomes NaN during training

Solutions: 1. Check data for NaN/inf:

import numpy as np
print(np.isnan(X).any())
print(np.isinf(X).any())

Reduce learning rate:

config = {'learning_rate': 1e-4}  # Instead of 1e-3

Increase gradient clipping:
```
config = {'gradient_clip': 2.0}
```
Check model initialization

Model Not Learning¶

Issue: Loss doesn't decrease

Solutions: 1. Check learning rate (may be too high or too low)

Verify data is normalized:

pipeline = SummarySetPipeline(normalize=True)

Check model capacity (may be too small)
Verify data quality and labels

Training Issues¶

Early Stopping Too Early¶

Issue: Training stops too early

Solutions: 1. Increase patience:

config = {'early_stopping_patience': 50}  # Instead of 20

Check if validation loss is actually improving
Verify validation set is representative

Slow Training¶

Issue: Training is very slow

Solutions: 1. Use GPU if available:

device = 'cuda' if torch.cuda.is_available() else 'cpu'

Enable mixed precision:
```
config = {'use_amp': True}
```
Increase batch size (if memory allows)
Use faster model (e.g., LightGBM for baselines)

Tracking Issues¶

TensorBoard Not Showing Data¶

Issue: TensorBoard shows no data

Solutions: 1. Check log directory:

tensorboard --logdir artifacts/runs

Verify files exist:
```
ls artifacts/runs/*/tensorboard/
```
Refresh TensorBoard or restart

MLflow Connection Issues¶

Error: MLflowException: Unable to connect

Solutions: 1. Check MLflow URI:

tracker = MLflowTracker(tracking_uri="file:./artifacts/mlruns")

For remote MLflow, check network connectivity
Verify MLflow server is running (if using remote)

Getting Help¶

If you encounter issues not covered here:

Check Data Issues for data-specific problems
Check Training Issues for training problems
Check FAQ for frequently asked questions
Open an issue on GitHub with:
Error message
Steps to reproduce
Environment details (Python version, OS, etc.)