Skip to content

Common Issues

This guide covers common issues and their solutions.

Installation Issues

Import Errors

Error: ModuleNotFoundError: No module named 'src'

Solution:

# Ensure you're in the project root
cd battery-ml

# Install in development mode
pip install -e .

# Or add to PYTHONPATH
export PYTHONPATH="${PYTHONPATH}:$(pwd)"

CUDA Not Available

Error: CUDA not available or torch.cuda.is_available() == False

Solutions: 1. Verify CUDA installation:

nvidia-smi

  1. Reinstall PyTorch with CUDA:

    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
    

  2. Check PyTorch CUDA version matches your CUDA version

Data Loading Issues

File Not Found

Error: FileNotFoundError: Performance summary not found

Solutions: 1. Verify data path:

from pathlib import Path
base_path = Path("Raw Data")
print(base_path.exists())  # Should be True

  1. Check experiment ID (should be 1-5)

  2. Verify file naming convention matches expected format

Missing Columns

Error: KeyError: 'column_name'

Solutions: 1. Check CSV file structure:

import pandas as pd
df = pd.read_csv("path/to/file.csv")
print(df.columns.tolist())

  1. Some experiments may have different column names

  2. Check unit normalization is applied correctly

Pipeline Issues

Cache Errors

Error: PickleError or cache corruption

Solutions: 1. Clear cache:

from pathlib import Path
cache_dir = Path("artifacts/cache")
cache_dir.rmdir()  # Remove cache directory

  1. Disable caching temporarily:
    pipeline = ICAPeaksPipeline(use_cache=False)
    

Feature Dimension Mismatch

Error: RuntimeError: Expected input size ... but got ...

Solutions: 1. Check feature dimensions:

print(sample.feature_dim)
print(model.input_dim)

  1. Ensure pipeline is fitted before transforming test data

  2. Verify pipeline parameters match between fit and transform

Model Issues

Out of Memory

Error: RuntimeError: CUDA out of memory

Solutions: 1. Reduce batch size:

config = {'batch_size': 16}  # Instead of 32

  1. Use gradient accumulation:

    # Accumulate gradients over multiple batches
    

  2. Enable mixed precision:

    config = {'use_amp': True}
    

  3. Use CPU if GPU memory insufficient

NaN Losses

Error: Loss becomes NaN during training

Solutions: 1. Check data for NaN/inf:

import numpy as np
print(np.isnan(X).any())
print(np.isinf(X).any())

  1. Reduce learning rate:

    config = {'learning_rate': 1e-4}  # Instead of 1e-3
    

  2. Increase gradient clipping:

    config = {'gradient_clip': 2.0}
    

  3. Check model initialization

Model Not Learning

Issue: Loss doesn't decrease

Solutions: 1. Check learning rate (may be too high or too low)

  1. Verify data is normalized:

    pipeline = SummarySetPipeline(normalize=True)
    

  2. Check model capacity (may be too small)

  3. Verify data quality and labels

Training Issues

Early Stopping Too Early

Issue: Training stops too early

Solutions: 1. Increase patience:

config = {'early_stopping_patience': 50}  # Instead of 20

  1. Check if validation loss is actually improving

  2. Verify validation set is representative

Slow Training

Issue: Training is very slow

Solutions: 1. Use GPU if available:

device = 'cuda' if torch.cuda.is_available() else 'cpu'

  1. Enable mixed precision:

    config = {'use_amp': True}
    

  2. Increase batch size (if memory allows)

  3. Use faster model (e.g., LightGBM for baselines)

Tracking Issues

TensorBoard Not Showing Data

Issue: TensorBoard shows no data

Solutions: 1. Check log directory:

tensorboard --logdir artifacts/runs

  1. Verify files exist:

    ls artifacts/runs/*/tensorboard/
    

  2. Refresh TensorBoard or restart

MLflow Connection Issues

Error: MLflowException: Unable to connect

Solutions: 1. Check MLflow URI:

tracker = MLflowTracker(tracking_uri="file:./artifacts/mlruns")

  1. For remote MLflow, check network connectivity

  2. Verify MLflow server is running (if using remote)

Getting Help

If you encounter issues not covered here:

  1. Check Data Issues for data-specific problems
  2. Check Training Issues for training problems
  3. Check FAQ for frequently asked questions
  4. Open an issue on GitHub with:
  5. Error message
  6. Steps to reproduce
  7. Environment details (Python version, OS, etc.)