Skip to content

Experiment Tracking

BatteryML supports multiple experiment tracking backends for logging metrics, hyperparameters, and artifacts.

Available Trackers

Tracker Description Best For
LocalTracker JSON + TensorBoard Local development
MLflowTracker MLflow server Team collaboration
DualTracker Both local + MLflow Recommended default

The DualTracker logs to both local files and MLflow simultaneously:

from src.tracking.dual_tracker import DualTracker

tracker = DualTracker(
    local_base_dir="artifacts/runs",
    use_tensorboard=True,
    mlflow_tracking_uri="file:./artifacts/mlruns",
    mlflow_experiment_name="battery_degradation"
)

Benefits

  • Local files: Fast access, no server needed
  • TensorBoard: Real-time training visualization
  • MLflow: Experiment comparison and management
  • Redundancy: Data saved in multiple formats

Basic Usage

Starting a Run

run_id = tracker.start_run(
    run_name="experiment_001",
    config={
        'model': 'mlp',
        'learning_rate': 1e-3,
        'batch_size': 32,
        # ... other hyperparameters
    }
)

Logging Metrics

# Single metric
tracker.log_metric("train_loss", 0.05, step=epoch)

# Multiple metrics
tracker.log_metrics({
    'train_rmse': 0.05,
    'val_rmse': 0.04,
    'train_mae': 0.03,
    'val_mae': 0.025
}, step=epoch)

Logging Artifacts

# Save model checkpoint
tracker.log_artifact("checkpoints/best.pt", "model.pt")

# Save plots
tracker.log_artifact("plots/attention.png", "attention_heatmap.png")

Ending a Run

tracker.end_run()

Local Tracker

For local development without MLflow:

from src.tracking.local import LocalTracker

tracker = LocalTracker(
    base_dir="artifacts/runs",
    use_tensorboard=True
)

Output Structure

artifacts/runs/
└── {run_id}/
    ├── config.json          # Hyperparameters
    ├── metrics.json         # Metrics history
    ├── artifacts/           # Saved files
    └── tensorboard/         # TensorBoard logs

MLflow Tracker

For team collaboration and experiment management:

from src.tracking.mlflow_tracker import MLflowTracker

tracker = MLflowTracker(
    tracking_uri="file:./artifacts/mlruns",  # or remote URI
    experiment_name="battery_degradation"
)

Remote MLflow Server

tracker = MLflowTracker(
    tracking_uri="http://mlflow-server:5000",
    experiment_name="battery_degradation"
)

Viewing Results

TensorBoard

View training curves in real-time:

tensorboard --logdir artifacts/runs

Open browser to http://localhost:6006

Features: - Loss curves - Learning rate schedule - Metric plots - Histograms (if logged)

MLflow UI

Compare experiments:

mlflow ui --backend-store-uri file:./artifacts/mlruns

Open browser to http://localhost:5000

Features: - Experiment comparison - Parameter search - Metric visualization - Artifact browsing - Model registry

Integration with Trainer

The Trainer automatically logs metrics:

trainer = Trainer(model, config, tracker)
history = trainer.fit(train_samples, val_samples)
# Metrics logged automatically each epoch

Best Practices

1. Consistent Naming

Use descriptive run names:

run_name = f"{model_name}_{pipeline_name}_{split_strategy}_{timestamp}"

2. Log Everything

Log all hyperparameters and metrics:

config = {
    'model': 'mlp',
    'model.hidden_dims': [64, 32],
    'pipeline': 'summary_set',
    'pipeline.include_arrhenius': True,
    'training.learning_rate': 1e-3,
    'training.batch_size': 32,
    # ... everything
}

3. Version Control

Tag important runs:

tracker.set_tag("status", "baseline")
tracker.set_tag("dataset_version", "v1.0")

4. Artifact Management

Save important artifacts:

# Model checkpoints
torch.save(model.state_dict(), "checkpoint.pt")
tracker.log_artifact("checkpoint.pt", "best_model.pt")

# Plots
fig.savefig("plot.png")
tracker.log_artifact("plot.png", "feature_importance.png")

5. Experiment Organization

Use MLflow experiments to organize runs:

# Create experiment
tracker.create_experiment("temperature_generalization")

# Set experiment
tracker.set_experiment("temperature_generalization")

Querying Results

Local Tracker

import json

# Load metrics
with open("artifacts/runs/{run_id}/metrics.json") as f:
    metrics = json.load(f)

MLflow

import mlflow

# Search runs
runs = mlflow.search_runs(
    experiment_ids=[experiment_id],
    filter_string="metrics.val_rmse < 0.05"
)

# Get best run
best_run = runs.loc[runs['metrics.val_rmse'].idxmin()]

Troubleshooting

TensorBoard Not Showing Data

  • Check path: Ensure --logdir points to correct directory
  • Refresh: TensorBoard may need refresh
  • Check logs: Verify files exist in directory

MLflow Connection Issues

  • Check URI: Verify tracking URI is correct
  • Network: Check network connectivity for remote servers
  • Permissions: Ensure write permissions for file-based storage

Missing Metrics

  • Check logging: Ensure log_metrics is called
  • Step numbers: Verify step numbers are sequential
  • Run active: Ensure run is started before logging

Next Steps