Architecture Overview¶
BatteryML is designed with modularity, extensibility, and reproducibility in mind. This document provides a high-level overview of the system architecture.
System Architecture¶
graph TB
subgraph DataLayer[Data Layer]
RawData[Raw CSV Files]
DataLoader[Data Loaders]
Splits[Split Strategies]
end
subgraph PipelineLayer[Pipeline Layer]
Pipelines[Feature Pipelines]
Cache[Hash-Based Cache]
Sample[Sample Objects]
end
subgraph ModelLayer[Model Layer]
Models[ML Models]
Registry[Model Registry]
end
subgraph TrainingLayer[Training Layer]
Trainer[Trainer]
Metrics[Metrics]
Callbacks[Callbacks]
end
subgraph TrackingLayer[Tracking Layer]
LocalTracker[Local Tracker]
MLflowTracker[MLflow Tracker]
DualTracker[Dual Tracker]
end
RawData --> DataLoader
DataLoader --> Splits
Splits --> Pipelines
Pipelines --> Cache
Cache --> Sample
Sample --> Models
Models --> Registry
Models --> Trainer
Trainer --> Metrics
Trainer --> Callbacks
Trainer --> LocalTracker
Trainer --> MLflowTracker
LocalTracker --> DualTracker
MLflowTracker --> DualTracker
Component Overview¶
Data Layer¶
- Data Loaders: Load CSV files from experiments
- Unit Conversion: Normalize units (mAh → Ah, °C → K)
- Split Strategies: Temperature holdout, LOCO, temporal splits
Pipeline Layer¶
- Feature Extraction: Transform raw data to features
- Caching: Cache expensive computations (ICA)
- Sample Schema: Universal data format
Model Layer¶
- Model Zoo: LightGBM, MLP, LSTM, Neural ODE
- Registry Pattern: Extensible model registration
- Base Interface: Consistent model API
Training Layer¶
- Trainer: Training loop with AMP, early stopping
- Metrics: RMSE, MAE, MAPE, R²
- Callbacks: Checkpointing, scheduling
Tracking Layer¶
- Local: JSON + TensorBoard
- MLflow: Experiment management
- Dual: Combined tracking
Design Principles¶
- Modularity: Components are independent and composable
- Extensibility: Easy to add new pipelines/models
- Reproducibility: Hash-based caching, config management
- Type Safety: Pydantic validation, type hints
- Documentation: Comprehensive docstrings and guides
Data Flow¶
sequenceDiagram
participant User
participant DataLoader
participant Pipeline
participant Cache
participant Model
participant Trainer
participant Tracker
User->>DataLoader: Load CSV files
DataLoader->>Pipeline: Raw DataFrame
Pipeline->>Cache: Check cache
Cache-->>Pipeline: Cached or compute
Pipeline->>User: Sample objects
User->>Model: Initialize model
User->>Trainer: Create trainer
Trainer->>Model: Forward pass
Model-->>Trainer: Predictions
Trainer->>Tracker: Log metrics
Tracker-->>User: Results
Key Design Patterns¶
Registry Pattern¶
The Registry Pattern decoupled the configuration from the implementation, allows for runtime discovery and instantiation of components.
sequenceDiagram
participant Config as Experiment Config
participant Reg as Component Registry
participant Base as Base Class
participant Conc as Concrete Implementation
Conc->>Reg: @Registry.register("name")
Config->>Reg: Registry.get("name", params)
Reg->>Conc: Instantiate(**params)
Conc-->>Reg: Component Instance
Reg-->>Config: Instance
Sample Schema¶
graph TB
RawData[Raw Data] --> Pipeline[Pipeline]
Pipeline --> Sample[Sample Object]
Sample --> Model[Model]
Sample --> Split[Split Strategy]
Sample --> Tracker[Tracker]
Caching Strategy¶
graph TB
Request[Pipeline Request] --> Check{Check Cache}
Check -->|Hit| Load[Load from Cache]
Check -->|Miss| Compute[Compute Features]
Compute --> Save[Save to Cache]
Load --> Return[Return Samples]
Save --> Return
Extension Points¶
The architecture provides several extension points:
- Pipelines: Add new feature extraction methods
- Models: Add new model architectures
- Splits: Add new data splitting strategies
- Trackers: Add new tracking backends
- Metrics: Add new evaluation metrics
Next Steps¶
- Data Flow - Detailed data flow documentation
- Design Patterns - Design pattern deep dive
- Pipeline System - Pipeline architecture
- Model System - Model architecture