Pipelines¶

Pipelines transform raw data (DataFrames, arrays) into Sample objects that models can consume. This guide covers all available pipelines and how to use them.

Overview¶

Pipelines follow a fit/transform pattern similar to scikit-learn:

pipeline = SomePipeline(param1=value1)
train_samples = pipeline.fit_transform({'df': train_df})
test_samples = pipeline.transform({'df': test_df})  # Uses fitted scalers

SummarySetPipeline¶

Extracts features from Performance Summary CSV files.

Features¶

Feature	Description
Cumulative Charge Throughput	Total charge capacity in Ah
Cumulative Discharge Throughput	Total discharge capacity in Ah
0.1s Resistance	Fast resistance measurement (Ohms)
10s Resistance	Slow resistance measurement (Ohms)
Temperature (K)	Temperature in Kelvin
Arrhenius Factor	`exp(-Ea/RT)` for temperature effects
Inverse Temperature	`1000/T` for linearization

Usage¶

from src.pipelines.summary_set import SummarySetPipeline

pipeline = SummarySetPipeline(
    include_arrhenius=True,
    arrhenius_Ea=50000.0,  # J/mol
    normalize=True
)

samples = pipeline.fit_transform({'df': df})

Parameters¶

include_arrhenius (bool): Include Arrhenius temperature features
arrhenius_Ea (float): Activation energy in J/mol (default: 50000.0)
normalize (bool): Apply StandardScaler normalization

When to Use¶

Fast baseline experiments
When summary statistics are sufficient
For initial model development

ICAPeaksPipeline¶

Extracts dQ/dV (Incremental Capacity Analysis) peak features from voltage curves.

Features¶

For each detected peak: - Peak Voltage: Position of peak (V) - Peak Height: Magnitude of dQ/dV at peak - Peak Width: Full-width at half-maximum (FWHM) - Peak Area: Integrated area under peak

Additional features: - Total Area: Total integrated dQ/dV curve - Number of Peaks: Count of detected peaks - Voltage at Max dQ/dV: Voltage at maximum dQ/dV value

Usage¶

from src.pipelines.ica_peaks import ICAPeaksPipeline

pipeline = ICAPeaksPipeline(
    sg_window=51,      # Savitzky-Golay window (must be odd)
    sg_order=3,        # Polynomial order
    num_peaks=3,       # Number of peaks to extract
    voltage_range=(3.0, 4.2),
    normalize=True,
    use_cache=True     # Cache expensive computations
)

samples = pipeline.fit_transform({
    'curves': voltage_curves,
    'targets': capacity_targets
})

Parameters¶

sg_window (int): Savitzky-Golay smoothing window (must be odd, default: 51)
sg_order (int): Polynomial order for smoothing (default: 3)
num_peaks (int): Number of peaks to extract features for (default: 3)
voltage_range (tuple): Voltage range for analysis (default: (3.0, 4.2))
resample_points (int): Points to resample curves to (default: 500)
normalize (bool): Apply StandardScaler (default: True)
use_cache (bool): Cache computed features (default: True)

ICA Theory¶

ICA features are highly diagnostic for degradation mechanisms:

Peak Shifts: Indicate Loss of Lithium Inventory (LLI)
Peak Height Changes: Indicate Loss of Active Material (LAM)
Peak Width Changes: Indicate kinetic degradation / impedance rise

See ICA Analysis Theory for more details.

Caching¶

ICA computation is expensive. The pipeline automatically caches results:

# First run: computes and caches
samples1 = pipeline.fit_transform({'curves': curves, 'targets': targets})

# Second run: loads from cache (much faster)
samples2 = pipeline.fit_transform({'curves': curves, 'targets': targets})

Cache is invalidated if pipeline parameters change.

When to Use¶

Degradation mechanism analysis
When voltage curve data is available
For interpretable features (SHAP analysis)

LatentODESequencePipeline¶

Creates time-series sequences with explicit time vectors for Neural ODE models.

Features¶

Sequential Features: Time-series of summary statistics
Time Vector: Explicit time values for ODE integration
Variable Length: Supports variable-length sequences with masking

Usage¶

from src.pipelines.latent_ode_seq import LatentODESequencePipeline

pipeline = LatentODESequencePipeline(
    time_unit="days",      # or "throughput_Ah"
    max_seq_len=50,        # Maximum sequence length
    normalize=True
)

# One sample per cell (entire degradation trajectory)
samples = pipeline.fit_transform({'df': df})

Parameters¶

time_unit (str): Time unit - "days" or "throughput_Ah" (default: "days")
max_seq_len (int): Maximum sequence length (default: 50)
normalize (bool): Apply StandardScaler (default: True)

Output Format¶

Each sample represents one cell's degradation trajectory:

sample.x.shape  # (seq_len, feature_dim)
sample.t.shape  # (seq_len,) - time vector
sample.mask.shape  # (seq_len,) - boolean mask for valid steps

When to Use¶

Neural ODE models
Continuous-time degradation modeling
When temporal dynamics are important

Creating Custom Pipelines¶

See Custom Pipeline Guide for step-by-step instructions.

Pipeline Interface¶

All pipelines must inherit from BasePipeline:

from src.pipelines.base import BasePipeline
from src.pipelines.sample import Sample
from src.pipelines.registry import PipelineRegistry

@PipelineRegistry.register("my_pipeline")
class MyPipeline(BasePipeline):
    def fit(self, data: Dict[str, Any]) -> 'BasePipeline':
        # Fit scalers, compute statistics, etc.
        return self

    def transform(self, data: Dict[str, Any]) -> List[Sample]:
        # Transform to Sample objects
        samples = []
        # ... create samples ...
        return samples

    def get_feature_names(self) -> List[str]:
        # Return feature names for interpretability
        return ['feature1', 'feature2', ...]

Pipeline Registry¶

List available pipelines:

from src.pipelines.registry import PipelineRegistry

available = PipelineRegistry.list_available()
print(available)  # ['summary_set', 'ica_peaks', 'latent_ode_seq']

Get pipeline by name:

pipeline = PipelineRegistry.get("summary_set", include_arrhenius=True)

Best Practices¶

Always fit on training data first: Use fit_transform on training, transform on test
Use caching for expensive pipelines: Enable use_cache=True for ICA pipelines
Normalize features: Most models benefit from normalized features
Check feature names: Use get_feature_names() for interpretability
Validate samples: Check sample.feature_dim and sample.seq_len match expectations

Next Steps¶

Models - Using models with pipeline outputs
Training - Training workflows
Custom Pipeline - Creating custom pipelines
API Reference - Complete API documentation