Data Flow¶
This document describes the complete data flow from raw CSV files to model predictions.
High-Level Flow¶
flowchart TD
Start[Raw CSV Files] --> Load[Data Loader]
Load --> Normalize[Unit Normalization]
Normalize --> Split[Data Splitting]
Split --> Pipeline[Feature Pipeline]
Pipeline --> Cache{Cache Check}
Cache -->|Hit| LoadCache[Load from Cache]
Cache -->|Miss| Compute[Compute Features]
Compute --> SaveCache[Save to Cache]
LoadCache --> Sample[Sample Objects]
SaveCache --> Sample
Sample --> Model[Model Training]
Model --> Predict[Predictions]
Predict --> Metrics[Evaluation Metrics]
Metrics --> Track[Experiment Tracking]
Detailed Data Flow¶
1. Data Loading¶
sequenceDiagram
participant User
participant ExperimentPaths
participant SummaryDataLoader
participant UnitConverter
participant DataFrame
User->>ExperimentPaths: Get paths for experiment
ExperimentPaths-->>User: File paths
User->>SummaryDataLoader: Load CSV
SummaryDataLoader->>DataFrame: Read CSV
SummaryDataLoader->>UnitConverter: Normalize units
UnitConverter-->>SummaryDataLoader: Normalized DataFrame
SummaryDataLoader-->>User: DataFrame with metadata
2. Feature Extraction¶
sequenceDiagram
participant User
participant Pipeline
participant Cache
participant FeatureExtractor
participant Sample
User->>Pipeline: fit_transform(df)
Pipeline->>Cache: Check cache key
Cache-->>Pipeline: Cache miss
Pipeline->>FeatureExtractor: Extract features
FeatureExtractor-->>Pipeline: Feature arrays
Pipeline->>Sample: Create Sample objects
Sample-->>User: List of Samples
Pipeline->>Cache: Save to cache
3. Model Training¶
sequenceDiagram
participant User
participant Trainer
participant DataLoader
participant Model
participant Optimizer
participant Tracker
User->>Trainer: fit(train_samples, val_samples)
Trainer->>DataLoader: Create DataLoader
loop Each Epoch
DataLoader->>Model: Batch of Samples
Model->>Model: Forward pass
Model-->>Trainer: Predictions
Trainer->>Optimizer: Backward pass
Optimizer->>Model: Update weights
Trainer->>Tracker: Log metrics
end
Trainer-->>User: Training history
Sample Object Flow¶
graph TB
RawData[Raw Data] --> Pipeline[Pipeline]
Pipeline --> Sample1[Sample: meta, x, y]
Sample1 --> Split[Split Strategy]
Split --> TrainSamples[Train Samples]
Split --> ValSamples[Val Samples]
TrainSamples --> Model[Model]
ValSamples --> Model
Model --> Predictions[Predictions]
Cache Flow¶
flowchart TD
Request[Pipeline Request] --> Hash[Hash Parameters]
Hash --> Key[Cache Key]
Key --> Check{File Exists?}
Check -->|Yes| Load[Load Pickle]
Check -->|No| Compute[Compute Features]
Load --> Validate{Valid?}
Validate -->|Yes| Return[Return Cached]
Validate -->|No| Compute
Compute --> Save[Save Pickle]
Save --> Return
Model Inference Flow¶
sequenceDiagram
participant User
participant Sample
participant Model
participant Forward[Forward Pass]
participant Output
User->>Sample: Get sample
Sample->>Model: x, t (optional)
Model->>Forward: forward(x, t)
Forward->>Output: Predictions
Output-->>User: y_pred
Batch Processing¶
graph TB
Samples[List of Samples] --> Batch[Create Batches]
Batch --> Collate[Collate Function]
Collate --> Tensor[Batch Tensor]
Tensor --> Model[Model Forward]
Model --> Loss[Compute Loss]
Loss --> Backward[Backward Pass]
Backward --> Update[Update Weights]
Tracking Flow¶
sequenceDiagram
participant Trainer
participant DualTracker
participant LocalTracker
participant MLflowTracker
participant TensorBoard
participant MLflowServer
Trainer->>DualTracker: Log metrics
DualTracker->>LocalTracker: Log to local
DualTracker->>MLflowTracker: Log to MLflow
LocalTracker->>TensorBoard: Write logs
MLflowTracker->>MLflowServer: HTTP request
Error Handling Flow¶
flowchart TD
Operation[Operation] --> Try{Try}
Try -->|Success| Return[Return Result]
Try -->|Error| Catch[Catch Exception]
Catch --> Log[Log Error]
Log --> Handle{Handle?}
Handle -->|Yes| Retry[Retry or Fallback]
Handle -->|No| Raise[Raise Exception]
Retry --> Operation
Next Steps¶
- Design Patterns - Design pattern details
- Pipeline System - Pipeline internals
- Model System - Model internals