Adding a New Split Strategy¶
This guide shows how to add a new data splitting strategy.
Split Function Template¶
"""Data split strategies."""
from typing import List, Tuple
from ..pipelines.sample import Sample
def your_split_strategy(samples: List[Sample],
param1: str = "default",
param2: int = 10) -> Tuple[List[Sample], List[Sample]]:
"""Your split strategy description.
Brief description of what the split does and when to use it.
Args:
samples: List of Sample objects
param1: Description of param1
param2: Description of param2
Returns:
Tuple of (train_samples, test_samples)
Example:
>>> train, test = your_split_strategy(samples, param1="value")
"""
train = []
test = []
# Implement your split logic
for sample in samples:
# Your splitting criteria
if some_condition(sample, param1, param2):
train.append(sample)
else:
test.append(sample)
return train, test
Steps to Add Split¶
1. Add Function to splits.py¶
Add your function to src/data/splits.py:
def your_split_strategy(samples: List[Sample], **kwargs) -> Tuple[List[Sample], List[Sample]]:
"""Your split strategy."""
# Implementation
pass
2. Handle Metadata¶
Ensure samples have required metadata:
def your_split_strategy(samples: List[Sample],
split_key: str) -> Tuple[List[Sample], List[Sample]]:
"""Split based on metadata key."""
train = []
test = []
for sample in samples:
# Access metadata
value = sample.meta.get(split_key)
if value is None:
raise ValueError(f"Sample missing required metadata: {split_key}")
# Split logic
if some_condition(value):
train.append(sample)
else:
test.append(sample)
return train, test
3. Add Tests¶
Create tests/test_splits.py or add to existing:
import pytest
from src.data.splits import your_split_strategy
from src.pipelines.sample import Sample
def test_your_split_strategy():
"""Test your split strategy."""
# Create test samples
samples = [
Sample(meta={'key': 'value1'}, x=None, y=None),
Sample(meta={'key': 'value2'}, x=None, y=None),
]
# Test split
train, test = your_split_strategy(samples, param1="value")
assert len(train) > 0
assert len(test) > 0
assert len(train) + len(test) == len(samples)
4. Add Configuration (Optional)¶
If using Hydra, create configs/split/your_split.yaml:
5. Update Documentation¶
- Add to Splits Guide
- Add example usage
- Document parameters
Common Split Patterns¶
By Metadata Value¶
def split_by_metadata(samples: List[Sample],
key: str,
train_values: List[Any]) -> Tuple[List[Sample], List[Sample]]:
"""Split by metadata value."""
train = [s for s in samples if s.meta.get(key) in train_values]
test = [s for s in samples if s.meta.get(key) not in train_values]
return train, test
By Percentage¶
def split_by_percentage(samples: List[Sample],
train_fraction: float = 0.8) -> Tuple[List[Sample], List[Sample]]:
"""Split by percentage."""
n_train = int(len(samples) * train_fraction)
train = samples[:n_train]
test = samples[n_train:]
return train, test
By Index Range¶
def split_by_index(samples: List[Sample],
train_indices: List[int]) -> Tuple[List[Sample], List[Sample]]:
"""Split by sample indices."""
train = [samples[i] for i in train_indices]
test = [samples[i] for i in range(len(samples)) if i not in train_indices]
return train, test
Best Practices¶
- Validate Inputs: Check samples have required metadata
- Handle Edge Cases: Empty splits, single sample, etc.
- Document Clearly: Explain split logic and use cases
- Return Consistent: Always return (train, test) tuple
- Add Tests: Test various scenarios
Checklist¶
- Split function added to
splits.py - Tests written and passing
- Documentation updated
- Handles edge cases
- Code follows style guidelines
Next Steps¶
- Splits Guide - Split usage guide
- Testing - Testing guidelines
- Code Structure - Codebase organization