EvoAug2 API Reference
This page provides comprehensive API documentation for the EvoAug2 package.
Package Overview
The EvoAug2 package consists of several core modules:
`evoaug.augment`: Core augmentation classes for genomic sequences
`evoaug.evoaug`: Main training utilities and RobustLoader
`evoaug_utils.model_zoo`: Pre-built model architectures
`evoaug_utils.utils`: Utility functions for data handling and evaluation
Core Modules
Augmentation Classes
The evoaug.augment module provides the core augmentation classes:
Base Augmentation:
Sequence Mutations:
- class evoaug.augment.RandomMutation(mut_frac=0.05)[source]
Bases:
AugmentBaseRandomly mutates nucleotides in sequences according to a mutation fraction.
This augmentation randomly selects positions in each sequence and replaces the nucleotides with random DNA, effectively introducing point mutations while maintaining the original sequence length L.
- Parameters:
mutate_frac (float, optional) – Probability of mutation for each nucleotide. Defaults to 0.05.
Notes
The actual number of mutations is calculated as: round(mutate_frac / 0.75 * L)
The division by 0.75 accounts for silent mutations (nucleotides that don’t change)
Random DNA is generated using uniform nucleotide distribution
Each sequence in the batch receives a different set of random mutations
- __call__(x)[source]
Randomly introduce mutations to a set of one-hot DNA sequences.
- Parameters:
x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
- Returns:
Sequences with randomly mutated DNA, maintaining shape (N, A, L).
- Return type:
- class evoaug.augment.RandomDeletion(delete_min=0, delete_max=20)[source]
Bases:
AugmentBaseRandomly deletes contiguous stretches of nucleotides from sequences.
This augmentation randomly selects deletion lengths and positions for each sequence in a batch, then pads the deleted regions with random DNA to maintain the original sequence length L.
- Parameters:
Notes
Deletion positions are constrained to ensure the deletion window fits within the sequence boundaries
Random DNA padding is added equally to both ends of the deletion to maintain sequence length L
Each sequence in the batch receives a different random deletion
- __call__(x)[source]
Randomly delete segments in a set of one-hot DNA sequences.
- Parameters:
x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
- Returns:
Sequences with randomly deleted segments, padded with random DNA to maintain shape (N, A, L).
- Return type:
- class evoaug.augment.RandomInsertion(insert_min=0, insert_max=20)[source]
Bases:
AugmentBaseRandomly inserts contiguous stretches of random DNA into sequences.
This augmentation randomly selects insertion lengths and positions for each sequence in a batch, then trims the resulting sequences equally from both ends to maintain the original sequence length L.
- Parameters:
Notes
Insertion positions are randomly selected across the sequence length
Random DNA is generated using uniform nucleotide distribution
After insertion, sequences are trimmed equally from both ends to maintain sequence length L
Each sequence in the batch receives a different random insertion
- __call__(x)[source]
Randomly insert segments of random DNA into DNA sequences.
- Parameters:
x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
- Returns:
Sequences with randomly inserted segments of random DNA, trimmed to maintain shape (N, A, L).
- Return type:
- class evoaug.augment.RandomTranslocation(shift_min=0, shift_max=20)[source]
Bases:
AugmentBaseRandomly shifts sequences using circular roll transformations.
This augmentation applies random positive or negative shifts to each sequence in a batch, effectively cutting the sequence and reordering the pieces while maintaining the original sequence length L.
- Parameters:
Notes
Shifts are randomly chosen between shift_min and shift_max
Approximately half of the shifts are made negative to create both left and right circular shifts
Uses torch.roll for efficient implementation
Each sequence in the batch receives a different random shift
- __call__(x)[source]
Randomly shift sequences in a batch using circular roll.
- Parameters:
x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
- Returns:
Sequences with random circular shifts applied, maintaining shape (N, A, L).
- Return type:
Sequence Transformations:
- class evoaug.augment.RandomRC(rc_prob=0.5)[source]
Bases:
AugmentBaseRandomly applies reverse-complement transformations to sequences.
This augmentation randomly selects sequences in a batch and applies a reverse-complement transformation with a specified probability. The transformation reverses both the sequence order and nucleotide identity while maintaining the original sequence length L.
- Parameters:
rc_prob (float, optional) – Probability to apply a reverse-complement transformation. Defaults to 0.5.
Notes
Each sequence is independently selected for transformation
Uses torch.flip with dims=[1,2] to reverse both sequence and nucleotide dimensions
Maintains original sequence length L
Useful for learning strand-invariant representations
- __init__(rc_prob=0.5)[source]
Create random reverse-complement augmentation object.
- Parameters:
rc_prob (float) – Probability to apply reverse-complement transformation.
- __call__(x)[source]
Randomly transform sequences with reverse-complement transformations.
- Parameters:
x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
- Returns:
Sequences with random reverse-complements applied, maintaining shape (N, A, L).
- Return type:
- class evoaug.augment.RandomNoise(noise_mean=0.0, noise_std=0.2)[source]
Bases:
AugmentBaseRandomly adds Gaussian noise to sequences.
This augmentation adds random Gaussian noise to each sequence in a batch, effectively introducing small perturbations to the one-hot encodings while maintaining the original sequence length L.
- Parameters:
Notes
Noise is sampled from a normal distribution with specified mean and standard deviation
Noise is added element-wise to the input tensor
Useful for improving model robustness to small perturbations
Each sequence in the batch receives different random noise
- __call__(x)[source]
Randomly add Gaussian noise to a set of one-hot DNA sequences.
- Parameters:
x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
- Returns:
Sequences with random noise added, maintaining shape (N, A, L).
- Return type:
Training Utilities
The evoaug.evoaug module provides training utilities:
RobustLoader:
- class evoaug.evoaug.RobustLoader(base_dataset: Dataset, augment_list: List[AugmentBase] = [], max_augs_per_seq: int = 0, hard_aug: bool = True, batch_size: int = 32, shuffle: bool = True, num_workers: int = 4, **kwargs)[source]
Bases:
DataLoaderEvoAug2 DataLoader that inherits from PyTorch DataLoader.
This class provides a DataLoader with built-in EvoAug augmentations that can be used with pl.DataModule or directly into vanilla PyTorch.
- Parameters:
base_dataset (torch.utils.data.Dataset) – The underlying dataset that provides (sequence, target) pairs.
augment_list (List[AugmentBase], optional) – List of augmentations to apply. Defaults to empty list.
max_augs_per_seq (int, optional) – Maximum augmentations per sequence. Defaults to 0.
hard_aug (bool, optional) – Whether to use hard augmentation count. Defaults to True.
batch_size (int, optional) – Batch size for the DataLoader. Defaults to 32.
shuffle (bool, optional) – Whether to shuffle the data. Defaults to True.
num_workers (int, optional) – Number of worker processes. Defaults to 4.
**kwargs – Additional arguments passed to DataLoader.
Notes
The RobustLoader automatically creates an AugmentedGenomicDataset wrapper
Augmentations can be enabled/disabled at runtime using enable_augmentations() and disable_augmentations() methods
All augmentations preserve sequence length L for consistent batch shapes
- __init__(base_dataset: Dataset, augment_list: List[AugmentBase] = [], max_augs_per_seq: int = 0, hard_aug: bool = True, batch_size: int = 32, shuffle: bool = True, num_workers: int = 4, **kwargs)[source]
- enable_augmentations()[source]
Enable augmentations for training.
Notes
This method enables augmentations on the underlying dataset, allowing them to be applied during training.
- disable_augmentations()[source]
Disable augmentations for finetuning/validation.
Notes
This method disables augmentations on the underlying dataset, useful for validation, testing, or finetuning on original data.
- set_augmentations(augment_list: List[AugmentBase], max_augs_per_seq: int = 0, hard_aug: bool = True)[source]
Update the augmentation settings.
- Parameters:
augment_list (List[AugmentBase]) – New list of augmentations to apply.
max_augs_per_seq (int, optional) – New maximum augmentations per sequence. Defaults to 0.
hard_aug (bool, optional) – New hard augmentation setting. Defaults to True.
Notes
This method allows dynamic updating of augmentation parameters without recreating the entire DataLoader.
Training Functions:
Model Architectures
The evoaug_utils.model_zoo module provides pre-built architectures:
DeepSTARR Models:
- class evoaug_utils.model_zoo.DeepSTARR(output_dim, d=256, conv1_filters=None, learn_conv1_filters=True, conv2_filters=None, learn_conv2_filters=True, conv3_filters=None, learn_conv3_filters=True, conv4_filters=None, learn_conv4_filters=True)[source]
Bases:
ModuleDeepSTARR model from de Almeida et al., 2022.
This is the original DeepSTARR model architecture as described in the paper. See https://www.nature.com/articles/s41588-022-01048-5 for details.
- Parameters:
output_dim (int) – Number of output classes for prediction.
d (int, optional) – Number of first-layer convolutional filters. Defaults to 256.
conv1_filters (torch.Tensor, optional) – Initial filters for the first convolutional layer. If None, random filters are initialized.
learn_conv1_filters (bool, optional) – Whether to learn the first convolutional filters. Defaults to True.
conv2_filters (torch.Tensor, optional) – Initial filters for the second convolutional layer. If None, random filters are initialized.
learn_conv2_filters (bool, optional) – Whether to learn the second convolutional filters. Defaults to True.
conv3_filters (torch.Tensor, optional) – Initial filters for the third convolutional layer. If None, random filters are initialized.
learn_conv3_filters (bool, optional) – Whether to learn the third convolutional filters. Defaults to True.
conv4_filters (torch.Tensor, optional) – Initial filters for the fourth convolutional layer. If None, random filters are initialized.
learn_conv4_filters (bool, optional) – Whether to learn the fourth convolutional filters. Defaults to True.
Notes
The original DeepSTARR model uses 256 first-layer convolutional filters
Supports transfer learning by initializing with pre-trained filters
Uses batch normalization and max pooling throughout
Final layers use LazyLinear for automatic input size inference
- __init__(output_dim, d=256, conv1_filters=None, learn_conv1_filters=True, conv2_filters=None, learn_conv2_filters=True, conv3_filters=None, learn_conv3_filters=True, conv4_filters=None, learn_conv4_filters=True)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- get_which_conv_layers_transferred()[source]
Get list of convolutional layers that were initialized with pre-trained filters.
- Returns:
List of layer indices (1-4) that were initialized with pre-trained filters.
- Return type:
Notes
This method is useful for understanding which layers were transferred from a pre-trained model during initialization.
- forward(x)[source]
Forward pass through the DeepSTARR model.
- Parameters:
x (torch.Tensor) – Input tensor with shape (batch_size, 4, sequence_length).
- Returns:
Output predictions with shape (batch_size, output_dim).
- Return type:
Notes
The forward pass applies: 1. Four sequential 1D convolutions with batch normalization and max pooling 2. Flattening of convolutional features 3. Two fully connected layers with batch normalization and dropout 4. Final output layer for predictions
- class evoaug_utils.model_zoo.DeepSTARRModel(model, learning_rate=0.001, weight_decay=1e-06)[source]
Bases:
LightningModulePyTorch Lightning module for DeepSTARR training.
This class wraps the DeepSTARR model in a PyTorch Lightning module, providing training, validation, and testing functionality with automatic logging and checkpointing.
- Parameters:
Notes
Uses MSE loss for regression tasks
Adam optimizer with ReduceLROnPlateau scheduler
Automatic logging of training, validation, and test losses
- forward(x)[source]
Forward pass through the model.
- Parameters:
x (torch.Tensor) – Input tensor.
- Returns:
Model predictions.
- Return type:
- training_step(batch, batch_idx)[source]
Training step for a single batch.
- Parameters:
- Returns:
Training loss for the batch.
- Return type:
- validation_step(batch, batch_idx)[source]
Validation step for a single batch.
- Parameters:
- Returns:
Validation loss for the batch.
- Return type:
- test_step(batch, batch_idx)[source]
Test step for a single batch.
- Parameters:
- Returns:
Test loss for the batch.
- Return type:
- configure_optimizers()[source]
Configure optimizer and learning rate scheduler.
- Returns:
Dictionary with optimizer and scheduler configuration.
- Return type:
Notes
Uses Adam optimizer with ReduceLROnPlateau scheduler. The scheduler monitors validation loss and reduces learning rate when no improvement is seen for 5 epochs.
Basset Models:
- class evoaug_utils.model_zoo.Basset(output_dim, d=300, conv1_filters=None, learn_conv1_filters=True, conv2_filters=None, learn_conv2_filters=True, conv3_filters=None, learn_conv3_filters=True)[source]
Bases:
ModuleBasset model from Kelley et al., 2016.
This is the Basset model architecture as described in the original paper. See https://genome.cshlp.org/content/early/2016/05/03/gr.200535.115.abstract and https://github.com/davek44/Basset/blob/master/data/models/pretrained_params.txt
- Parameters:
output_dim (int) – Number of output classes for prediction.
d (int, optional) – Number of first-layer convolutional filters. Defaults to 300.
conv1_filters (torch.Tensor, optional) – Initial filters for the first convolutional layer. If None, random filters are initialized.
learn_conv1_filters (bool, optional) – Whether to learn the first convolutional filters. Defaults to True.
conv2_filters (torch.Tensor, optional) – Initial filters for the second convolutional layer. If None, random filters are initialized.
learn_conv2_filters (bool, optional) – Whether to learn the second convolutional filters. Defaults to True.
conv3_filters (torch.Tensor, optional) – Initial filters for the third convolutional layer. If None, random filters are initialized.
learn_conv3_filters (bool, optional) – Whether to learn the third convolutional filters. Defaults to True.
Notes
The original Basset model uses 300 first-layer convolutional filters
Supports transfer learning by initializing with pre-trained filters
Uses batch normalization and max pooling throughout
Final layers use LazyLinear for automatic input size inference
Output uses sigmoid activation for binary classification
- __init__(output_dim, d=300, conv1_filters=None, learn_conv1_filters=True, conv2_filters=None, learn_conv2_filters=True, conv3_filters=None, learn_conv3_filters=True)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- get_which_conv_layers_transferred()[source]
Get list of convolutional layers that were initialized with pre-trained filters.
- Returns:
List of layer indices (1-3) that were initialized with pre-trained filters.
- Return type:
Notes
This method is useful for understanding which layers were transferred from a pre-trained model during initialization.
- forward(x)[source]
Forward pass through the Basset model.
- Parameters:
x (torch.Tensor) – Input tensor with shape (batch_size, 4, sequence_length).
- Returns:
Output predictions with shape (batch_size, output_dim).
- Return type:
Notes
The forward pass applies: 1. Three sequential 1D convolutions with batch normalization and max pooling 2. Flattening of convolutional features 3. Two fully connected layers with batch normalization and dropout 4. Final output layer with sigmoid activation
Other Models:
- class evoaug_utils.model_zoo.CNN(output_dim)[source]
Bases:
ModuleGeneric CNN model for genomic sequence classification.
This is a flexible CNN architecture that can be used for various genomic sequence classification tasks.
- Parameters:
output_dim (int) – Number of output classes for prediction.
Notes
Uses three convolutional layers with batch normalization and max pooling
Includes dropout for regularization
Final layers use LazyLinear for automatic input size inference
Output uses sigmoid activation for binary classification
- __init__(output_dim)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(x)[source]
Forward pass through the CNN model.
- Parameters:
x (torch.Tensor) – Input tensor with shape (batch_size, 4, sequence_length).
- Returns:
Output predictions with shape (batch_size, output_dim).
- Return type:
Notes
The forward pass applies: 1. Three sequential 1D convolutions with batch normalization and max pooling 2. Dropout after each convolutional layer 3. Flattening of convolutional features 4. Fully connected layer with batch normalization and dropout 5. Final output layer with sigmoid activation
Utility Functions
The evoaug_utils.utils module provides helper functions:
Data Handling:
- evoaug_utils.utils.H5Dataset(filepath, batch_size=128, lower_case=False, transpose=False, downsample=None)[source]
Enhanced Dataset class for H5 data files with DataModule-like functionality.
This class combines the functionality of a PyTorch Dataset with the convenience methods of a Lightning DataModule, making it easy to integrate with EvoAug2 augmentations without nesting datamodules.
- Parameters:
filepath (str) – Path to the H5 file.
batch_size (int, optional) – Batch size for dataloaders. Defaults to 128.
lower_case (bool, optional) – Whether to use lowercase keys (‘x’, ‘y’) instead of uppercase (‘X’, ‘Y’). Defaults to False.
transpose (bool, optional) – Whether to transpose the data dimensions. Defaults to False.
downsample (int, optional) – Number of samples to use (for debugging). If None, uses all data. Defaults to None.
- evoaug_utils.utils.filepath
Path to the H5 file.
- Type:
- evoaug_utils.utils.batch_size
Batch size for dataloaders.
- Type:
- evoaug_utils.utils.lower_case
Whether to use lowercase keys.
- Type:
- evoaug_utils.utils.transpose
Whether data is transposed.
- Type:
- evoaug_utils.utils.downsample
Number of samples to use.
- Type:
- evoaug_utils.utils.x_key
Key prefix for input data.
- Type:
- evoaug_utils.utils.y_key
Key prefix for target data.
- Type:
- evoaug_utils.utils.x_train
Training input data.
- Type:
- evoaug_utils.utils.y_train
Training target data.
- Type:
- evoaug_utils.utils.x_valid
Validation input data.
- Type:
- evoaug_utils.utils.y_valid
Validation target data.
- Type:
- evoaug_utils.utils.x_test
Test input data.
- Type:
- evoaug_utils.utils.y_test
Test target data.
- Type:
- evoaug_utils.utils.A
Alphabet size (number of nucleotides).
- Type:
- evoaug_utils.utils.L
Sequence length.
- Type:
- evoaug_utils.utils.num_classes
Number of output classes.
- Type:
- class evoaug_utils.utils.H5DataModule(data_path, batch_size=128, stage=None, lower_case=False, transpose=False, downsample=None)[source]
Bases:
LightningDataModulePyTorch Lightning DataModule for H5 data files.
This class provides a standardized way to load and manage H5 datasets for training, validation, and testing in PyTorch Lightning workflows.
- Parameters:
data_path (str) – Path to the H5 data file.
batch_size (int, optional) – Batch size for dataloaders. Defaults to 128.
stage (str, optional) – Lightning stage (‘fit’, ‘test’, or None). Defaults to None.
lower_case (bool, optional) – Whether to use lowercase keys (‘x’, ‘y’) instead of uppercase (‘X’, ‘Y’). Defaults to False.
transpose (bool, optional) – Whether to transpose the data dimensions. Defaults to False.
downsample (int, optional) – Number of samples to use (for debugging). If None, uses all data. Defaults to None.
- x_train
Training input data.
- Type:
- y_train
Training target data.
- Type:
- x_valid
Validation input data.
- Type:
- y_valid
Validation target data.
- Type:
- x_test
Test input data.
- Type:
- y_test
Test target data.
- Type:
- __init__(data_path, batch_size=128, stage=None, lower_case=False, transpose=False, downsample=None)[source]
- prepare_data_per_node
If True, each LOCAL_RANK=0 will call prepare data. Otherwise only NODE_RANK=0, LOCAL_RANK=0 will prepare data.
- allow_zero_length_dataloader_with_multiple_devices
If True, dataloader with zero length within local rank is allowed. Default value is False.
- setup(stage=None)[source]
Set up the data module for the specified stage.
- Parameters:
stage (str, optional) – Lightning stage (‘fit’, ‘test’, or None). Defaults to None.
Notes
Loads training and validation data for ‘fit’ stage
Loads test data for ‘test’ stage
Sets shape attributes (A, L, num_classes)
- train_dataloader()[source]
Get training dataloader.
- Returns:
Training dataloader with shuffling enabled.
- Return type:
- val_dataloader()[source]
Get validation dataloader.
- Returns:
Validation dataloader with shuffling disabled.
- Return type:
Evaluation:
- evoaug_utils.utils.evaluate_model(y_test, pred, task, verbose=True)[source]
Evaluate model performance for classification or regression tasks.
- Parameters:
- Returns:
For regression: (mse, pearson_r, spearman_r) For classification: (auroc, aupr)
- Return type:
Notes
Regression metrics: MSE, Pearson correlation, Spearman correlation
Classification metrics: AUROC, Average Precision
- evoaug_utils.utils.get_predictions(model, x, batch_size=100, accelerator='gpu', devices=1)[source]
Get predictions from a PyTorch model.
- Parameters:
model (torch.nn.Module) – The trained model to use for predictions.
x (array-like) – Input data for prediction.
batch_size (int, optional) – Batch size for prediction. Defaults to 100.
accelerator (str, optional) – Hardware accelerator to use. Defaults to ‘gpu’.
devices (int, optional) – Number of devices to use. Defaults to 1.
- Returns:
Model predictions.
- Return type:
Notes
Model is set to evaluation mode during prediction
Predictions are made in batches to manage memory
Output is converted to numpy array
- evoaug_utils.utils.calculate_auroc(y_true, y_score)[source]
Calculate Area Under ROC Curve for each class.
- Parameters:
y_true (array-like) – True binary labels.
y_score (array-like) – Predicted probabilities or scores.
- Returns:
AUROC values for each class.
- Return type:
- evoaug_utils.utils.calculate_aupr(y_true, y_score)[source]
Calculate Average Precision for each class.
- Parameters:
y_true (array-like) – True binary labels.
y_score (array-like) – Predicted probabilities or scores.
- Returns:
Average precision values for each class.
- Return type:
- evoaug_utils.utils.calculate_mse(y_true, y_score)[source]
Calculate Mean Squared Error for each class.
- Parameters:
y_true (array-like) – True target values.
y_score (array-like) – Predicted values.
- Returns:
MSE values for each class.
- Return type:
- evoaug_utils.utils.calculate_pearsonr(y_true, y_score)[source]
Calculate Pearson correlation coefficient for each class.
- Parameters:
y_true (array-like) – True target values.
y_score (array-like) – Predicted values.
- Returns:
Pearson correlation values for each class.
- Return type:
- evoaug_utils.utils.calculate_spearmanr(y_true, y_score)[source]
Calculate Spearman correlation coefficient for each class.
- Parameters:
y_true (array-like) – True target values.
y_score (array-like) – Predicted values.
- Returns:
Spearman correlation values for each class.
- Return type:
Training Support:
- evoaug_utils.utils.configure_optimizer(model, lr=0.001, weight_decay=1e-06, decay_factor=0.1, patience=5, monitor='val_loss')[source]
Configure optimizer and learning rate scheduler for PyTorch models.
- Parameters:
model (torch.nn.Module) – The model to configure optimizer for.
lr (float, optional) – Learning rate. Defaults to 0.001.
weight_decay (float, optional) – Weight decay (L2 regularization). Defaults to 1e-6.
decay_factor (float, optional) – Factor by which to reduce learning rate. Defaults to 0.1.
patience (int, optional) – Number of epochs with no improvement before reducing LR. Defaults to 5.
monitor (str, optional) – Metric to monitor for LR reduction. Defaults to ‘val_loss’.
- Returns:
Dictionary with optimizer and scheduler configuration.
- Return type:
Notes
Uses Adam optimizer with ReduceLROnPlateau scheduler.
- evoaug_utils.utils.get_fmaps(robust_model, x)[source]
Get first layer feature maps from a model.
- Parameters:
robust_model (torch.nn.Module) – The model to extract feature maps from.
x (torch.Tensor) – Input data to generate feature maps for.
- Returns:
Feature maps from the first layer (activation1).
- Return type:
Notes
Requires the model to have a layer named ‘activation1’
Model is moved to CPU for feature extraction
Feature maps are transposed for visualization
Usage Examples
Basic Augmentation:
from evoaug.augment import RandomMutation, RandomDeletion
# Create augmentations
mutation = RandomMutation(mut_frac=0.05)
deletion = RandomDeletion(delete_min=0, delete_max=20)
# Apply to sequence
sequence = torch.randn(1, 200, 4)
augmented = mutation(sequence)
augmented = deletion(augmented)
Training with RobustLoader:
from evoaug.evoaug import RobustLoader
# Create loader
loader = RobustLoader(
base_dataset=dataset,
augment_list=[mutation, deletion],
max_augs_per_seq=2,
hard_aug=True,
batch_size=32
)
# Training loop
for batch_seqs, batch_labels in loader:
# Augmentations applied automatically
outputs = model(batch_seqs)
loss = criterion(outputs, batch_labels)
# ... rest of training
Model Training:
from evoaug_utils.model_zoo import DeepSTARR, DeepSTARRModel
# Create model
model = DeepSTARRModel(DeepSTARR(2))
# Train with augmentations
trainer.fit(model, datamodule=data_module)
Evaluation:
from evoaug_utils import utils
# Get predictions
predictions = utils.get_predictions(model, test_data)
# Evaluate
results = utils.evaluate_model(true_labels, predictions, task='regression')
Configuration
Augmentation Parameters:
Each augmentation class accepts specific parameters that control the augmentation behavior:
RandomMutation: mut_frac - fraction of positions to mutate
RandomDeletion: delete_min, delete_max - deletion range
RandomInsertion: insert_min, insert_max - insertion range
RandomTranslocation: shift_min, shift_max - shift range
RandomRC: rc_prob - reverse-complement probability
RandomNoise: noise_mean, noise_std - noise parameters
Training Parameters:
max_augs_per_seq: Maximum augmentations per sequence
hard_aug: Whether to always apply exactly N augmentations
batch_size: Training batch size
shuffle: Whether to shuffle data
Model Parameters:
input_size: Sequence length
num_classes: Number of output classes
learning_rate: Training learning rate
weight_decay: L2 regularization
Error Handling
Common Errors and Solutions:
Shape Mismatch Errors: - Ensure input tensors have shape [batch, length, channels] - Check that augmentation parameters are within valid ranges
Memory Errors: - Reduce batch size - Use gradient accumulation - Enable mixed precision training
Data Type Errors: - Ensure input tensors are torch.float32 - Check label tensors are appropriate type (long for classification)
Debugging Tips:
# Check tensor shapes and types
print(f"Input shape: {sequence.shape}")
print(f"Input dtype: {sequence.dtype}")
# Verify augmentation parameters
print(f"Mutation fraction: {mutation.mut_frac}")
print(f"Deletion range: {deletion.delete_min}-{deletion.delete_max}")
Performance Considerations
Optimization Tips:
Use GPU Acceleration: - Move tensors to GPU: sequence = sequence.cuda() - Use mixed precision training when available
Batch Processing: - Use appropriate batch sizes for your hardware - Consider gradient accumulation for large effective batch sizes
Augmentation Efficiency: - Limit number of augmentations per sequence