EvoAug2 Core API
This page documents the core EvoAug2 modules for augmentations and training.
Augmentation Module
The evoaug.augment module provides evolution-inspired data augmentation techniques:
Library of data augmentations for genomic sequence data.
This module provides evolution-inspired data augmentation techniques for genomic sequences, ensuring that all augmentations preserve the input sequence length L.
To contribute a custom augmentation, use the following syntax:
class CustomAugmentation(AugmentBase):
def __init__(self, param1, param2):
self.param1 = param1
self.param2 = param2
def __call__(self, x: torch.Tensor) -> torch.Tensor:
# Perform augmentation
return x_aug
- class evoaug.augment.AugmentBase[source]
Bases:
objectBase class for EvoAug augmentations for genomic sequences.
All augmentation classes should inherit from this base class and implement the
__call__()method to ensure consistent interface.- __call__(x)[source]
Return an augmented version of x.
- Parameters:
x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L) where: - N is the batch size - A is the number of nucleotides (4 for DNA) - L is the sequence length
- Returns:
Batch of one-hot sequences with random augmentation applied. Output shape must be (N, A, L) to maintain sequence length consistency.
- Return type:
- Raises:
NotImplementedError – If the augmentation class does not implement this method.
- class evoaug.augment.RandomDeletion(delete_min=0, delete_max=20)[source]
Bases:
AugmentBaseRandomly deletes contiguous stretches of nucleotides from sequences.
This augmentation randomly selects deletion lengths and positions for each sequence in a batch, then pads the deleted regions with random DNA to maintain the original sequence length L.
- Parameters:
Notes
Deletion positions are constrained to ensure the deletion window fits within the sequence boundaries
Random DNA padding is added equally to both ends of the deletion to maintain sequence length L
Each sequence in the batch receives a different random deletion
- __call__(x)[source]
Randomly delete segments in a set of one-hot DNA sequences.
- Parameters:
x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
- Returns:
Sequences with randomly deleted segments, padded with random DNA to maintain shape (N, A, L).
- Return type:
- class evoaug.augment.RandomInsertion(insert_min=0, insert_max=20)[source]
Bases:
AugmentBaseRandomly inserts contiguous stretches of random DNA into sequences.
This augmentation randomly selects insertion lengths and positions for each sequence in a batch, then trims the resulting sequences equally from both ends to maintain the original sequence length L.
- Parameters:
Notes
Insertion positions are randomly selected across the sequence length
Random DNA is generated using uniform nucleotide distribution
After insertion, sequences are trimmed equally from both ends to maintain sequence length L
Each sequence in the batch receives a different random insertion
- __call__(x)[source]
Randomly insert segments of random DNA into DNA sequences.
- Parameters:
x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
- Returns:
Sequences with randomly inserted segments of random DNA, trimmed to maintain shape (N, A, L).
- Return type:
- class evoaug.augment.RandomTranslocation(shift_min=0, shift_max=20)[source]
Bases:
AugmentBaseRandomly shifts sequences using circular roll transformations.
This augmentation applies random positive or negative shifts to each sequence in a batch, effectively cutting the sequence and reordering the pieces while maintaining the original sequence length L.
- Parameters:
Notes
Shifts are randomly chosen between shift_min and shift_max
Approximately half of the shifts are made negative to create both left and right circular shifts
Uses torch.roll for efficient implementation
Each sequence in the batch receives a different random shift
- __call__(x)[source]
Randomly shift sequences in a batch using circular roll.
- Parameters:
x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
- Returns:
Sequences with random circular shifts applied, maintaining shape (N, A, L).
- Return type:
- class evoaug.augment.RandomInversion(invert_min=0, invert_max=20)[source]
Bases:
AugmentBaseRandomly inverts contiguous stretches of nucleotides in sequences.
This augmentation randomly selects inversion lengths and positions for each sequence in a batch, then applies a reverse-complement transformation to the selected region while maintaining the original sequence length L.
- Parameters:
Notes
Inversion positions are constrained to ensure the inversion window fits within the sequence boundaries
Applies reverse-complement transformation (flip both sequence and nucleotide dimensions)
Each sequence in the batch receives a different random inversion
- __call__(x)[source]
Randomly invert segments of DNA sequences.
- Parameters:
x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
- Returns:
Sequences with randomly inverted segments, maintaining shape (N, A, L).
- Return type:
- class evoaug.augment.RandomMutation(mut_frac=0.05)[source]
Bases:
AugmentBaseRandomly mutates nucleotides in sequences according to a mutation fraction.
This augmentation randomly selects positions in each sequence and replaces the nucleotides with random DNA, effectively introducing point mutations while maintaining the original sequence length L.
- Parameters:
mutate_frac (float, optional) – Probability of mutation for each nucleotide. Defaults to 0.05.
Notes
The actual number of mutations is calculated as: round(mutate_frac / 0.75 * L)
The division by 0.75 accounts for silent mutations (nucleotides that don’t change)
Random DNA is generated using uniform nucleotide distribution
Each sequence in the batch receives a different set of random mutations
- __call__(x)[source]
Randomly introduce mutations to a set of one-hot DNA sequences.
- Parameters:
x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
- Returns:
Sequences with randomly mutated DNA, maintaining shape (N, A, L).
- Return type:
- class evoaug.augment.RandomRC(rc_prob=0.5)[source]
Bases:
AugmentBaseRandomly applies reverse-complement transformations to sequences.
This augmentation randomly selects sequences in a batch and applies a reverse-complement transformation with a specified probability. The transformation reverses both the sequence order and nucleotide identity while maintaining the original sequence length L.
- Parameters:
rc_prob (float, optional) – Probability to apply a reverse-complement transformation. Defaults to 0.5.
Notes
Each sequence is independently selected for transformation
Uses torch.flip with dims=[1,2] to reverse both sequence and nucleotide dimensions
Maintains original sequence length L
Useful for learning strand-invariant representations
- __init__(rc_prob=0.5)[source]
Create random reverse-complement augmentation object.
- Parameters:
rc_prob (float) – Probability to apply reverse-complement transformation.
- __call__(x)[source]
Randomly transform sequences with reverse-complement transformations.
- Parameters:
x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
- Returns:
Sequences with random reverse-complements applied, maintaining shape (N, A, L).
- Return type:
- class evoaug.augment.RandomNoise(noise_mean=0.0, noise_std=0.2)[source]
Bases:
AugmentBaseRandomly adds Gaussian noise to sequences.
This augmentation adds random Gaussian noise to each sequence in a batch, effectively introducing small perturbations to the one-hot encodings while maintaining the original sequence length L.
- Parameters:
Notes
Noise is sampled from a normal distribution with specified mean and standard deviation
Noise is added element-wise to the input tensor
Useful for improving model robustness to small perturbations
Each sequence in the batch receives different random noise
- __call__(x)[source]
Randomly add Gaussian noise to a set of one-hot DNA sequences.
- Parameters:
x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
- Returns:
Sequences with random noise added, maintaining shape (N, A, L).
- Return type:
Training Module
The evoaug.evoaug module provides training utilities and the RobustLoader:
EvoAug2: PyTorch DataLoader implementation of EvoAug functionality.
This module provides the same augmentation capabilities as RobustModel but as a standalone PyTorch DataLoader that can be used with any model.
The RobustLoader inherits from DataLoader and can be used directly in PyTorch Lightning DataModules or vanilla PyTorch training loops.
Classes
- AugmentedGenomicDataset
Dataset wrapper that applies EvoAug augmentations on-the-fly.
- RobustLoader
DataLoader with built-in EvoAug augmentations.
- class evoaug.evoaug.AugmentedGenomicDataset(base_dataset: Dataset, augment_list: List[AugmentBase] = [], max_augs_per_seq: int = 0, hard_aug: bool = True, apply_augmentations: bool = True)[source]
Bases:
DatasetPyTorch Dataset that applies EvoAug-style augmentations to genomic sequences.
This dataset wraps an existing dataset and applies augmentations on-the-fly during training, while optionally disabling them for validation/finetuning.
- Parameters:
base_dataset (torch.utils.data.Dataset) – The underlying dataset that provides (sequence, target) pairs.
augment_list (List[AugmentBase], optional) – List of data augmentations to apply. Defaults to empty list.
max_augs_per_seq (int, optional) – Maximum number of augmentations to apply per sequence. Defaults to 0.
hard_aug (bool, optional) – If True, always apply exactly max_augs_per_seq augmentations. If False, randomly sample 1 to max_augs_per_seq augmentations. Defaults to True.
apply_augmentations (bool, optional) – Whether to apply augmentations. Can be toggled for finetuning. Defaults to True.
Notes
The dataset automatically detects the maximum insertion length from augmentations
Augmentations can be enabled/disabled at runtime using enable_augmentations() and disable_augmentations() methods
Each sequence receives a different random combination of augmentations
- __init__(base_dataset: Dataset, augment_list: List[AugmentBase] = [], max_augs_per_seq: int = 0, hard_aug: bool = True, apply_augmentations: bool = True)[source]
- __len__()[source]
Return the number of samples in the dataset.
- Returns:
Number of samples in the base dataset.
- Return type:
- __getitem__(idx)[source]
Get a single sample from the dataset.
- Parameters:
idx (int) – Index of the sample to retrieve.
- Returns:
If target exists: (augmented_sequence, target) If no target: augmented_sequence
- Return type:
Notes
Sequences are augmented on-the-fly if augmentations are enabled
Augmentations preserve the original sequence length L
Each call may produce different augmentations due to randomness
- class evoaug.evoaug.RobustLoader(base_dataset: Dataset, augment_list: List[AugmentBase] = [], max_augs_per_seq: int = 0, hard_aug: bool = True, batch_size: int = 32, shuffle: bool = True, num_workers: int = 4, **kwargs)[source]
Bases:
DataLoaderEvoAug2 DataLoader that inherits from PyTorch DataLoader.
This class provides a DataLoader with built-in EvoAug augmentations that can be used with pl.DataModule or directly into vanilla PyTorch.
- Parameters:
base_dataset (torch.utils.data.Dataset) – The underlying dataset that provides (sequence, target) pairs.
augment_list (List[AugmentBase], optional) – List of augmentations to apply. Defaults to empty list.
max_augs_per_seq (int, optional) – Maximum augmentations per sequence. Defaults to 0.
hard_aug (bool, optional) – Whether to use hard augmentation count. Defaults to True.
batch_size (int, optional) – Batch size for the DataLoader. Defaults to 32.
shuffle (bool, optional) – Whether to shuffle the data. Defaults to True.
num_workers (int, optional) – Number of worker processes. Defaults to 4.
**kwargs – Additional arguments passed to DataLoader.
Notes
The RobustLoader automatically creates an AugmentedGenomicDataset wrapper
Augmentations can be enabled/disabled at runtime using enable_augmentations() and disable_augmentations() methods
All augmentations preserve sequence length L for consistent batch shapes
- __init__(base_dataset: Dataset, augment_list: List[AugmentBase] = [], max_augs_per_seq: int = 0, hard_aug: bool = True, batch_size: int = 32, shuffle: bool = True, num_workers: int = 4, **kwargs)[source]
- enable_augmentations()[source]
Enable augmentations for training.
Notes
This method enables augmentations on the underlying dataset, allowing them to be applied during training.
- disable_augmentations()[source]
Disable augmentations for finetuning/validation.
Notes
This method disables augmentations on the underlying dataset, useful for validation, testing, or finetuning on original data.
- set_augmentations(augment_list: List[AugmentBase], max_augs_per_seq: int = 0, hard_aug: bool = True)[source]
Update the augmentation settings.
- Parameters:
augment_list (List[AugmentBase]) – New list of augmentations to apply.
max_augs_per_seq (int, optional) – New maximum augmentations per sequence. Defaults to 0.
hard_aug (bool, optional) – New hard augmentation setting. Defaults to True.
Notes
This method allows dynamic updating of augmentation parameters without recreating the entire DataLoader.
Usage Examples
Basic Augmentation:
from evoaug.augment import RandomMutation, RandomDeletion
from evoaug.evoaug import RobustLoader
# Create augmentations
mutation = RandomMutation(mut_frac=0.05)
deletion = RandomDeletion(delete_min=0, delete_max=20)
# Create RobustLoader
loader = RobustLoader(
base_dataset=dataset,
augment_list=[mutation, deletion],
max_augs_per_seq=2,
hard_aug=True,
batch_size=32
)
Training Loop:
for batch_seqs, batch_labels in loader:
# Augmentations applied automatically
outputs = model(batch_seqs)
loss = criterion(outputs, batch_labels)
loss.backward()
optimizer.step()