EvoAug2 Core API

This page documents the core EvoAug2 modules for augmentations and training.

Augmentation Module

The evoaug.augment module provides evolution-inspired data augmentation techniques:

Library of data augmentations for genomic sequence data.

This module provides evolution-inspired data augmentation techniques for genomic sequences, ensuring that all augmentations preserve the input sequence length L.

To contribute a custom augmentation, use the following syntax:

class CustomAugmentation(AugmentBase):
    def __init__(self, param1, param2):
        self.param1 = param1
        self.param2 = param2

    def __call__(self, x: torch.Tensor) -> torch.Tensor:
        # Perform augmentation
        return x_aug

class evoaug.augment.AugmentBase[source]

Bases: object

Base class for EvoAug augmentations for genomic sequences.

All augmentation classes should inherit from this base class and implement the __call__() method to ensure consistent interface.

__call__(x)[source]

Return an augmented version of x.

Parameters:: x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L) where: - N is the batch size - A is the number of nucleotides (4 for DNA) - L is the sequence length
Returns:: Batch of one-hot sequences with random augmentation applied. Output shape must be (N, A, L) to maintain sequence length consistency.
Return type:: torch.Tensor
Raises:: NotImplementedError – If the augmentation class does not implement this method.

class evoaug.augment.RandomDeletion(delete_min=0, delete_max=20)[source]

Bases: AugmentBase

Randomly deletes contiguous stretches of nucleotides from sequences.

This augmentation randomly selects deletion lengths and positions for each sequence in a batch, then pads the deleted regions with random DNA to maintain the original sequence length L.

Parameters:

delete_min (int, optional) – Minimum size for random deletion. Defaults to 0.
delete_max (int, optional) – Maximum size for random deletion. Defaults to 20.

Notes

Deletion positions are constrained to ensure the deletion window fits within the sequence boundaries
Random DNA padding is added equally to both ends of the deletion to maintain sequence length L
Each sequence in the batch receives a different random deletion

__init__(delete_min=0, delete_max=20)[source]

__call__(x)[source]

Randomly delete segments in a set of one-hot DNA sequences.

Parameters:: x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
Returns:: Sequences with randomly deleted segments, padded with random DNA to maintain shape (N, A, L).
Return type:: torch.Tensor

class evoaug.augment.RandomInsertion(insert_min=0, insert_max=20)[source]

Bases: AugmentBase

Randomly inserts contiguous stretches of random DNA into sequences.

This augmentation randomly selects insertion lengths and positions for each sequence in a batch, then trims the resulting sequences equally from both ends to maintain the original sequence length L.

Parameters:

insert_min (int, optional) – Minimum size for random insertion. Defaults to 0.
insert_max (int, optional) – Maximum size for random insertion. Defaults to 20.

Notes

Insertion positions are randomly selected across the sequence length
Random DNA is generated using uniform nucleotide distribution
After insertion, sequences are trimmed equally from both ends to maintain sequence length L
Each sequence in the batch receives a different random insertion

__init__(insert_min=0, insert_max=20)[source]

__call__(x)[source]

Randomly insert segments of random DNA into DNA sequences.

Parameters:: x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
Returns:: Sequences with randomly inserted segments of random DNA, trimmed to maintain shape (N, A, L).
Return type:: torch.Tensor

class evoaug.augment.RandomTranslocation(shift_min=0, shift_max=20)[source]

Bases: AugmentBase

Randomly shifts sequences using circular roll transformations.

This augmentation applies random positive or negative shifts to each sequence in a batch, effectively cutting the sequence and reordering the pieces while maintaining the original sequence length L.

Parameters:

shift_min (int, optional) – Minimum size for random shift. Defaults to 0.
shift_max (int, optional) – Maximum size for random shift. Defaults to 20.

Notes

Shifts are randomly chosen between shift_min and shift_max
Approximately half of the shifts are made negative to create both left and right circular shifts
Uses torch.roll for efficient implementation
Each sequence in the batch receives a different random shift

__init__(shift_min=0, shift_max=20)[source]

__call__(x)[source]

Randomly shift sequences in a batch using circular roll.

Parameters:: x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
Returns:: Sequences with random circular shifts applied, maintaining shape (N, A, L).
Return type:: torch.Tensor

class evoaug.augment.RandomInversion(invert_min=0, invert_max=20)[source]

Bases: AugmentBase

Randomly inverts contiguous stretches of nucleotides in sequences.

This augmentation randomly selects inversion lengths and positions for each sequence in a batch, then applies a reverse-complement transformation to the selected region while maintaining the original sequence length L.

Parameters:

invert_min (int, optional) – Minimum size for random inversion. Defaults to 0.
invert_max (int, optional) – Maximum size for random inversion. Defaults to 20.

Notes

Inversion positions are constrained to ensure the inversion window fits within the sequence boundaries
Applies reverse-complement transformation (flip both sequence and nucleotide dimensions)
Each sequence in the batch receives a different random inversion

__init__(invert_min=0, invert_max=20)[source]

__call__(x)[source]

Randomly invert segments of DNA sequences.

Parameters:: x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
Returns:: Sequences with randomly inverted segments, maintaining shape (N, A, L).
Return type:: torch.Tensor

class evoaug.augment.RandomMutation(mut_frac=0.05)[source]

Bases: AugmentBase

Randomly mutates nucleotides in sequences according to a mutation fraction.

This augmentation randomly selects positions in each sequence and replaces the nucleotides with random DNA, effectively introducing point mutations while maintaining the original sequence length L.

Parameters:: mutate_frac (float, optional) – Probability of mutation for each nucleotide. Defaults to 0.05.

Notes

The actual number of mutations is calculated as: round(mutate_frac / 0.75 * L)
The division by 0.75 accounts for silent mutations (nucleotides that don’t change)
Random DNA is generated using uniform nucleotide distribution
Each sequence in the batch receives a different set of random mutations

__init__(mut_frac=0.05)[source]

__call__(x)[source]

Randomly introduce mutations to a set of one-hot DNA sequences.

Parameters:: x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
Returns:: Sequences with randomly mutated DNA, maintaining shape (N, A, L).
Return type:: torch.Tensor

class evoaug.augment.RandomRC(rc_prob=0.5)[source]

Bases: AugmentBase

Randomly applies reverse-complement transformations to sequences.

This augmentation randomly selects sequences in a batch and applies a reverse-complement transformation with a specified probability. The transformation reverses both the sequence order and nucleotide identity while maintaining the original sequence length L.

Parameters:: rc_prob (float, optional) – Probability to apply a reverse-complement transformation. Defaults to 0.5.

Notes

Each sequence is independently selected for transformation
Uses torch.flip with dims=[1,2] to reverse both sequence and nucleotide dimensions
Maintains original sequence length L
Useful for learning strand-invariant representations

__init__(rc_prob=0.5)[source]

Create random reverse-complement augmentation object.

Parameters:: rc_prob (float) – Probability to apply reverse-complement transformation.

__call__(x)[source]

Randomly transform sequences with reverse-complement transformations.

Parameters:: x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
Returns:: Sequences with random reverse-complements applied, maintaining shape (N, A, L).
Return type:: torch.Tensor

class evoaug.augment.RandomNoise(noise_mean=0.0, noise_std=0.2)[source]

Bases: AugmentBase

Randomly adds Gaussian noise to sequences.

This augmentation adds random Gaussian noise to each sequence in a batch, effectively introducing small perturbations to the one-hot encodings while maintaining the original sequence length L.

Parameters:

noise_mean (float, optional) – Mean of the Gaussian noise. Defaults to 0.0.
noise_std (float, optional) – Standard deviation of the Gaussian noise. Defaults to 0.2.

Notes

Noise is sampled from a normal distribution with specified mean and standard deviation
Noise is added element-wise to the input tensor
Useful for improving model robustness to small perturbations
Each sequence in the batch receives different random noise

__init__(noise_mean=0.0, noise_std=0.2)[source]

__call__(x)[source]

Randomly add Gaussian noise to a set of one-hot DNA sequences.

Parameters:: x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).
Returns:: Sequences with random noise added, maintaining shape (N, A, L).
Return type:: torch.Tensor

Training Module

The evoaug.evoaug module provides training utilities and the RobustLoader:

EvoAug2: PyTorch DataLoader implementation of EvoAug functionality.

This module provides the same augmentation capabilities as RobustModel but as a standalone PyTorch DataLoader that can be used with any model.

The RobustLoader inherits from DataLoader and can be used directly in PyTorch Lightning DataModules or vanilla PyTorch training loops.

Classes

AugmentedGenomicDataset: Dataset wrapper that applies EvoAug augmentations on-the-fly.
RobustLoader: DataLoader with built-in EvoAug augmentations.

class evoaug.evoaug.AugmentedGenomicDataset(base_dataset: Dataset, augment_list: List[AugmentBase] = [], max_augs_per_seq: int = 0, hard_aug: bool = True, apply_augmentations: bool = True)[source]

Bases: Dataset

PyTorch Dataset that applies EvoAug-style augmentations to genomic sequences.

This dataset wraps an existing dataset and applies augmentations on-the-fly during training, while optionally disabling them for validation/finetuning.

Parameters:

base_dataset (torch.utils.data.Dataset) – The underlying dataset that provides (sequence, target) pairs.
augment_list (List[AugmentBase], optional) – List of data augmentations to apply. Defaults to empty list.
max_augs_per_seq (int, optional) – Maximum number of augmentations to apply per sequence. Defaults to 0.
hard_aug (bool, optional) – If True, always apply exactly max_augs_per_seq augmentations. If False, randomly sample 1 to max_augs_per_seq augmentations. Defaults to True.
apply_augmentations (bool, optional) – Whether to apply augmentations. Can be toggled for finetuning. Defaults to True.

Notes

The dataset automatically detects the maximum insertion length from augmentations
Augmentations can be enabled/disabled at runtime using enable_augmentations() and disable_augmentations() methods
Each sequence receives a different random combination of augmentations

__init__(base_dataset: Dataset, augment_list: List[AugmentBase] = [], max_augs_per_seq: int = 0, hard_aug: bool = True, apply_augmentations: bool = True)[source]

__len__()[source]

Return the number of samples in the dataset.

Returns:: Number of samples in the base dataset.
Return type:: int

__getitem__(idx)[source]

Get a single sample from the dataset.

Parameters:: idx (int) – Index of the sample to retrieve.
Returns:: If target exists: (augmented_sequence, target) If no target: augmented_sequence
Return type:: torch.Tensor or tuple

Notes

Sequences are augmented on-the-fly if augmentations are enabled
Augmentations preserve the original sequence length L
Each call may produce different augmentations due to randomness

enable_augmentations()[source]

Enable augmentations for training.

Notes

This method allows augmentations to be applied during training while keeping them disabled for validation/finetuning.

disable_augmentations()[source]

Disable augmentations for finetuning/validation.

Notes

This method prevents augmentations from being applied, useful for validation, testing, or finetuning on original data.

class evoaug.evoaug.RobustLoader(base_dataset: Dataset, augment_list: List[AugmentBase] = [], max_augs_per_seq: int = 0, hard_aug: bool = True, batch_size: int = 32, shuffle: bool = True, num_workers: int = 4, **kwargs)[source]

Bases: DataLoader

EvoAug2 DataLoader that inherits from PyTorch DataLoader.

This class provides a DataLoader with built-in EvoAug augmentations that can be used with pl.DataModule or directly into vanilla PyTorch.

Parameters:

base_dataset (torch.utils.data.Dataset) – The underlying dataset that provides (sequence, target) pairs.
augment_list (List[AugmentBase], optional) – List of augmentations to apply. Defaults to empty list.
max_augs_per_seq (int, optional) – Maximum augmentations per sequence. Defaults to 0.
hard_aug (bool, optional) – Whether to use hard augmentation count. Defaults to True.
batch_size (int, optional) – Batch size for the DataLoader. Defaults to 32.
shuffle (bool, optional) – Whether to shuffle the data. Defaults to True.
num_workers (int, optional) – Number of worker processes. Defaults to 4.
**kwargs – Additional arguments passed to DataLoader.

Notes

The RobustLoader automatically creates an AugmentedGenomicDataset wrapper
Augmentations can be enabled/disabled at runtime using enable_augmentations() and disable_augmentations() methods
All augmentations preserve sequence length L for consistent batch shapes

__init__(base_dataset: Dataset, augment_list: List[AugmentBase] = [], max_augs_per_seq: int = 0, hard_aug: bool = True, batch_size: int = 32, shuffle: bool = True, num_workers: int = 4, **kwargs)[source]

enable_augmentations()[source]

Enable augmentations for training.

Notes

This method enables augmentations on the underlying dataset, allowing them to be applied during training.

disable_augmentations()[source]

Disable augmentations for finetuning/validation.

Notes

This method disables augmentations on the underlying dataset, useful for validation, testing, or finetuning on original data.

set_augmentations(augment_list: List[AugmentBase], max_augs_per_seq: int = 0, hard_aug: bool = True)[source]

Update the augmentation settings.

Parameters:

augment_list (List[AugmentBase]) – New list of augmentations to apply.
max_augs_per_seq (int, optional) – New maximum augmentations per sequence. Defaults to 0.
hard_aug (bool, optional) – New hard augmentation setting. Defaults to True.

Notes

This method allows dynamic updating of augmentation parameters without recreating the entire DataLoader.

Usage Examples

Basic Augmentation:

from evoaug.augment import RandomMutation, RandomDeletion
from evoaug.evoaug import RobustLoader

# Create augmentations
mutation = RandomMutation(mut_frac=0.05)
deletion = RandomDeletion(delete_min=0, delete_max=20)

# Create RobustLoader
loader = RobustLoader(
    base_dataset=dataset,
    augment_list=[mutation, deletion],
    max_augs_per_seq=2,
    hard_aug=True,
    batch_size=32
)

Training Loop:

for batch_seqs, batch_labels in loader:
    # Augmentations applied automatically
    outputs = model(batch_seqs)
    loss = criterion(outputs, batch_labels)
    loss.backward()
    optimizer.step()