EvoAug2 API Reference

This page provides comprehensive API documentation for the EvoAug2 package.

Package Overview

The EvoAug2 package consists of several core modules:

  • `evoaug.augment`: Core augmentation classes for genomic sequences

  • `evoaug.evoaug`: Main training utilities and RobustLoader

  • `evoaug_utils.model_zoo`: Pre-built model architectures

  • `evoaug_utils.utils`: Utility functions for data handling and evaluation

Core Modules

Augmentation Classes

The evoaug.augment module provides the core augmentation classes:

Base Augmentation:

Sequence Mutations:

class evoaug.augment.RandomMutation(mut_frac=0.05)[source]

Bases: AugmentBase

Randomly mutates nucleotides in sequences according to a mutation fraction.

This augmentation randomly selects positions in each sequence and replaces the nucleotides with random DNA, effectively introducing point mutations while maintaining the original sequence length L.

Parameters:

mutate_frac (float, optional) – Probability of mutation for each nucleotide. Defaults to 0.05.

Notes

  • The actual number of mutations is calculated as: round(mutate_frac / 0.75 * L)

  • The division by 0.75 accounts for silent mutations (nucleotides that don’t change)

  • Random DNA is generated using uniform nucleotide distribution

  • Each sequence in the batch receives a different set of random mutations

__init__(mut_frac=0.05)[source]
__call__(x)[source]

Randomly introduce mutations to a set of one-hot DNA sequences.

Parameters:

x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).

Returns:

Sequences with randomly mutated DNA, maintaining shape (N, A, L).

Return type:

torch.Tensor

class evoaug.augment.RandomDeletion(delete_min=0, delete_max=20)[source]

Bases: AugmentBase

Randomly deletes contiguous stretches of nucleotides from sequences.

This augmentation randomly selects deletion lengths and positions for each sequence in a batch, then pads the deleted regions with random DNA to maintain the original sequence length L.

Parameters:
  • delete_min (int, optional) – Minimum size for random deletion. Defaults to 0.

  • delete_max (int, optional) – Maximum size for random deletion. Defaults to 20.

Notes

  • Deletion positions are constrained to ensure the deletion window fits within the sequence boundaries

  • Random DNA padding is added equally to both ends of the deletion to maintain sequence length L

  • Each sequence in the batch receives a different random deletion

__init__(delete_min=0, delete_max=20)[source]
__call__(x)[source]

Randomly delete segments in a set of one-hot DNA sequences.

Parameters:

x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).

Returns:

Sequences with randomly deleted segments, padded with random DNA to maintain shape (N, A, L).

Return type:

torch.Tensor

class evoaug.augment.RandomInsertion(insert_min=0, insert_max=20)[source]

Bases: AugmentBase

Randomly inserts contiguous stretches of random DNA into sequences.

This augmentation randomly selects insertion lengths and positions for each sequence in a batch, then trims the resulting sequences equally from both ends to maintain the original sequence length L.

Parameters:
  • insert_min (int, optional) – Minimum size for random insertion. Defaults to 0.

  • insert_max (int, optional) – Maximum size for random insertion. Defaults to 20.

Notes

  • Insertion positions are randomly selected across the sequence length

  • Random DNA is generated using uniform nucleotide distribution

  • After insertion, sequences are trimmed equally from both ends to maintain sequence length L

  • Each sequence in the batch receives a different random insertion

__init__(insert_min=0, insert_max=20)[source]
__call__(x)[source]

Randomly insert segments of random DNA into DNA sequences.

Parameters:

x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).

Returns:

Sequences with randomly inserted segments of random DNA, trimmed to maintain shape (N, A, L).

Return type:

torch.Tensor

class evoaug.augment.RandomTranslocation(shift_min=0, shift_max=20)[source]

Bases: AugmentBase

Randomly shifts sequences using circular roll transformations.

This augmentation applies random positive or negative shifts to each sequence in a batch, effectively cutting the sequence and reordering the pieces while maintaining the original sequence length L.

Parameters:
  • shift_min (int, optional) – Minimum size for random shift. Defaults to 0.

  • shift_max (int, optional) – Maximum size for random shift. Defaults to 20.

Notes

  • Shifts are randomly chosen between shift_min and shift_max

  • Approximately half of the shifts are made negative to create both left and right circular shifts

  • Uses torch.roll for efficient implementation

  • Each sequence in the batch receives a different random shift

__init__(shift_min=0, shift_max=20)[source]
__call__(x)[source]

Randomly shift sequences in a batch using circular roll.

Parameters:

x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).

Returns:

Sequences with random circular shifts applied, maintaining shape (N, A, L).

Return type:

torch.Tensor

Sequence Transformations:

class evoaug.augment.RandomRC(rc_prob=0.5)[source]

Bases: AugmentBase

Randomly applies reverse-complement transformations to sequences.

This augmentation randomly selects sequences in a batch and applies a reverse-complement transformation with a specified probability. The transformation reverses both the sequence order and nucleotide identity while maintaining the original sequence length L.

Parameters:

rc_prob (float, optional) – Probability to apply a reverse-complement transformation. Defaults to 0.5.

Notes

  • Each sequence is independently selected for transformation

  • Uses torch.flip with dims=[1,2] to reverse both sequence and nucleotide dimensions

  • Maintains original sequence length L

  • Useful for learning strand-invariant representations

__init__(rc_prob=0.5)[source]

Create random reverse-complement augmentation object.

Parameters:

rc_prob (float) – Probability to apply reverse-complement transformation.

__call__(x)[source]

Randomly transform sequences with reverse-complement transformations.

Parameters:

x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).

Returns:

Sequences with random reverse-complements applied, maintaining shape (N, A, L).

Return type:

torch.Tensor

class evoaug.augment.RandomNoise(noise_mean=0.0, noise_std=0.2)[source]

Bases: AugmentBase

Randomly adds Gaussian noise to sequences.

This augmentation adds random Gaussian noise to each sequence in a batch, effectively introducing small perturbations to the one-hot encodings while maintaining the original sequence length L.

Parameters:
  • noise_mean (float, optional) – Mean of the Gaussian noise. Defaults to 0.0.

  • noise_std (float, optional) – Standard deviation of the Gaussian noise. Defaults to 0.2.

Notes

  • Noise is sampled from a normal distribution with specified mean and standard deviation

  • Noise is added element-wise to the input tensor

  • Useful for improving model robustness to small perturbations

  • Each sequence in the batch receives different random noise

__init__(noise_mean=0.0, noise_std=0.2)[source]
__call__(x)[source]

Randomly add Gaussian noise to a set of one-hot DNA sequences.

Parameters:

x (torch.Tensor) – Batch of one-hot sequences with shape (N, A, L).

Returns:

Sequences with random noise added, maintaining shape (N, A, L).

Return type:

torch.Tensor

Training Utilities

The evoaug.evoaug module provides training utilities:

RobustLoader:

class evoaug.evoaug.RobustLoader(base_dataset: Dataset, augment_list: List[AugmentBase] = [], max_augs_per_seq: int = 0, hard_aug: bool = True, batch_size: int = 32, shuffle: bool = True, num_workers: int = 4, **kwargs)[source]

Bases: DataLoader

EvoAug2 DataLoader that inherits from PyTorch DataLoader.

This class provides a DataLoader with built-in EvoAug augmentations that can be used with pl.DataModule or directly into vanilla PyTorch.

Parameters:
  • base_dataset (torch.utils.data.Dataset) – The underlying dataset that provides (sequence, target) pairs.

  • augment_list (List[AugmentBase], optional) – List of augmentations to apply. Defaults to empty list.

  • max_augs_per_seq (int, optional) – Maximum augmentations per sequence. Defaults to 0.

  • hard_aug (bool, optional) – Whether to use hard augmentation count. Defaults to True.

  • batch_size (int, optional) – Batch size for the DataLoader. Defaults to 32.

  • shuffle (bool, optional) – Whether to shuffle the data. Defaults to True.

  • num_workers (int, optional) – Number of worker processes. Defaults to 4.

  • **kwargs – Additional arguments passed to DataLoader.

Notes

  • The RobustLoader automatically creates an AugmentedGenomicDataset wrapper

  • Augmentations can be enabled/disabled at runtime using enable_augmentations() and disable_augmentations() methods

  • All augmentations preserve sequence length L for consistent batch shapes

__init__(base_dataset: Dataset, augment_list: List[AugmentBase] = [], max_augs_per_seq: int = 0, hard_aug: bool = True, batch_size: int = 32, shuffle: bool = True, num_workers: int = 4, **kwargs)[source]
enable_augmentations()[source]

Enable augmentations for training.

Notes

This method enables augmentations on the underlying dataset, allowing them to be applied during training.

disable_augmentations()[source]

Disable augmentations for finetuning/validation.

Notes

This method disables augmentations on the underlying dataset, useful for validation, testing, or finetuning on original data.

set_augmentations(augment_list: List[AugmentBase], max_augs_per_seq: int = 0, hard_aug: bool = True)[source]

Update the augmentation settings.

Parameters:
  • augment_list (List[AugmentBase]) – New list of augmentations to apply.

  • max_augs_per_seq (int, optional) – New maximum augmentations per sequence. Defaults to 0.

  • hard_aug (bool, optional) – New hard augmentation setting. Defaults to True.

Notes

This method allows dynamic updating of augmentation parameters without recreating the entire DataLoader.

Training Functions:

Model Architectures

The evoaug_utils.model_zoo module provides pre-built architectures:

DeepSTARR Models:

class evoaug_utils.model_zoo.DeepSTARR(output_dim, d=256, conv1_filters=None, learn_conv1_filters=True, conv2_filters=None, learn_conv2_filters=True, conv3_filters=None, learn_conv3_filters=True, conv4_filters=None, learn_conv4_filters=True)[source]

Bases: Module

DeepSTARR model from de Almeida et al., 2022.

This is the original DeepSTARR model architecture as described in the paper. See https://www.nature.com/articles/s41588-022-01048-5 for details.

Parameters:
  • output_dim (int) – Number of output classes for prediction.

  • d (int, optional) – Number of first-layer convolutional filters. Defaults to 256.

  • conv1_filters (torch.Tensor, optional) – Initial filters for the first convolutional layer. If None, random filters are initialized.

  • learn_conv1_filters (bool, optional) – Whether to learn the first convolutional filters. Defaults to True.

  • conv2_filters (torch.Tensor, optional) – Initial filters for the second convolutional layer. If None, random filters are initialized.

  • learn_conv2_filters (bool, optional) – Whether to learn the second convolutional filters. Defaults to True.

  • conv3_filters (torch.Tensor, optional) – Initial filters for the third convolutional layer. If None, random filters are initialized.

  • learn_conv3_filters (bool, optional) – Whether to learn the third convolutional filters. Defaults to True.

  • conv4_filters (torch.Tensor, optional) – Initial filters for the fourth convolutional layer. If None, random filters are initialized.

  • learn_conv4_filters (bool, optional) – Whether to learn the fourth convolutional filters. Defaults to True.

Notes

  • The original DeepSTARR model uses 256 first-layer convolutional filters

  • Supports transfer learning by initializing with pre-trained filters

  • Uses batch normalization and max pooling throughout

  • Final layers use LazyLinear for automatic input size inference

__init__(output_dim, d=256, conv1_filters=None, learn_conv1_filters=True, conv2_filters=None, learn_conv2_filters=True, conv3_filters=None, learn_conv3_filters=True, conv4_filters=None, learn_conv4_filters=True)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

get_which_conv_layers_transferred()[source]

Get list of convolutional layers that were initialized with pre-trained filters.

Returns:

List of layer indices (1-4) that were initialized with pre-trained filters.

Return type:

list

Notes

This method is useful for understanding which layers were transferred from a pre-trained model during initialization.

forward(x)[source]

Forward pass through the DeepSTARR model.

Parameters:

x (torch.Tensor) – Input tensor with shape (batch_size, 4, sequence_length).

Returns:

Output predictions with shape (batch_size, output_dim).

Return type:

torch.Tensor

Notes

The forward pass applies: 1. Four sequential 1D convolutions with batch normalization and max pooling 2. Flattening of convolutional features 3. Two fully connected layers with batch normalization and dropout 4. Final output layer for predictions

class evoaug_utils.model_zoo.DeepSTARRModel(model, learning_rate=0.001, weight_decay=1e-06)[source]

Bases: LightningModule

PyTorch Lightning module for DeepSTARR training.

This class wraps the DeepSTARR model in a PyTorch Lightning module, providing training, validation, and testing functionality with automatic logging and checkpointing.

Parameters:
  • model (DeepSTARR) – The DeepSTARR model instance.

  • learning_rate (float, optional) – Learning rate for training. Defaults to 0.001.

  • weight_decay (float, optional) – Weight decay (L2 regularization). Defaults to 1e-6.

Notes

  • Uses MSE loss for regression tasks

  • Adam optimizer with ReduceLROnPlateau scheduler

  • Automatic logging of training, validation, and test losses

__init__(model, learning_rate=0.001, weight_decay=1e-06)[source]
forward(x)[source]

Forward pass through the model.

Parameters:

x (torch.Tensor) – Input tensor.

Returns:

Model predictions.

Return type:

torch.Tensor

training_step(batch, batch_idx)[source]

Training step for a single batch.

Parameters:
  • batch (tuple) – Tuple of (x, y) where x is input and y is target.

  • batch_idx (int) – Index of the current batch.

Returns:

Training loss for the batch.

Return type:

torch.Tensor

validation_step(batch, batch_idx)[source]

Validation step for a single batch.

Parameters:
  • batch (tuple) – Tuple of (x, y) where x is input and y is target.

  • batch_idx (int) – Index of the current batch.

Returns:

Validation loss for the batch.

Return type:

torch.Tensor

test_step(batch, batch_idx)[source]

Test step for a single batch.

Parameters:
  • batch (tuple) – Tuple of (x, y) where x is input and y is target.

  • batch_idx (int) – Index of the current batch.

Returns:

Test loss for the batch.

Return type:

torch.Tensor

configure_optimizers()[source]

Configure optimizer and learning rate scheduler.

Returns:

Dictionary with optimizer and scheduler configuration.

Return type:

dict

Notes

Uses Adam optimizer with ReduceLROnPlateau scheduler. The scheduler monitors validation loss and reduces learning rate when no improvement is seen for 5 epochs.

Basset Models:

class evoaug_utils.model_zoo.Basset(output_dim, d=300, conv1_filters=None, learn_conv1_filters=True, conv2_filters=None, learn_conv2_filters=True, conv3_filters=None, learn_conv3_filters=True)[source]

Bases: Module

Basset model from Kelley et al., 2016.

This is the Basset model architecture as described in the original paper. See https://genome.cshlp.org/content/early/2016/05/03/gr.200535.115.abstract and https://github.com/davek44/Basset/blob/master/data/models/pretrained_params.txt

Parameters:
  • output_dim (int) – Number of output classes for prediction.

  • d (int, optional) – Number of first-layer convolutional filters. Defaults to 300.

  • conv1_filters (torch.Tensor, optional) – Initial filters for the first convolutional layer. If None, random filters are initialized.

  • learn_conv1_filters (bool, optional) – Whether to learn the first convolutional filters. Defaults to True.

  • conv2_filters (torch.Tensor, optional) – Initial filters for the second convolutional layer. If None, random filters are initialized.

  • learn_conv2_filters (bool, optional) – Whether to learn the second convolutional filters. Defaults to True.

  • conv3_filters (torch.Tensor, optional) – Initial filters for the third convolutional layer. If None, random filters are initialized.

  • learn_conv3_filters (bool, optional) – Whether to learn the third convolutional filters. Defaults to True.

Notes

  • The original Basset model uses 300 first-layer convolutional filters

  • Supports transfer learning by initializing with pre-trained filters

  • Uses batch normalization and max pooling throughout

  • Final layers use LazyLinear for automatic input size inference

  • Output uses sigmoid activation for binary classification

__init__(output_dim, d=300, conv1_filters=None, learn_conv1_filters=True, conv2_filters=None, learn_conv2_filters=True, conv3_filters=None, learn_conv3_filters=True)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

get_which_conv_layers_transferred()[source]

Get list of convolutional layers that were initialized with pre-trained filters.

Returns:

List of layer indices (1-3) that were initialized with pre-trained filters.

Return type:

list

Notes

This method is useful for understanding which layers were transferred from a pre-trained model during initialization.

forward(x)[source]

Forward pass through the Basset model.

Parameters:

x (torch.Tensor) – Input tensor with shape (batch_size, 4, sequence_length).

Returns:

Output predictions with shape (batch_size, output_dim).

Return type:

torch.Tensor

Notes

The forward pass applies: 1. Three sequential 1D convolutions with batch normalization and max pooling 2. Flattening of convolutional features 3. Two fully connected layers with batch normalization and dropout 4. Final output layer with sigmoid activation

Other Models:

class evoaug_utils.model_zoo.CNN(output_dim)[source]

Bases: Module

Generic CNN model for genomic sequence classification.

This is a flexible CNN architecture that can be used for various genomic sequence classification tasks.

Parameters:

output_dim (int) – Number of output classes for prediction.

Notes

  • Uses three convolutional layers with batch normalization and max pooling

  • Includes dropout for regularization

  • Final layers use LazyLinear for automatic input size inference

  • Output uses sigmoid activation for binary classification

__init__(output_dim)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x)[source]

Forward pass through the CNN model.

Parameters:

x (torch.Tensor) – Input tensor with shape (batch_size, 4, sequence_length).

Returns:

Output predictions with shape (batch_size, output_dim).

Return type:

torch.Tensor

Notes

The forward pass applies: 1. Three sequential 1D convolutions with batch normalization and max pooling 2. Dropout after each convolutional layer 3. Flattening of convolutional features 4. Fully connected layer with batch normalization and dropout 5. Final output layer with sigmoid activation

Utility Functions

The evoaug_utils.utils module provides helper functions:

Data Handling:

evoaug_utils.utils.H5Dataset(filepath, batch_size=128, lower_case=False, transpose=False, downsample=None)[source]

Enhanced Dataset class for H5 data files with DataModule-like functionality.

This class combines the functionality of a PyTorch Dataset with the convenience methods of a Lightning DataModule, making it easy to integrate with EvoAug2 augmentations without nesting datamodules.

Parameters:
  • filepath (str) – Path to the H5 file.

  • batch_size (int, optional) – Batch size for dataloaders. Defaults to 128.

  • lower_case (bool, optional) – Whether to use lowercase keys (‘x’, ‘y’) instead of uppercase (‘X’, ‘Y’). Defaults to False.

  • transpose (bool, optional) – Whether to transpose the data dimensions. Defaults to False.

  • downsample (int, optional) – Number of samples to use (for debugging). If None, uses all data. Defaults to None.

evoaug_utils.utils.filepath

Path to the H5 file.

Type:

str

evoaug_utils.utils.batch_size

Batch size for dataloaders.

Type:

int

evoaug_utils.utils.lower_case

Whether to use lowercase keys.

Type:

bool

evoaug_utils.utils.transpose

Whether data is transposed.

Type:

bool

evoaug_utils.utils.downsample

Number of samples to use.

Type:

int

evoaug_utils.utils.x_key

Key prefix for input data.

Type:

str

evoaug_utils.utils.y_key

Key prefix for target data.

Type:

str

evoaug_utils.utils.x_train

Training input data.

Type:

torch.Tensor

evoaug_utils.utils.y_train

Training target data.

Type:

torch.Tensor

evoaug_utils.utils.x_valid

Validation input data.

Type:

torch.Tensor

evoaug_utils.utils.y_valid

Validation target data.

Type:

torch.Tensor

evoaug_utils.utils.x_test

Test input data.

Type:

torch.Tensor

evoaug_utils.utils.y_test

Test target data.

Type:

torch.Tensor

evoaug_utils.utils.A

Alphabet size (number of nucleotides).

Type:

int

evoaug_utils.utils.L

Sequence length.

Type:

int

evoaug_utils.utils.num_classes

Number of output classes.

Type:

int

class evoaug_utils.utils.H5DataModule(data_path, batch_size=128, stage=None, lower_case=False, transpose=False, downsample=None)[source]

Bases: LightningDataModule

PyTorch Lightning DataModule for H5 data files.

This class provides a standardized way to load and manage H5 datasets for training, validation, and testing in PyTorch Lightning workflows.

Parameters:
  • data_path (str) – Path to the H5 data file.

  • batch_size (int, optional) – Batch size for dataloaders. Defaults to 128.

  • stage (str, optional) – Lightning stage (‘fit’, ‘test’, or None). Defaults to None.

  • lower_case (bool, optional) – Whether to use lowercase keys (‘x’, ‘y’) instead of uppercase (‘X’, ‘Y’). Defaults to False.

  • transpose (bool, optional) – Whether to transpose the data dimensions. Defaults to False.

  • downsample (int, optional) – Number of samples to use (for debugging). If None, uses all data. Defaults to None.

data_path

Path to the H5 data file.

Type:

str

batch_size

Batch size for dataloaders.

Type:

int

x

Key prefix for input data (‘x’ or ‘X’).

Type:

str

y

Key prefix for target data (‘y’ or ‘Y’).

Type:

str

transpose

Whether data is transposed.

Type:

bool

downsample

Number of samples to use.

Type:

int

x_train

Training input data.

Type:

torch.Tensor

y_train

Training target data.

Type:

torch.Tensor

x_valid

Validation input data.

Type:

torch.Tensor

y_valid

Validation target data.

Type:

torch.Tensor

x_test

Test input data.

Type:

torch.Tensor

y_test

Test target data.

Type:

torch.Tensor

A

Alphabet size (number of nucleotides).

Type:

int

L

Sequence length.

Type:

int

num_classes

Number of output classes.

Type:

int

__init__(data_path, batch_size=128, stage=None, lower_case=False, transpose=False, downsample=None)[source]
prepare_data_per_node

If True, each LOCAL_RANK=0 will call prepare data. Otherwise only NODE_RANK=0, LOCAL_RANK=0 will prepare data.

allow_zero_length_dataloader_with_multiple_devices

If True, dataloader with zero length within local rank is allowed. Default value is False.

setup(stage=None)[source]

Set up the data module for the specified stage.

Parameters:

stage (str, optional) – Lightning stage (‘fit’, ‘test’, or None). Defaults to None.

Notes

  • Loads training and validation data for ‘fit’ stage

  • Loads test data for ‘test’ stage

  • Sets shape attributes (A, L, num_classes)

train_dataloader()[source]

Get training dataloader.

Returns:

Training dataloader with shuffling enabled.

Return type:

torch.utils.data.DataLoader

val_dataloader()[source]

Get validation dataloader.

Returns:

Validation dataloader with shuffling disabled.

Return type:

torch.utils.data.DataLoader

test_dataloader()[source]

Get test dataloader.

Returns:

Test dataloader with shuffling disabled.

Return type:

torch.utils.data.DataLoader

Evaluation:

evoaug_utils.utils.evaluate_model(y_test, pred, task, verbose=True)[source]

Evaluate model performance for classification or regression tasks.

Parameters:
  • y_test (array-like) – True target values.

  • pred (array-like) – Predicted values from the model.

  • task (str) – Task type: ‘regression’ or ‘classification’.

  • verbose (bool, optional) – Whether to print evaluation results. Defaults to True.

Returns:

For regression: (mse, pearson_r, spearman_r) For classification: (auroc, aupr)

Return type:

tuple

Notes

  • Regression metrics: MSE, Pearson correlation, Spearman correlation

  • Classification metrics: AUROC, Average Precision

evoaug_utils.utils.get_predictions(model, x, batch_size=100, accelerator='gpu', devices=1)[source]

Get predictions from a PyTorch model.

Parameters:
  • model (torch.nn.Module) – The trained model to use for predictions.

  • x (array-like) – Input data for prediction.

  • batch_size (int, optional) – Batch size for prediction. Defaults to 100.

  • accelerator (str, optional) – Hardware accelerator to use. Defaults to ‘gpu’.

  • devices (int, optional) – Number of devices to use. Defaults to 1.

Returns:

Model predictions.

Return type:

numpy.ndarray

Notes

  • Model is set to evaluation mode during prediction

  • Predictions are made in batches to manage memory

  • Output is converted to numpy array

evoaug_utils.utils.calculate_auroc(y_true, y_score)[source]

Calculate Area Under ROC Curve for each class.

Parameters:
  • y_true (array-like) – True binary labels.

  • y_score (array-like) – Predicted probabilities or scores.

Returns:

AUROC values for each class.

Return type:

numpy.ndarray

evoaug_utils.utils.calculate_aupr(y_true, y_score)[source]

Calculate Average Precision for each class.

Parameters:
  • y_true (array-like) – True binary labels.

  • y_score (array-like) – Predicted probabilities or scores.

Returns:

Average precision values for each class.

Return type:

numpy.ndarray

evoaug_utils.utils.calculate_mse(y_true, y_score)[source]

Calculate Mean Squared Error for each class.

Parameters:
  • y_true (array-like) – True target values.

  • y_score (array-like) – Predicted values.

Returns:

MSE values for each class.

Return type:

numpy.ndarray

evoaug_utils.utils.calculate_pearsonr(y_true, y_score)[source]

Calculate Pearson correlation coefficient for each class.

Parameters:
  • y_true (array-like) – True target values.

  • y_score (array-like) – Predicted values.

Returns:

Pearson correlation values for each class.

Return type:

numpy.ndarray

evoaug_utils.utils.calculate_spearmanr(y_true, y_score)[source]

Calculate Spearman correlation coefficient for each class.

Parameters:
  • y_true (array-like) – True target values.

  • y_score (array-like) – Predicted values.

Returns:

Spearman correlation values for each class.

Return type:

numpy.ndarray

Training Support:

evoaug_utils.utils.configure_optimizer(model, lr=0.001, weight_decay=1e-06, decay_factor=0.1, patience=5, monitor='val_loss')[source]

Configure optimizer and learning rate scheduler for PyTorch models.

Parameters:
  • model (torch.nn.Module) – The model to configure optimizer for.

  • lr (float, optional) – Learning rate. Defaults to 0.001.

  • weight_decay (float, optional) – Weight decay (L2 regularization). Defaults to 1e-6.

  • decay_factor (float, optional) – Factor by which to reduce learning rate. Defaults to 0.1.

  • patience (int, optional) – Number of epochs with no improvement before reducing LR. Defaults to 5.

  • monitor (str, optional) – Metric to monitor for LR reduction. Defaults to ‘val_loss’.

Returns:

Dictionary with optimizer and scheduler configuration.

Return type:

dict

Notes

Uses Adam optimizer with ReduceLROnPlateau scheduler.

evoaug_utils.utils.get_fmaps(robust_model, x)[source]

Get first layer feature maps from a model.

Parameters:
  • robust_model (torch.nn.Module) – The model to extract feature maps from.

  • x (torch.Tensor) – Input data to generate feature maps for.

Returns:

Feature maps from the first layer (activation1).

Return type:

numpy.ndarray

Notes

  • Requires the model to have a layer named ‘activation1’

  • Model is moved to CPU for feature extraction

  • Feature maps are transposed for visualization

evoaug_utils.utils.make_directory(directory)[source]

Create a directory if it doesn’t exist.

Parameters:

directory (str) – Path to the directory to create.

Notes

Creates parent directories as needed using pathlib.

Usage Examples

Basic Augmentation:

from evoaug.augment import RandomMutation, RandomDeletion

# Create augmentations
mutation = RandomMutation(mut_frac=0.05)
deletion = RandomDeletion(delete_min=0, delete_max=20)

# Apply to sequence
sequence = torch.randn(1, 200, 4)
augmented = mutation(sequence)
augmented = deletion(augmented)

Training with RobustLoader:

from evoaug.evoaug import RobustLoader

# Create loader
loader = RobustLoader(
    base_dataset=dataset,
    augment_list=[mutation, deletion],
    max_augs_per_seq=2,
    hard_aug=True,
    batch_size=32
)

# Training loop
for batch_seqs, batch_labels in loader:
    # Augmentations applied automatically
    outputs = model(batch_seqs)
    loss = criterion(outputs, batch_labels)
    # ... rest of training

Model Training:

from evoaug_utils.model_zoo import DeepSTARR, DeepSTARRModel

# Create model
model = DeepSTARRModel(DeepSTARR(2))

# Train with augmentations
trainer.fit(model, datamodule=data_module)

Evaluation:

from evoaug_utils import utils

# Get predictions
predictions = utils.get_predictions(model, test_data)

# Evaluate
results = utils.evaluate_model(true_labels, predictions, task='regression')

Configuration

Augmentation Parameters:

Each augmentation class accepts specific parameters that control the augmentation behavior:

  • RandomMutation: mut_frac - fraction of positions to mutate

  • RandomDeletion: delete_min, delete_max - deletion range

  • RandomInsertion: insert_min, insert_max - insertion range

  • RandomTranslocation: shift_min, shift_max - shift range

  • RandomRC: rc_prob - reverse-complement probability

  • RandomNoise: noise_mean, noise_std - noise parameters

Training Parameters:

  • max_augs_per_seq: Maximum augmentations per sequence

  • hard_aug: Whether to always apply exactly N augmentations

  • batch_size: Training batch size

  • shuffle: Whether to shuffle data

Model Parameters:

  • input_size: Sequence length

  • num_classes: Number of output classes

  • learning_rate: Training learning rate

  • weight_decay: L2 regularization

Error Handling

Common Errors and Solutions:

  1. Shape Mismatch Errors: - Ensure input tensors have shape [batch, length, channels] - Check that augmentation parameters are within valid ranges

  2. Memory Errors: - Reduce batch size - Use gradient accumulation - Enable mixed precision training

  3. Data Type Errors: - Ensure input tensors are torch.float32 - Check label tensors are appropriate type (long for classification)

Debugging Tips:

# Check tensor shapes and types
print(f"Input shape: {sequence.shape}")
print(f"Input dtype: {sequence.dtype}")

# Verify augmentation parameters
print(f"Mutation fraction: {mutation.mut_frac}")
print(f"Deletion range: {deletion.delete_min}-{deletion.delete_max}")

Performance Considerations

Optimization Tips:

  1. Use GPU Acceleration: - Move tensors to GPU: sequence = sequence.cuda() - Use mixed precision training when available

  2. Batch Processing: - Use appropriate batch sizes for your hardware - Consider gradient accumulation for large effective batch sizes

  3. Augmentation Efficiency: - Limit number of augmentations per sequence