immuneML.ml_methods.generative_models package¶
Submodules¶
immuneML.ml_methods.generative_models.BackgroundSequences module¶
immuneML.ml_methods.generative_models.ExperimentalImport module¶
- class immuneML.ml_methods.generative_models.ExperimentalImport.ExperimentalImport(dataset: SequenceDataset, original_input_file: Path = None)[source]¶
Bases:
GenerativeModel
Allows to import existing experimental data and do annotations and simulations on top of them. This model should be used only for LIgO simulation and not with TrainGenModel instruction.
YAML specification:
definitions: ml_methods: generative_model: type: ExperimentalImport import_format: AIRR tmp_import_path: ./tmp/ import_params: path: path/to/files/ region_type: IMGT_CDR3 # what part of the sequence to import column_mapping: # column mapping AIRR: immuneML junction: sequence junction_aa: sequence_aa locus: chain
- compute_p_gen(sequence: dict, sequence_type: SequenceType) float [source]¶
- compute_p_gens(sequences, sequence_type: SequenceType) ndarray [source]¶
- generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶
- generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]¶
immuneML.ml_methods.generative_models.GenerativeModel module¶
- class immuneML.ml_methods.generative_models.GenerativeModel.GenerativeModel(locus, name: str = None, region_type: RegionType = None)[source]¶
Bases:
object
Generative models are algorithms which can be trained to learn patterns in existing datasets, and then be used to generate new synthetic datasets.
These methods can be used in the TrainGenModel instruction, and previously trained models can be used to generate data using the ApplyGenModel instruction.
- DOCS_TITLE = 'Generative models'¶
- OUTPUT_COLUMNS = []¶
- abstract compute_p_gen(sequence: dict, sequence_type: SequenceType) float [source]¶
- abstract compute_p_gens(sequences, sequence_type: SequenceType) ndarray [source]¶
- abstract generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶
- abstract generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool) Dataset [source]¶
immuneML.ml_methods.generative_models.InternalOlgaModel module¶
- class immuneML.ml_methods.generative_models.InternalOlgaModel.InternalOlgaModel(sequence_gen_model: olga.sequence_generation.SequenceGenerationVDJ | olga.sequence_generation.SequenceGenerationVJ = None, v_gene_mapping: list = None, j_gene_mapping: list = None, genomic_data: olga.load_model.GenomicData = None, olga_gen_model: olga.load_model.GenerativeModelVDJ | olga.load_model.GenerativeModelVJ = None)[source]¶
Bases:
object
- genomic_data: GenomicData = None¶
- j_gene_mapping: list = None¶
- olga_gen_model: GenerativeModelVDJ | GenerativeModelVJ = None¶
- sequence_gen_model: SequenceGenerationVDJ | SequenceGenerationVJ = None¶
- v_gene_mapping: list = None¶
immuneML.ml_methods.generative_models.OLGA module¶
- class immuneML.ml_methods.generative_models.OLGA.OLGA(model_path: Path = None, default_model_name: str = None, locus: Chain = None, region_type: RegionType = RegionType.IMGT_JUNCTION, _olga_model: InternalOlgaModel = None)[source]¶
Bases:
GenerativeModel
This is a wrapper for the OLGA package as described by Sethna et al. 2019 (OLGA package on PyPI or GitHub: https://github.com/statbiophys/OLGA ). This model should be used only for LIgO simulation and is not yet supported for use with TrainGenModel instruction.
Reference:
Zachary Sethna, Yuval Elhanati, Curtis G Callan, Jr, Aleksandra M Walczak, Thierry Mora, OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs, Bioinformatics, Volume 35, Issue 17, 1 September 2019, Pages 2974–2981, https://doi.org/10.1093/bioinformatics/btz035
Note:
OLGA generates sequences that correspond to IMGT junction and are used for matching as such. See the https://github.com/statbiophys/OLGA for more details.
Gene names are as provided in OLGA (either in default models or in the user-specified model files). For simulation, one should use gene names in the same format.
Note
While this is a generative model, in the current version of immuneML it cannot be used in combination with TrainGenModel or ApplyGenModel instruction. If you want to use OLGA for sequence simulation, see Dataset simulation with LIgO.
Specification arguments:
model_path (str): if not default model, this parameter should point to a folder where the four OLGA/IGOR format files are stored (could also be inferred from some experimental data)
default_model_name (str): if not using custom models, one of the OLGA default models could be specified here; the value should be the same as it would be passed to command line in OLGA: e.g., humanTRB, human IGH
YAML specification:
definitions: ml_methods: generative_model: type: OLGA model_path: None default_model_name: humanTRB
- DEFAULT_MODEL_FOLDER_MAP = {'humanIGH': 'human_B_heavy', 'humanIGK': 'human_B_kappa', 'humanIGL': 'human_B_lambda', 'humanTRA': 'human_T_alpha', 'humanTRB': 'human_T_beta', 'mouseTRA': 'mouse_T_alpha', 'mouseTRB': 'mouse_T_beta'}¶
- MODEL_FILENAMES = {'j_gene_anchor': 'J_gene_CDR3_anchors.csv', 'marginals': 'model_marginals.txt', 'params': 'model_params.txt', 'v_gene_anchor': 'V_gene_CDR3_anchors.csv'}¶
- OUTPUT_COLUMNS = ['sequence', 'sequence_aa', 'v_call', 'j_call', 'region_type', 'frame_type', 'p_gen', 'from_default_model', 'duplicate_count', 'locus']¶
- compute_p_gen(sequence: dict, sequence_type: SequenceType, sequence_field: str = None) float [source]¶
- compute_p_gens(sequences: BNPDataClass, sequence_type: SequenceType, sequence_field: str = None) list [source]¶
- default_model_name: str = None¶
- generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶
- generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool) Path [source]¶
- property is_vdj¶
- load_internal_model(model_path: Path = None) InternalOlgaModel [source]¶
- locus: Chain = None¶
- model_path: Path = None¶
- region_type: RegionType = 'junction'¶
immuneML.ml_methods.generative_models.PWM module¶
- class immuneML.ml_methods.generative_models.PWM.PWM(locus, sequence_type: str, region_type: str, name: str = None)[source]¶
Bases:
GenerativeModel
This is a baseline implementation of a positional weight matrix. It is estimated from a set of sequences for each of the different lengths that appear in the dataset.
Specification arguments:
locus (str): which chain is generated (for now, it is only assigned to the generated sequences)
sequence_type (str): amino_acid or nucleotide
region_type (str): which region type to use (e.g., IMGT_CDR3), this is only assigned to the generated sequences
YAML specification:
definitions: ml_methods: my_pwm: PWM: locus: beta sequence_type: amino_acid region_type: IMGT_CDR3
- compute_p_gen(sequence: dict, sequence_type: SequenceType) float [source]¶
- compute_p_gens(sequences, sequence_type: SequenceType) ndarray [source]¶
- generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶
- generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]¶
immuneML.ml_methods.generative_models.SimpleLSTM module¶
- class immuneML.ml_methods.generative_models.SimpleLSTM.SimpleLSTM(locus: str, sequence_type: str, hidden_size: int, learning_rate: float, num_epochs: int, batch_size: int, num_layers: int, embed_size: int, temperature, device: str, name=None, region_type: str = RegionType.IMGT_CDR3)[source]¶
Bases:
GenerativeModel
This is a simple generative model for receptor sequences based on LSTM.
Similar models have been proposed in:
Akbar, R. et al. (2022). In silico proof of principle of machine learning-based antibody design at unconstrained scale. mAbs, 14(1), 2031482. https://doi.org/10.1080/19420862.2022.2031482
Saka, K. et al. (2021). Antibody design using LSTM based deep generative model from phage display library for affinity maturation. Scientific Reports, 11(1), Article 1. https://doi.org/10.1038/s41598-021-85274-7
Specification arguments:
sequence_type (str): whether the model should work on amino_acid or nucleotide level
hidden_size (int): how many LSTM cells should exist per layer
num_layers (int): how many hidden LSTM layers should there be
num_epochs (int): for how many epochs to train the model
learning_rate (float): what learning rate to use for optimization
batch_size (int): how many examples (sequences) to use for training for one batch
embed_size (int): the dimension of the sequence embedding
temperature (float): a higher temperature leads to faster yet more unstable learning
YAML specification:
definitions: ml_methods: my_simple_lstm: sequence_type: amino_acid hidden_size: 50 num_layers: 1 num_epochs: 5000 learning_rate: 0.001 batch_size: 100 embed_size: 100
- ITER_TO_REPORT = 100¶
- compute_p_gen(sequence: dict, sequence_type: SequenceType) float [source]¶
- compute_p_gens(sequences, sequence_type: SequenceType) ndarray [source]¶
- generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶
- generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]¶
immuneML.ml_methods.generative_models.SimpleVAE module¶
- class immuneML.ml_methods.generative_models.SimpleVAE.PyTorchSequenceDataset(data)[source]¶
Bases:
Dataset
- class immuneML.ml_methods.generative_models.SimpleVAE.SimpleVAE(locus, beta, latent_dim, linear_nodes_count, num_epochs, batch_size, j_gene_embed_dim, pretrains, v_gene_embed_dim, cdr3_embed_dim, warmup_epochs, patience, iter_count_prob_estimation, device, vocab=None, max_cdr3_len=None, unique_v_genes=None, unique_j_genes=None, name: str = None)[source]¶
Bases:
GenerativeModel
SimpleVAE is a generative model on sequence level that relies on variational autoencoder. This type of model was proposed by Davidsen et al. 2019, and this implementation is inspired by their original implementation available at https://github.com/matsengrp/vampire.
References:
Davidsen, K., Olson, B. J., DeWitt, W. S., III, Feng, J., Harkins, E., Bradley, P., & Matsen, F. A., IV. (2019). Deep generative models for T cell receptor protein sequences. eLife, 8, e46935. https://doi.org/10.7554/eLife.46935
Specification arguments:
locus (str): which locus the sequence come from, e.g., TRB
beta (float): VAE hyperparameter that balanced the reconstruction loss and latent dimension regularization
latent_dim (int): latent dimension of the VAE
linear_nodes_count (int): in linear layers, how many nodes to use
num_epochs (int): how many epochs to use for training
batch_size (int): how many examples to consider at the same time
j_gene_embed_dim (int): dimension of J gene embedding
v_gene_embed_dim (int): dimension of V gene embedding
cdr3_embed_dim (int): dimension of the cdr3 embedding
pretrains (int): how many times to attempt pretraining to initialize the weights and use warm-up for the beta hyperparameter before the main training process
warmup_epochs (int): how many epochs to use for training where beta hyperparameter is linearly increased from 0 up to its max value; this is in addition to num_epochs set above
patience (int): number of epochs to wait before the training is stopped when the loss is not improving
iter_count_prob_estimation (int): how many iterations to use to estimate the log probability of the generated sequence (the more iterations, the better the estimated log probability)
vocab (list): which letters (amino acids) are allowed - this is automatically filled for new models (no need to set)
max_cdr3_len (int): what is the maximum cdr3 length - this is automatically filled for new models (no need to set)
unique_v_genes (list): list of allowed V genes (this will be automatically filled from the dataset if not provided here manually)
unique_j_genes (list): list of allowed J genes (this will be automatically filled from the dataset if not provided here manually)
device (str): name of the device where to train the model (e.g., cpu)
YAML specification:
definitions: ml_methods: my_vae: SimpleVAE: locus: beta beta: 0.75 latent_dim: 20 linear_nodes_count: 75 num_epochs: 5000 batch_size: 10000 j_gene_embed_dim: 13 v_gene_embed_dim: 30 cdr3_embed_dim: 21 pretrains: 10 warmup_epochs: 20 patience: 20 device: cpu
- compute_p_gen(sequence: dict, sequence_type: SequenceType) float [source]¶
- compute_p_gens(sequences, sequence_type: SequenceType) ndarray [source]¶
- generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶
- generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]¶
immuneML.ml_methods.generative_models.SoNNia module¶
- class immuneML.ml_methods.generative_models.SoNNia.SoNNia(locus=None, batch_size: int = None, epochs: int = None, deep: bool = False, name: str = None, default_model_name: str = None, n_gen_seqs: int = None, include_joint_genes: bool = True, custom_model_path: str = None, region_type: RegionType = RegionType.IMGT_CDR3)[source]¶
Bases:
GenerativeModel
SoNNia models the selection process of T and B cell receptor repertoires. It is based on the SoNNia Python package. It supports SequenceDataset as input, but not RepertoireDataset.
Original publication: Isacchini, G., Walczak, A. M., Mora, T., & Nourmohammad, A. (2021). Deep generative selection models of T and B cell receptor repertoires with soNNia. Proceedings of the National Academy of Sciences, 118(14), e2023141118. https://doi.org/10.1073/pnas.2023141118
Specification arguments:
locus (str)
batch_size (int)
epochs (int)
deep (bool)
include_joint_genes (bool)
n_gen_seqs (int)
custom_model_path (str)
default_model_name (str)
YAML specification:
definitions: ml_methods: my_sonnia_model: SoNNia: ...
- compute_p_gen(sequence: dict, sequence_type: SequenceType) float [source]¶
- compute_p_gens(sequences, sequence_type: SequenceType) ndarray [source]¶
- generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶
- generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]¶