immuneML.ml_methods.generative_models package

Submodules

immuneML.ml_methods.generative_models.BackgroundSequences module

immuneML.ml_methods.generative_models.ExperimentalImport module

class immuneML.ml_methods.generative_models.ExperimentalImport.ExperimentalImport(dataset: SequenceDataset, original_input_file: Path = None)[source]

Bases: GenerativeModel

Allows to import existing experimental data and do annotations and simulations on top of them. This model should be used only for LIgO simulation and not with TrainGenModel instruction.

YAML specification:

definitions:
    ml_methods:
        generative_model:
            type: ExperimentalImport
            import_format: AIRR
            tmp_import_path: ./tmp/
            import_params:
                path: path/to/files/
                region_type: IMGT_CDR3 # what part of the sequence to import
                column_mapping: # column mapping AIRR: immuneML
                    junction: sequence
                    junction_aa: sequence_aa
                    locus: chain
classmethod build_object(**kwargs)[source]
can_compute_p_gens() bool[source]
can_generate_from_skewed_gene_models() bool[source]
compute_p_gen(sequence: dict, sequence_type: SequenceType) float[source]
compute_p_gens(sequences, sequence_type: SequenceType) ndarray[source]
generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]
generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]
is_same(model) bool[source]

immuneML.ml_methods.generative_models.GenerativeModel module

class immuneML.ml_methods.generative_models.GenerativeModel.GenerativeModel(locus, name: str = None, region_type: RegionType = None)[source]

Bases: object

Generative models are algorithms which can be trained to learn patterns in existing datasets, and then be used to generate new synthetic datasets.

These methods can be used in the TrainGenModel instruction, and previously trained models can be used to generate data using the ApplyGenModel instruction.

DOCS_TITLE = 'Generative models'
OUTPUT_COLUMNS = []
abstract can_compute_p_gens() bool[source]
abstract can_generate_from_skewed_gene_models() bool[source]
abstract compute_p_gen(sequence: dict, sequence_type: SequenceType) float[source]
abstract compute_p_gens(sequences, sequence_type: SequenceType) ndarray[source]
abstract fit(data, path: Path = None)[source]
abstract generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]
abstract generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool) Dataset[source]
abstract is_same(model) bool[source]
abstract classmethod load_model(path: Path)[source]
abstract save_model(path: Path) Path[source]

immuneML.ml_methods.generative_models.InternalOlgaModel module

class immuneML.ml_methods.generative_models.InternalOlgaModel.InternalOlgaModel(sequence_gen_model: olga.sequence_generation.SequenceGenerationVDJ | olga.sequence_generation.SequenceGenerationVJ = None, v_gene_mapping: list = None, j_gene_mapping: list = None, genomic_data: olga.load_model.GenomicData = None, olga_gen_model: olga.load_model.GenerativeModelVDJ | olga.load_model.GenerativeModelVJ = None)[source]

Bases: object

genomic_data: GenomicData = None
j_gene_mapping: list = None
olga_gen_model: GenerativeModelVDJ | GenerativeModelVJ = None
sequence_gen_model: SequenceGenerationVDJ | SequenceGenerationVJ = None
v_gene_mapping: list = None

immuneML.ml_methods.generative_models.OLGA module

class immuneML.ml_methods.generative_models.OLGA.OLGA(model_path: Path = None, default_model_name: str = None, locus: Chain = None, region_type: RegionType = RegionType.IMGT_JUNCTION, _olga_model: InternalOlgaModel = None)[source]

Bases: GenerativeModel

This is a wrapper for the OLGA package as described by Sethna et al. 2019 (OLGA package on PyPI or GitHub: https://github.com/statbiophys/OLGA ). This model should be used only for LIgO simulation and is not yet supported for use with TrainGenModel instruction.

Reference:

Zachary Sethna, Yuval Elhanati, Curtis G Callan, Jr, Aleksandra M Walczak, Thierry Mora, OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs, Bioinformatics, Volume 35, Issue 17, 1 September 2019, Pages 2974–2981, https://doi.org/10.1093/bioinformatics/btz035

Note:

  • OLGA generates sequences that correspond to IMGT junction and are used for matching as such. See the https://github.com/statbiophys/OLGA for more details.

  • Gene names are as provided in OLGA (either in default models or in the user-specified model files). For simulation, one should use gene names in the same format.

Note

While this is a generative model, in the current version of immuneML it cannot be used in combination with TrainGenModel or ApplyGenModel instruction. If you want to use OLGA for sequence simulation, see Dataset simulation with LIgO.

`

Specification arguments:

  • model_path (str): if not default model, this parameter should point to a folder where the four OLGA/IGOR format files are stored (could also be inferred from some experimental data)

  • default_model_name (str): if not using custom models, one of the OLGA default models could be specified here; the value should be the same as it would be passed to command line in OLGA: e.g., humanTRB, human IGH

YAML specification:

definitions:
    ml_methods:
        generative_model:
            type: OLGA
            model_path: None
            default_model_name: humanTRB
DEFAULT_MODEL_FOLDER_MAP = {'humanIGH': 'human_B_heavy', 'humanIGK': 'human_B_kappa', 'humanIGL': 'human_B_lambda', 'humanTRA': 'human_T_alpha', 'humanTRB': 'human_T_beta', 'mouseTRA': 'mouse_T_alpha', 'mouseTRB': 'mouse_T_beta'}
MODEL_FILENAMES = {'j_gene_anchor': 'J_gene_CDR3_anchors.csv', 'marginals': 'model_marginals.txt', 'params': 'model_params.txt', 'v_gene_anchor': 'V_gene_CDR3_anchors.csv'}
OUTPUT_COLUMNS = ['sequence', 'sequence_aa', 'v_call', 'j_call', 'region_type', 'frame_type', 'p_gen', 'from_default_model', 'duplicate_count', 'locus']
classmethod build_object(**kwargs)[source]
can_compute_p_gens() bool[source]
can_generate_from_skewed_gene_models() bool[source]
compute_p_gen(sequence: dict, sequence_type: SequenceType, sequence_field: str = None) float[source]
compute_p_gens(sequences: BNPDataClass, sequence_type: SequenceType, sequence_field: str = None) list[source]
default_model_name: str = None
fit(data, path: Path = None)[source]
generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]
generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool) Path[source]
is_same(model) bool[source]
property is_vdj
load_internal_model(model_path: Path = None) InternalOlgaModel[source]
classmethod load_model(path: Path)[source]
locus: Chain = None
model_path: Path = None
region_type: RegionType = 'junction'
save_model(path: Path) Path[source]

immuneML.ml_methods.generative_models.PWM module

class immuneML.ml_methods.generative_models.PWM.PWM(locus, sequence_type: str, region_type: str, name: str = None)[source]

Bases: GenerativeModel

This is a baseline implementation of a positional weight matrix. It is estimated from a set of sequences for each of the different lengths that appear in the dataset.

Specification arguments:

  • locus (str): which chain is generated (for now, it is only assigned to the generated sequences)

  • sequence_type (str): amino_acid or nucleotide

  • region_type (str): which region type to use (e.g., IMGT_CDR3), this is only assigned to the generated sequences

YAML specification:

definitions:
    ml_methods:
        my_pwm:
            PWM:
                locus: beta
                sequence_type: amino_acid
                region_type: IMGT_CDR3
can_compute_p_gens() bool[source]
can_generate_from_skewed_gene_models() bool[source]
compute_p_gen(sequence: dict, sequence_type: SequenceType) float[source]
compute_p_gens(sequences, sequence_type: SequenceType) ndarray[source]
fit(data: SequenceDataset, path: Path = None)[source]
generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]
generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]
is_same(model) bool[source]
classmethod load_model(path: Path)[source]
save_model(path: Path) Path[source]

immuneML.ml_methods.generative_models.SimpleLSTM module

class immuneML.ml_methods.generative_models.SimpleLSTM.SimpleLSTM(locus: str, sequence_type: str, hidden_size: int, learning_rate: float, num_epochs: int, batch_size: int, num_layers: int, embed_size: int, temperature, device: str, name=None, region_type: str = RegionType.IMGT_CDR3)[source]

Bases: GenerativeModel

This is a simple generative model for receptor sequences based on LSTM.

Similar models have been proposed in:

Akbar, R. et al. (2022). In silico proof of principle of machine learning-based antibody design at unconstrained scale. mAbs, 14(1), 2031482. https://doi.org/10.1080/19420862.2022.2031482

Saka, K. et al. (2021). Antibody design using LSTM based deep generative model from phage display library for affinity maturation. Scientific Reports, 11(1), Article 1. https://doi.org/10.1038/s41598-021-85274-7

Specification arguments:

  • sequence_type (str): whether the model should work on amino_acid or nucleotide level

  • hidden_size (int): how many LSTM cells should exist per layer

  • num_layers (int): how many hidden LSTM layers should there be

  • num_epochs (int): for how many epochs to train the model

  • learning_rate (float): what learning rate to use for optimization

  • batch_size (int): how many examples (sequences) to use for training for one batch

  • embed_size (int): the dimension of the sequence embedding

  • temperature (float): a higher temperature leads to faster yet more unstable learning

YAML specification:

definitions:
    ml_methods:
        my_simple_lstm:
            sequence_type: amino_acid
            hidden_size: 50
            num_layers: 1
            num_epochs: 5000
            learning_rate: 0.001
            batch_size: 100
            embed_size: 100
ITER_TO_REPORT = 100
can_compute_p_gens() bool[source]
can_generate_from_skewed_gene_models() bool[source]
compute_p_gen(sequence: dict, sequence_type: SequenceType) float[source]
compute_p_gens(sequences, sequence_type: SequenceType) ndarray[source]
fit(data, path: Path = None)[source]
generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]
generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]
is_same(model) bool[source]
classmethod load_model(path: Path)[source]
make_new_model(state_dict_file: Path = None)[source]
save_model(path: Path) Path[source]

immuneML.ml_methods.generative_models.SimpleVAE module

class immuneML.ml_methods.generative_models.SimpleVAE.PyTorchSequenceDataset(data)[source]

Bases: Dataset

get_j_genes()[source]
get_v_genes()[source]
class immuneML.ml_methods.generative_models.SimpleVAE.SimpleVAE(locus, beta, latent_dim, linear_nodes_count, num_epochs, batch_size, j_gene_embed_dim, pretrains, v_gene_embed_dim, cdr3_embed_dim, warmup_epochs, patience, iter_count_prob_estimation, device, vocab=None, max_cdr3_len=None, unique_v_genes=None, unique_j_genes=None, name: str = None)[source]

Bases: GenerativeModel

SimpleVAE is a generative model on sequence level that relies on variational autoencoder. This type of model was proposed by Davidsen et al. 2019, and this implementation is inspired by their original implementation available at https://github.com/matsengrp/vampire.

References:

Davidsen, K., Olson, B. J., DeWitt, W. S., III, Feng, J., Harkins, E., Bradley, P., & Matsen, F. A., IV. (2019). Deep generative models for T cell receptor protein sequences. eLife, 8, e46935. https://doi.org/10.7554/eLife.46935

Specification arguments:

  • locus (str): which locus the sequence come from, e.g., TRB

  • beta (float): VAE hyperparameter that balanced the reconstruction loss and latent dimension regularization

  • latent_dim (int): latent dimension of the VAE

  • linear_nodes_count (int): in linear layers, how many nodes to use

  • num_epochs (int): how many epochs to use for training

  • batch_size (int): how many examples to consider at the same time

  • j_gene_embed_dim (int): dimension of J gene embedding

  • v_gene_embed_dim (int): dimension of V gene embedding

  • cdr3_embed_dim (int): dimension of the cdr3 embedding

  • pretrains (int): how many times to attempt pretraining to initialize the weights and use warm-up for the beta hyperparameter before the main training process

  • warmup_epochs (int): how many epochs to use for training where beta hyperparameter is linearly increased from 0 up to its max value; this is in addition to num_epochs set above

  • patience (int): number of epochs to wait before the training is stopped when the loss is not improving

  • iter_count_prob_estimation (int): how many iterations to use to estimate the log probability of the generated sequence (the more iterations, the better the estimated log probability)

  • vocab (list): which letters (amino acids) are allowed - this is automatically filled for new models (no need to set)

  • max_cdr3_len (int): what is the maximum cdr3 length - this is automatically filled for new models (no need to set)

  • unique_v_genes (list): list of allowed V genes (this will be automatically filled from the dataset if not provided here manually)

  • unique_j_genes (list): list of allowed J genes (this will be automatically filled from the dataset if not provided here manually)

  • device (str): name of the device where to train the model (e.g., cpu)

YAML specification:

definitions:
    ml_methods:
        my_vae:
            SimpleVAE:
                locus: beta
                beta: 0.75
                latent_dim: 20
                linear_nodes_count: 75
                num_epochs: 5000
                batch_size: 10000
                j_gene_embed_dim: 13
                v_gene_embed_dim: 30
                cdr3_embed_dim: 21
                pretrains: 10
                warmup_epochs: 20
                patience: 20
                device: cpu
can_compute_p_gens() bool[source]
can_generate_from_skewed_gene_models() bool[source]
compute_p_gen(sequence: dict, sequence_type: SequenceType) float[source]
compute_p_gens(sequences, sequence_type: SequenceType) ndarray[source]
encode_dataset(dataset, batch_size=None, shuffle=True)[source]
fit(data, path: Path = None)[source]
generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]
generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]
is_same(model) bool[source]
classmethod load_model(path: Path)[source]
make_new_model(initial_values_path: Path = None)[source]
save_model(path: Path) Path[source]

immuneML.ml_methods.generative_models.SoNNia module

class immuneML.ml_methods.generative_models.SoNNia.SoNNia(locus=None, batch_size: int = None, epochs: int = None, deep: bool = False, name: str = None, default_model_name: str = None, n_gen_seqs: int = None, include_joint_genes: bool = True, custom_model_path: str = None, region_type: RegionType = RegionType.IMGT_CDR3)[source]

Bases: GenerativeModel

SoNNia models the selection process of T and B cell receptor repertoires. It is based on the SoNNia Python package. It supports SequenceDataset as input, but not RepertoireDataset.

Original publication: Isacchini, G., Walczak, A. M., Mora, T., & Nourmohammad, A. (2021). Deep generative selection models of T and B cell receptor repertoires with soNNia. Proceedings of the National Academy of Sciences, 118(14), e2023141118. https://doi.org/10.1073/pnas.2023141118

Specification arguments:

  • locus (str)

  • batch_size (int)

  • epochs (int)

  • deep (bool)

  • include_joint_genes (bool)

  • n_gen_seqs (int)

  • custom_model_path (str)

  • default_model_name (str)

    YAML specification:

definitions:
    ml_methods:
        my_sonnia_model:
            SoNNia:
                ...
can_compute_p_gens() bool[source]
can_generate_from_skewed_gene_models() bool[source]
compute_p_gen(sequence: dict, sequence_type: SequenceType) float[source]
compute_p_gens(sequences, sequence_type: SequenceType) ndarray[source]
fit(dataset: Dataset, path: Path = None)[source]
generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]
generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]
is_same(model) bool[source]
classmethod load_model(path: Path)[source]
save_model(path: Path) Path[source]

Module contents