immuneML.ml_methods.generative_models package¶
Submodules¶
immuneML.ml_methods.generative_models.BackgroundSequences module¶
immuneML.ml_methods.generative_models.ExperimentalImport module¶
- class immuneML.ml_methods.generative_models.ExperimentalImport.ExperimentalImport(dataset: SequenceDataset, original_input_file: Path = None)[source]¶
Bases:
GenerativeModel
Allows to import existing experimental data and do annotations and simulations on top of them. This model should be used only for LIgO simulation and not with TrainGenModel instruction.
YAML specification:
definitions: ml_methods: generative_model: type: ExperimentalImport import_format: AIRR tmp_import_path: ./tmp/ import_params: path: path/to/files/ region_type: IMGT_CDR3 # what part of the sequence to import column_mapping: # column mapping AIRR: immuneML junction: sequence junction_aa: sequence_aa locus: chain
- compute_p_gen(sequence: dict, sequence_type: SequenceType) float [source]¶
- compute_p_gens(sequences, sequence_type: SequenceType) ndarray [source]¶
- generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶
- generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]¶
immuneML.ml_methods.generative_models.GenerativeModel module¶
- class immuneML.ml_methods.generative_models.GenerativeModel.GenerativeModel(locus: Chain, name: str = None, region_type: RegionType = None, seed=None)[source]¶
Bases:
object
Generative models are algorithms which can be trained to learn patterns in existing datasets, and then be used to generate new synthetic datasets.
These methods can be used in the TrainGenModel instruction, and previously trained models can be used to generate data using the ApplyGenModel instruction.
- DOCS_TITLE = 'Generative models'¶
- OUTPUT_COLUMNS = []¶
- abstract compute_p_gen(sequence: dict, sequence_type: SequenceType) float [source]¶
- abstract compute_p_gens(sequences, sequence_type: SequenceType) ndarray [source]¶
- abstract generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶
- abstract generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool) Dataset [source]¶
immuneML.ml_methods.generative_models.InternalOlgaModel module¶
- class immuneML.ml_methods.generative_models.InternalOlgaModel.InternalOlgaModel(sequence_gen_model: olga.sequence_generation.SequenceGenerationVDJ | olga.sequence_generation.SequenceGenerationVJ = None, v_gene_mapping: list = None, j_gene_mapping: list = None, genomic_data: olga.load_model.GenomicData = None, olga_gen_model: olga.load_model.GenerativeModelVDJ | olga.load_model.GenerativeModelVJ = None)[source]¶
Bases:
object
- genomic_data: olga.load_model.GenomicData = None¶
- j_gene_mapping: list = None¶
- olga_gen_model: olga.load_model.GenerativeModelVDJ | olga.load_model.GenerativeModelVJ = None¶
- sequence_gen_model: olga.sequence_generation.SequenceGenerationVDJ | olga.sequence_generation.SequenceGenerationVJ = None¶
- v_gene_mapping: list = None¶
immuneML.ml_methods.generative_models.KLEvaluator module¶
- immuneML.ml_methods.generative_models.KLEvaluator.KL(sequences, model_1, model_2)[source]¶
Computes the KL divergence between two models (model_1 and model_2) for a given set of sequences.
- Parameters:
sequences – list of sequences
model_1 – model 1
model_2 – model 2
- Returns:
KL divergence value
- class immuneML.ml_methods.generative_models.KLEvaluator.KLEvaluator(true_sequences, simulated_sequences, estimator, n_sequences)[source]¶
Bases:
object
immuneML.ml_methods.generative_models.MultinomialKmerModel module¶
- class immuneML.ml_methods.generative_models.MultinomialKmerModel.EmpiricalLengthDistribution(lengths_frequencies: ndarray)[source]¶
Bases:
object
- class immuneML.ml_methods.generative_models.MultinomialKmerModel.KmerDistribution(*args, **kwargs)[source]¶
Bases:
Protocol
- class immuneML.ml_methods.generative_models.MultinomialKmerModel.KmerModel(kmer_probs: EncodedLookup)[source]¶
Bases:
object
- class immuneML.ml_methods.generative_models.MultinomialKmerModel.MultinomialKmerModel(kmer_probs: EncodedLookup, sequence_length: int)[source]¶
Bases:
object
- class immuneML.ml_methods.generative_models.MultinomialKmerModel.Poisson(mu: float)[source]¶
Bases:
object
- mu: float¶
- class immuneML.ml_methods.generative_models.MultinomialKmerModel.SmoothedLengthDistribution(empirical_distribution, smooth_distribution, p_smooth)[source]¶
Bases:
object
- immuneML.ml_methods.generative_models.MultinomialKmerModel.estimate_kmer_model(kmers: EncodedRaggedArray, prior_count=1) MultinomialKmerModel [source]¶
- immuneML.ml_methods.generative_models.MultinomialKmerModel.estimate_length_distribution(lengths: ndarray) EmpiricalLengthDistribution [source]¶
- immuneML.ml_methods.generative_models.MultinomialKmerModel.estimate_smoothed_length_distribution(lengths: ndarray, prior_count=1) SmoothedLengthDistribution [source]¶
immuneML.ml_methods.generative_models.OLGA module¶
- class immuneML.ml_methods.generative_models.OLGA.OLGA(model_path: Path = None, default_model_name: str = None, locus: Chain = None, region_type: RegionType = RegionType.IMGT_JUNCTION, _olga_model: InternalOlgaModel = None)[source]¶
Bases:
GenerativeModel
This is a wrapper for the OLGA package as described by Sethna et al. 2019 (OLGA package on PyPI or GitHub: https://github.com/statbiophys/OLGA ). This model should be used only for LIgO simulation and is not yet supported for use with TrainGenModel instruction.
Reference:
Zachary Sethna, Yuval Elhanati, Curtis G Callan, Jr, Aleksandra M Walczak, Thierry Mora, OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs, Bioinformatics, Volume 35, Issue 17, 1 September 2019, Pages 2974–2981, https://doi.org/10.1093/bioinformatics/btz035
Note:
OLGA generates sequences that correspond to IMGT junction and are used for matching as such. See the https://github.com/statbiophys/OLGA for more details.
Gene names are as provided in OLGA (either in default models or in the user-specified model files). For simulation, one should use gene names in the same format.
Note
While this is a generative model, in the current version of immuneML it cannot be used in combination with TrainGenModel or ApplyGenModel instruction. If you want to use OLGA for sequence simulation, see Dataset simulation with LIgO.
- `
Specification arguments:
model_path (str): if not default model, this parameter should point to a folder where the four OLGA/IGOR format files are stored (could also be inferred from some experimental data)
default_model_name (str): if not using custom models, one of the OLGA default models could be specified here; the value should be the same as it would be passed to command line in OLGA: e.g., humanTRB, human IGH
YAML specification:
definitions: ml_methods: generative_model: type: OLGA model_path: None default_model_name: humanTRB
- DEFAULT_MODEL_FOLDER_MAP = {'humanIGH': 'human_B_heavy', 'humanIGK': 'human_B_kappa', 'humanIGL': 'human_B_lambda', 'humanTRA': 'human_T_alpha', 'humanTRB': 'human_T_beta', 'mouseTRA': 'mouse_T_alpha', 'mouseTRB': 'mouse_T_beta'}¶
- MODEL_FILENAMES = {'j_gene_anchor': 'J_gene_CDR3_anchors.csv', 'marginals': 'model_marginals.txt', 'params': 'model_params.txt', 'v_gene_anchor': 'V_gene_CDR3_anchors.csv'}¶
- OUTPUT_COLUMNS = ['sequence', 'sequence_aa', 'v_call', 'j_call', 'region_type', 'frame_type', 'p_gen', 'from_default_model', 'duplicate_count', 'locus']¶
- compute_p_gen(sequence: dict, sequence_type: SequenceType, sequence_field: str = None) float [source]¶
- compute_p_gens(sequences: BNPDataClass, sequence_type: SequenceType, sequence_field: str = None) list [source]¶
- default_model_name: str = None¶
- generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶
- generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool) Path [source]¶
- property is_vdj¶
- load_internal_model(model_path: Path = None) InternalOlgaModel [source]¶
- model_path: Path = None¶
- region_type: RegionType = 'junction'¶
immuneML.ml_methods.generative_models.PWM module¶
- class immuneML.ml_methods.generative_models.PWM.PWM(locus, sequence_type: str, region_type: str, name: str = None)[source]¶
Bases:
GenerativeModel
This is a baseline implementation of a positional weight matrix. It is estimated from a set of sequences for each of the different lengths that appear in the dataset.
Specification arguments:
locus (str): which chain is generated (for now, it is only assigned to the generated sequences)
sequence_type (str): amino_acid or nucleotide
region_type (str): which region type to use (e.g., IMGT_CDR3), this is only assigned to the generated sequences
YAML specification:
definitions: ml_methods: my_pwm: PWM: locus: beta sequence_type: amino_acid region_type: IMGT_CDR3
- compute_p_gen(sequence: dict, sequence_type: SequenceType) float [source]¶
- compute_p_gens(sequences, sequence_type: SequenceType) ndarray [source]¶
- fit(data: SequenceDataset, path: Path = None)[source]¶
- generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶
- generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]¶
immuneML.ml_methods.generative_models.SequenceTransitionDistribution module¶
- class immuneML.ml_methods.generative_models.SequenceTransitionDistribution.EncodedLookup(lookup: ndarray, encoding: Encoding)[source]¶
Bases:
NDArrayOperatorsMixin
- property alphabet_size¶
- property encoding¶
- class immuneML.ml_methods.generative_models.SequenceTransitionDistribution.SequenceTransitionDistribution(transition_matrix: immuneML.ml_methods.generative_models.SequenceTransitionDistribution.EncodedLookup, initial_distribution: immuneML.ml_methods.generative_models.SequenceTransitionDistribution.EncodedLookup, end_probs: immuneML.ml_methods.generative_models.SequenceTransitionDistribution.EncodedLookup)[source]¶
Bases:
object
- end_probs: EncodedLookup¶
- initial_distribution: EncodedLookup¶
- transition_matrix: EncodedLookup¶
immuneML.ml_methods.generative_models.SimpleLSTM module¶
- class immuneML.ml_methods.generative_models.SimpleLSTM.SimpleLSTM(locus: str, sequence_type: str, hidden_size: int, learning_rate: float, num_epochs: int, batch_size: int, num_layers: int, embed_size: int, temperature, device: str, name=None, region_type: str = 'IMGT_CDR3', prime_str: str = 'C', window_size: int = 64, seed: int = None, iter_to_report: int = 1)[source]¶
Bases:
GenerativeModel
This is a simple generative model for receptor sequences based on LSTM.
Similar models have been proposed in:
Akbar, R. et al. (2022). In silico proof of principle of machine learning-based antibody design at unconstrained scale. mAbs, 14(1), 2031482. https://doi.org/10.1080/19420862.2022.2031482
Saka, K. et al. (2021). Antibody design using LSTM based deep generative model from phage display library for affinity maturation. Scientific Reports, 11(1), Article 1. https://doi.org/10.1038/s41598-021-85274-7
Specification arguments:
sequence_type (str): whether the model should work on amino_acid or nucleotide level
hidden_size (int): how many LSTM cells should exist per layer
num_layers (int): how many hidden LSTM layers should there be
num_epochs (int): for how many epochs to train the model
learning_rate (float): what learning rate to use for optimization
batch_size (int): how many examples (sequences) to use for training for one batch
embed_size (int): the dimension of the sequence embedding
temperature (float): a higher temperature leads to faster yet more unstable learning
prime_str (str): the initial sequence to start generating from
seed (int): random seed for the model or None
iter_to_report (int): number of epochs between training progress reports
YAML specification:
definitions: ml_methods: my_simple_lstm: sequence_type: amino_acid hidden_size: 50 num_layers: 1 num_epochs: 5000 learning_rate: 0.001 batch_size: 100 embed_size: 100
- compute_p_gen(sequence: dict, sequence_type: SequenceType) float [source]¶
- compute_p_gens(sequences, sequence_type: SequenceType) ndarray [source]¶
- generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶
- generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool, max_failed_trials: int = 1000)[source]¶
immuneML.ml_methods.generative_models.SimpleVAE module¶
immuneML.ml_methods.generative_models.SoNNia module¶
- class immuneML.ml_methods.generative_models.SoNNia.SoNNia(locus=None, batch_size: int = None, epochs: int = None, deep: bool = False, name: str = None, default_model_name: str = None, n_gen_seqs: int = None, include_joint_genes: bool = True, custom_model_path: str = None, seed: int = None)[source]¶
Bases:
GenerativeModel
SoNNia models the selection process of T and B cell receptor repertoires. It is based on the SoNNia Python package. It supports SequenceDataset as input, but not RepertoireDataset.
Original publication: Isacchini, G., Walczak, A. M., Mora, T., & Nourmohammad, A. (2021). Deep generative selection models of T and B cell receptor repertoires with soNNia. Proceedings of the National Academy of Sciences, 118(14), e2023141118. https://doi.org/10.1073/pnas.2023141118
Specification arguments:
locus (str): The locus of the receptor chain.
batch_size (int): number of sequences to use in each batch
epochs (int): number of epochs to train the model
deep (bool): whether to use a deep model
include_joint_genes (bool)
n_gen_seqs (int)
custom_model_path (str): path for the custom OLGA model if used
default_model_name (str): name of the default OLGA model if used
seed (int): random seed for the model or None
YAML specification:
definitions: ml_methods: my_sonnia_model: SoNNia: batch_size: 1e4 epochs: 5 default_model_name: humanTRB deep: False include_joint_genes: True n_gen_seqs: 100
- compute_p_gen(sequence: dict, sequence_type: SequenceType) float [source]¶
- compute_p_gens(sequences, sequence_type: SequenceType) ndarray [source]¶
- generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶
- generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]¶