immuneML.ml_methods.generative_models package¶

Submodules¶

immuneML.ml_methods.generative_models.BackgroundSequences module¶

immuneML.ml_methods.generative_models.ExperimentalImport module¶

class immuneML.ml_methods.generative_models.ExperimentalImport.ExperimentalImport(dataset: SequenceDataset, original_input_file: Path = None)[source]¶

Bases: GenerativeModel

Allows to import existing experimental data and do annotations and simulations on top of them. This model should be used only for LIgO simulation and not with TrainGenModel instruction.

YAML specification:

definitions:
    ml_methods:
        generative_model:
            type: ExperimentalImport
            import_format: AIRR
            tmp_import_path: ./tmp/
            import_params:
                path: path/to/files/
                region_type: IMGT_CDR3 # what part of the sequence to import
                column_mapping: # column mapping AIRR: immuneML
                    junction: sequence
                    junction_aa: sequence_aa
                    locus: chain

classmethod build_object(**kwargs)[source]¶

can_compute_p_gens() → bool[source]¶

can_generate_from_skewed_gene_models() → bool[source]¶

compute_p_gen(sequence: dict, sequence_type: SequenceType) → float[source]¶

compute_p_gens(sequences, sequence_type: SequenceType) → ndarray[source]¶

generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶

generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]¶

is_same(model) → bool[source]¶

immuneML.ml_methods.generative_models.GenerativeModel module¶

class immuneML.ml_methods.generative_models.GenerativeModel.GenerativeModel(locus: Chain, name: str = None, region_type: RegionType = None, seed=None)[source]¶

Bases: object

Generative models are algorithms which can be trained to learn patterns in existing datasets, and then be used to generate new synthetic datasets.

These methods can be used in the TrainGenModel instruction, and previously trained models can be used to generate data using the ApplyGenModel instruction.

DOCS_TITLE = 'Generative models'¶

OUTPUT_COLUMNS = []¶

abstract can_compute_p_gens() → bool[source]¶

abstract can_generate_from_skewed_gene_models() → bool[source]¶

abstract compute_p_gen(sequence: dict, sequence_type: SequenceType) → float[source]¶

abstract compute_p_gens(sequences, sequence_type: SequenceType) → ndarray[source]¶

abstract fit(data, path: Path = None)[source]¶

abstract generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶

abstract generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool) → Dataset[source]¶

abstract is_same(model) → bool[source]¶

abstract classmethod load_model(path: Path)[source]¶

abstract save_model(path: Path) → Path[source]¶

immuneML.ml_methods.generative_models.InternalOlgaModel module¶

class immuneML.ml_methods.generative_models.InternalOlgaModel.InternalOlgaModel(sequence_gen_model: olga.sequence_generation.SequenceGenerationVDJ | olga.sequence_generation.SequenceGenerationVJ = None, v_gene_mapping: list = None, j_gene_mapping: list = None, genomic_data: olga.load_model.GenomicData = None, olga_gen_model: olga.load_model.GenerativeModelVDJ | olga.load_model.GenerativeModelVJ = None)[source]¶

Bases: object

genomic_data: olga.load_model.GenomicData = None¶

j_gene_mapping: list = None¶

olga_gen_model: olga.load_model.GenerativeModelVDJ | olga.load_model.GenerativeModelVJ = None¶

sequence_gen_model: olga.sequence_generation.SequenceGenerationVDJ | olga.sequence_generation.SequenceGenerationVJ = None¶

v_gene_mapping: list = None¶

immuneML.ml_methods.generative_models.KLEvaluator module¶

immuneML.ml_methods.generative_models.KLEvaluator.KL(sequences, model_1, model_2)[source]¶

Computes the KL divergence between two models (model_1 and model_2) for a given set of sequences.

Parameters:

sequences – list of sequences
model_1 – model 1
model_2 – model 2

Returns:

KL divergence value

class immuneML.ml_methods.generative_models.KLEvaluator.KLEvaluator(true_sequences, simulated_sequences, estimator, n_sequences)[source]¶

Bases: object

get_plot(indices, kmers, scores, weights)[source]¶

get_worst_simulated_sequences(n=20)[source]¶

get_worst_true_sequences(n=20)[source]¶

original_plot()[source]¶

simulated_kl()[source]¶

simulated_kl_weights()[source]¶

simulated_plot()[source]¶

true_kl()[source]¶

true_kl_weights()[source]¶

immuneML.ml_methods.generative_models.KLEvaluator.evaluate_similarities(true_sequences, simulated_sequences, estimator)[source]¶

immuneML.ml_methods.generative_models.KLEvaluator.get_kl_weights(model_1, model_2, sequences)[source]¶

immuneML.ml_methods.generative_models.MultinomialKmerModel module¶

class immuneML.ml_methods.generative_models.MultinomialKmerModel.EmpiricalLengthDistribution(lengths_frequencies: ndarray)[source]¶

Bases: object

log_prob(lengths: ndarray) → ndarray[source]¶

sample(count: int) → ndarray[source]¶

class immuneML.ml_methods.generative_models.MultinomialKmerModel.KmerDistribution(*args, **kwargs)[source]¶

Bases: Protocol

log_prob(kmers) → ndarray[source]¶

sample(count: int) → EncodedRaggedArray[source]¶

class immuneML.ml_methods.generative_models.MultinomialKmerModel.KmerModel(kmer_probs: EncodedLookup)[source]¶

Bases: object

log_prob(kmers: EncodedRaggedArray) → ndarray[source]¶

sample(count: int) → EncodedRaggedArray[source]¶

class immuneML.ml_methods.generative_models.MultinomialKmerModel.MultinomialKmerModel(kmer_probs: EncodedLookup, sequence_length: int)[source]¶

Bases: object

log_prob(kmers: EncodedRaggedArray) → ndarray[source]¶

sample(count: int) → EncodedRaggedArray[source]¶

class immuneML.ml_methods.generative_models.MultinomialKmerModel.Poisson(mu: float)[source]¶

Bases: object

log_prob(x)[source]¶

mu: float¶

sample(n_samples)[source]¶

class immuneML.ml_methods.generative_models.MultinomialKmerModel.SmoothedLengthDistribution(empirical_distribution, smooth_distribution, p_smooth)[source]¶

Bases: object

log_prob(lengths: ndarray) → ndarray[source]¶

sample(count: int) → ndarray[source]¶

immuneML.ml_methods.generative_models.MultinomialKmerModel.estimate_kmer_model(kmers: EncodedRaggedArray, prior_count=1) → MultinomialKmerModel[source]¶

immuneML.ml_methods.generative_models.MultinomialKmerModel.estimate_length_distribution(lengths: ndarray) → EmpiricalLengthDistribution[source]¶

immuneML.ml_methods.generative_models.MultinomialKmerModel.estimate_smoothed_length_distribution(lengths: ndarray, prior_count=1) → SmoothedLengthDistribution[source]¶

immuneML.ml_methods.generative_models.OLGA module¶

class immuneML.ml_methods.generative_models.OLGA.OLGA(model_path: Path = None, default_model_name: str = None, locus: Chain = None, region_type: RegionType = RegionType.IMGT_JUNCTION, _olga_model: InternalOlgaModel = None)[source]¶

Bases: GenerativeModel

This is a wrapper for the OLGA package as described by Sethna et al. 2019 (OLGA package on PyPI or GitHub: https://github.com/statbiophys/OLGA ). This model should be used only for LIgO simulation and is not yet supported for use with TrainGenModel instruction.

Reference:

Zachary Sethna, Yuval Elhanati, Curtis G Callan, Jr, Aleksandra M Walczak, Thierry Mora, OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs, Bioinformatics, Volume 35, Issue 17, 1 September 2019, Pages 2974–2981, https://doi.org/10.1093/bioinformatics/btz035

Note:

OLGA generates sequences that correspond to IMGT junction and are used for matching as such. See the https://github.com/statbiophys/OLGA for more details.

Gene names are as provided in OLGA (either in default models or in the user-specified model files). For simulation, one should use gene names in the same format.

Note

While this is a generative model, in the current version of immuneML it cannot be used in combination with TrainGenModel or ApplyGenModel instruction. If you want to use OLGA for sequence simulation, see Dataset simulation with LIgO.

`

Specification arguments:

model_path (str): if not default model, this parameter should point to a folder where the four OLGA/IGOR format files are stored (could also be inferred from some experimental data)
default_model_name (str): if not using custom models, one of the OLGA default models could be specified here; the value should be the same as it would be passed to command line in OLGA: e.g., humanTRB, human IGH

YAML specification:

definitions:
    ml_methods:
        generative_model:
            type: OLGA
            model_path: None
            default_model_name: humanTRB

DEFAULT_MODEL_FOLDER_MAP = {'humanIGH': 'human_B_heavy', 'humanIGK': 'human_B_kappa', 'humanIGL': 'human_B_lambda', 'humanTRA': 'human_T_alpha', 'humanTRB': 'human_T_beta', 'mouseTRA': 'mouse_T_alpha', 'mouseTRB': 'mouse_T_beta'}¶

MODEL_FILENAMES = {'j_gene_anchor': 'J_gene_CDR3_anchors.csv', 'marginals': 'model_marginals.txt', 'params': 'model_params.txt', 'v_gene_anchor': 'V_gene_CDR3_anchors.csv'}¶

OUTPUT_COLUMNS = ['sequence', 'sequence_aa', 'v_call', 'j_call', 'region_type', 'frame_type', 'p_gen', 'from_default_model', 'duplicate_count', 'locus']¶

classmethod build_object(**kwargs)[source]¶

can_compute_p_gens() → bool[source]¶

can_generate_from_skewed_gene_models() → bool[source]¶

compute_p_gen(sequence: dict, sequence_type: SequenceType, sequence_field: str = None) → float[source]¶

compute_p_gens(sequences: BNPDataClass, sequence_type: SequenceType, sequence_field: str = None) → list[source]¶

default_model_name: str = None¶

fit(data, path: Path = None)[source]¶

generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶

generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool) → Path[source]¶

is_same(model) → bool[source]¶

property is_vdj¶

load_internal_model(model_path: Path = None) → InternalOlgaModel[source]¶

classmethod load_model(path: Path)[source]¶

locus: Chain = None¶

model_path: Path = None¶

region_type: RegionType = 'junction'¶

save_model(path: Path) → Path[source]¶

immuneML.ml_methods.generative_models.PWM module¶

class immuneML.ml_methods.generative_models.PWM.PWM(locus, sequence_type: str, region_type: str, name: str = None)[source]¶

Bases: GenerativeModel

This is a baseline implementation of a positional weight matrix. It is estimated from a set of sequences for each of the different lengths that appear in the dataset.

Specification arguments:

locus (str): which chain is generated (for now, it is only assigned to the generated sequences)
sequence_type (str): amino_acid or nucleotide
region_type (str): which region type to use (e.g., IMGT_CDR3), this is only assigned to the generated sequences

YAML specification:

definitions:
    ml_methods:
        my_pwm:
            PWM:
                locus: beta
                sequence_type: amino_acid
                region_type: IMGT_CDR3

can_compute_p_gens() → bool[source]¶

can_generate_from_skewed_gene_models() → bool[source]¶

compute_p_gen(sequence: dict, sequence_type: SequenceType) → float[source]¶

compute_p_gens(sequences, sequence_type: SequenceType) → ndarray[source]¶

fit(data: SequenceDataset, path: Path = None)[source]¶

generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶

generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]¶

is_same(model) → bool[source]¶

classmethod load_model(path: Path)[source]¶

save_model(path: Path) → Path[source]¶

immuneML.ml_methods.generative_models.SequenceTransitionDistribution module¶

class immuneML.ml_methods.generative_models.SequenceTransitionDistribution.EncodedLookup(lookup: ndarray, encoding: Encoding)[source]¶

Bases: NDArrayOperatorsMixin

property alphabet_size¶

property encoding¶

raw()[source]¶

class immuneML.ml_methods.generative_models.SequenceTransitionDistribution.SequenceTransitionDistribution(transition_matrix: immuneML.ml_methods.generative_models.SequenceTransitionDistribution.EncodedLookup, initial_distribution: immuneML.ml_methods.generative_models.SequenceTransitionDistribution.EncodedLookup, end_probs: immuneML.ml_methods.generative_models.SequenceTransitionDistribution.EncodedLookup)[source]¶

Bases: object

end_probs: EncodedLookup¶

classmethod from_probabilities(*args, **kwargs)[source]¶

initial_distribution: EncodedLookup¶

classmethod load(filename)[source]¶

log_prob(sequence)[source]¶

sample(n_samples)[source]¶

save(filename)[source]¶

transition_matrix: EncodedLookup¶

immuneML.ml_methods.generative_models.SequenceTransitionDistribution.estimate_transition_model(sequences, weights=None)[source]¶

immuneML.ml_methods.generative_models.SimpleLSTM module¶

class immuneML.ml_methods.generative_models.SimpleLSTM.SimpleLSTM(locus: str, sequence_type: str, hidden_size: int, learning_rate: float, num_epochs: int, batch_size: int, num_layers: int, embed_size: int, temperature, device: str, name=None, region_type: str = 'IMGT_CDR3', prime_str: str = 'C', window_size: int = 64, seed: int = None, iter_to_report: int = 1)[source]¶

Bases: GenerativeModel

This is a simple generative model for receptor sequences based on LSTM.

Similar models have been proposed in:

Akbar, R. et al. (2022). In silico proof of principle of machine learning-based antibody design at unconstrained scale. mAbs, 14(1), 2031482. https://doi.org/10.1080/19420862.2022.2031482

Saka, K. et al. (2021). Antibody design using LSTM based deep generative model from phage display library for affinity maturation. Scientific Reports, 11(1), Article 1. https://doi.org/10.1038/s41598-021-85274-7

Specification arguments:

sequence_type (str): whether the model should work on amino_acid or nucleotide level
hidden_size (int): how many LSTM cells should exist per layer
num_layers (int): how many hidden LSTM layers should there be
num_epochs (int): for how many epochs to train the model
learning_rate (float): what learning rate to use for optimization
batch_size (int): how many examples (sequences) to use for training for one batch
embed_size (int): the dimension of the sequence embedding
temperature (float): a higher temperature leads to faster yet more unstable learning
prime_str (str): the initial sequence to start generating from
seed (int): random seed for the model or None
iter_to_report (int): number of epochs between training progress reports

YAML specification:

definitions:
    ml_methods:
        my_simple_lstm:
            sequence_type: amino_acid
            hidden_size: 50
            num_layers: 1
            num_epochs: 5000
            learning_rate: 0.001
            batch_size: 100
            embed_size: 100

can_compute_p_gens() → bool[source]¶

can_generate_from_skewed_gene_models() → bool[source]¶

compute_p_gen(sequence: dict, sequence_type: SequenceType) → float[source]¶

compute_p_gens(sequences, sequence_type: SequenceType) → ndarray[source]¶

fit(data, path: Path = None)[source]¶

generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶

generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool, max_failed_trials: int = 1000)[source]¶

is_same(model) → bool[source]¶

classmethod load_model(path: Path)[source]¶

make_new_model(state_dict_file: Path = None)[source]¶

save_model(path: Path) → Path[source]¶

immuneML.ml_methods.generative_models.SimpleVAE module¶

immuneML.ml_methods.generative_models.SoNNia module¶

class immuneML.ml_methods.generative_models.SoNNia.SoNNia(locus=None, batch_size: int = None, epochs: int = None, deep: bool = False, name: str = None, default_model_name: str = None, n_gen_seqs: int = None, include_joint_genes: bool = True, custom_model_path: str = None, seed: int = None)[source]¶

Bases: GenerativeModel

SoNNia models the selection process of T and B cell receptor repertoires. It is based on the SoNNia Python package. It supports SequenceDataset as input, but not RepertoireDataset.

Original publication: Isacchini, G., Walczak, A. M., Mora, T., & Nourmohammad, A. (2021). Deep generative selection models of T and B cell receptor repertoires with soNNia. Proceedings of the National Academy of Sciences, 118(14), e2023141118. https://doi.org/10.1073/pnas.2023141118

Specification arguments:

locus (str): The locus of the receptor chain.
batch_size (int): number of sequences to use in each batch
epochs (int): number of epochs to train the model
deep (bool): whether to use a deep model
include_joint_genes (bool)
n_gen_seqs (int)
custom_model_path (str): path for the custom OLGA model if used
default_model_name (str): name of the default OLGA model if used
seed (int): random seed for the model or None

YAML specification:

definitions:
    ml_methods:
        my_sonnia_model:
            SoNNia:
                batch_size: 1e4
                epochs: 5
                default_model_name: humanTRB
                deep: False
                include_joint_genes: True
                n_gen_seqs: 100

can_compute_p_gens() → bool[source]¶

can_generate_from_skewed_gene_models() → bool[source]¶

compute_p_gen(sequence: dict, sequence_type: SequenceType) → float[source]¶

compute_p_gens(sequences, sequence_type: SequenceType) → ndarray[source]¶

fit(dataset: Dataset, path: Path = None)[source]¶

generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]¶

generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]¶

is_same(model) → bool[source]¶

classmethod load_model(path: Path)[source]¶

save_model(path: Path) → Path[source]¶

immuneML.ml_methods.generative_models package¶

Submodules¶

immuneML.ml_methods.generative_models.BackgroundSequences module¶

immuneML.ml_methods.generative_models.ExperimentalImport module¶

immuneML.ml_methods.generative_models.GenerativeModel module¶

immuneML.ml_methods.generative_models.InternalOlgaModel module¶

immuneML.ml_methods.generative_models.KLEvaluator module¶

immuneML.ml_methods.generative_models.MultinomialKmerModel module¶

immuneML.ml_methods.generative_models.OLGA module¶

immuneML.ml_methods.generative_models.PWM module¶

immuneML.ml_methods.generative_models.SequenceTransitionDistribution module¶

immuneML.ml_methods.generative_models.SimpleLSTM module¶

immuneML.ml_methods.generative_models.SimpleVAE module¶

immuneML.ml_methods.generative_models.SoNNia module¶

Module contents¶