immuneML.ml_methods.generative_models package

Submodules

immuneML.ml_methods.generative_models.BackgroundSequences module

immuneML.ml_methods.generative_models.ExperimentalImport module

class immuneML.ml_methods.generative_models.ExperimentalImport.ExperimentalImport(dataset: SequenceDataset, original_input_file: Path = None)[source]

Bases: GenerativeModel

Allows to import existing experimental data and do annotations and simulations on top of them. This model should be used only for LIgO simulation and not with TrainGenModel instruction.

YAML specification:

definitions:
    ml_methods:
        generative_model:
            type: ExperimentalImport
            import_format: AIRR
            tmp_import_path: ./tmp/
            import_params:
                path: path/to/files/
                region_type: IMGT_CDR3 # what part of the sequence to import
                column_mapping: # column mapping AIRR: immuneML
                    junction: sequence
                    junction_aa: sequence_aa
                    locus: chain
classmethod build_object(**kwargs)[source]
can_compute_p_gens() bool[source]
can_generate_from_skewed_gene_models() bool[source]
compute_p_gen(sequence: dict, sequence_type: SequenceType) float[source]
compute_p_gens(sequences, sequence_type: SequenceType) ndarray[source]
generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]
generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]
is_same(model) bool[source]

immuneML.ml_methods.generative_models.GenerativeModel module

class immuneML.ml_methods.generative_models.GenerativeModel.GenerativeModel(locus: Chain, name: str = None, region_type: RegionType = None, seed=None)[source]

Bases: object

Generative models are algorithms which can be trained to learn patterns in existing datasets, and then be used to generate new synthetic datasets.

These methods can be used in the TrainGenModel instruction, and previously trained models can be used to generate data using the ApplyGenModel instruction.

DOCS_TITLE = 'Generative models'
OUTPUT_COLUMNS = []
abstract can_compute_p_gens() bool[source]
abstract can_generate_from_skewed_gene_models() bool[source]
abstract compute_p_gen(sequence: dict, sequence_type: SequenceType) float[source]
abstract compute_p_gens(sequences, sequence_type: SequenceType) ndarray[source]
abstract fit(data, path: Path = None)[source]
abstract generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]
abstract generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool) Dataset[source]
abstract is_same(model) bool[source]
abstract classmethod load_model(path: Path)[source]
abstract save_model(path: Path) Path[source]

immuneML.ml_methods.generative_models.InternalOlgaModel module

class immuneML.ml_methods.generative_models.InternalOlgaModel.InternalOlgaModel(sequence_gen_model: olga.sequence_generation.SequenceGenerationVDJ | olga.sequence_generation.SequenceGenerationVJ = None, v_gene_mapping: list = None, j_gene_mapping: list = None, genomic_data: olga.load_model.GenomicData = None, olga_gen_model: olga.load_model.GenerativeModelVDJ | olga.load_model.GenerativeModelVJ = None)[source]

Bases: object

genomic_data: olga.load_model.GenomicData = None
j_gene_mapping: list = None
olga_gen_model: olga.load_model.GenerativeModelVDJ | olga.load_model.GenerativeModelVJ = None
sequence_gen_model: olga.sequence_generation.SequenceGenerationVDJ | olga.sequence_generation.SequenceGenerationVJ = None
v_gene_mapping: list = None

immuneML.ml_methods.generative_models.KLEvaluator module

immuneML.ml_methods.generative_models.KLEvaluator.KL(sequences, model_1, model_2)[source]

Computes the KL divergence between two models (model_1 and model_2) for a given set of sequences.

Parameters:
  • sequences – list of sequences

  • model_1 – model 1

  • model_2 – model 2

Returns:

KL divergence value

class immuneML.ml_methods.generative_models.KLEvaluator.KLEvaluator(true_sequences, simulated_sequences, estimator, n_sequences)[source]

Bases: object

get_plot(indices, kmers, scores, weights)[source]
get_worst_simulated_sequences(n=20)[source]
get_worst_true_sequences(n=20)[source]
original_plot()[source]
simulated_kl()[source]
simulated_kl_weights()[source]
simulated_plot()[source]
true_kl()[source]
true_kl_weights()[source]
immuneML.ml_methods.generative_models.KLEvaluator.evaluate_similarities(true_sequences, simulated_sequences, estimator)[source]
immuneML.ml_methods.generative_models.KLEvaluator.get_kl_weights(model_1, model_2, sequences)[source]

immuneML.ml_methods.generative_models.MultinomialKmerModel module

class immuneML.ml_methods.generative_models.MultinomialKmerModel.EmpiricalLengthDistribution(lengths_frequencies: ndarray)[source]

Bases: object

log_prob(lengths: ndarray) ndarray[source]
sample(count: int) ndarray[source]
class immuneML.ml_methods.generative_models.MultinomialKmerModel.KmerDistribution(*args, **kwargs)[source]

Bases: Protocol

log_prob(kmers) ndarray[source]
sample(count: int) EncodedRaggedArray[source]
class immuneML.ml_methods.generative_models.MultinomialKmerModel.KmerModel(kmer_probs: EncodedLookup)[source]

Bases: object

log_prob(kmers: EncodedRaggedArray) ndarray[source]
sample(count: int) EncodedRaggedArray[source]
class immuneML.ml_methods.generative_models.MultinomialKmerModel.MultinomialKmerModel(kmer_probs: EncodedLookup, sequence_length: int)[source]

Bases: object

log_prob(kmers: EncodedRaggedArray) ndarray[source]
sample(count: int) EncodedRaggedArray[source]
class immuneML.ml_methods.generative_models.MultinomialKmerModel.Poisson(mu: float)[source]

Bases: object

log_prob(x)[source]
mu: float
sample(n_samples)[source]
class immuneML.ml_methods.generative_models.MultinomialKmerModel.SmoothedLengthDistribution(empirical_distribution, smooth_distribution, p_smooth)[source]

Bases: object

log_prob(lengths: ndarray) ndarray[source]
sample(count: int) ndarray[source]
immuneML.ml_methods.generative_models.MultinomialKmerModel.estimate_kmer_model(kmers: EncodedRaggedArray, prior_count=1) MultinomialKmerModel[source]
immuneML.ml_methods.generative_models.MultinomialKmerModel.estimate_length_distribution(lengths: ndarray) EmpiricalLengthDistribution[source]
immuneML.ml_methods.generative_models.MultinomialKmerModel.estimate_smoothed_length_distribution(lengths: ndarray, prior_count=1) SmoothedLengthDistribution[source]

immuneML.ml_methods.generative_models.OLGA module

class immuneML.ml_methods.generative_models.OLGA.OLGA(model_path: Path = None, default_model_name: str = None, locus: Chain = None, region_type: RegionType = RegionType.IMGT_JUNCTION, _olga_model: InternalOlgaModel = None)[source]

Bases: GenerativeModel

This is a wrapper for the OLGA package as described by Sethna et al. 2019 (OLGA package on PyPI or GitHub: https://github.com/statbiophys/OLGA ). This model should be used only for LIgO simulation and is not yet supported for use with TrainGenModel instruction.

Reference:

Zachary Sethna, Yuval Elhanati, Curtis G Callan, Jr, Aleksandra M Walczak, Thierry Mora, OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs, Bioinformatics, Volume 35, Issue 17, 1 September 2019, Pages 2974–2981, https://doi.org/10.1093/bioinformatics/btz035

Note:

  • OLGA generates sequences that correspond to IMGT junction and are used for matching as such. See the https://github.com/statbiophys/OLGA for more details.

  • Gene names are as provided in OLGA (either in default models or in the user-specified model files). For simulation, one should use gene names in the same format.

Note

While this is a generative model, in the current version of immuneML it cannot be used in combination with TrainGenModel or ApplyGenModel instruction. If you want to use OLGA for sequence simulation, see Dataset simulation with LIgO.

`

Specification arguments:

  • model_path (str): if not default model, this parameter should point to a folder where the four OLGA/IGOR format files are stored (could also be inferred from some experimental data)

  • default_model_name (str): if not using custom models, one of the OLGA default models could be specified here; the value should be the same as it would be passed to command line in OLGA: e.g., humanTRB, human IGH

YAML specification:

definitions:
    ml_methods:
        generative_model:
            type: OLGA
            model_path: None
            default_model_name: humanTRB
DEFAULT_MODEL_FOLDER_MAP = {'humanIGH': 'human_B_heavy', 'humanIGK': 'human_B_kappa', 'humanIGL': 'human_B_lambda', 'humanTRA': 'human_T_alpha', 'humanTRB': 'human_T_beta', 'mouseTRA': 'mouse_T_alpha', 'mouseTRB': 'mouse_T_beta'}
MODEL_FILENAMES = {'j_gene_anchor': 'J_gene_CDR3_anchors.csv', 'marginals': 'model_marginals.txt', 'params': 'model_params.txt', 'v_gene_anchor': 'V_gene_CDR3_anchors.csv'}
OUTPUT_COLUMNS = ['sequence', 'sequence_aa', 'v_call', 'j_call', 'region_type', 'frame_type', 'p_gen', 'from_default_model', 'duplicate_count', 'locus']
classmethod build_object(**kwargs)[source]
can_compute_p_gens() bool[source]
can_generate_from_skewed_gene_models() bool[source]
compute_p_gen(sequence: dict, sequence_type: SequenceType, sequence_field: str = None) float[source]
compute_p_gens(sequences: BNPDataClass, sequence_type: SequenceType, sequence_field: str = None) list[source]
default_model_name: str = None
fit(data, path: Path = None)[source]
generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]
generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool) Path[source]
is_same(model) bool[source]
property is_vdj
load_internal_model(model_path: Path = None) InternalOlgaModel[source]
classmethod load_model(path: Path)[source]
locus: Chain = None
model_path: Path = None
region_type: RegionType = 'junction'
save_model(path: Path) Path[source]

immuneML.ml_methods.generative_models.PWM module

class immuneML.ml_methods.generative_models.PWM.PWM(locus, sequence_type: str, region_type: str, name: str = None)[source]

Bases: GenerativeModel

This is a baseline implementation of a positional weight matrix. It is estimated from a set of sequences for each of the different lengths that appear in the dataset.

Specification arguments:

  • locus (str): which chain is generated (for now, it is only assigned to the generated sequences)

  • sequence_type (str): amino_acid or nucleotide

  • region_type (str): which region type to use (e.g., IMGT_CDR3), this is only assigned to the generated sequences

YAML specification:

definitions:
    ml_methods:
        my_pwm:
            PWM:
                locus: beta
                sequence_type: amino_acid
                region_type: IMGT_CDR3
can_compute_p_gens() bool[source]
can_generate_from_skewed_gene_models() bool[source]
compute_p_gen(sequence: dict, sequence_type: SequenceType) float[source]
compute_p_gens(sequences, sequence_type: SequenceType) ndarray[source]
fit(data: SequenceDataset, path: Path = None)[source]
generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]
generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]
is_same(model) bool[source]
classmethod load_model(path: Path)[source]
save_model(path: Path) Path[source]

immuneML.ml_methods.generative_models.SequenceTransitionDistribution module

class immuneML.ml_methods.generative_models.SequenceTransitionDistribution.EncodedLookup(lookup: ndarray, encoding: Encoding)[source]

Bases: NDArrayOperatorsMixin

property alphabet_size
property encoding
raw()[source]
class immuneML.ml_methods.generative_models.SequenceTransitionDistribution.SequenceTransitionDistribution(transition_matrix: immuneML.ml_methods.generative_models.SequenceTransitionDistribution.EncodedLookup, initial_distribution: immuneML.ml_methods.generative_models.SequenceTransitionDistribution.EncodedLookup, end_probs: immuneML.ml_methods.generative_models.SequenceTransitionDistribution.EncodedLookup)[source]

Bases: object

end_probs: EncodedLookup
classmethod from_probabilities(*args, **kwargs)[source]
initial_distribution: EncodedLookup
classmethod load(filename)[source]
log_prob(sequence)[source]
sample(n_samples)[source]
save(filename)[source]
transition_matrix: EncodedLookup
immuneML.ml_methods.generative_models.SequenceTransitionDistribution.estimate_transition_model(sequences, weights=None)[source]

immuneML.ml_methods.generative_models.SimpleLSTM module

class immuneML.ml_methods.generative_models.SimpleLSTM.SimpleLSTM(locus: str, sequence_type: str, hidden_size: int, learning_rate: float, num_epochs: int, batch_size: int, num_layers: int, embed_size: int, temperature, device: str, name=None, region_type: str = 'IMGT_CDR3', prime_str: str = 'C', window_size: int = 64, seed: int = None, iter_to_report: int = 1)[source]

Bases: GenerativeModel

This is a simple generative model for receptor sequences based on LSTM.

Similar models have been proposed in:

Akbar, R. et al. (2022). In silico proof of principle of machine learning-based antibody design at unconstrained scale. mAbs, 14(1), 2031482. https://doi.org/10.1080/19420862.2022.2031482

Saka, K. et al. (2021). Antibody design using LSTM based deep generative model from phage display library for affinity maturation. Scientific Reports, 11(1), Article 1. https://doi.org/10.1038/s41598-021-85274-7

Specification arguments:

  • sequence_type (str): whether the model should work on amino_acid or nucleotide level

  • hidden_size (int): how many LSTM cells should exist per layer

  • num_layers (int): how many hidden LSTM layers should there be

  • num_epochs (int): for how many epochs to train the model

  • learning_rate (float): what learning rate to use for optimization

  • batch_size (int): how many examples (sequences) to use for training for one batch

  • embed_size (int): the dimension of the sequence embedding

  • temperature (float): a higher temperature leads to faster yet more unstable learning

  • prime_str (str): the initial sequence to start generating from

  • seed (int): random seed for the model or None

  • iter_to_report (int): number of epochs between training progress reports

YAML specification:

definitions:
    ml_methods:
        my_simple_lstm:
            sequence_type: amino_acid
            hidden_size: 50
            num_layers: 1
            num_epochs: 5000
            learning_rate: 0.001
            batch_size: 100
            embed_size: 100
can_compute_p_gens() bool[source]
can_generate_from_skewed_gene_models() bool[source]
compute_p_gen(sequence: dict, sequence_type: SequenceType) float[source]
compute_p_gens(sequences, sequence_type: SequenceType) ndarray[source]
fit(data, path: Path = None)[source]
generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]
generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool, max_failed_trials: int = 1000)[source]
is_same(model) bool[source]
classmethod load_model(path: Path)[source]
make_new_model(state_dict_file: Path = None)[source]
save_model(path: Path) Path[source]

immuneML.ml_methods.generative_models.SimpleVAE module

immuneML.ml_methods.generative_models.SoNNia module

class immuneML.ml_methods.generative_models.SoNNia.SoNNia(locus=None, batch_size: int = None, epochs: int = None, deep: bool = False, name: str = None, default_model_name: str = None, n_gen_seqs: int = None, include_joint_genes: bool = True, custom_model_path: str = None, seed: int = None)[source]

Bases: GenerativeModel

SoNNia models the selection process of T and B cell receptor repertoires. It is based on the SoNNia Python package. It supports SequenceDataset as input, but not RepertoireDataset.

Original publication: Isacchini, G., Walczak, A. M., Mora, T., & Nourmohammad, A. (2021). Deep generative selection models of T and B cell receptor repertoires with soNNia. Proceedings of the National Academy of Sciences, 118(14), e2023141118. https://doi.org/10.1073/pnas.2023141118

Specification arguments:

  • locus (str): The locus of the receptor chain.

  • batch_size (int): number of sequences to use in each batch

  • epochs (int): number of epochs to train the model

  • deep (bool): whether to use a deep model

  • include_joint_genes (bool)

  • n_gen_seqs (int)

  • custom_model_path (str): path for the custom OLGA model if used

  • default_model_name (str): name of the default OLGA model if used

  • seed (int): random seed for the model or None

YAML specification:

definitions:
    ml_methods:
        my_sonnia_model:
            SoNNia:
                batch_size: 1e4
                epochs: 5
                default_model_name: humanTRB
                deep: False
                include_joint_genes: True
                n_gen_seqs: 100
can_compute_p_gens() bool[source]
can_generate_from_skewed_gene_models() bool[source]
compute_p_gen(sequence: dict, sequence_type: SequenceType) float[source]
compute_p_gens(sequences, sequence_type: SequenceType) ndarray[source]
fit(dataset: Dataset, path: Path = None)[source]
generate_from_skewed_gene_models(v_genes: list, j_genes: list, seed: int, path: Path, sequence_type: SequenceType, batch_size: int, compute_p_gen: bool)[source]
generate_sequences(count: int, seed: int, path: Path, sequence_type: SequenceType, compute_p_gen: bool)[source]
is_same(model) bool[source]
classmethod load_model(path: Path)[source]
save_model(path: Path) Path[source]

Module contents