immuneML.simulation.dataset_generation package¶

Submodules¶

immuneML.simulation.dataset_generation.RandomDatasetGenerator module¶

class immuneML.simulation.dataset_generation.RandomDatasetGenerator.RandomDatasetGenerator[source]¶

Bases: object

static generate_receptor_dataset(receptor_count: int, chain_1_length_probabilities: dict, chain_2_length_probabilities: dict, labels: dict, path: Path, name='receptor_dataset')[source]¶

Creates receptor_count receptors where the length of sequences in each chain is sampled independently for each sequence from chain_n_length_probabilities distribution. The labels are also randomly assigned to receptors from the distribution given in labels. In this case, labels are multi-class, so each receptor will get one class from each label. This means that negative classes for the labels should be included as well in the specification. chain 1 and 2 in this case refer to alpha and beta chain of a T-cell receptor.

An example of input parameters is given below:

receptor_count: 100 # generate 100 TRABReceptors chain_1_length_probabilities:

14: 0.8 # 80% of all generated sequences for all receptors (for chain 1) will have length 14 15: 0.2 # 20% of all generated sequences across all receptors (for chain 1) will have length 15

chain_2_length_probabilities:

14: 0.8 # 80% of all generated sequences for all receptors (for chain 2) will have length 14 15: 0.2 # 20% of all generated sequences across all receptors (for chain 2) will have length 15

labels:

epitope1: # label name: True: 0.5 # 50% of the receptors will have class True False: 0.5 # 50% of the receptors will have class False
epitope2: # next label with classes that will be assigned to receptors independently of the previous label or other parameters: 1: 0.3 # 30% of the generated receptors will have class 1 0: 0.7 # 70% of the generated receptors will have class 0

static generate_repertoire_dataset(repertoire_count: int, sequence_count_probabilities: dict, sequence_length_probabilities: dict, labels: dict, path: Path, name='repertoire_dataset') → RepertoireDataset[source]¶

Creates repertoire_count repertoires where the number of sequences per repertoire is sampled from the probability distribution given in sequence_count_probabilities. The length of sequences is sampled independently for each sequence from sequence_length_probabilities distribution. The labels are also randomly assigned to repertoires from the distribution given in labels. In this case, labels are multi-class, so each repertoire will get at one class from each label. This means that negative classes for the labels should be included as well in the specification.

An example of input parameters is given below: repertoire_count: 100 # generate 100 repertoires sequence_count_probabilities:

100: 0.5 # half of the generated repertoires will have 100 sequences 200: 0.5 # the other half of the generated repertoires will have 200 sequences

sequence_length_distribution:

14: 0.8 # 80% of all generated sequences for all repertoires will have length 14 15: 0.2 # 20% of all generated sequences across all repertoires will have length 15

labels:

cmv: # label name: True: 0.5 # 50% of the repertoires will have class True False: 0.5 # 50% of the repertoires will have class False
coeliac: # next label with classes that will be assigned to repertoires independently of the previous label or any other parameter: 1: 0.3 # 30% of the generated repertoires will have class 1 0: 0.7 # 70% of the generated repertoires will have class 0

static generate_sequence_dataset(sequence_count: int, length_probabilities: dict, labels: dict, path: Path, region_type: str = 'IMGT_CDR3', name='sequence_dataset')[source]¶

Creates sequence_count receptor sequences (single chain) where the length of sequences in each chain is sampled independently for each sequence from length_probabilities distribution. The labels are also randomly assigned to sequences from the distribution given in labels. In this case, labels are multi-class, so each sequences will get one class from each label. This means that negative classes for the labels should be included as well in the specification.

An example of input parameters is given below:

sequence_count: 100 # generate 100 TRB ReceptorSequences length_probabilities:

14: 0.8 # 80% of all generated sequences for all receptors (for chain 1) will have length 14 15: 0.2 # 20% of all generated sequences across all receptors (for chain 1) will have length 15

labels:

epitope1: # label name: True: 0.5 # 50% of the receptors will have class True False: 0.5 # 50% of the receptors will have class False
epitope2: # next label with classes that will be assigned to receptors independently of the previous label or other parameters: 1: 0.3 # 30% of the generated receptors will have class 1 0: 0.7 # 70% of the generated receptors will have class 0

immuneML.simulation.dataset_generation package¶

Submodules¶

immuneML.simulation.dataset_generation.RandomDatasetGenerator module¶

Module contents¶