immuneML.simulation package¶
Subpackages¶
- immuneML.simulation.dataset_generation package
- immuneML.simulation.implants package
- Submodules
- immuneML.simulation.implants.ImplantAnnotation module
- immuneML.simulation.implants.LigoPWM module
- immuneML.simulation.implants.Motif module
- immuneML.simulation.implants.MotifInstance module
- immuneML.simulation.implants.SeedMotif module
SeedMotif
SeedMotif.all_possible_instances
SeedMotif.alphabet_weights
SeedMotif.get_all_possible_instances()
SeedMotif.get_alphabet()
SeedMotif.get_max_length()
SeedMotif.hamming_distance_probabilities
SeedMotif.instantiate_motif()
SeedMotif.max_gap
SeedMotif.min_gap
SeedMotif.position_weights
SeedMotif.seed
SeedMotif.set_default_weights()
- immuneML.simulation.implants.Signal module
- Module contents
- immuneML.simulation.simulation_strategy package
- immuneML.simulation.util package
- Submodules
- immuneML.simulation.util.bnp_util module
- immuneML.simulation.util.igor_helper module
- immuneML.simulation.util.util module
annotate_sequences()
assign_duplicate_counts()
build_imgt_positions()
check_iteration_progress()
check_sequence_count()
choose_implant_position()
construct_sequence_metadata_object()
filter_out_illegal_sequences()
filter_sequences_by_length()
get_allowed_positions()
get_bnp_data()
get_max_seq_length()
get_min_seq_length()
get_no_signal_sequences()
get_region_type()
get_sequence_per_signal_count()
get_signal_sequence_count()
get_signal_sequences()
make_annotated_dataclass()
make_bnp_annotated_sequences()
make_repertoire_from_sequences()
make_sequence_paths()
make_signal_metadata()
match_genes()
match_motif()
match_motif_group()
match_motif_regexes()
needs_seqs_with_signal()
prepare_data_for_airr_seq_set()
prepare_data_for_repertoire_obj()
update_seqs_with_signal()
update_seqs_without_signal()
write_bnp_data()
- Module contents
Submodules¶
immuneML.simulation.LigoSimState module¶
- class immuneML.simulation.LigoSimState.LigoSimState(signals: list, simulation: immuneML.simulation.SimConfig.SimConfig, paths: dict = None, name: str = None, target_p_gen_histogram: Dict[str, numpy.ndarray] = <factory>, p_gen_bins: Dict[str, Any] = <factory>, resulting_dataset: immuneML.data_model.datasets.Dataset.Dataset = None, result_path: pathlib.Path = None)[source]¶
Bases:
object
- formats = None¶
- name: str = None¶
- p_gen_bins: Dict[str, Any]¶
- paths: dict = None¶
- result_path: Path = None¶
- signals: list¶
- target_p_gen_histogram: Dict[str, ndarray]¶
immuneML.simulation.SimConfig module¶
- class immuneML.simulation.SimConfig.SimConfig(sim_items: List[SimConfigItem] = None, identifier: str = None, is_repertoire: bool = None, paired: bool | List[List[str]] = None, sequence_type: SequenceType = None, simulation_strategy: SimulationStrategy = None, p_gen_bin_count: int = None, keep_p_gen_dist: bool = None, remove_seqs_with_signals: bool = None, species: str = None, implanting_scaling_factor: int = None)[source]¶
Bases:
object
The simulation config defines all parameters of the simulation. It can contain one or more simulation config items, which define groups of repertoires or receptors that have the same simulation parameters, such as signals, generative model, clonal frequencies, and noise parameters.
Specification arguments:
sim_items (dict): a list of SimConfigItems defining individual units of simulation
is_repertoire (bool): whether the simulation is on a repertoire (person) or sequence/receptor level
paired: if the simulation should output paired data, this parameter should contain a list of a list of sim_item pairs referenced by name that should be combined; if paired data is not needed, then it should be False
sequence_type (str): either amino_acid or nucleotide
simulation_strategy (str): either RejectionSampling or Implanting, see the tutorials for more information on choosing one of these
keep_p_gen_dist (bool): if possible, whether to keep the distribution of generation probabilities of the sequences the same as provided by the model without any signals
p_gen_bin_count (int): if keep_p_gen_dist is true, how many bins to use to approximate the generation probability distribution
remove_seqs_with_signals (bool): if true, it explicitly controls the proportions of signals in sequences and removes any accidental occurrences
species (str): species that the sequences come from; used to select correct genes to export full length sequences; default is ‘human’
implanting_scaling_factor (int): determines in how many receptors to implant the signal in reach iteration; this is computed as number_of_receptors_needed_for_signal * implanting_scaling_factor; useful when using Implanting simulation strategy in combination with importance sampling, since the generation probability of some receptors with implanted signals might be very rare and those receptors might end up not being kept often with importance sampling; this parameter is only used when keep_p_gen_dist is set to True
YAML specification:
definitions: simulations: sim1: is_repertoire: false paired: false sequence_type: amino_acid simulation_strategy: RejectionSampling sim_items: sim_item1: # group of sequences with same simulation params generative_model: chain: beta default_model_name: humanTRB model_path: null type: OLGA number_of_examples: 100 seed: 1002 signals: signal1: 1
- identifier: str = None¶
- implanting_scaling_factor: int = None¶
- is_repertoire: bool = None¶
- keep_p_gen_dist: bool = None¶
- p_gen_bin_count: int = None¶
- paired: bool | List[List[str]] = None¶
- remove_seqs_with_signals: bool = None¶
- sequence_type: SequenceType = None¶
- sim_items: List[SimConfigItem] = None¶
- simulation_strategy: SimulationStrategy = None¶
- species: str = None¶
immuneML.simulation.SimConfigItem module¶
- class immuneML.simulation.SimConfigItem.SimConfigItem(signal_proportions: ~typing.Dict[~immuneML.simulation.implants.Signal.Signal | ~immuneML.simulation.implants.Signal.SignalPair, float], name: str = '', is_noise: bool = False, seed: int = None, generative_model: ~immuneML.ml_methods.generative_models.GenerativeModel.GenerativeModel = None, number_of_examples: int = 0, receptors_in_repertoire_count: int = 0, false_positive_prob_in_receptors: float = 0.0, false_negative_prob_in_receptors: float = 0.0, immune_events: dict = <factory>, default_clonal_frequency: dict = None, sequence_len_limits: dict = None)[source]¶
Bases:
object
When performing a simulation, one or more simulation config items can be specified. Config items define groups of repertoires or receptors that have the same simulation parameters, such as signals, generative model, clonal frequencies, noise parameters.
Specification arguments:
signals (dict): signals for the simulation item and the proportion of sequences in the repertoire that will have the given signal. For receptor-level simulation, the proportion will always be 1.
is_noise (bool): indicates whether the implanting should be regarded as noise; if it is True, the signals will be implanted as specified, but the repertoire/receptor in question will have negative class.
generative_model: parameters of the generative model, including its type, path to the model; currently supported models are OLGA and ExperimentalImport
seed (int): starting random seed for the generative model (it should differ across simulation items, or it can be set to null when not used)
false_positives_prob_in_receptors (float): when performing repertoire level simulation, what percentage of sequences should be false positives
false_negative_prob_in_receptors (float): when performing repertoire level simulation, what percentage of sequences should be false negatives
immune_events (dict): a set of key-value pairs that will be added to the metadata (same values for all data generated in one simulation sim_item) and can be later used as labels
default_clonal_frequency (dict): clonal frequency in Ligo is simulated through scipy’s zeta distribution function for generating random numbers, with parameters provided under default_clonal_frequency parameter. These parameters will be used to assign count values to sequences that do not contain any signals if they are required by the simulation. If clonal frequency shouldn’t be used, this parameter can be None
clonal_frequency: a: 2 # shape parameter of the distribution loc: 0 # 0 by default but can be used to shift the distribution
sequence_len_limits (dict): allows for filtering the generated sequences by length, needs to have parameters min and max specified; if not used, min/max should be -1
sequence_len_limits: min: 4 # keep sequences of length 4 and longer max: -1 # no limit on the max length of the sequences
YAML specification:
definitions: simulations: # definitions of simulations should be under key simulations in the definitions part of the specification # one simulation with multiple implanting objects, a part of definition section my_simulation: sim_item1: number_of_examples: 10 seed: null # don't use seed receptors_in_repertoire_count: 100 generative_model: chain: beta default_model_name: humanTRB model_path: null type: OLGA signals: my_signal: 0.25 my_signal2: 0.01 my_signal__my_signal2: 0.02 # my_signal and my_signal2 will co-occur in 2% of the receptors in all 10 repertoires sim_item2: number_of_examples: 5 receptors_in_repertoire_count: 150 seed: 10 # generative_model: chain: beta default_model_name: humanTRB model_path: null type: OLGA signals: my_signal: 0.75 default_clonal_frequency: a: 2 sequence_len_limits: min: 3
- default_clonal_frequency: dict = None¶
- false_negative_prob_in_receptors: float = 0.0¶
- false_positive_prob_in_receptors: float = 0.0¶
- generative_model: GenerativeModel = None¶
- immune_events: dict¶
- is_noise: bool = False¶
- name: str = ''¶
- number_of_examples: int = 0¶
- receptors_in_repertoire_count: int = 0¶
- seed: int = None¶
- sequence_len_limits: dict = None¶
- signal_proportions: Dict[Signal | SignalPair, float]¶