immuneML.workflows.instructions.train_gen_model package

Submodules

immuneML.workflows.instructions.train_gen_model.TrainGenModelInstruction module

class immuneML.workflows.instructions.train_gen_model.TrainGenModelInstruction.TrainGenModelInstruction(dataset: Dataset = None, methods: List[GenerativeModel] = None, number_of_processes: int = 1, gen_examples_count: int = 100, result_path: Path = None, name: str = None, reports: list = None, export_generated_dataset: bool = True, export_combined_dataset: bool = False, training_percentage: float = None, split_strategy: SplitType = SplitType.RANDOM, split_config: ManualSplitConfig = None)[source]

Bases: Instruction

TrainGenModel instruction implements training generative AIRR models on receptor level. Models that can be trained for sequence generation are listed under Generative Models section.

This instruction takes a dataset as input which will be used to train a model, the model itself, and the number of sequences to generate to illustrate the applicability of the model. It can also produce reports of the fitted model and reports of original and generated sequences.

To use the generative model previously trained with immuneML, see ApplyGenModel instruction.

Specification arguments:

  • dataset: dataset to use for fitting the generative model; it has to be defined under definitions/datasets

  • methods: which methods to fit (defined previously under definitions/ml_methods); for compatibility with previous versions ‘method’ with a single method can also be used, but the single method option will be removed in the future.

  • number_of_processes (int): how many processes to use for fitting the model

  • gen_examples_count (int): how many examples (sequences, repertoires) to generate from the fitted model

  • reports (list): list of report ids (defined under definitions/reports) to apply after fitting a generative model and generating gen_examples_count examples; these can be data reports (to be run on generated examples), ML reports (to be run on the fitted model)

  • split_strategy (str): strategy to use for splitting the dataset into training and test datasets; valid options are RANDOM and MANUAL (in which case train and test set are fixed); default is RANDOM

  • training_percentage (float): percentage of the dataset to use for training the generative model if split_strategy parameter is RANDOM. If set to 1, the full dataset will be used for training and the test dataset will be the same as the training dataset. Default value is 0.7. When export_combined_dataset is set to True, the splitting of sequences into train, test, and generated will be shown in column dataset_split.

  • manual_config (dict): if split_strategy is set to MANUAL, this parameter can be used to specify the ids of examples that should be in train and test sets; the paths to csv files with ids for train and test data should be provided under keys ‘train_metadata_path’ and ‘test_metadata_path’

YAML specification:

instructions:
    my_train_gen_model_inst: # user-defined instruction name
        type: TrainGenModel
        dataset: d1 # defined previously under definitions/datasets
        methods: [model1] # defined previously under definitions/ml_methods
        gen_examples_count: 100
        number_of_processes: 4
        training_percentage: 0.7
        split_strategy: RANDOM # optional, default is RANDOM
        export_generated_dataset: True
        export_combined_dataset: False
        reports: [data_rep1, ml_rep2]

    my_train_gen_model_with_manual_split: # another instruction example
        type: TrainGenModel
        dataset: d1 # defined previously under definitions/datasets
        methods: [m1, m2]
        gen_examples_count: 100
        split_strategy: MANUAL
        split_config:
            train_metadata_path: path/to/train_metadata.csv # path to csv file with ids of examples in train set
            test_metadata_path: path/to/test_metadata.csv # path to csv file with ids of examples in test set
        export_generated_dataset: True
        export_combined_dataset: False
        reports: [data_rep1, ml_rep2]
MAX_ELEMENT_COUNT_TO_SHOW = 10
merge_datasets(datasets: List[SequenceDataset], result_path: Path) SequenceDataset[source]
run(result_path: Path) TrainGenModelState[source]
class immuneML.workflows.instructions.train_gen_model.TrainGenModelInstruction.TrainGenModelState(result_path: pathlib.Path = None, name: str = None, gen_examples_count: int = None, model_paths: Dict[str, pathlib.Path] = <factory>, generated_dataset: immuneML.data_model.datasets.Dataset.Dataset = None, exported_datasets: Dict[str, pathlib.Path] = <factory>, report_results: Dict[str, List[immuneML.reports.ReportResult.ReportResult]] = <factory>, combined_dataset: immuneML.data_model.datasets.Dataset.Dataset = None, train_dataset: immuneML.data_model.datasets.Dataset.Dataset = None, test_dataset: immuneML.data_model.datasets.Dataset.Dataset = None, training_percentage: float = None, split_strategy: immuneML.hyperparameter_optimization.config.SplitType.SplitType = None, split_config: immuneML.hyperparameter_optimization.config.ManualSplitConfig.ManualSplitConfig = None)[source]

Bases: object

combined_dataset: Dataset = None
exported_datasets: Dict[str, Path]
gen_examples_count: int = None
generated_dataset: Dataset = None
model_paths: Dict[str, Path]
name: str = None
report_results: Dict[str, List[ReportResult]]
result_path: Path = None
split_config: ManualSplitConfig = None
split_strategy: SplitType = None
test_dataset: Dataset = None
train_dataset: Dataset = None
training_percentage: float = None

Module contents