immuneML.workflows.instructions package¶

Subpackages¶

Submodules¶

immuneML.workflows.instructions.Instruction module¶

class immuneML.workflows.instructions.Instruction.Instruction[source]¶

Bases: object

abstract run(result_path: pathlib.Path)[source]¶

immuneML.workflows.instructions.MLProcess module¶

class immuneML.workflows.instructions.MLProcess.MLProcess(train_dataset: immuneML.data_model.dataset.Dataset.Dataset, test_dataset: immuneML.data_model.dataset.Dataset.Dataset, label: str, metrics: set, optimization_metric: immuneML.environment.Metric.Metric, path: pathlib.Path, ml_reports: Optional[List[immuneML.reports.ml_reports.MLReport.MLReport]] = None, encoding_reports: Optional[list] = None, data_reports: Optional[list] = None, number_of_processes: int = 2, label_config: Optional[immuneML.environment.LabelConfiguration.LabelConfiguration] = None, report_context: Optional[dict] = None, hp_setting: Optional[immuneML.hyperparameter_optimization.HPSetting.HPSetting] = None, store_encoded_data: Optional[bool] = None)[source]¶

Bases: object

Class that implements the machine learning process:

encodes the training dataset
encodes the test dataset (using parameters learnt in step 1 if there are any such parameters)
trains the ML method on encoded training dataset
assesses the method’s performance on encoded test dataset

It performs the task for a given label configuration, and given list of metrics (used only in the assessment step).

run(split_index: int) → immuneML.hyperparameter_optimization.states.HPItem.HPItem [source]¶

immuneML.workflows.instructions.SimulationInstruction module¶

class immuneML.workflows.instructions.SimulationInstruction.SimulationInstruction(signals: list, simulation: immuneML.simulation.Simulation.Simulation, dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, name: Optional[str] = None, exporters: Optional[List[immuneML.IO.dataset_export.DataExporter.DataExporter]] = None)[source]¶

Bases: immuneML.workflows.instructions.Instruction.Instruction

A simulation is an instruction that implants synthetic signals into the given dataset according to given parameters. This results in a new dataset containing modified sequences, and is annotated with metadata labels according to the implanted signals.

Parameters

dataset (RepertoireDataset) – original dataset which will be used as a basis for implanting signals from the simulation
simulation (Simulation) – definition of how to perform the simulation.
export_formats – in which formats to export the dataset after simulation. Valid formats are class names of any non-abstract class inheriting DataExporter. Important note: Pickle files might not be compatible between different immuneML (sub)versions.

YAML specification:

my_simulation_instruction: # user-defined name of the instruction
    type: Simulation # which instruction to execute
    dataset: my_dataset # which dataset to use for implanting the signals
    simulation: my_simulation # how to implanting the signals - definition of the simulation
    export_formats: [AIRR] # in which formats to export the dataset

export_dataset()[source]¶

static get_documentation()[source]¶

run(result_path: pathlib.Path)[source]¶

immuneML.workflows.instructions.TrainMLModelInstruction module¶

class immuneML.workflows.instructions.TrainMLModelInstruction.TrainMLModelInstruction(dataset, hp_strategy: immuneML.hyperparameter_optimization.strategy.HPOptimizationStrategy.HPOptimizationStrategy, hp_settings: list, assessment: immuneML.hyperparameter_optimization.config.SplitConfig.SplitConfig, selection: immuneML.hyperparameter_optimization.config.SplitConfig.SplitConfig, metrics: set, optimization_metric: immuneML.environment.Metric.Metric, label_configuration: immuneML.environment.LabelConfiguration.LabelConfiguration, path: Optional[pathlib.Path] = None, context: Optional[dict] = None, number_of_processes: int = 1, reports: Optional[dict] = None, name: Optional[str] = None, refit_optimal_model: bool = False, store_encoded_data: Optional[bool] = None)[source]¶

Bases: immuneML.workflows.instructions.Instruction.Instruction

Class implementing hyperparameter optimization and training and assessing the model through nested cross-validation (CV). The process is defined by two loops:

the outer loop over defined splits of the dataset for performance assessment

the inner loop over defined hyperparameter space and with cross-validation or train & validation split to choose the best hyperparameters.

Optimal model chosen by the inner loop is then retrained on the whole training dataset in the outer loop.

Note: If you are interested in plotting the performance of all combinations of encodings and ML methods on the test set, consider running the MLSettingsPerformance report as hyperparameter report in the assessment loop.

Parameters

dataset (Dataset) – dataset to use for training and assessing the classifier
hp_strategy (HPOptimizationStrategy) – how to search different hyperparameters; common options include grid search, random search. Valid values are objects of any class inheriting HPOptimizationStrategy.
hp_settings (list) – a list of combinations of preprocessing_sequence, encoding and ml_method. preprocessing_sequence is optional, while encoding and ml_method are mandatory. These three options (and their parameters) can be optimized over, choosing the highest performing combination.
assessment (SplitConfig) – description of the outer loop (for assessment) of nested cross-validation. It describes how to split the data, how many splits to make, what percentage to use for training and what reports to execute on those splits. See SplitConfig.
selection (SplitConfig) – description of the inner loop (for selection) of nested cross-validation. The same as assessment argument, just to be executed in the inner loop. See SplitConfig.
metrics (list) – a list of metrics to compute for all splits and settings created during the nested cross-validation. These metrics will be computed only for reporting purposes. For choosing the optimal setting, optimization_metric will be used.
optimization_metric (Metric) – a metric to use for optimization and assessment in the nested cross-validation.
label_configuration (LabelConfiguration) – a list of labels for which to train the classifiers. The goal of the nested CV is to find the
which will have best performance in predicting the given label (setting) –
and optimal settings will be reported for each label separately. If a label is binary (Performance) –
of specifying only its name (instead) –
one –
explicitly set the name of the positive class as well under parameter positive_class. If positive class is not set (should) –
of the label (one) –
will be assumed to be positive. (classes) –
number_of_processes (int) – how many processes should be created at once to speed up the analysis. For personal machines, 4 or 8 is usually a good choice.
reports (list) – a list of report names to be executed after the nested CV has finished to show the overall performance or some statistic;

:param the reports to be specified here have to be TrainMLModelReport reports.: :param refit_optimal_model: if the final combination of preprocessing-encoding-ML model should be refitted on the full dataset thus providing :type refit_optimal_model: bool :param the final model to be exported from instruction; alternatively: :param train combination from one of the assessment folds will be used: :param store_encoded_data: if the encoded datasets should be stored, can be True or False; setting this argument to True might increase the :type store_encoded_data: bool :param disk usage significantly:

YAML specification:

my_nested_cv_instruction: # user-defined name of the instruction
    type: TrainMLModel # which instruction should be executed
    settings: # a list of combinations of preprocessing, encoding and ml_method to optimize over
        - preprocessing: seq1 # preprocessing is optional
          encoding: e1 # mandatory field
          ml_method: simpleLR # mandatory field
        - preprocessing: seq1 # the second combination
          encoding: e2
          ml_method: simpleLR
    assessment: # outer loop of nested CV
        split_strategy: random # perform Monte Carlo CV (randomly split the data into train and test)
        split_count: 1 # how many train/test datasets to generate
        training_percentage: 0.7 # what percentage of the original data should be used for the training set
        reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods
            data_splits: # list of reports to execute on training/test datasets (before they are encoded)
                - rep1
            encoding: # list of reports to execute on encoded training/test datasets
                - rep2
            models: # list of reports to execute on trained ML methods for each assessment CV split
                - rep3
    selection: # inner loop of nested CV
        split_strategy: k_fold # perform k-fold CV
        split_count: 5 # how many fold to create: here these two parameters mean: do 5-fold CV
        reports:
            data_splits: # list of reports to execute on training/test datasets (in the inner loop, so these are actually training and validation datasets)
                - rep1
            models: # list of reports to execute on trained ML methods for each selection CV split
                - rep2
            encoding: # list of reports to execute on encoded training/test datasets (again, it is training/validation here)
                - rep3
    labels: # list of labels to optimize the classifier for, as given in the metadata for the dataset
        - celiac:
            positive_class: + # if it's binary classification, positive class parameter should be set
        - T1D # this is not binary label, so no need to specify positive class
    dataset: d1 # which dataset to use for the nested CV
    strategy: GridSearch # how to choose the combinations which to test from settings (GridSearch means test all)
    metrics: # list of metrics to compute for all settings, but these do not influence the choice of optimal model
        - accuracy
        - auc
    reports: # list of reports to execute when nested CV is finished to show overall performance
        - rep4
    number_of_processes: 4 # number of parallel processes to create (could speed up the computation)
    optimization_metric: balanced_accuracy # the metric to use for choosing the optimal model and during training
    refit_optimal_model: False # use trained model, do not refit on the full dataset
    store_encoded_data: True # store encoded datasets in pickle format

static get_documentation()[source]¶

print_performances(state: immuneML.hyperparameter_optimization.states.TrainMLModelState.TrainMLModelState)[source]¶

run(result_path: pathlib.Path)[source]¶

immuneML.workflows.instructions.quickstart module¶

class immuneML.workflows.instructions.quickstart.Quickstart[source]¶

Bases: object

build_path(path: Optional[str] = None)[source]¶

create_specfication(path: pathlib.Path)[source]¶

run(result_path: str)[source]¶

immuneML.workflows.instructions.quickstart.main()[source]¶

immuneML.workflows.instructions package¶

Subpackages¶

Submodules¶

immuneML.workflows.instructions.Instruction module¶

immuneML.workflows.instructions.MLProcess module¶

immuneML.workflows.instructions.SimulationInstruction module¶

immuneML.workflows.instructions.TrainMLModelInstruction module¶

immuneML.workflows.instructions.quickstart module¶

Module contents¶