Instruction parameters

The different workflows that can be executed by immuneML are called instructions. Different intructions may require different analysis components (defined under definitions).

This page documents all instructions and their parameters in detail. Tutorials for general usage of most instructions can be found under Tutorials.

Please use the menu on the right side of this page to navigate to the documentation for the instructions of interest, or jump to one of the following sections:

Machine learning:

Data simulation:

Data analysis, exploration and manipulation:

TrainMLModel

Class implementing hyperparameter optimization and training and assessing the model through nested cross-validation (CV). The process is defined by two loops:

  • the outer loop over defined splits of the dataset for performance assessment

  • the inner loop over defined hyperparameter space and with cross-validation or train & validation split to choose the best hyperparameters.

Optimal model chosen by the inner loop is then retrained on the whole training dataset in the outer loop.

Note

If you are interested in plotting the performance of all combinations of encodings and ML methods on the test set, consider running the MLSettingsPerformance report as hyperparameter report in the assessment loop.

Specification arguments:

  • dataset: dataset to use for training and assessing the classifier

  • strategy: how to search different hyperparameters; common options include grid search, random search. Valid values are: GridSearch.

  • settings (list): a list of combinations of preprocessing_sequence, encoding and ml_method. preprocessing_sequence is optional, while encoding and ml_method are mandatory. These three options (and their parameters) can be optimized over, choosing the highest performing combination.

  • assessment: description of the outer loop (for assessment) of nested cross-validation. It describes how to split the data, how many splits to make, what percentage to use for training and what reports to execute on those splits. See SplitConfig.

  • selection: description of the inner loop (for selection) of nested cross-validation. The same as assessment argument, just to be executed in the inner loop. See SplitConfig.

  • metrics (list): a list of metrics (accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision, recall, auc, log_loss) to compute for all splits and settings created during the nested cross-validation. These metrics will be computed only for reporting purposes. For choosing the optimal setting, optimization_metric will be used.

  • optimization_metric: a metric to use for optimization (one of accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision, recall, auc, log_loss) and assessment in the nested cross-validation.

  • example_weighting: which example weighting strategy to use. Example weighting can be used to up-weight or down-weight the importance of each example in the dataset. These weights will be applied when computing (optimization) metrics, and are used by some encoders and ML methods.

  • labels (list): a list of labels for which to train the classifiers. The goal of the nested CV is to find the setting which will have best performance in predicting the given label (e.g., if a subject has experienced an immune event or not). Performance and optimal settings will be reported for each label separately. If a label is binary, instead of specifying only its name, one should explicitly set the name of the positive class as well under parameter positive_class. If positive class is not set, one of the label classes will be assumed to be positive.

  • number_of_processes (int): how many processes should be created at once to speed up the analysis. For personal machines, 4 or 8 is usually a good choice.

  • reports (list): a list of report names to be executed after the nested CV has finished to show the overall performance or some statistic; the reports that can be provided here are Train ML model reports.

  • refit_optimal_model (bool): if the final combination of preprocessing-encoding-ML model should be refitted on the full dataset thus providing the final model to be exported from instruction; alternatively, train combination from one of the assessment folds will be used

  • export_all_models (bool): if set to True, all trained models in the assessment split are exported as .zip files. If False, only the optimal model is exported. By default, export_all_models is False.

  • sequence_type (str): whether to perform the analysis on amino acid or nucleotide sequences

  • region_type (str): which part of the sequence to analyze, e.g., IMGT_CDR3

YAML specification:

instructions:
    my_nested_cv_instruction: # user-defined name of the instruction
        type: TrainMLModel # which instruction should be executed
        settings: # a list of combinations of preprocessing, encoding and ml_method to optimize over
            - preprocessing: seq1 # preprocessing is optional
              encoding: e1 # mandatory field
              ml_method: simpleLR # mandatory field
            - preprocessing: seq1 # the second combination
              encoding: e2
              ml_method: simpleLR
        assessment: # outer loop of nested CV
            split_strategy: random # perform Monte Carlo CV (randomly split the data into train and test)
            split_count: 1 # how many train/test datasets to generate
            training_percentage: 0.7 # what percentage of the original data should be used for the training set
            reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods
                data_splits: # list of reports to execute on training/test datasets (before they are encoded)
                    - rep1
                encoding: # list of reports to execute on encoded training/test datasets
                    - rep2
                models: # list of reports to execute on trained ML methods for each assessment CV split
                    - rep3
        selection: # inner loop of nested CV
            split_strategy: k_fold # perform k-fold CV
            split_count: 5 # how many fold to create: here these two parameters mean: do 5-fold CV
            reports:
                data_splits: # list of reports to execute on training/test datasets (in the inner loop, so these are actually training and validation datasets)
                    - rep1
                models: # list of reports to execute on trained ML methods for each selection CV split
                    - rep2
                encoding: # list of reports to execute on encoded training/test datasets (again, it is training/validation here)
                    - rep3
        labels: # list of labels to optimize the classifier for, as given in the metadata for the dataset
            - celiac:
                positive_class: + # if it's binary classification, positive class parameter should be set
            - T1D # this is not binary label, so no need to specify positive class
        dataset: d1 # which dataset to use for the nested CV
        strategy: GridSearch # how to choose the combinations which to test from settings (GridSearch means test all)
        metrics: # list of metrics to compute for all settings, but these do not influence the choice of optimal model
            - accuracy
            - auc
        reports: # list of reports to execute when nested CV is finished to show overall performance
            - rep4
        number_of_processes: 4 # number of parallel processes to create (could speed up the computation)
        optimization_metric: balanced_accuracy # the metric to use for choosing the optimal model and during training
        refit_optimal_model: False # use trained model, do not refit on the full dataset
        export_all_ml_settings: False # only export the optimal setting
        region_type: IMGT_CDR3
        sequence_type: AMINO_ACID

MLApplication

Instruction which enables using trained ML models and encoders on new datasets which do not necessarily have labeled data. When the same label is provided as the ML setting was trained for, performance metrics can be computed.

The predictions are stored in the predictions.csv in the result path in the following format:

example_id

cmv_predicted_class

cmv_1_proba

cmv_0_proba

e1

1

0.8

0.2

e2

0

0.2

0.8

e3

1

0.78

0.22

If the same label that the ML setting was trained for is present in the provided dataset, the ‘true’ label value will be added to the predictions table in addition:

example_id

cmv_predicted_class

cmv_1_proba

cmv_0_proba

cmv_true_class

e1

1

0.8

0.2

1

e2

0

0.2

0.8

0

e3

1

0.78

0.22

0

Specification arguments:

  • dataset: dataset for which examples need to be classified

  • config_path: path to the zip file exported from MLModelTraining instruction (which includes train ML model, encoder, preprocessing etc.)

  • number_of_processes (int): how many processes should be created at once to speed up the analysis. For personal machines, 4 or 8 is usually a good choice.

  • metrics (list): a list of metrics (accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision, recall, auc, log_loss) to compute between the true and predicted classes. These metrics will only be computed when the same label with the same classes is provided for the dataset as the original label the ML setting was trained for.

YAML specification:

instructions:
    instruction_name:
        type: MLApplication
        dataset: d1
        config_path: ./config.zip
        metrics:
        - accuracy
        - precision
        - recall
        number_of_processes: 4

ExploratoryAnalysis

Allows exploratory analysis of different datasets using encodings and reports.

Analysis is defined by a dictionary of ExploratoryAnalysisUnit objects that encapsulate a dataset, an encoding [optional] and a report to be executed on the [encoded] dataset. Each analysis specified under analyses is completely independent from all others.

Specification arguments:

  • analyses (dict): a dictionary of analyses to perform. The keys are the names of different analyses, and the values for each of the analyses are:

    • dataset: dataset on which to perform the exploratory analysis

    • preprocessing_sequence: which preprocessings to use on the dataset, this item is optional and does not have to be specified.

    • example_weighting: which example weighting strategy to use before encoding the data, this item is optional and does not have to be specified.

    • encoding: how to encode the dataset before running the report, this item is optional and does not have to be specified.

    • labels: if encoding is specified, the relevant labels should be specified here.

    • dim_reduction: which dimensionality reduction to apply;

    • report: which report to run on the dataset. Reports specified here may be of the category Data reports or Encoding reports, depending on whether ‘encoding’ was specified.

  • number_of_processes: (int): how many processes should be created at once to speed up the analysis. For personal machines, 4 or 8 is usually a good choice.

YAML specification:

instructions:
    my_expl_analysis_instruction: # user-defined instruction name
        type: ExploratoryAnalysis # which instruction to execute
        analyses: # analyses to perform
            my_first_analysis: # user-defined name of the analysis
                dataset: d1 # dataset to use in the first analysis
                preprocessing_sequence: p1 # preprocessing sequence to use in the first analysis
                report: r1 # which report to generate using the dataset d1
            my_second_analysis: # user-defined name of another analysis
                dataset: d1 # dataset to use in the second analysis - can be the same or different from other analyses
                encoding: e1 # encoding to apply on the specified dataset (d1)
                report: r2 # which report to generate in the second analysis
                labels: # labels present in the dataset d1 which will be included in the encoded data on which report r2 will be run
                    - celiac # name of the first label as present in the column of dataset's metadata file
                    - CMV # name of the second label as present in the column of dataset's metadata file
            my_third_analysis: # user-defined name of another analysis
                dataset: d1 # dataset to use in the second analysis - can be the same or different from other analyses
                encoding: e1 # encoding to apply on the specified dataset (d1)
                dim_reduction: umap # or None; which dimensionality reduction method to apply to encoded d1
                report: r3 # which report to generate in the third analysis
        number_of_processes: 4 # number of parallel processes to create (could speed up the computation)

LigoSim

LIgO simulation instruction creates a synthetic dataset from scratch based on the generative model and a set of signals provided by the user.

Specification arguments:

  • simulation (str): a name of a simulation object containing a list of SimConfigItem as specified under definitions key; defines how to combine signals with simulated data; specified under definitions

  • sequence_batch_size (int): how many sequences to generate at once using the generative model before checking for signals and filtering

  • max_iterations (int): how many iterations are allowed when creating sequences

  • export_p_gens (bool): whether to compute generation probabilities (if supported by the generative model) for sequences and include them as part of output

  • number_of_processes (int): determines how many simulation items can be simulated in parallel

YAML specification:

instructions:
    my_simulation_instruction: # user-defined name of the instruction
        type: LIgOSim # which instruction to execute
        simulation: sim1
        sequence_batch_size: 1000
        max_iterations: 1000
        export_p_gens: False
        number_of_processes: 4

FeasibilitySummary

FeasibilitySummary instruction creates a small synthetic dataset and reports summary metrics to show if the simulation with the given parameters is feasible. The input parameters to this analysis are the name of the simulation (the same that can be used with LigoSim instruction later if feasibility analysis looks acceptable), and the number of sequences to simulate for estimating the feasibility.

The feasibility analysis is performed for each generative model separately as these could differ in the analyses that will be reported.

Specification arguments:

  • simulation (str): a name of a simulation object containing a list of SimConfigItem as specified under definitions key; defines how to combine signals with simulated data; specified under definitions

  • sequence_count (int): how many sequences to generate to estimate feasibility (default value: 100 000)

  • number_of_processes (int): for the parts of the analysis that are possible to parallelize, how many processes to use

YAML specification:

instructions:
    my_feasibility_summary: # user-defined name of the instruction
        type: FeasibilitySummary # which instruction to execute
        simulation: sim1
        sequence_count: 10000

TrainGenModel

TrainGenModel instruction implements training generative AIRR models on receptor level. Models that can be trained for sequence generation are listed under Generative Models section.

This instruction takes a dataset as input which will be used to train a model, the model itself, and the number of sequences to generate to illustrate the applicability of the model. It can also produce reports of the fitted model and reports of original and generated sequences.

To use the generative model previously trained with immuneML, see ApplyGenModel instruction.

Specification arguments:

  • dataset: dataset to use for fitting the generative model; it has to be defined under definitions/datasets

  • method: which model to fit (defined previously under definitions/ml_methods)

  • number_of_processes (int): how many processes to use for fitting the model

  • gen_examples_count (int): how many examples (sequences, repertoires) to generate from the fitted model

  • reports (list): list of report ids (defined under definitions/reports) to apply after fitting a generative model and generating gen_examples_count examples; these can be data reports (to be run on generated examples), ML reports (to be run on the fitted model)

YAML specification:

instructions:
    my_train_gen_model_inst: # user-defined instruction name
        type: TrainGenModel
        dataset: d1 # defined previously under definitions/datasets
        method: model1 # defined previously under definitions/ml_methods
        gen_examples_count: 100
        number_of_processes: 4
        training_percentage: 0.7
        export_generated_dataset: True
        export_combined_dataset: False
        reports: [data_rep1, ml_rep2]

ApplyGenModel

ApplyGenModel instruction implements applying generative AIRR models on the sequence level.

This instruction takes as input a trained model (trained in the TrainGenModel instruction) which will be used for generating data and the number of sequences to be generated. It can also produce reports of the applied model and reports of generated sequences.

Specification arguments:

  • gen_examples_count (int): how many examples (sequences, repertoires) to generate from the applied model

  • reports (list): list of report ids (defined under definitions/reports) to apply after generating gen_examples_count examples; these can be data reports (to be run on generated examples), ML reports (to be run on the fitted model)

  • ml_config_path (str): path to the trained model in zip format (as provided by TrainGenModel instruction)

YAML specification:

instructions:
    my_apply_gen_model_inst: # user-defined instruction name
        type: ApplyGenModel
        gen_examples_count: 100
        ml_config_path: ./config.zip
        reports: [data_rep1, ml_rep2]

Clustering

Clustering instruction fits clustering methods to the provided encoded dataset and compares the combinations of clustering method with its hyperparameters, and encodings across a pre-defined set of metrics. The dataset is split into discovery and validation datasets and the clustering results are reported on both. Finally, it provides options to include a set of reports to visualize the results.

For more details on choosing the clustering algorithm and its hyperparameters, see the paper: Ullmann, T., Hennig, C., & Boulesteix, A.-L. (2022). Validation of cluster analysis results on validation data: A systematic framework. WIREs Data Mining and Knowledge Discovery, 12(3), e1444. https://doi.org/10.1002/widm.1444

Specification arguments:

  • dataset (str): name of the dataset to be clustered

  • metrics (list): a list of metrics to use for comparison of clustering algorithms and encodings (it can include metrics for either internal evaluation if no labels are provided or metrics for external evaluation so that the clusters can be compared against a list of predefined labels)

  • labels (list): an optional list of labels to use for external evaluation of clustering

  • split_config (SplitConfig): how to perform splitting of the original dataset into discovery and validation data; for this parameter, specify: split_strategy (leave_one_out_stratification, manual, random), training percentage if split_strategy is random, and defaults of manual or leave one out stratification config for corresponding split strategy; all three options are illustrated here:

    split_config:
        split_strategy: manual
        manual_config:
            discovery_data: file_with_ids_of_examples_for_discovery_data.csv
            validation_data: file_with_ids_of_examples_for_validation_data.csv
    
    split_config:
        split_strategy: random
        training_percentage: 0.5
    
    split_config:
        split_strategy: leave_one_out_stratification
        leave_one_out_config:
            parameter: subject_id # any name of the parameter for split, must be present in the metadata
            min_count: 1 #  defines the minimum number of examples that can be present in the validation dataset.
    
  • clustering_settings (list): a list where each element represents a ClusteringSetting; a combinations of encoding, optional dimensionality reduction algorithm, and the clustering algorithm that will be evaluated

  • reports (list): a list of reports to be run on the clustering results or the encoded data

  • number_of_processes (int): how many processes to use for parallelization

  • sequence_type (str): whether to do analysis on the amino_acid or nucleotide level; this value is used only if nothing is specified on the encoder level

  • region_type (str): which part of the receptor sequence to analyze (e.g., IMGT_CDR3); this value is used only if nothing is specified on the encoder level

YAML specification:

instructions:
    my_clustering_instruction:
        type: Clustering
        dataset: d1
        metrics: [adjusted_rand_score, adjusted_mutual_info_score]
        labels: [epitope, v_call]
        sequence_type: amino_acid
        region_type: imgt_cdr3
        split_config:
            split_strategy: manual
            manual_config:
                discovery_data: file_with_ids_of_examples_for_discovery_data.csv
                validation_data: file_with_ids_of_examples_for_validation_data.csv
        clustering_settings:
            - encoding: e1
              dim_reduction: pca
              method: k_means1
            - encoding: e2
              method: dbscan
        reports: [rep1, rep2]

DatasetExport

DatasetExport instruction takes a list of datasets as input, optionally applies preprocessing steps, and outputs the data in specified formats.

Specification arguments:

  • datasets (list): a list of datasets to export in all given formats

  • preprocessing_sequence (str): which preprocessing sequence to use on the dataset(s), this item is optional and does not have to be specified. When specified, the same preprocessing sequence will be applied to all datasets.

  • formats (list): a list of formats in which to export the datasets. Valid formats are class names of any non-abstract class inheriting DataExporter.

  • number_of_processes (int): how many processes to use during repertoire export (not used for sequence datasets)

YAML specification:

instructions:
    my_dataset_export_instruction: # user-defined instruction name
        type: DatasetExport # which instruction to execute
        datasets: # list of datasets to export
            - my_generated_dataset
            - my_dataset_from_adaptive
        preprocessing_sequence: my_preprocessing_sequence
        number_of_processes: 4
        export_formats: # list of formats to export the datasets to
            - AIRR
            - ImmuneML

Subsampling

Subsampling is an instruction that subsamples a given dataset and creates multiple smaller dataset according to the parameters provided.

Specification arguments:

  • dataset (str): original dataset which will be used as a basis for subsampling

  • subsampled_dataset_sizes (list): a list of dataset sizes (number of examples) each subsampled dataset should have

  • dataset_export_formats (list): in which formats to export the subsampled datasets. Valid values are: AIRR.

YAML specification:

instructions:
    my_subsampling_instruction: # user-defined name of the instruction
        type: Subsampling # which instruction to execute
        dataset: my_dataset # original dataset to be subsampled, with e.g., 300 examples
        subsampled_dataset_sizes: # how large the subsampled datasets should be, one dataset will be created for each list item
            - 200 # one subsampled dataset with 200 examples (200 repertoires if my_dataset was repertoire dataset)
            - 100 # the other subsampled dataset will have 100 examples
        dataset_export_formats: # in which formats to export the subsampled datasets
            - ImmuneML
            - AIRR