Instructions¶
The different workflows that can be executed by immuneML are called instructions
.
Different intructions may require different analysis components (defined under definitions
).
This page documents all instructions and their parameters in detail. Tutorials for general usage of most instructions can be found under Tutorials.
Please use the menu on the right side of this page to navigate to the documentation for the instructions of interest, or jump to one of the following sections:
Machine learning:
Data simulation:
Data analysis, exploration and manipulation:
TrainMLModel¶
Class implementing hyperparameter optimization and training and assessing the model through nested cross-validation (CV). The process is defined by two loops:
the outer loop over defined splits of the dataset for performance assessment
the inner loop over defined hyperparameter space and with cross-validation or train & validation split to choose the best hyperparameters.
Optimal model chosen by the inner loop is then retrained on the whole training dataset in the outer loop.
Note
If you are interested in plotting the performance of all combinations of encodings and ML methods on the test set, consider running the MLSettingsPerformance report as hyperparameter report in the assessment loop.
Specification arguments:
dataset: dataset to use for training and assessing the classifier
strategy: how to search different hyperparameters; common options include grid search, random search. Valid values are: GridSearch.
settings (list): a list of combinations of preprocessing_sequence, encoding and ml_method. preprocessing_sequence is optional, while encoding and ml_method are mandatory. These three options (and their parameters) can be optimized over, choosing the highest performing combination.
assessment: description of the outer loop (for assessment) of nested cross-validation. It describes how to split the data, how many splits to make, what percentage to use for training and what reports to execute on those splits. See SplitConfig.
selection: description of the inner loop (for selection) of nested cross-validation. The same as assessment argument, just to be executed in the inner loop. See SplitConfig.
metrics (list): a list of metrics (accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision, recall, auc, log_loss) to compute for all splits and settings created during the nested cross-validation. These metrics will be computed only for reporting purposes. For choosing the optimal setting, optimization_metric will be used.
optimization_metric: a metric to use for optimization (one of accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision, recall, auc, log_loss) and assessment in the nested cross-validation.
example_weighting: which example weighting strategy to use. Example weighting can be used to up-weight or down-weight the importance of each example in the dataset. These weights will be applied when computing (optimization) metrics, and are used by some encoders and ML methods.
labels (list): a list of labels for which to train the classifiers. The goal of the nested CV is to find the setting which will have best performance in predicting the given label (e.g., if a subject has experienced an immune event or not). Performance and optimal settings will be reported for each label separately. If a label is binary, instead of specifying only its name, one should explicitly set the name of the positive class as well under parameter positive_class. If positive class is not set, one of the label classes will be assumed to be positive.
number_of_processes (int): how many processes should be created at once to speed up the analysis. For personal machines, 4 or 8 is usually a good choice.
reports (list): a list of report names to be executed after the nested CV has finished to show the overall performance or some statistic; the reports that can be provided here are Train ML model reports.
refit_optimal_model (bool): if the final combination of preprocessing-encoding-ML model should be refitted on the full dataset thus providing the final model to be exported from instruction; alternatively, train combination from one of the assessment folds will be used
export_all_models (bool): if set to True, all trained models in the assessment split are exported as .zip files. If False, only the optimal model is exported. By default, export_all_models is False.
sequence_type (str): whether to perform the analysis on amino acid or nucleotide sequences
region_type (str): which part of the sequence to analyze, e.g., IMGT_CDR3
YAML specification:
instructions:
my_nested_cv_instruction: # user-defined name of the instruction
type: TrainMLModel # which instruction should be executed
settings: # a list of combinations of preprocessing, encoding and ml_method to optimize over
- preprocessing: seq1 # preprocessing is optional
encoding: e1 # mandatory field
ml_method: simpleLR # mandatory field
- preprocessing: seq1 # the second combination
encoding: e2
ml_method: simpleLR
assessment: # outer loop of nested CV
split_strategy: random # perform Monte Carlo CV (randomly split the data into train and test)
split_count: 1 # how many train/test datasets to generate
training_percentage: 0.7 # what percentage of the original data should be used for the training set
reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods
data_splits: # list of reports to execute on training/test datasets (before they are encoded)
- rep1
encoding: # list of reports to execute on encoded training/test datasets
- rep2
models: # list of reports to execute on trained ML methods for each assessment CV split
- rep3
selection: # inner loop of nested CV
split_strategy: k_fold # perform k-fold CV
split_count: 5 # how many fold to create: here these two parameters mean: do 5-fold CV
reports:
data_splits: # list of reports to execute on training/test datasets (in the inner loop, so these are actually training and validation datasets)
- rep1
models: # list of reports to execute on trained ML methods for each selection CV split
- rep2
encoding: # list of reports to execute on encoded training/test datasets (again, it is training/validation here)
- rep3
labels: # list of labels to optimize the classifier for, as given in the metadata for the dataset
- celiac:
positive_class: + # if it's binary classification, positive class parameter should be set
- T1D # this is not binary label, so no need to specify positive class
dataset: d1 # which dataset to use for the nested CV
strategy: GridSearch # how to choose the combinations which to test from settings (GridSearch means test all)
metrics: # list of metrics to compute for all settings, but these do not influence the choice of optimal model
- accuracy
- auc
reports: # list of reports to execute when nested CV is finished to show overall performance
- rep4
number_of_processes: 4 # number of parallel processes to create (could speed up the computation)
optimization_metric: balanced_accuracy # the metric to use for choosing the optimal model and during training
refit_optimal_model: False # use trained model, do not refit on the full dataset
export_all_ml_settings: False # only export the optimal setting
region_type: IMGT_CDR3
sequence_type: AMINO_ACID
MLApplication¶
Instruction which enables using trained ML models and encoders on new datasets which do not necessarily have labeled data. When the same label is provided as the ML setting was trained for, performance metrics can be computed.
The predictions are stored in the predictions.csv in the result path in the following format:
example_id |
cmv_predicted_class |
cmv_1_proba |
cmv_0_proba |
---|---|---|---|
e1 |
1 |
0.8 |
0.2 |
e2 |
0 |
0.2 |
0.8 |
e3 |
1 |
0.78 |
0.22 |
If the same label that the ML setting was trained for is present in the provided dataset, the ‘true’ label value will be added to the predictions table in addition:
example_id |
cmv_predicted_class |
cmv_1_proba |
cmv_0_proba |
cmv_true_class |
---|---|---|---|---|
e1 |
1 |
0.8 |
0.2 |
1 |
e2 |
0 |
0.2 |
0.8 |
0 |
e3 |
1 |
0.78 |
0.22 |
0 |
Specification arguments:
dataset: dataset for which examples need to be classified
config_path: path to the zip file exported from MLModelTraining instruction (which includes train ML model, encoder, preprocessing etc.)
number_of_processes (int): how many processes should be created at once to speed up the analysis. For personal machines, 4 or 8 is usually a good choice.
metrics (list): a list of metrics (accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision, recall, auc, log_loss) to compute between the true and predicted classes. These metrics will only be computed when the same label with the same classes is provided for the dataset as the original label the ML setting was trained for.
YAML specification:
instructions:
instruction_name:
type: MLApplication
dataset: d1
config_path: ./config.zip
metrics:
- accuracy
- precision
- recall
number_of_processes: 4
ExploratoryAnalysis¶
Allows exploratory analysis of different datasets using encodings and reports.
Analysis is defined by a dictionary of ExploratoryAnalysisUnit objects that encapsulate a dataset, an encoding [optional] and a report to be executed on the [encoded] dataset. Each analysis specified under analyses is completely independent from all others.
Specification arguments:
analyses (dict): a dictionary of analyses to perform. The keys are the names of different analyses, and the values for each of the analyses are:
dataset: dataset on which to perform the exploratory analysis
preprocessing_sequence: which preprocessings to use on the dataset, this item is optional and does not have to be specified.
example_weighting: which example weighting strategy to use before encoding the data, this item is optional and does not have to be specified.
encoding: how to encode the dataset before running the report, this item is optional and does not have to be specified.
labels: if encoding is specified, the relevant labels should be specified here.
dim_reduction: which dimensionality reduction to apply;
report: which report to run on the dataset. Reports specified here may be of the category Data reports or Encoding reports, depending on whether ‘encoding’ was specified.
number_of_processes: (int): how many processes should be created at once to speed up the analysis. For personal machines, 4 or 8 is usually a good choice.
YAML specification:
instructions:
my_expl_analysis_instruction: # user-defined instruction name
type: ExploratoryAnalysis # which instruction to execute
analyses: # analyses to perform
my_first_analysis: # user-defined name of the analysis
dataset: d1 # dataset to use in the first analysis
preprocessing_sequence: p1 # preprocessing sequence to use in the first analysis
report: r1 # which report to generate using the dataset d1
my_second_analysis: # user-defined name of another analysis
dataset: d1 # dataset to use in the second analysis - can be the same or different from other analyses
encoding: e1 # encoding to apply on the specified dataset (d1)
report: r2 # which report to generate in the second analysis
labels: # labels present in the dataset d1 which will be included in the encoded data on which report r2 will be run
- celiac # name of the first label as present in the column of dataset's metadata file
- CMV # name of the second label as present in the column of dataset's metadata file
my_third_analysis: # user-defined name of another analysis
dataset: d1 # dataset to use in the second analysis - can be the same or different from other analyses
encoding: e1 # encoding to apply on the specified dataset (d1)
dim_reduction: umap # or None; which dimensionality reduction method to apply to encoded d1
report: r3 # which report to generate in the third analysis
number_of_processes: 4 # number of parallel processes to create (could speed up the computation)
LigoSim¶
LIgO simulation instruction creates a synthetic dataset from scratch based on the generative model and a set of signals provided by the user.
Specification arguments:
simulation (str): a name of a simulation object containing a list of SimConfigItem as specified under definitions key; defines how to combine signals with simulated data; specified under definitions
sequence_batch_size (int): how many sequences to generate at once using the generative model before checking for signals and filtering
max_iterations (int): how many iterations are allowed when creating sequences
export_p_gens (bool): whether to compute generation probabilities (if supported by the generative model) for sequences and include them as part of output
number_of_processes (int): determines how many simulation items can be simulated in parallel
YAML specification:
instructions:
my_simulation_instruction: # user-defined name of the instruction
type: LIgOSim # which instruction to execute
simulation: sim1
sequence_batch_size: 1000
max_iterations: 1000
export_p_gens: False
number_of_processes: 4
FeasibilitySummary¶
FeasibilitySummary instruction creates a small synthetic dataset and reports summary metrics to show if the simulation with the given parameters is feasible. The input parameters to this analysis are the name of the simulation (the same that can be used with LigoSim instruction later if feasibility analysis looks acceptable), and the number of sequences to simulate for estimating the feasibility.
The feasibility analysis is performed for each generative model separately as these could differ in the analyses that will be reported.
Specification arguments:
simulation (str): a name of a simulation object containing a list of SimConfigItem as specified under definitions key; defines how to combine signals with simulated data; specified under definitions
sequence_count (int): how many sequences to generate to estimate feasibility (default value: 100 000)
number_of_processes (int): for the parts of the analysis that are possible to parallelize, how many processes to use
YAML specification:
instructions:
my_feasibility_summary: # user-defined name of the instruction
type: FeasibilitySummary # which instruction to execute
simulation: sim1
sequence_count: 10000
TrainGenModel¶
TrainGenModel instruction implements training generative AIRR models on receptor level. Models that can be trained for sequence generation are listed under Generative Models section.
This instruction takes a dataset as input which will be used to train a model, the model itself, and the number of sequences to generate to illustrate the applicability of the model. It can also produce reports of the fitted model and reports of original and generated sequences.
To use the generative model previously trained with immuneML, see ApplyGenModel instruction.
Specification arguments:
dataset: dataset to use for fitting the generative model; it has to be defined under definitions/datasets
method: which model to fit (defined previously under definitions/ml_methods)
number_of_processes (int): how many processes to use for fitting the model
gen_examples_count (int): how many examples (sequences, repertoires) to generate from the fitted model
reports (list): list of report ids (defined under definitions/reports) to apply after fitting a generative model and generating gen_examples_count examples; these can be data reports (to be run on generated examples), ML reports (to be run on the fitted model)
YAML specification:
instructions:
my_train_gen_model_inst: # user-defined instruction name
type: TrainGenModel
dataset: d1 # defined previously under definitions/datasets
model: model1 # defined previously under definitions/ml_methods
gen_examples_count: 100
number_of_processes: 4
training_percentage: 0.7
export_generated_dataset: True
export_combined_dataset: False
reports: [data_rep1, ml_rep2]
ApplyGenModel¶
ApplyGenModel instruction implements applying generative AIRR models on the sequence level.
This instruction takes as input a trained model (trained in the TrainGenModel instruction) which will be used for generating data and the number of sequences to be generated. It can also produce reports of the applied model and reports of generated sequences.
Specification arguments:
gen_examples_count (int): how many examples (sequences, repertoires) to generate from the applied model
reports (list): list of report ids (defined under definitions/reports) to apply after generating gen_examples_count examples; these can be data reports (to be run on generated examples), ML reports (to be run on the fitted model)
ml_config_path (str): path to the trained model in zip format (as provided by TrainGenModel instruction)
YAML specification:
instructions:
my_apply_gen_model_inst: # user-defined instruction name
type: ApplyGenModel
gen_examples_count: 100
ml_config_path: ./config.zip
reports: [data_rep1, ml_rep2]
Clustering¶
Clustering instruction fits clustering methods to the provided encoded dataset and compares the combinations of clustering method with its hyperparameters, and encodings across a pre-defined set of metrics. The dataset is split into discovery and validation datasets and the clustering results are reported on both. Finally, it provides options to include a set of reports to visualize the results.
For more details on choosing the clustering algorithm and its hyperparameters, see the paper: Ullmann, T., Hennig, C., & Boulesteix, A.-L. (2022). Validation of cluster analysis results on validation data: A systematic framework. WIREs Data Mining and Knowledge Discovery, 12(3), e1444. https://doi.org/10.1002/widm.1444
Specification arguments:
dataset (str): name of the dataset to be clustered
metrics (list): a list of metrics to use for comparison of clustering algorithms and encodings (it can include metrics for either internal evaluation if no labels are provided or metrics for external evaluation so that the clusters can be compared against a list of predefined labels)
labels (list): an optional list of labels to use for external evaluation of clustering
split_config (SplitConfig): how to perform splitting of the original dataset into discovery and validation data; for this parameter, specify: split_strategy (leave_one_out_stratification, manual, random), training percentage if split_strategy is random, and defaults of manual or leave one out stratification config for corresponding split strategy; all three options are illustrated here:
split_config: split_strategy: manual manual_config: discovery_data: file_with_ids_of_examples_for_discovery_data.csv validation_data: file_with_ids_of_examples_for_validation_data.csv
split_config: split_strategy: random training_percentage: 0.5
split_config: split_strategy: leave_one_out_stratification leave_one_out_config: parameter: subject_id # any name of the parameter for split, must be present in the metadata min_count: 1 # defines the minimum number of examples that can be present in the validation dataset.
clustering_settings (list): a list where each element represents a
ClusteringSetting
; a combinations of encoding, optional dimensionality reduction algorithm, and the clustering algorithm that will be evaluatedreports (list): a list of reports to be run on the clustering results or the encoded data
number_of_processes (int): how many processes to use for parallelization
sequence_type (str): whether to do analysis on the amino_acid or nucleotide level; this value is used only if nothing is specified on the encoder level
region_type (str): which part of the receptor sequence to analyze (e.g., IMGT_CDR3); this value is used only if nothing is specified on the encoder level
YAML specification:
instructions:
my_clustering_instruction:
type: Clustering
dataset: d1
metrics: [adjusted_rand_score, adjusted_mutual_info_score]
labels: [epitope, v_call]
sequence_type: amino_acid
region_type: imgt_cdr3
split_config:
split_strategy: manual
manual_config:
discovery_data: file_with_ids_of_examples_for_discovery_data.csv
validation_data: file_with_ids_of_examples_for_validation_data.csv
clustering_settings:
- encoding: e1
dim_reduction: pca
method: k_means1
- encoding: e2
method: dbscan
reports: [rep1, rep2]
DatasetExport¶
DatasetExport instruction takes a list of datasets as input, optionally applies preprocessing steps, and outputs the data in specified formats.
Specification arguments:
datasets (list): a list of datasets to export in all given formats
preprocessing_sequence (str): which preprocessing sequence to use on the dataset(s), this item is optional and does not have to be specified. When specified, the same preprocessing sequence will be applied to all datasets.
formats (list): a list of formats in which to export the datasets. Valid formats are class names of any non-abstract class inheriting
DataExporter
.number_of_processes (int): how many processes to use during repertoire export (not used for sequence datasets)
YAML specification:
instructions:
my_dataset_export_instruction: # user-defined instruction name
type: DatasetExport # which instruction to execute
datasets: # list of datasets to export
- my_generated_dataset
- my_dataset_from_adaptive
preprocessing_sequence: my_preprocessing_sequence
number_of_processes: 4
export_formats: # list of formats to export the datasets to
- AIRR
- ImmuneML
Subsampling¶
Subsampling is an instruction that subsamples a given dataset and creates multiple smaller dataset according to the parameters provided.
Specification arguments:
dataset (str): original dataset which will be used as a basis for subsampling
subsampled_dataset_sizes (list): a list of dataset sizes (number of examples) each subsampled dataset should have
dataset_export_formats (list): in which formats to export the subsampled datasets. Valid values are: AIRR.
YAML specification:
instructions:
my_subsampling_instruction: # user-defined name of the instruction
type: Subsampling # which instruction to execute
dataset: my_dataset # original dataset to be subsampled, with e.g., 300 examples
subsampled_dataset_sizes: # how large the subsampled datasets should be, one dataset will be created for each list item
- 200 # one subsampled dataset with 200 examples (200 repertoires if my_dataset was repertoire dataset)
- 100 # the other subsampled dataset will have 100 examples
dataset_export_formats: # in which formats to export the subsampled datasets
- ImmuneML
- AIRR