Instructions#
The different workflows that can be executed by immuneML are called instructions
.
Different intructions may require different analysis components (defined under definitions
).
This page documents all instructions and their parameters in detail. Tutorials for general usage of most instructions can be found under Tutorials.
Please use the menu on the right side of this page to navigate to the documentation for the instructions of interest, or jump to one of the following sections:
Machine learning:
Data simulation:
Data analysis, exploration and manipulation:
TrainMLModel#
MLApplication#
Instruction which enables using trained ML models and encoders on new datasets which do not necessarily have labeled data. When the same label is provided as the ML setting was trained for, performance metrics can be computed.
The predictions are stored in the predictions.csv in the result path in the following format:
example_id |
cmv_predicted_class |
cmv_1_proba |
cmv_0_proba |
---|---|---|---|
e1 |
1 |
0.8 |
0.2 |
e2 |
0 |
0.2 |
0.8 |
e3 |
1 |
0.78 |
0.22 |
If the same label that the ML setting was trained for is present in the provided dataset, the ‘true’ label value will be added to the predictions table in addition:
example_id |
cmv_predicted_class |
cmv_1_proba |
cmv_0_proba |
cmv_true_class |
---|---|---|---|---|
e1 |
1 |
0.8 |
0.2 |
1 |
e2 |
0 |
0.2 |
0.8 |
0 |
e3 |
1 |
0.78 |
0.22 |
0 |
Specification arguments:
dataset: dataset for which examples need to be classified
config_path: path to the zip file exported from MLModelTraining instruction (which includes train ML model, encoder, preprocessing etc.)
number_of_processes (int): how many processes should be created at once to speed up the analysis. For personal machines, 4 or 8 is usually a good choice.
metrics (list): a list of metrics (accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision, recall, auc, log_loss) to compute between the true and predicted classes. These metrics will only be computed when the same label with the same classes is provided for the dataset as the original label the ML setting was trained for.
YAML specification:
instructions:
instruction_name:
type: MLApplication
dataset: d1
config_path: ./config.zip
metrics:
- accuracy
- precision
- recall
number_of_processes: 4
ExploratoryAnalysis#
Allows exploratory analysis of different datasets using encodings and reports.
Analysis is defined by a dictionary of ExploratoryAnalysisUnit objects that encapsulate a dataset, an encoding [optional] and a report to be executed on the [encoded] dataset. Each analysis specified under analyses is completely independent from all others.
Specification arguments:
analyses (dict): a dictionary of analyses to perform. The keys are the names of different analyses, and the values for each of the analyses are:
dataset: dataset on which to perform the exploratory analysis
preprocessing_sequence: which preprocessings to use on the dataset, this item is optional and does not have to be specified.
example_weighting: which example weighting strategy to use before encoding the data, this item is optional and does not have to be specified.
encoding: how to encode the dataset before running the report, this item is optional and does not have to be specified.
labels: if encoding is specified, the relevant labels should be specified here.
dim_reduction: which dimensionality reduction to apply; this is an experimental feature
report: which report to run on the dataset. Reports specified here may be of the category Data reports or Encoding reports, depending on whether ‘encoding’ was specified.
number_of_processes: (int): how many processes should be created at once to speed up the analysis. For personal machines, 4 or 8 is usually a good choice.
YAML specification:
instructions:
my_expl_analysis_instruction: # user-defined instruction name
type: ExploratoryAnalysis # which instruction to execute
analyses: # analyses to perform
my_first_analysis: # user-defined name of the analysis
dataset: d1 # dataset to use in the first analysis
preprocessing_sequence: p1 # preprocessing sequence to use in the first analysis
report: r1 # which report to generate using the dataset d1
my_second_analysis: # user-defined name of another analysis
dataset: d1 # dataset to use in the second analysis - can be the same or different from other analyses
encoding: e1 # encoding to apply on the specified dataset (d1)
report: r2 # which report to generate in the second analysis
labels: # labels present in the dataset d1 which will be included in the encoded data on which report r2 will be run
- celiac # name of the first label as present in the column of dataset's metadata file
- CMV # name of the second label as present in the column of dataset's metadata file
my_third_analysis: # user-defined name of another analysis
dataset: d1 # dataset to use in the second analysis - can be the same or different from other analyses
encoding: e1 # encoding to apply on the specified dataset (d1)
dim_reduction: umap # or None; which dimensionality reduction method to apply to encoded d1
report: r3 # which report to generate in the third analysis
number_of_processes: 4 # number of parallel processes to create (could speed up the computation)
LigoSim#
LIgO simulation instruction creates a synthetic dataset from scratch based on the generative model and a set of signals provided by the user.
Specification arguments:
simulation (str): a name of a simulation object containing a list of SimConfigItem as specified under definitions key; defines how to combine signals with simulated data; specified under definitions
sequence_batch_size (int): how many sequences to generate at once using the generative model before checking for signals and filtering
max_iterations (int): how many iterations are allowed when creating sequences
export_p_gens (bool): whether to compute generation probabilities (if supported by the generative model) for sequences and include them as part of output
number_of_processes (int): determines how many simulation items can be simulated in parallel
YAML specification:
instructions:
my_simulation_instruction: # user-defined name of the instruction
type: LIgOSim # which instruction to execute
simulation: sim1
sequence_batch_size: 1000
max_iterations: 1000
export_p_gens: False
number_of_processes: 4
FeasibilitySummary#
FeasibilitySummary instruction creates a small synthetic dataset and reports summary metrics to show if the simulation with the given parameters is feasible. The input parameters to this analysis are the name of the simulation (the same that can be used with LigoSim instruction later if feasibility analysis looks acceptable), and the number of sequences to simulate for estimating the feasibility.
The feasibility analysis is performed for each generative model separately as these could differ in the analyses that will be reported.
Specification arguments:
simulation (str): a name of a simulation object containing a list of SimConfigItem as specified under definitions key; defines how to combine signals with simulated data; specified under definitions
sequence_count (int): how many sequences to generate to estimate feasibility (default value: 100 000)
number_of_processes (int): for the parts of the analysis that are possible to parallelize, how many processes to use
YAML specification:
instructions:
my_feasibility_summary: # user-defined name of the instruction
type: FeasibilitySummary # which instruction to execute
simulation: sim1
sequence_count: 10000
TrainGenModel#
Note
This is an experimental feature
TrainGenModel instruction implements training generative AIRR models on receptor level. Models that can be trained for sequence generation are listed under Generative Models section.
This instruction takes a dataset as input which will be used to train a model, the model itself, and the number of sequences to generate to illustrate the applicability of the model. It can also produce reports of the fitted model and reports of original and generated sequences.
To use the generative model previously trained with immuneML, see ApplyGenModel instruction.
Specification arguments:
dataset: dataset to use for fitting the generative model; it has to be defined under definitions/datasets
method: which model to fit (defined previously under definitions/ml_methods)
number_of_processes (int): how many processes to use for fitting the model
gen_examples_count (int): how many examples (sequences, repertoires) to generate from the fitted model
reports (list): list of report ids (defined under definitions/reports) to apply after fitting a generative model and generating gen_examples_count examples; these can be data reports (to be run on generated examples), ML reports (to be run on the fitted model)
YAML specification:
instructions:
my_train_gen_model_inst: # user-defined instruction name
type: TrainGenModel
dataset: d1 # defined previously under definitions/datasets
model: model1 # defined previously under definitions/ml_methods
gen_examples_count: 100
number_of_processes: 4
reports: [data_rep1, ml_rep2]
ApplyGenModel#
Note
This is an experimental feature
ApplyGenModel instruction implements applying generative AIRR models on the sequence level.
This instruction takes as input a trained model (trained in the TrainGenModel instruction) which will be used for generating data and the number of sequences to be generated. It can also produce reports of the applied model and reports of generated sequences.
Specification arguments:
gen_examples_count (int): how many examples (sequences, repertoires) to generate from the applied model
reports (list): list of report ids (defined under definitions/reports) to apply after generating gen_examples_count examples; these can be data reports (to be run on generated examples), ML reports (to be run on the fitted model)
config_path (str): path to the trained model in zip format (as provided by TrainGenModel instruction)
YAML specification:
instructions:
my_apply_gen_model_inst: # user-defined instruction name
type: ApplyGenModel
gen_examples_count: 100
ml_config_path: ./config.zip
reports: [data_rep1, ml_rep2]
Clustering#
Note
This is an experimental feature
Clustering instruction fits clustering methods to the provided encoded dataset and compares the combinations of clustering method with its hyperparameters, and encodings across a pre-defined set of metrics. Finally, it provides options to include a set of reports to visualize the results.
Specification arguments:
dataset (str): name of the dataset to be clustered
metrics (list): a list of metrics to use for comparison of clustering algorithms and encodings (it can include metrics for either internal evaluation if no labels are provided or metrics for external evaluation so that the clusters can be compared against a list of predefined labels)
labels (list): an optional list of labels to use for external evaluation of clustering
clustering_settings (list): a list of combinations of encoding, optional dimensionality reduction algorithm, and the clustering algorithm that will be evaluated
reports (list): a list of reports to be run on the clustering results or the encoded data
number_of_processes (int): how many processes to use for parallelization
YAML specification:
instructions:
my_clustering_instruction:
type: Clustering
dataset: d1
metrics: [adjusted_rand_score, adjusted_mutual_info_score]
labels: [epitope, v_call]
clustering_settings:
- encoding: e1
dim_reduction: pca
method: k_means1
- encoding: e2
method: dbscan
reports: [rep1, rep2]
DatasetExport#
DatasetExport instruction takes a list of datasets as input, optionally applies preprocessing steps, and outputs the data in specified formats.
Specification arguments:
datasets (list): a list of datasets to export in all given formats
preprocessing_sequence (str): which preprocessing sequence to use on the dataset(s), this item is optional and does not have to be specified. When specified, the same preprocessing sequence will be applied to all datasets.
formats (list): a list of formats in which to export the datasets. Valid formats are class names of any non-abstract class inheriting
DataExporter
.number_of_processes (int): how many processes to use during repertoire export (not used for sequence datasets)
YAML specification:
instructions:
my_dataset_export_instruction: # user-defined instruction name
type: DatasetExport # which instruction to execute
datasets: # list of datasets to export
- my_generated_dataset
- my_dataset_from_adaptive
preprocessing_sequence: my_preprocessing_sequence
number_of_processes: 4
export_formats: # list of formats to export the datasets to
- AIRR
- ImmuneML
Subsampling#
Subsampling is an instruction that subsamples a given dataset and creates multiple smaller dataset according to the parameters provided.
Specification arguments:
dataset (str): original dataset which will be used as a basis for subsampling
subsampled_dataset_sizes (list): a list of dataset sizes (number of examples) each subsampled dataset should have
dataset_export_formats (list): in which formats to export the subsampled datasets. Valid values are: ImmuneML, AIRR.
YAML specification:
instructions:
my_subsampling_instruction: # user-defined name of the instruction
type: Subsampling # which instruction to execute
dataset: my_dataset # original dataset to be subsampled, with e.g., 300 examples
subsampled_dataset_sizes: # how large the subsampled datasets should be, one dataset will be created for each list item
- 200 # one subsampled dataset with 200 examples (200 repertoires if my_dataset was repertoire dataset)
- 100 # the other subsampled dataset will have 100 examples
dataset_export_formats: # in which formats to export the subsampled datasets
- ImmuneML
- AIRR