Instructions#

The different workflows that can be executed by immuneML are called instructions. Different intructions may require different analysis components (defined under definitions).

This page documents all instructions and their parameters in detail. Tutorials for general usage of most instructions can be found under Tutorials.

Please use the menu on the right side of this page to navigate to the documentation for the instructions of interest, or jump to one of the following sections:

Machine learning:

Data simulation:

Data analysis, exploration and manipulation:

TrainMLModel#

MLApplication#

Instruction which enables using trained ML models and encoders on new datasets which do not necessarily have labeled data. When the same label is provided as the ML setting was trained for, performance metrics can be computed.

The predictions are stored in the predictions.csv in the result path in the following format:

example_id

cmv_predicted_class

cmv_1_proba

cmv_0_proba

e1

1

0.8

0.2

e2

0

0.2

0.8

e3

1

0.78

0.22

If the same label that the ML setting was trained for is present in the provided dataset, the ‘true’ label value will be added to the predictions table in addition:

example_id

cmv_predicted_class

cmv_1_proba

cmv_0_proba

cmv_true_class

e1

1

0.8

0.2

1

e2

0

0.2

0.8

0

e3

1

0.78

0.22

0

Specification arguments:

  • dataset: dataset for which examples need to be classified

  • config_path: path to the zip file exported from MLModelTraining instruction (which includes train ML model, encoder, preprocessing etc.)

  • number_of_processes (int): how many processes should be created at once to speed up the analysis. For personal machines, 4 or 8 is usually a good choice.

  • metrics (list): a list of metrics (accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision, recall, auc, log_loss) to compute between the true and predicted classes. These metrics will only be computed when the same label with the same classes is provided for the dataset as the original label the ML setting was trained for.

YAML specification:

instructions:
    instruction_name:
        type: MLApplication
        dataset: d1
        config_path: ./config.zip
        metrics:
        - accuracy
        - precision
        - recall
        number_of_processes: 4

ExploratoryAnalysis#

Allows exploratory analysis of different datasets using encodings and reports.

Analysis is defined by a dictionary of ExploratoryAnalysisUnit objects that encapsulate a dataset, an encoding [optional] and a report to be executed on the [encoded] dataset. Each analysis specified under analyses is completely independent from all others.

Specification arguments:

  • analyses (dict): a dictionary of analyses to perform. The keys are the names of different analyses, and the values for each of the analyses are:

    • dataset: dataset on which to perform the exploratory analysis

    • preprocessing_sequence: which preprocessings to use on the dataset, this item is optional and does not have to be specified.

    • example_weighting: which example weighting strategy to use before encoding the data, this item is optional and does not have to be specified.

    • encoding: how to encode the dataset before running the report, this item is optional and does not have to be specified.

    • labels: if encoding is specified, the relevant labels should be specified here.

    • dim_reduction: which dimensionality reduction to apply; this is an experimental feature

    • report: which report to run on the dataset. Reports specified here may be of the category Data reports or Encoding reports, depending on whether ‘encoding’ was specified.

  • number_of_processes: (int): how many processes should be created at once to speed up the analysis. For personal machines, 4 or 8 is usually a good choice.

YAML specification:

instructions:
    my_expl_analysis_instruction: # user-defined instruction name
        type: ExploratoryAnalysis # which instruction to execute
        analyses: # analyses to perform
            my_first_analysis: # user-defined name of the analysis
                dataset: d1 # dataset to use in the first analysis
                preprocessing_sequence: p1 # preprocessing sequence to use in the first analysis
                report: r1 # which report to generate using the dataset d1
            my_second_analysis: # user-defined name of another analysis
                dataset: d1 # dataset to use in the second analysis - can be the same or different from other analyses
                encoding: e1 # encoding to apply on the specified dataset (d1)
                report: r2 # which report to generate in the second analysis
                labels: # labels present in the dataset d1 which will be included in the encoded data on which report r2 will be run
                    - celiac # name of the first label as present in the column of dataset's metadata file
                    - CMV # name of the second label as present in the column of dataset's metadata file
            my_third_analysis: # user-defined name of another analysis
                dataset: d1 # dataset to use in the second analysis - can be the same or different from other analyses
                encoding: e1 # encoding to apply on the specified dataset (d1)
                dim_reduction: umap # or None; which dimensionality reduction method to apply to encoded d1
                report: r3 # which report to generate in the third analysis
        number_of_processes: 4 # number of parallel processes to create (could speed up the computation)

LigoSim#

LIgO simulation instruction creates a synthetic dataset from scratch based on the generative model and a set of signals provided by the user.

Specification arguments:

  • simulation (str): a name of a simulation object containing a list of SimConfigItem as specified under definitions key; defines how to combine signals with simulated data; specified under definitions

  • sequence_batch_size (int): how many sequences to generate at once using the generative model before checking for signals and filtering

  • max_iterations (int): how many iterations are allowed when creating sequences

  • export_p_gens (bool): whether to compute generation probabilities (if supported by the generative model) for sequences and include them as part of output

  • number_of_processes (int): determines how many simulation items can be simulated in parallel

YAML specification:

instructions:
    my_simulation_instruction: # user-defined name of the instruction
        type: LIgOSim # which instruction to execute
        simulation: sim1
        sequence_batch_size: 1000
        max_iterations: 1000
        export_p_gens: False
        number_of_processes: 4

FeasibilitySummary#

FeasibilitySummary instruction creates a small synthetic dataset and reports summary metrics to show if the simulation with the given parameters is feasible. The input parameters to this analysis are the name of the simulation (the same that can be used with LigoSim instruction later if feasibility analysis looks acceptable), and the number of sequences to simulate for estimating the feasibility.

The feasibility analysis is performed for each generative model separately as these could differ in the analyses that will be reported.

Specification arguments:

  • simulation (str): a name of a simulation object containing a list of SimConfigItem as specified under definitions key; defines how to combine signals with simulated data; specified under definitions

  • sequence_count (int): how many sequences to generate to estimate feasibility (default value: 100 000)

  • number_of_processes (int): for the parts of the analysis that are possible to parallelize, how many processes to use

YAML specification:

instructions:
    my_feasibility_summary: # user-defined name of the instruction
        type: FeasibilitySummary # which instruction to execute
        simulation: sim1
        sequence_count: 10000

TrainGenModel#

Note

This is an experimental feature

TrainGenModel instruction implements training generative AIRR models on receptor level. Models that can be trained for sequence generation are listed under Generative Models section.

This instruction takes a dataset as input which will be used to train a model, the model itself, and the number of sequences to generate to illustrate the applicability of the model. It can also produce reports of the fitted model and reports of original and generated sequences.

To use the generative model previously trained with immuneML, see ApplyGenModel instruction.

Specification arguments:

  • dataset: dataset to use for fitting the generative model; it has to be defined under definitions/datasets

  • method: which model to fit (defined previously under definitions/ml_methods)

  • number_of_processes (int): how many processes to use for fitting the model

  • gen_examples_count (int): how many examples (sequences, repertoires) to generate from the fitted model

  • reports (list): list of report ids (defined under definitions/reports) to apply after fitting a generative model and generating gen_examples_count examples; these can be data reports (to be run on generated examples), ML reports (to be run on the fitted model)

YAML specification:

instructions:
    my_train_gen_model_inst: # user-defined instruction name
        type: TrainGenModel
        dataset: d1 # defined previously under definitions/datasets
        model: model1 # defined previously under definitions/ml_methods
        gen_examples_count: 100
        number_of_processes: 4
        reports: [data_rep1, ml_rep2]

ApplyGenModel#

Note

This is an experimental feature

ApplyGenModel instruction implements applying generative AIRR models on the sequence level.

This instruction takes as input a trained model (trained in the TrainGenModel instruction) which will be used for generating data and the number of sequences to be generated. It can also produce reports of the applied model and reports of generated sequences.

Specification arguments:

  • gen_examples_count (int): how many examples (sequences, repertoires) to generate from the applied model

  • reports (list): list of report ids (defined under definitions/reports) to apply after generating gen_examples_count examples; these can be data reports (to be run on generated examples), ML reports (to be run on the fitted model)

  • config_path (str): path to the trained model in zip format (as provided by TrainGenModel instruction)

YAML specification:

instructions:
    my_apply_gen_model_inst: # user-defined instruction name
        type: ApplyGenModel
        gen_examples_count: 100
        ml_config_path: ./config.zip
        reports: [data_rep1, ml_rep2]

Clustering#

Note

This is an experimental feature

Clustering instruction fits clustering methods to the provided encoded dataset and compares the combinations of clustering method with its hyperparameters, and encodings across a pre-defined set of metrics. Finally, it provides options to include a set of reports to visualize the results.

Specification arguments:

  • dataset (str): name of the dataset to be clustered

  • metrics (list): a list of metrics to use for comparison of clustering algorithms and encodings (it can include metrics for either internal evaluation if no labels are provided or metrics for external evaluation so that the clusters can be compared against a list of predefined labels)

  • labels (list): an optional list of labels to use for external evaluation of clustering

  • clustering_settings (list): a list of combinations of encoding, optional dimensionality reduction algorithm, and the clustering algorithm that will be evaluated

  • reports (list): a list of reports to be run on the clustering results or the encoded data

  • number_of_processes (int): how many processes to use for parallelization

YAML specification:

instructions:
    my_clustering_instruction:
        type: Clustering
        dataset: d1
        metrics: [adjusted_rand_score, adjusted_mutual_info_score]
        labels: [epitope, v_call]
        clustering_settings:
            - encoding: e1
              dim_reduction: pca
              method: k_means1
            - encoding: e2
              method: dbscan
        reports: [rep1, rep2]

DatasetExport#

DatasetExport instruction takes a list of datasets as input, optionally applies preprocessing steps, and outputs the data in specified formats.

Specification arguments:

  • datasets (list): a list of datasets to export in all given formats

  • preprocessing_sequence (str): which preprocessing sequence to use on the dataset(s), this item is optional and does not have to be specified. When specified, the same preprocessing sequence will be applied to all datasets.

  • formats (list): a list of formats in which to export the datasets. Valid formats are class names of any non-abstract class inheriting DataExporter.

  • number_of_processes (int): how many processes to use during repertoire export (not used for sequence datasets)

YAML specification:

instructions:
    my_dataset_export_instruction: # user-defined instruction name
        type: DatasetExport # which instruction to execute
        datasets: # list of datasets to export
            - my_generated_dataset
            - my_dataset_from_adaptive
        preprocessing_sequence: my_preprocessing_sequence
        number_of_processes: 4
        export_formats: # list of formats to export the datasets to
            - AIRR
            - ImmuneML

Subsampling#

Subsampling is an instruction that subsamples a given dataset and creates multiple smaller dataset according to the parameters provided.

Specification arguments:

  • dataset (str): original dataset which will be used as a basis for subsampling

  • subsampled_dataset_sizes (list): a list of dataset sizes (number of examples) each subsampled dataset should have

  • dataset_export_formats (list): in which formats to export the subsampled datasets. Valid values are: ImmuneML, AIRR.

YAML specification:

instructions:
    my_subsampling_instruction: # user-defined name of the instruction
        type: Subsampling # which instruction to execute
        dataset: my_dataset # original dataset to be subsampled, with e.g., 300 examples
        subsampled_dataset_sizes: # how large the subsampled datasets should be, one dataset will be created for each list item
            - 200 # one subsampled dataset with 200 examples (200 repertoires if my_dataset was repertoire dataset)
            - 100 # the other subsampled dataset will have 100 examples
        dataset_export_formats: # in which formats to export the subsampled datasets
            - ImmuneML
            - AIRR