How to train ML models in Galaxy

The Galaxy tool Train machine learning models can be used to run hyperparameter optimization over several different ML settings, which include ML models and their parameters, encodings and preprocessing steps. Nested cross-validation is used to identify the optimal combination of ML settings.

This is a YAML-based Galaxy tool, if you prefer a button-based interface that assumes less ML knowledge, see the tutorials for training ML models for receptor and repertoire classification using the easy Galaxy interfaces.

An example Galaxy history showing how to use this tool can be found here.

Creating the YAML specification

This Galaxy tool takes as input an immuneML dataset from the Galaxy history, optional additional files, and a YAML specification file.

To train ML models in immuneML, the TrainMLModel instruction should be used. One or more ML methods and Encodings must be used, and in addition it is possible to include Preprocessings in the hyperparameter optimization. Reports may be specified to export plots and statistics in order to gain more insight into the dataset or the process of training ML models. Constructing a YAML for training ML models is described in more detail in the tutorial How to train and assess a receptor or repertoire-level ML classifier. Note that on Galaxy, the ML methods DeepRC and TCRdistClassifier are not available.

When writing an analysis specification for Galaxy, it can be assumed that all selected files are present in the current working directory. A path to an additional file thus consists only of the filename. Note that in Galaxy, it is only possible to train ML models for one label at a time.

A complete YAML specification for training ML models is shown here:

definitions:
  datasets:
    dataset: # user-defined dataset name
      format: ImmuneML # the default format used by the 'Create dataset' galaxy tool is ImmuneML
      params:
        path: dataset.iml_dataset # specify the dataset name, the default name used by
                                  # the 'Create dataset' galaxy tool is dataset.iml_dataset

  encodings:
    my_3mer_encoding: # user-defined encoding name
      KmerFrequency:
        k: 3
    my_5mer_encoding:
      KmerFrequency:
        k: 5

  ml_methods:
    my_logistic_regression:
      LogisticRegression:
        C:
        - 0.01
        - 0.1
        - 1
        - 10
        - 100
        show_warnings: false # disabling scikit-learn warnings is recommended for Galaxy users
      model_selection_cv: true     # use scikit-learns 5-fold cross-validation to search
      model_selection_n_folds: 5   # over the optimal values for hyperparameter C

  reports:
    my_benchmark: MLSettingsPerformance
    my_coefficients:
      Coefficients:
        coefs_to_plot:
        - N_LARGEST
        n_largest:
        - 25

instructions:
  my_training_instruction: # user-defined instruction name
    type: TrainMLModel

    dataset: dataset # select the dataset defined above
    labels:          # only one label can be specified here
    - signal_disease

    settings:        # which combinations of ML settings to run
    - encoding: my_3mer_encoding
      ml_method: my_logistic_regression
    - encoding: my_5mer_encoding
      ml_method: my_logistic_regression

    assessment: # parameters in the assessment (outer) cross-validation loop
      reports:
        models:
        - my_coefficients  # run the 'coefficients' report on all the models
      split_count: 3
      split_strategy: random
      training_percentage: 0.7
    selection:  # parameters in the selection (inner) cross-validation loop
      split_count: 1
      split_strategy: random
      training_percentage: 0.7

    reports: # train ML model reports to run
    - my_benchmark

    optimization_metric: balanced_accuracy # the metric to optimize during nested cross-validation
    metrics: # other metrics to compute
    - accuracy
    - auc
    strategy: GridSearch # strategy for hyperparameter optimization, GridSearch is currently the only available option
    refit_optimal_model: true # whether to retrain the model on the whole dataset after optimizing hyperparameters
    number_of_processes: 4 # processes for parallelization

Tool output

This Galaxy tool will produce the following history elements:

  • Summary: ML model training: a HTML page that allows you to browse through all results, including prediction accuracies on the various data splits and report results.

  • Archive: ML model training: a .zip file containing the complete output folder as it was produced by immuneML. This folder contains the output of the TrainMLModel instruction including all trained models and their predictions, and report results. Furthermore, the folder contains the complete YAML specification file for the immuneML run, the HTML output and a log file.

  • optimal_ml_settings.zip: a .zip file containing the raw files for the optimal trained ML settings (ML model, encoding, and optionally preprocessing steps). This .zip file can subsequently be used as an input when applying previously trained ML models to a new dataset. Currently, this can only be done locally using the command-line interface.