How to run any AIRR ML analysis in Galaxy

To be able to run any possible YAML-based immuneML analysis in Galaxy, the tool Run immuneML with YAML specification should be used. It is typically recommended to use the analysis-specific Galaxy tools for creating datasets, simulating synthetic data, implanting synthetic immune signals or training ML models instead of this tool. These other tools are able to export the relevant output files to Galaxy history elements.

However, when you want to run the ExploratoryAnalysis instruction, or other analyses that do not have a corresponding Galaxy tool, this generic tool can be used.

An example Galaxy history showing how to use this tool can be found here.

Creating the YAML specification

This Galaxy tool takes as input an immuneML dataset from the Galaxy history, optional additional files, and a YAML specification file. To see the details on how to write the YAML specification, see How to specify an analysis with YAML.

When writing an analysis specification for Galaxy, it can be assumed that all files selected under ‘Additional files’ are present in the current working directory. A path to an additional file thus consists only of the filename.

The following YAML specification shows an example of how to run the ExploratoryAnalysis instruction inside Galaxy:

definitions:
  datasets:
    dataset: # user-defined dataset name
      format: ImmuneML # the default format used by the 'Create dataset' galaxy tool is Pickle
      params:
        path: dataset.iml_dataset # specify the dataset name, the default name used by
                                  # the 'Create dataset' galaxy tool is dataset.iml_dataset
  encodings:
    my_sequence_matches:
      MatchedSequences:
        reference:
          params:
            path: reference_sequences.tsv # this file must be selected from the galaxy history as an 'additional file'
          format: AIRR

  reports:
    my_seq_lengths: SequenceLengthDistribution # reports without parameters
    my_matches: Matches

instructions:
  my_instruction: # user-defined instruction name
    type: ExploratoryAnalysis
    analyses:
      my_analysis_1: # user-defined analysis name
        dataset: dataset
        report: my_seq_lengths
      my_analysis_2:
        dataset: dataset
        encoding: my_sequence_matches
        report: my_matches

All files referenced in the YAML can be found in the example Galaxy history.

Tool output

This Galaxy tool will produce the following history elements:

  • Summary: immuneML analysis: a HTML page that allows you to browse through all results.

  • ImmuneML Analysis Archive: a .zip file containing the complete output folder as it was produced by immuneML. This folder contains the output of the instruction that was used, including all raw data files. Furthermore, the folder contains the complete YAML specification file for the immuneML run, the HTML output and a log file.