How to perform an exploratory data analysisΒΆ

To explore preprocessing, encodings and/or reports without running a machine learning algorithm, the ExploratoryAnalysis instruction should be used. The components in the definitions section are defined in the same manner as for all other instructions (see: How to specify an analysis with YAML, for importing a dataset see How to import data into immuneML).

The instruction consists of a list of analyses to be performed. Each analysis should contain at least a dataset and a report. Optionally, the analysis may also contain an encoding along with labels if applicable. In the example below, two analyses are done:

  • my_analysis_1 runs report my_seq_lengths directly on dataset my_dataset

  • my_analysis_2 first encodes my_dataset using my_regex_matches before running report my_matches.

    exploratory_analysis.yaml
    definitions:
      datasets:
        # imported datasets
        my_dataset: # user-defined dataset name
          format: AIRR
          params:
            metadata_file: path/to/metadata.csv
            path: path/to/data/
    
      encodings:
        my_regex_matches:
          MatchedRegex:
            motif_filepath: path/to/regex_file.tsv
    
      reports:
        my_seq_lengths: SequenceLengthDistribution # reports without parameters
        my_matches: Matches
    
    instructions:
      my_instruction: # user-defined instruction name
        type: ExploratoryAnalysis
        analyses:
          my_analysis_1: # user-defined analysis name
            dataset: my_dataset
            report: my_seq_lengths
          my_analysis_2:
            dataset: my_dataset
            encoding: my_regex_matches
            report: my_matches
    

Where the file regex_file.tsv must be a tab-separated file, which may contain the following contents:

id

TRB_regex

1

ACG

2

EDNA

3

DFWG