YAML specification

The YAML specification defines which analysis should be performed by immuneML. It is defined under three main keywords:

  • definitions - describing the settings of datasets, encodings, ml_methods, preprocessing_sequences, reports, simulations and other components,

  • instructions - describing the parameters of the analysis that will be performed and which of the analysis components (defined under definitions) will be used for this

  • output - describing how to format the results of the analysis (currently, only HTML output is supported).

The purpose of this page is to list all the YAML specification options. If you are not familiar with the YAML specification and get started, see How to specify an analysis with YAML.

The overall structure of the YAML specification is the following:

definitions: # mandatory keyword
  datasets: # mandatory keyword
    my_dataset_1: # user-defined name of the dataset
      ... # see below for the specification of the dataset
  encodings: # optional keyword - present if encodings are used
    my_encoding_1: # user-defined name of the encoding
      ... # see below for the specification of different encodings
  ml_methods: # optional keyword - present if ML methods are used
    my_ml_method_1: # user-defined name of the ML method
      ml_method_class_name: # see below for the specification of different ML methods
        ... # parameters of the method if any (if none are specified, default values are used)
      # the parameters model_selection_cv and model_selection_n_folds can be specified for any ML method used and define if there will be
      # an internal cross-validation for the given method (if used with TrainMLModel instruction, this will result in the third nested CV, but only over method parameters)
      model_selection_cv: False # whether to use cross-validation and random search to estimate the optimal parameters for one split to train/test (True/False)
      model_selection_n_folds: -1 # number of folds if cross-validation is used for model selection and optimal parameter estimation
  preprocessing_sequences: # optional keyword - present if preprocessing sequences are used
    my_preprocessing: # user-defined name of the preprocessing sequence
      ... # see below for the specification of different preprocessing
  reports: # optional keyword - present if reports are used
    my_report_1:
      ... # see below for the specification of different reports
instructions: # mandatory keyword - at least one instruction has to be specified
  my_instruction_1: # user-defined name of the instruction
    ... # see below for the specification of different instructions
output: # how to present the result after running (the only valid option now)
  format: HTML

A diagram of the different dataset types, preprocessing steps, encodings, ML methods and reports, and how they can be combined in different analyses is shown below. The solid lines represent components that should be used together, and the dashed lines indicate optional combinations.

Analysis paths

Definitions

Datasets

AIRR

Imports data in AIRR format into a Repertoire-, Sequence- or ReceptorDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets or ReceptorDatasets should be used when predicting values for unpaired (single-chain) and paired immune receptors respectively, like antigen specificity.

The AIRR .tsv format is explained here: https://docs.airr-community.org/en/stable/datarep/format.html And the AIRR rearrangement schema can be found here: https://docs.airr-community.org/en/stable/datarep/rearrangements.html

When importing a ReceptorDataset, the AIRR field cell_id is used to determine the chain pairs.

Arguments:

path (str): For RepertoireDatasets, this is the path to a directory with AIRR files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset or ReceptorDataset. By default, is_repertoire is set to True.

metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the AIRR files included under the column ‘filename’ are imported into the RepertoireDataset. For setting SequenceDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

paired (str): Required for Sequence- or ReceptorDatasets. This parameter determines whether to import a SequenceDataset (paired = False) or a ReceptorDataset (paired = True). In a ReceptorDataset, two sequences with chain types specified by receptor_chains are paired together based on the identifier given in the AIRR column named ‘cell_id’.

receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values are TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).

import_productive (bool): Whether productive sequences (with value ‘T’ in column productive) should be included in the imported sequences. By default, import_productive is True.

import_with_stop_codon (bool): Whether sequences with stop codons (with value ‘T’ in column stop_codon) should be included in the imported sequences. This only applies if column stop_codon is present. By default, import_with_stop_codon is False.

import_out_of_frame (bool): Whether out of frame sequences (with value ‘F’ in column vj_in_frame) should be included in the imported sequences. This only applies if column vj_in_frame is present. By default, import_out_of_frame is False.

import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as AIRR uses the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

column_mapping (dict): A mapping from AIRR column names to immuneML’s internal data representation. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the AIRR file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aas’, ‘sequences’, ‘v_genes’, ‘j_genes’, ‘v_subgroups’, ‘j_subgroups’, ‘v_alleles’, ‘j_alleles’, ‘chains’, ‘counts’, ‘frame_types’, ‘sequence_identifiers’, ‘cell_ids’].. For AIRR, this is by default set to:

junction: sequences
junction_aa: sequence_aas
v_call: v_alleles
j_call: j_alleles
locus: chains
duplicate_count: counts
sequence_id: sequence_identifiers

column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For AIRR format, there is no default column_mapping_synonyms.

metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are AIRR column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. For AIRR format, there is no default metadata_column_mapping.

separator (str): Column separator, for AIRR this is by default “t”.

YAML specification:

my_airr_dataset:
    format: AIRR
    params:
        path: path/to/files/
        is_repertoire: True # whether to import a RepertoireDataset
        metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
        metadata_column_mapping: # metadata column mapping AIRR: immuneML for Sequence- or ReceptorDatasetDataset
            airr_column_name1: metadata_label1
            airr_column_name2: metadata_label2
        import_productive: True # whether to include productive sequences in the dataset
        import_with_stop_codon: False # whether to include sequences with stop codon in the dataset
        import_out_of_frame: False # whether to include out of frame sequences in the dataset
        import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
        import_empty_nt_sequences: True # keep sequences even if the `sequences` column is empty (provided that other fields are as specified here)
        import_empty_aa_sequences: False # remove all sequences with empty `sequence_aas` column
        # Optional fields with AIRR-specific defaults, only change when different behavior is required:
        separator: "\t" # column separator
        region_type: IMGT_CDR3 # what part of the sequence to import
        column_mapping: # column mapping AIRR: immuneML
            junction: sequences
            junction_aa: sequence_aas
            v_call: v_alleles
            j_call: j_alleles
            locus: chains
            duplicate_count: counts
            sequence_id: sequence_identifiers

Generic

Imports data from any tabular file into a Repertoire-, Sequence- or ReceptorDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets or ReceptorDatasets should be used when predicting values for unpaired (single-chain) and paired immune receptors respectively, like antigen specificity.

This importer works similarly to other importers, but has no predefined default values for which fields are imported, and can therefore be tailored to import data from various different tabular files with headers.

For ReceptorDatasets: this importer assumes the two receptor sequences appear on different lines in the file, and can be paired together by a common sequence identifier. If you instead want to import a ReceptorDataset from a tabular file that contains both receptor chains on one line, see SingleLineReceptor import

Arguments:

path (str): For RepertoireDatasets, this is the path to a directory with files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset or ReceptorDataset. By default, is_repertoire is set to True.

metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. For setting Sequence- or ReceptorDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

paired (str): Required for Sequence- or ReceptorDatasets. This parameter determines whether to import a SequenceDataset (paired = False) or a ReceptorDataset (paired = True). In a ReceptorDataset, two sequences with chain types specified by receptor_chains are paired together based on a common identifier. This identifier should be mapped to the immuneML field ‘sequence_identifiers’ using the column_mapping.

receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values are TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK.

import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means immuneML assumes the IMGT junction (including leading C and trailing Y/F amino acids) is used in the input file, and the first and last amino acids will be removed from the sequences to retrieve the IMGT CDR3 sequence. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

column_mapping (dict): Required for all datasets. A mapping where the keys are the column names in the input file, and the values correspond to the names used in immuneML’s internal data representation. Valid immuneML fields that can be specified here are [‘sequence_aas’, ‘sequences’, ‘v_genes’, ‘j_genes’, ‘v_subgroups’, ‘j_subgroups’, ‘v_alleles’, ‘j_alleles’, ‘chains’, ‘counts’, ‘frame_types’, ‘sequence_identifiers’, ‘cell_ids’].. At least sequences (nucleotide) or sequence_aas (amino acids) must be specified, but all other fields are optional. A column mapping can look for example like this:

file_column_amino_acids: sequence_aas
file_column_v_genes: v_genes
file_column_j_genes: j_genes
file_column_frequencies: counts

column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For Generic import, there is no default column_mapping_synonyms.

metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are file column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. There is no default metadata_column_mapping.

file_column_antigen_specificity: antigen_specificity

columns_to_load (list): Optional; specifies which columns to load from the input file. This may be useful if the input files contain many unused columns. If no value is specified, all columns are loaded.

separator (str): Required parameter. Column separator, for example “t” or “,”. The default value is “t”

YAML specification:

my_generic_dataset:
    format: Generic
    params:
        path: path/to/files/
        is_repertoire: True # whether to import a RepertoireDataset
        metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
        paired: False # whether to import SequenceDataset (False) or ReceptorDataset (True) when is_repertoire = False
        receptor_chains: TRA_TRB # what chain pair to import for a ReceptorDataset
        separator: "\t" # column separator
        import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
        import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
        import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
        region_type: IMGT_CDR3 # what part of the sequence to import
        column_mapping: # column mapping file: immuneML
            file_column_amino_acids: sequence_aas
            file_column_v_genes: v_genes
            file_column_j_genes: j_genes
            file_column_frequencies: counts
        metadata_column_mapping: # metadata column mapping file: immuneML
            file_column_antigen_specificity: antigen_specificity
        columns_to_load:  # which subset of columns to load from the file
            - file_column_amino_acids
            - file_column_v_genes
            - file_column_j_genes
            - file_column_frequencies
            - file_column_antigen_specificity

IGoR

Imports data generated by IGoR simulations into a Repertoire-, or SequenceDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets should be used when predicting values for unpaired (single-chain) immune receptors, like antigen specificity.

Note that you should run IGoR with the –CDR3 option specified, this tool imports the generated CDR3 files. Sequences with missing anchors are not imported, meaning only sequences with value ‘1’ in the anchors_found column are imported. Nucleotide sequences are automatically translated to amino acid sequences.

Reference: Quentin Marcou, Thierry Mora, Aleksandra M. Walczak ‘High-throughput immune repertoire analysis with IGoR’. Nature Communications, (2018) doi.org/10.1038/s41467-018-02832-w.

Arguments:

path (str): For RepertoireDatasets, this is the path to a directory with IGoR files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset. By default, is_repertoire is set to True.

metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the IGoR files included under the column ‘filename’ are imported into the RepertoireDataset. For setting SequenceDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

import_with_stop_codon (bool): Whether sequences with stop codons should be included in the imported sequences. By default, import_with_stop_codon is False.

import_out_of_frame (bool): Whether out of frame sequences (with value ‘0’ in column is_inframe) should be included in the imported sequences. By default, import_out_of_frame is False.

import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as IGoR uses the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

column_mapping (dict): A mapping from IGoR column names to immuneML’s internal data representation. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the IGoR file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aas’, ‘sequences’, ‘v_genes’, ‘j_genes’, ‘v_subgroups’, ‘j_subgroups’, ‘v_alleles’, ‘j_alleles’, ‘chains’, ‘counts’, ‘frame_types’, ‘sequence_identifiers’, ‘cell_ids’].. For IGoR, this is by default set to:

nt_CDR3: sequences
seq_index: sequence_identifiers

column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For IGoR format, there is no default column_mapping_synonyms.

metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are IGoR column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. For IGoR format, there is no default metadata_column_mapping.

separator (str): Column separator, for IGoR this is by default “,”.

YAML specification:

my_igor_dataset:
    format: IGoR
    params:
        path: path/to/files/
        is_repertoire: True # whether to import a RepertoireDataset (True) or a SequenceDataset (False)
        metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
        metadata_column_mapping: # metadata column mapping IGoR: immuneML for SequenceDataset
            igor_column_name1: metadata_label1
            igor_column_name2: metadata_label2
        import_with_stop_codon: False # whether to include sequences with stop codon in the dataset
        import_out_of_frame: False # whether to include out of frame sequences in the dataset
        import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
        import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
        # Optional fields with IGoR-specific defaults, only change when different behavior is required:
        separator: "," # column separator
        region_type: IMGT_CDR3 # what part of the sequence to import
        column_mapping: # column mapping IGoR: immuneML
            nt_CDR3: sequences
            seq_index: sequence_identifiers

IReceptor

Imports AIRR datasets retrieved through the iReceptor Gateway into a Repertoire-, Sequence- or ReceptorDataset. The differences between this importer and the AIRR importer are:

  • This importer takes in a list of .zip files, which must contain one or more AIRR tsv files, and for each AIRR file, a corresponding metadata json file must be present.

  • This importer does not require a metadata csv file for RepertoireDataset import, it is generated automatically from the metadata json files.

RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets or ReceptorDatasets should be used when predicting values for unpaired (single-chain) and paired immune receptors respectively, like antigen specificity.

AIRR rearrangement schema can be found here: https://docs.airr-community.org/en/stable/datarep/rearrangements.html

When importing a ReceptorDataset, the AIRR field cell_id is used to determine the chain pairs.

Arguments:

path (str): This is the path to a directory with .zip files retrieved from the iReceptor Gateway. These .zip files should include AIRR files (with .tsv extension) and corresponding metadata.json files with matching names (e.g., for my_dataset.tsv the corresponding metadata file is called my_dataset-metadata.json). The zip files must use the .zip extension.

is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset or ReceptorDataset. By default, is_repertoire is set to True.

paired (str): Required for Sequence- or ReceptorDatasets. This parameter determines whether to import a SequenceDataset (paired = False) or a ReceptorDataset (paired = True). In a ReceptorDataset, two sequences with chain types specified by receptor_chains are paired together based on the identifier given in the AIRR column named ‘cell_id’.

receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values are TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).

import_productive (bool): Whether productive sequences (with value ‘T’ in column productive) should be included in the imported sequences. By default, import_productive is True.

import_with_stop_codon (bool): Whether sequences with stop codons (with value ‘T’ in column stop_codon) should be included in the imported sequences. This only applies if column stop_codon is present. By default, import_with_stop_codon is False.

import_out_of_frame (bool): Whether out of frame sequences (with value ‘F’ in column vj_in_frame) should be included in the imported sequences. This only applies if column vj_in_frame is present. By default, import_out_of_frame is False.

import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as AIRR uses the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

column_mapping (dict): A mapping from AIRR column names to immuneML’s internal data representation. For AIRR, this is by default set to the values shown in YAML below. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the AIRR file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aas’, ‘sequences’, ‘v_genes’, ‘j_genes’, ‘v_subgroups’, ‘j_subgroups’, ‘v_alleles’, ‘j_alleles’, ‘chains’, ‘counts’, ‘frame_types’, ‘sequence_identifiers’, ‘cell_ids’]. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the AIRR file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aas’, ‘sequences’, ‘v_genes’, ‘j_genes’, ‘v_subgroups’, ‘j_subgroups’, ‘v_alleles’, ‘j_alleles’, ‘chains’, ‘counts’, ‘frame_types’, ‘sequence_identifiers’, ‘cell_ids’]..

junction: sequences
junction_aa: sequence_aas
v_call: v_alleles
j_call: j_alleles
locus: chains
duplicate_count: counts
sequence_id: sequence_identifiers

column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For AIRR format, there is no default column_mapping_synonyms.

metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are AIRR column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. For AIRR format, there is no default metadata_column_mapping. When importing a RepertoireDataset, the metadata is automatically extracted from the metadata json files.

separator (str): Column separator, for AIRR this is by default “t”.

YAML specification:

my_airr_dataset:
    format: IReceptor
    params:
        path: path/to/zipfiles/
        is_repertoire: True # whether to import a RepertoireDataset
        metadata_column_mapping: # metadata column mapping AIRR: immuneML for Sequence- or ReceptorDatasetDataset
            airr_column_name1: metadata_label1
            airr_column_name2: metadata_label2
        import_productive: True # whether to include productive sequences in the dataset
        import_with_stop_codon: False # whether to include sequences with stop codon in the dataset
        import_out_of_frame: False # whether to include out of frame sequences in the dataset
        import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
        import_empty_nt_sequences: True # keep sequences even if the `sequences` column is empty (provided that other fields are as specified here)
        import_empty_aa_sequences: False # remove all sequences with empty `sequence_aas` column
        # Optional fields with AIRR-specific defaults, only change when different behavior is required:
        separator: "\t" # column separator
        region_type: IMGT_CDR3 # what part of the sequence to import
        column_mapping: # column mapping AIRR: immuneML
            junction: sequences
            junction_aa: sequence_aas
            v_call: v_alleles
            j_call: j_alleles
            locus: chains
            duplicate_count: counts
            sequence_id: sequence_identifiers

ImmuneML

Imports the dataset from the files previously exported by immuneML. It closely resembles AIRR format but relies on binary representations and is optimized for faster read-in at runtime.

ImmuneMLImport can import any kind of dataset (RepertoireDataset, SequenceDataset, ReceptorDataset).

This format includes:

  1. a dataset file in yaml format with iml_dataset extension with parameters:

    • name,

    • identifier,

    • metadata_file (for repertoire datasets),

    • metadata_fields (for repertoire datasets),

    • repertoire_ids (for repertoire datasets)

    • element_ids (for receptor and sequence datasets),

    • labels,

  2. a csv metadata file (only for repertoire datasets, should be in the same folder as the iml_dataset file),

  3. data files for different types of data. For repertoire datasets, data files include one binary numpy file per repertoire with sequences and associated information and one metadata yaml file per repertoire with details such as repertoire identifier, disease status, subject id and other similar available information. For sequence and receptor datasets, sequences or receptors respectively, are stored in batches in binary numpy files.

Arguments:

path (str): The path to the previously created dataset file. This file should have an ‘.iml_dataset’ extension. If the path has not been specified, immuneML attempts to load the dataset from a specified metadata file (only for RepertoireDatasets).

metadata_file (str): An optional metadata file for a RepertoireDataset. If specified, the RepertoireDataset metadata will be updated to the newly specified metadata without otherwise changing the Repertoire objects

YAML specification:

my_dataset:
    format: ImmuneML
    params:
        path: path/to/dataset.iml_dataset
        metadata_file: path/to/metadata.csv

ImmunoSEQRearrangement

Imports data from Adaptive Biotechnologies immunoSEQ Analyzer rearrangement-level .tsv files into a Repertoire-, or SequenceDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets should be used when predicting values for unpaired (single-chain) immune receptors, like antigen specificity.

The format of the files imported by this importer is described here: https://www.adaptivebiotech.com/wp-content/uploads/2019/07/MRK-00342_immunoSEQ_TechNote_DataExport_WEB_REV.pdf Alternatively, to import sample-level .tsv files, see ImmunoSEQSample import

The only difference between these two importers is which columns they load from the .tsv files.

Arguments:

path (str): For RepertoireDatasets, this is the path to a directory with files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset. By default, is_repertoire is set to True.

metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the files included under the column ‘filename’ are imported into the RepertoireDataset. For setting SequenceDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

import_productive (bool): Whether productive sequences (with value ‘In’ in column frame_type) should be included in the imported sequences. By default, import_productive is True.

import_with_stop_codon (bool): Whether sequences with stop codons (with value ‘Stop’ in column frame_type) should be included in the imported sequences. By default, import_with_stop_codon is False.

import_out_of_frame (bool): Whether out of frame sequences (with value ‘Out’ in column frame_type) should be included in the imported sequences. By default, import_out_of_frame is False.

import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as immunoSEQ files use the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

column_mapping (dict): A mapping from immunoSEQ column names to immuneML’s internal data representation. For immunoSEQ rearrangement-level files, this is by default set the values shown below in YAML format. A custom column mapping can be specified here if necessary (for example: adding additional data fields if they are present in the file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aas’, ‘sequences’, ‘v_genes’, ‘j_genes’, ‘v_subgroups’, ‘j_subgroups’, ‘v_alleles’, ‘j_alleles’, ‘chains’, ‘counts’, ‘frame_types’, ‘sequence_identifiers’, ‘cell_ids’]..

rearrangement: sequences
amino_acid: sequence_aas
v_gene: v_genes
j_gene: j_genes
frame_type: frame_types
v_family: v_subgroups
j_family: j_subgroups
v_allele: v_alleles
j_allele: j_alleles
templates: counts
locus: chains

column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For immunoSEQ rearrangement-level files, this is by default set to:

v_resolved: v_alleles
j_resolved: j_alleles

columns_to_load (list): Specifies which subset of columns must be loaded from the file. By default, this is: [rearrangement, v_family, v_gene, v_allele, j_family, j_gene, j_allele, amino_acid, templates, frame_type, locus]

metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are immunoSEQ column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. For immunoSEQ rearrangement .tsv files, there is no default metadata_column_mapping.

separator (str): Column separator, for ImmunoSEQ files this is by default “t”.

import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False

import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter will typically be False (import only non-empty amino acid sequences)

YAML specification:

my_immunoseq_dataset:
    format: ImmunoSEQRearrangement
    params:
        path: path/to/files/
        is_repertoire: True # whether to import a RepertoireDataset (True) or a SequenceDataset (False)
        metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
        metadata_column_mapping: # metadata column mapping ImmunoSEQ: immuneML for SequenceDataset
            immunoseq_column_name1: metadata_label1
            immunoseq_column_name2: metadata_label2
        import_productive: True # whether to include productive sequences in the dataset
        import_with_stop_codon: False # whether to include sequences with stop codon in the dataset
        import_out_of_frame: False # whether to include out of frame sequences in the dataset
        import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
        import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
        import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
        # Optional fields with ImmunoSEQ rearrangement-specific defaults, only change when different behavior is required:
        separator: "\t" # column separator
        columns_to_load: # subset of columns to load
        - rearrangement
        - v_family
        - v_gene
        - v_allele
        - j_family
        - j_gene
        - j_allele
        - amino_acid
        - templates
        - frame_type
        - locus
        region_type: IMGT_CDR3 # what part of the sequence to import
        column_mapping: # column mapping immunoSEQ: immuneML
            rearrangement: sequences
            amino_acid: sequence_aas
            v_gene: v_genes
            j_gene: j_genes
            frame_type: frame_types
            v_family: v_subgroups
            j_family: j_subgroups
            v_allele: v_alleles
            j_allele: j_alleles
            templates: counts
            locus: chains

ImmunoSEQSample

Imports data from Adaptive Biotechnologies immunoSEQ Analyzer sample-level .tsv files into a Repertoire-, or SequenceDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets should be used when predicting values for unpaired (single-chain) immune receptors, like antigen specificity.

The format of the files imported by this importer is described here in section 3.4.13 https://clients.adaptivebiotech.com/assets/downloads/immunoSEQ_AnalyzerManual.pdf Alternatively, to import rearrangement-level .tsv files, see ImmunoSEQRearrangement import. The only difference between these two importers is which columns they load from the .tsv files.

Arguments:

path (str): For RepertoireDatasets, this is the path to a directory with files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset. By default, is_repertoire is set to True.

metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the files included under the column ‘filename’ are imported into the RepertoireDataset. For setting SequenceDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

import_productive (bool): Whether productive sequences (with value ‘In’ in column frame_type) should be included in the imported sequences. By default, import_productive is True.

import_with_stop_codon (bool): Whether sequences with stop codons (with value ‘Stop’ in column frame_type) should be included in the imported sequences. By default, import_with_stop_codon is False.

import_out_of_frame (bool): Whether out of frame sequences (with value ‘Out’ in column frame_type) should be included in the imported sequences. By default, import_out_of_frame is False.

import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as immunoSEQ files use the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

column_mapping (dict): A mapping from immunoSEQ column names to immuneML’s internal data representation. For immunoSEQ sample-level files, this is by default set to the values shown bellow in YAML format. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aas’, ‘sequences’, ‘v_genes’, ‘j_genes’, ‘v_subgroups’, ‘j_subgroups’, ‘v_alleles’, ‘j_alleles’, ‘chains’, ‘counts’, ‘frame_types’, ‘sequence_identifiers’, ‘cell_ids’]..

nucleotide: sequences
aminoAcid: sequence_aas
vGeneName: v_genes
jGeneName: j_genes
sequenceStatus: frame_types
vFamilyName: v_subgroups
jFamilyName: j_subgroups
vGeneAllele: v_alleles
jGeneAllele: j_alleles
count (templates/reads): counts

column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For immunoSEQ sample .tsv files, there is no default column_mapping_synonyms.

columns_to_load (list): Specifies which subset of columns must be loaded from the file. By default, this is: [nucleotide, aminoAcid, count (templates/reads), vFamilyName, vGeneName, vGeneAllele, jFamilyName, jGeneName, jGeneAllele, sequenceStatus]; these are the columns from the original file that will be imported

metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are immunoSEQ column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. For immunoSEQ sample .tsv files, there is no default metadata_column_mapping.

separator (str): Column separator, for ImmunoSEQ files this is by default “t”.

YAML specification:

my_immunoseq_dataset:
    format: ImmunoSEQSample
    params:
        path: path/to/files/
        is_repertoire: True # whether to import a RepertoireDataset (True) or a SequenceDataset (False)
        metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
        metadata_column_mapping: # metadata column mapping ImmunoSEQ: immuneML for SequenceDataset
            immunoseq_column_name1: metadata_label1
            immunoseq_column_name2: metadata_label2
        import_productive: True # whether to include productive sequences in the dataset
        import_with_stop_codon: False # whether to include sequences with stop codon in the dataset
        import_out_of_frame: False # whether to include out of frame sequences in the dataset
        import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
        import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
        import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
        # Optional fields with ImmunoSEQ sample-specific defaults, only change when different behavior is required:
        separator: "\t" # column separator
        columns_to_load: # subset of columns to load
        - nucleotide
        - aminoAcid
        - count (templates/reads)
        - vFamilyName
        - vGeneName
        - vGeneAllele
        - jFamilyName
        - jGeneName
        - jGeneAllele
        - sequenceStatus
        region_type: IMGT_CDR3 # what part of the sequence to import
        column_mapping: # column mapping immunoSEQ: immuneML
            nucleotide: sequences
            aminoAcid: sequence_aas
            vGeneName: v_genes
            jGeneName: j_genes
            sequenceStatus: frame_types
            vFamilyName: v_subgroups
            jFamilyName: j_subgroups
            vGeneAllele: v_alleles
            jGeneAllele: j_alleles
            count (templates/reads): counts

MiXCR

Imports data in MiXCR format into a Repertoire-, or SequenceDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets should be used when predicting values for unpaired (single-chain) immune receptors, like antigen specificity.

Arguments:

path (str): For RepertoireDatasets, this is the path to a directory with MiXCR files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset. By default, is_repertoire is set to True.

metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the MiXCR files included under the column ‘filename’ are imported into the RepertoireDataset. For setting SequenceDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence, such as ‘_’, are removed). By default import_illegal_characters is False.

import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as MiXCR uses IMGT junction as CDR3. Alternatively to importing the CDR3 sequence, other region types can be specified here as well. Valid values are IMGT_CDR3, IMGT_CDR1, IMGT_CDR2, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4.

column_mapping (dict): A mapping from MiXCR column names to immuneML’s internal data representation. The columns that specify the sequences to import are handled by the region_type parameter. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the MiXCR file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aas’, ‘sequences’, ‘v_genes’, ‘j_genes’, ‘v_subgroups’, ‘j_subgroups’, ‘v_alleles’, ‘j_alleles’, ‘chains’, ‘counts’, ‘frame_types’, ‘sequence_identifiers’, ‘cell_ids’].. For MiXCR, this is by default set to:

cloneCount: counts
allVHitsWithScore: v_alleles
allJHitsWithScore: j_alleles

column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For MiXCR format, there is no default column_mapping_synonyms.

columns_to_load (list): Specifies which subset of columns must be loaded from the MiXCR file. By default, this is: [cloneCount, allVHitsWithScore, allJHitsWithScore, aaSeqCDR3, nSeqCDR3]

metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are MiXCR column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. For MiXCR format, there is no default metadata_column_mapping.

separator (str): Column separator, for MiXCR this is by default “t”.

YAML specification:

my_mixcr_dataset:
    format: MiXCR
    params:
        path: path/to/files/
        is_repertoire: True # whether to import a RepertoireDataset (True) or a SequenceDataset (False)
        metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
        metadata_column_mapping: # metadata column mapping MiXCR: immuneML for SequenceDataset
            mixcrColumnName1: metadata_label1
            mixcrColumnName2: metadata_label2
        region_type: IMGT_CDR3 # what part of the sequence to import
        import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
        import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
        import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
        # Optional fields with MiXCR-specific defaults, only change when different behavior is required:
        separator: "\t" # column separator
        columns_to_load: # subset of columns to load, sequence columns are handled by region_type parameter
        - cloneCount
        - allVHitsWithScore
        - allJHitsWithScore
        - aaSeqCDR3
        - nSeqCDR3
        column_mapping: # column mapping MiXCR: immuneML
            cloneCount: counts
            allVHitsWithScore: v_genes
            allJHitsWithScore: j_genes

OLGA

Imports data generated by OLGA simulations into a Repertoire-, or SequenceDataset. Assumes that the columns in each file correspond to: nucleotide sequences, amino acid sequences, v genes, j genes

Reference: Sethna, Zachary et al. ‘High-throughput immune repertoire analysis with IGoR’. Bioinformatics, (2019) doi.org/10.1093/bioinformatics/btz035.

Arguments:

path (str): For RepertoireDatasets, this is the path to a directory with OLGA files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset. By default, is_repertoire is set to True.

metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the OLGA files included under the column ‘filename’ are imported into the RepertoireDataset. SequenceDataset metadata is currently not supported.

import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as OLGA uses the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

separator (str): Column separator, for OLGA this is by default “t”.

column_mapping (dict): defines which columns to import from olga format: keys are the number of the columns and values are the names of the columns to be mapped to

YAML specification:

my_olga_dataset:
    format: OLGA
    params:
        path: path/to/files/
        is_repertoire: True # whether to import a RepertoireDataset (True) or a SequenceDataset (False)
        metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
        import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
        import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
        import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
        # Optional fields with OLGA-specific defaults, only change when different behavior is required:
        separator: "\t" # column separator
        region_type: IMGT_CDR3 # what part of the sequence to import
        columns_to_load: [0, 1, 2, 3]
        column_mapping:
            0: sequences
            1: sequence_aas
            2: v_genes
            3: j_genes

RandomReceptorDataset

Returns a ReceptorDataset consisting of randomly generated sequences, which can be used for benchmarking purposes. The sequences consist of uniformly chosen amino acids or nucleotides.

Arguments:

receptor_count (int): The number of receptors the ReceptorDataset should contain.

chain_1_length_probabilities (dict): A mapping where the keys correspond to different sequence lengths for chain 1, and the values are the probabilities for choosing each sequence length. For example, to create a random ReceptorDataset where 40% of the sequences for chain 1 would be of length 10, and 60% of the sequences would have length 12, this mapping would need to be specified:

10: 0.4
12: 0.6

chain_2_length_probabilities (dict): Same as chain_1_length_probabilities, but for chain 2.

labels (dict): A mapping that specifies randomly chosen labels to be assigned to the receptors. One or multiple labels can be specified here. The keys of this mapping are the labels, and the values consist of another mapping between label classes and their probabilities. For example, to create a random ReceptorDataset with the label cmv_epitope where 70% of the receptors has class binding and the remaining 30% has class not_binding, the following mapping should be specified:

cmv_epitope:
    binding: 0.7
    not_binding: 0.3

YAML specification:

my_random_dataset:
    format: RandomReceptorDataset
    params:
        receptor_count: 100 # number of random receptors to generate
        chain_1_length_probabilities:
            14: 0.8 # 80% of all generated sequences for all receptors (for chain 1) will have length 14
            15: 0.2 # 20% of all generated sequences across all receptors (for chain 1) will have length 15
        chain_2_length_probabilities:
            14: 0.8 # 80% of all generated sequences for all receptors (for chain 2) will have length 14
            15: 0.2 # 20% of all generated sequences across all receptors (for chain 2) will have length 15
        labels:
            epitope1: # label name
                True: 0.5 # 50% of the receptors will have class True
                False: 0.5 # 50% of the receptors will have class False
            epitope2: # next label with classes that will be assigned to receptors independently of the previous label or other parameters
                1: 0.3 # 30% of the generated receptors will have class 1
                0: 0.7 # 70% of the generated receptors will have class 0

RandomRepertoireDataset

Returns a RepertoireDataset consisting of randomly generated sequences, which can be used for benchmarking purposes. The sequences consist of uniformly chosen amino acids or nucleotides.

Arguments:

repertoire_count (int): The number of repertoires the RepertoireDataset should contain.

sequence_count_probabilities (dict): A mapping where the keys are the number of sequences per repertoire, and the values are the probabilities that any of the repertoires would have that number of sequences. For example, to create a random RepertoireDataset where 40% of the repertoires would have 1000 sequences, and the other 60% would have 1100 sequences, this mapping would need to be specified:

1000: 0.4
1100: 0.6

sequence_length_probabilities (dict): A mapping where the keys correspond to different sequence lengths, and the values are the probabilities for choosing each sequence length. For example, to create a random RepertoireDataset where 40% of the sequences would be of length 10, and 60% of the sequences would have length 12, this mapping would need to be specified:

10: 0.4
12: 0.6

labels (dict): A mapping that specifies randomly chosen labels to be assigned to the Repertoires. One or multiple labels can be specified here. The keys of this mapping are the labels, and the values consist of another mapping between label classes and their probabilities. For example, to create a random RepertoireDataset with the label CMV where 70% of the Repertoires has class cmv_positive and the remaining 30% has class cmv_negative, the following mapping should be specified:

CMV:
    cmv_positive: 0.7
    cmv_negative: 0.3

YAML specification:

my_random_dataset:
    format: RandomRepertoireDataset
    params:
        repertoire_count: 100 # number of random repertoires to generate
        sequence_count_probabilities:
            10: 0.5 # probability that any of the repertoires would have 10 receptor sequences
            20: 0.5
        sequence_length_probabilities:
            10: 0.5 # probability that any of the receptor sequences would be 10 amino acids in length
            12: 0.5
        labels: # randomly assigned labels (only useful for simple benchmarking)
            cmv:
                True: 0.5 # probability of value True for label cmv to be assigned to any repertoire
                False: 0.5

RandomSequenceDataset

Returns a SequenceDataset consisting of randomly generated sequences, which can be used for benchmarking purposes. The sequences consist of uniformly chosen amino acids or nucleotides.

Arguments:

sequence_count (int): The number of sequences the SequenceDataset should contain.

length_probabilities (dict): A mapping where the keys correspond to different sequence lengths and the values are the probabilities for choosing each sequence length. For example, to create a random SequenceDataset where 40% of the sequences would be of length 10, and 60% of the sequences would have length 12, this mapping would need to be specified:

10: 0.4
12: 0.6

labels (dict): A mapping that specifies randomly chosen labels to be assigned to the sequences. One or multiple labels can be specified here. The keys of this mapping are the labels, and the values consist of another mapping between label classes and their probabilities. For example, to create a random SequenceDataset with the label cmv_epitope where 70% of the sequences has class binding and the remaining 30% has class not_binding, the following mapping should be specified:

cmv_epitope:
    binding: 0.7
    not_binding: 0.3

YAML specification:

my_random_dataset:
    format: RandomSequenceDataset
    params:
        sequence_count: 100 # number of random sequences to generate
        length_probabilities:
            14: 0.8 # 80% of all generated sequences for all sequences will have length 14
            15: 0.2 # 20% of all generated sequences across all sequences will have length 15
        labels:
            epitope1: # label name
                True: 0.5 # 50% of the sequences will have class True
                False: 0.5 # 50% of the sequences will have class False
            epitope2: # next label with classes that will be assigned to sequences independently of the previous label or other parameters
                1: 0.3 # 30% of the generated sequences will have class 1
                0: 0.7 # 70% of the generated sequences will have class 0

SingleLineReceptor

Imports data from a tabular file (where each line contains a pair of immune receptor sequences) into a ReceptorDataset. If you instead want to import a ReceptorDataset from a tabular file that contains one receptor sequence per line, see Generic import.

Arguments:

path (str): Required parameter. This is the path to a directory with files to import.

receptor_chains (str): Required parameter. Determines which pair of chains to import for each Receptor. Valid values for receptor_chains are: TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK.

import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

region_type (str): Which part of the sequence to import. When IMGT_CDR3 is specified, immuneML assumes the IMGT junction (including leading C and trailing Y/F amino acids) is used in the input file, and the first and last amino acids will be removed from the sequences to retrieve the IMGT CDR3 sequence. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

column_mapping (dict): A mapping where the keys are the column names in the input file, and the values must be mapped to the following fields: <chain>_amino_acid_sequence, <chain>_nucleotide_sequence, <chain>_v_gene, <chain>_j_gene, identifier, epitope. The possible names that can be filled in for <chain> are: ALPHA, BETA, GAMMA, DELTA, HEAVY, LIGHT, KAPPA. Any column namme other than the sequence, v/j genes and identifier will be set as metadata fields to the Receptors, and can subsequently be used as labels in immuneML instructions. For TCR alpha-beta receptor import, a column mapping could for example look like this:

cdr3_a_aa: alpha_amino_acid_sequence
cdr3_b_aa: beta_amino_acid_sequence
cdr3_a_nucseq: alpha_nucleotide_sequence
cdr3_b_nucseq: beta_nucleotide_sequence
v_a_gene: alpha_v_gene
v_b_gene: beta_v_gene
j_a_gene: alpha_j_gene
j_b_gene: beta_j_gene
clone_id: identifier
epitope: epitope # metadata field

column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded.

columns_to_load (list): Optional; specifies which columns to load from the input file. This may be useful if the input files contain many unused columns. If no value is specified, all columns are loaded.

separator (str): Required parameter. Column separator, for example “t” or “,”.

organism (str): The organism that the receptors came from. This will be set as a parameter in the ReceptorDataset object.

YAML specification:

my_receptor_dataset:
    format: SingleLineReceptor
    params:
        path: path/to/files/
        receptor_chains: TRA_TRB # what chain pair to import
        separator: "\t" # column separator
        import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
        import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
        region_type: IMGT_CDR3 # what part of the sequence to import
        columns_to_load: # which subset of columns to load from the file
        - subject
        - epitope
        - count
        - v_a_gene
        - j_a_gene
        - cdr3_a_aa
        - v_b_gene
        - j_b_gene
        - cdr3_b_aa
        - clone_id
        column_mapping: # column mapping file: immuneML
            cdr3_a_aa: alpha_amino_acid_sequence
            cdr3_b_aa: beta_amino_acid_sequence
            cdr3_a_nucseq: alpha_nucleotide_sequence
            cdr3_b_nucseq: beta_nucleotide_sequence
            v_a_gene: alpha_v_gene
            v_b_gene: beta_v_gene
            j_a_gene: alpha_j_gene
            j_b_gene: beta_j_gene
            clone_id: identifier
            epitope: epitope
            organism: mouse

TenxGenomics

Imports data from the 10x Genomics Cell Ranger analysis pipeline into a Repertoire-, Sequence- or ReceptorDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets or ReceptorDatasets should be used when predicting values for unpaired (single-chain) and paired immune receptors respectively, like antigen specificity.

The files that should be used as input are named ‘Clonotype consensus annotations (CSV)’, as described here: https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/output/annotation#consensus

Note: by default the 10xGenomics field ‘umis’ is used to define the immuneML field counts. If you want to use the 10x Genomics field reads instead, this can be changed in the column_mapping (set reads: counts). Furthermore, the 10xGenomics field clonotype_id is used for the immuneML field cell_id.

Arguments:

path (str): For RepertoireDatasets, this is the path to a directory with 10xGenomics files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset or ReceptorDataset. By default, is_repertoire is set to True.

metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. For setting Sequence- or ReceptorDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

paired (str): Required for Sequence- or ReceptorDatasets. This parameter determines whether to import a SequenceDataset (paired = False) or a ReceptorDataset (paired = True). In a ReceptorDataset, two sequences with chain types specified by receptor_chains are paired together based on the identifier given in the 10xGenomics column named ‘clonotype_id’.

receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values are TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).

import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as 10xGenomics uses IMGT junction as CDR3. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

column_mapping (dict): A mapping from 10xGenomics column names to immuneML’s internal data representation. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the 10xGenomics file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aas’, ‘sequences’, ‘v_genes’, ‘j_genes’, ‘v_subgroups’, ‘j_subgroups’, ‘v_alleles’, ‘j_alleles’, ‘chains’, ‘counts’, ‘frame_types’, ‘sequence_identifiers’, ‘cell_ids’].. For 10xGenomics, this is by default set to:

cdr3: sequence_aas
cdr3_nt: sequences
v_gene: v_genes
j_gene: j_genes
umis: counts
chain: chains
clonotype_id: cell_ids
consensus_id: sequence_identifiers

column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For 10xGenomics format, there is no default column_mapping_synonyms.

metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are 10xGenomics column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. For 10xGenomics format, there is no default metadata_column_mapping.

separator (str): Column separator, for 10xGenomics this is by default “,”.

YAML specification:

my_10x_dataset:
    format: 10xGenomics
    params:
        path: path/to/files/
        is_repertoire: True # whether to import a RepertoireDataset
        metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
        paired: False # whether to import SequenceDataset (False) or ReceptorDataset (True) when is_repertoire = False
        receptor_chains: TRA_TRB # what chain pair to import for a ReceptorDataset
        metadata_column_mapping: # metadata column mapping 10xGenomics: immuneML for SequenceDataset
            tenx_column_name1: metadata_label1
            tenx_column_name2: metadata_label2
        import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
        import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
        import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
        # Optional fields with 10xGenomics-specific defaults, only change when different behavior is required:
        separator: "," # column separator
        region_type: IMGT_CDR3 # what part of the sequence to import
        column_mapping: # column mapping 10xGenomics: immuneML
            cdr3: sequence_aas
            cdr3_nt: sequences
            v_gene: v_genes
            j_gene: j_genes
            umis: counts
            chain: chains
            clonotype_id: cell_ids
            consensus_id: sequence_identifiers

VDJdb

Imports data in VDJdb format into a Repertoire-, Sequence- or ReceptorDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets or ReceptorDatasets should be used when predicting values for unpaired (single-chain) and paired immune receptors respectively, like antigen specificity.

Arguments:

path (str): For RepertoireDatasets, this is the path to a directory with VDJdb files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset or ReceptorDataset. By default, is_repertoire is set to True.

metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. For setting Sequence- or ReceptorDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

paired (str): Required for Sequence- or ReceptorDatasets. This parameter determines whether to import a SequenceDataset (paired = False) or a ReceptorDataset (paired = True). In a ReceptorDataset, two sequences with chain types specified by receptor_chains are paired together based on the identifier given in the VDJdb column named ‘complex.id’.

receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values are TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).

import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as VDJdb uses IMGT junction as CDR3. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

column_mapping (dict): A mapping from VDJdb column names to immuneML’s internal data representation. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the VDJdb file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aas’, ‘sequences’, ‘v_genes’, ‘j_genes’, ‘v_subgroups’, ‘j_subgroups’, ‘v_alleles’, ‘j_alleles’, ‘chains’, ‘counts’, ‘frame_types’, ‘sequence_identifiers’, ‘cell_ids’].. For VDJdb, this is by default set to:

V: v_alleles
J: j_alleles
CDR3: sequence_aas
complex.id: sequence_identifiers
Gene: chains

column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For VDJdb format, there is no default column_mapping_synonyms.

metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are VDJdb column names and values are the names that are internally used in immuneML as metadata fields. This means that epitope, epitope_gene and epitope_species can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. For VDJdb format, this parameter is by default set to:
Epitope: epitope
Epitope gene: epitope_gene
Epitope species: epitope_species

separator (str): Column separator, for VDJdb this is by default “t”.

YAML specification:

my_vdjdb_dataset:
    format: VDJdb
    params:
        path: path/to/files/
        is_repertoire: True # whether to import a RepertoireDataset
        metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
        paired: False # whether to import SequenceDataset (False) or ReceptorDataset (True) when is_repertoire = False
        receptor_chains: TRA_TRB # what chain pair to import for a ReceptorDataset
        import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
        import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
        import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
        # Optional fields with VDJdb-specific defaults, only change when different behavior is required:
        separator: "\t" # column separator
        region_type: IMGT_CDR3 # what part of the sequence to import
        column_mapping: # column mapping VDJdb: immuneML
            V: v_genes
            J: j_genes
            CDR3: sequence_aas
            complex.id: sequence_identifiers
            Gene: chains
        metadata_column_mapping: # metadata column mapping VDJdb: immuneML
            Epitope: epitope
            Epitope gene: epitope_gene
            Epitope species: epitope_species

Simulation

Motif

Class describing motifs where each motif is defined by a seed and a way of creating specific instances of the motif (instantiation_strategy);

When instantiation_strategy is set, specific motif instances will be produced by calling instantiate_motif(seed) method of instantiation_strategy

Arguments:

seed (str): An amino acid sequence that represents the basic motif seed. All implanted motifs correspond to the seed, or a modified version thereof, as specified in it’s instantiation strategy. If this argument is set, seed_chain1 and seed_chain2 arguments are not used.

instantiation (MotifInstantiationStrategy): Which strategy to use for implanting the seed. Valid values are: GappedKmer. In the YAML specification this can either be one of these values as a string in which case the default parameters will be used. Alternatively, instantiation can be specified with parameters as in the example YAML specification below. For the detailed list of parameters, see the specific instantiation strategies below.

seed_chain1 (str): in case when representing motifs for paired chain data, it is possible to define a motif seed per chain; if this parameter is set, the generated motif instances will include a motif instance for both chains; for more details on how it works see seed argument above. Used only if the seed argument is not set.

seed_chain2 (str): used for paired chain data, for the other receptor chain; for more details on how it works see seed argument. This argument is used only if the seed argument is not set.

name_chain1: name of the first chain if paired receptor data are simulated. The value should be an instance of Chain. This argument is used only if the seed argument is not set.

name_chain2: name of the second chain 2 if paired receptor data are simulated. The value should be an instance of Chain. This argument is used only if the seed argument is not set.

YAML specification:

motifs:
    # examples for single chain receptor data
    my_simple_motif: # this will be the identifier of the motif
        seed: AAA
        instantiation: GappedKmer
    my_gapped_motif:
        seed: AA/A
        instantiation:
            GappedKmer:
                min_gap: 1
                max_gap: 2
    # examples for paired chain receptor data
    my_paired_motif:
        seed_chain1: AAA # seed for chain1 or chain2 can optionally include gap, same as for single chain receptor data
        name_chain1: ALPHA # alpha chain of TCR
        seed_chain2: CCC
        name_chain2: BETA # beta chain of TCR
        instantiation: GappedKmer # same as for single chain receptor data
GappedKmer

Creates a motif instance from a given seed and additional optional parameters. Currently, at most a single gap can be specified in the sequence.

Arguments:

min_gap (int): The minimum gap length, in case the original seed contains a gap.

max_gap (int): The maximum gap length, in case the original seed contains a gap.

hamming_distance_probabilities (dict): The probability of modifying the given seed with each number of modifications. The keys represent the number of modifications (hamming distance) between the original seed and the implanted motif, and the values represent the probabilities for the respective number of modifications. For example {0: 0.7, 1: 0.3} means that 30% of the time one position will be modified, and the remaining 70% of the time the motif will remain unmodified with respect to the seed. The values of hamming_distance_probabilities must sum to 1.

position_weights (dict): A dictionary containing the relative probabilities of choosing each position for hamming distance modification. The keys represent the position in the seed, where counting starts at 0. If the index of a gap is specified in position_weights, it will be removed. The values represent the relative probabilities for modifying each position when it gets selected for modification. For example {0: 0.6, 1: 0, 2: 0.4} means that when a sequence is selected for a modification (as specified in hamming_distance_probabilities), then 60% of the time the amino acid at index 0 is modified, and the remaining 40% of the time the amino acid at index 2. If the values of position_weights do not sum to 1, the remainder will be redistributed over all positions, including those not specified.

alphabet_weights (dict): A dictionary describing the relative probabilities of choosing each amino acid for hamming distance modification. The keys represent the amino acids and the values the relative probabilities for choosing this amino acid. If the values of alphabet_weights do not sum to 1, the remainder will be redistributed over all possible amino acids, including those not specified.

YAML specification:

GappedKmer:
    min_gap: 1
    max_gap: 2
    hamming_distance_probabilities:
        0: 0.7
        1: 0.3
    position_weights: # note that index 2, the position of the gap, is excluded from position_weights
        0: 1
        1: 0
        3: 0
    alphabet_weights:
        A: 0.2
        C: 0.2
        D: 0.4
        E: 0.2

Signal

This class represents the signal that will be implanted during a Simulation. A signal is represented by a list of motifs, and an implanting strategy.

A signal is associated with a metadata label, which is assigned to a receptor or repertoire. For example antigen-specific/disease-associated (receptor) or diseased (repertoire).

Arguments:

motifs (list): A list of the motifs associated with this signal.

implanting (SignalImplantingStrategy): The strategy that is used to decide in which sequences the motifs should be implanted, and how.

Valid values are: HealthySequence, FullSequence, Receptor

YAML specification:

signals:
    my_signal:
        motifs:
            - my_simple_motif
            - my_gapped_motif
        implanting: HealthySequence
        sequence_position_weights:
            109: 0.5
            110: 0.5
HealthySequence

This class represents a SignalImplantingStrategy where signals will be implanted in ‘healthy sequences’, meaning sequences in which no signal has been implanted previously. This ensures that there is only one signal per receptor sequence.

If for the given number of sequences in the repertoire and repertoire implanting rate, the total number of sequences for implanting turns out to be less than 1 (e.g. for 12000 sequences and repertoire implanting rate 0.00005, it should implant the signal in 0.6 sequences), the signal will not be implanted in that repertoire and a warning with repertoire identifier along with the repertoire implanting rate and number of sequences in the repertoire will be raised.

Arguments:

implanting: name of the implanting strategy, here HealthySequence

sequence_position_weights (dict): A dictionary describing the relative weights for implanting a signal at each given IMGT position in the receptor sequence. If sequence_position_weights are not set, then SequenceImplantingStrategy will make all of the positions equally likely for each receptor sequence.

implanting_computation (str): defines how to determine the number of sequences to implant the signal in a repertoire; it relies on repertoire_implanting_rate, but in case where the number of sequences for implanting is not an integer, this option can be useful. If implanting rate is set to ‘round’, then the number of sequences for implanting in a repertoire will be rounded to the nearest integer value of the product of repertoire implanting rate and the number of sequences in a repertoire (e.g., if the product value is 1.2, the signal will be implanted in one sequence only). If implanting rate is set to ‘Poisson’, the number of sequences for implanting will be sampled from the Poisson distribution with the value of the lambda parameter being repertoire implanting rate multiplied by the number of sequences in the repertoire.

YAML specification:

motifs:
    my_motif:
        ...

signals:
    my_signal:
        motifs:
            - my_motif
            - ...
        implanting: HealthySequence
        implanting_computation: Poisson
        sequence_position_weights:
            109: 1
            110: 2
            111: 5
            112: 1
FullSequence

This class represents a SignalImplantingStrategy where signals will be implanted in the repertoire by replacing repertoire_implanting_rate percent of the sequences with sequences generated from the motifs of the signal. Motifs here cannot include gaps and the motif instances are the full sequences and will be a part of the repertoire.

Note: when providing the sequence to be implanted, check if the import setting regarding the sequence type (CDR3 vs IMGT junction) matches the sequence to be implanted. For example, if importing would convert junction sequences to CDR3, but the sequence specified here for implanting would be the junction, the results of the simulation could be inconsistent.

Arguments: this signal implanting strategy has no arguments.

YAML specification:

motifs:
    my_motif: # cannot include gaps
        ...

signals:
    my_signal:
        motifs:
            - my_motif
        implanting: FullSequence
Receptor

This class represents a SignalImplantingStrategy where signals will be implanted in both chains of immune receptors. This class should be used only when simulating paired chain data.

Arguments:

implanting: name of the implanting strategy, here Receptor

sequence_position_weights (dict): A dictionary describing the relative weights for implanting a signal at each given IMGT position in the receptor sequence. If sequence_position_weights are not set, then SequenceImplantingStrategy will make all of the positions equally likely for each receptor sequence.

YAML specification:

motifs:
    my_motif:
        ...

signals:
    my_signal:
        motifs:
            - my_motif
            - ...
        implanting: Receptor
        sequence_position_weights:
            109: 1
            110: 2
            111: 5
            112: 1

Implanting

When performing a Simulation, one or more implantings can be specified. An implanting represents a set of signals which are implanted in a RepertoireDataset with given rates.

Multiple implantings may be specified in one simulation. In this case, each implanting will only affect its own partition of the dataset, so each repertoire can only receive implanted signals from one implanting. This way, implantings can be used to ensure signals do not overlap (one implanting per signal), or to ensure signals always occur together (multiple signals per implanting).

Arguments:

signals (list): The list of Signal objects to be implanted in a subset of the repertoires in a RepertoireDataset. When multiple signals are specified, this means that all of these signals are implanted in the same repertoires in a RepertoireDataset, although they may not be implanted in the same sequences within those repertoires (this depends on the SignalImplantingStrategy).

dataset_implanting_rate (float): The proportion of repertoires in the RepertoireDataset in which the signals should be implanted. When specifying multiple implantings, the sum of all dataset_implanting_rates should not exceed 1.

repertoire_implanting_rate (float): The proportion of sequences in a Repertoire where a motif associated with one of the signals should be implanted.

is_noise (bool): indicates whether the implanting should be regarded as noise; if it is True, the signals will be implanted as specified, but the repertoire/receptor in question will have negative class.

YAML specification:

simulations: # definitions of simulations should be under key simulations in the definitions part of the specification
    # one simulation with multiple implanting objects, a part of definition section
    my_simulation:
        my_implanting_1:
            signals:
                - my_signal
            dataset_implanting_rate: 0.5
            repertoire_implanting_rate: 0.25
        my_implanting_2:
            signals:
                - my_signal
            dataset_implanting_rate: 0.2
            repertoire_implanting_rate: 0.75

    # a simulation where the signals is present in the negative class as well (e.g. wrong labels, or by chance)
    noisy_simulation:
        positive_class_implanting:
            signals:
                - my_signal
            dataset_implanting_rate: 0.5
            repertoire_implanting_rate: 0.1 # 10% of the repertoire includes the signal in the positive class
        negative_class_implanting:
            signals:
                - my_signal
            is_noise: True # means that signal will be implanted, but the label will have negative class
            dataset_implanting_rate: 0.5
            repertoire_implanting_rate: 0.01 # 1% of negative class repertoires has the signal

    # in case of defining implanting for paired chain immune receptor data the simulation with implanting objects would be:
    my_receptor_simulation:
        my_receptor_implanting_1: # repertoire_implanting_rate is omitted in this case, as it is not applicable
            signals:
                - my_receptor_signal
            dataset_implanting_rate: 0.4 # 40% of the receptors will have signal my_receptor_signal implanted and 60% will not

Encodings

AtchleyKmer

Represents a repertoire through Atchley factors and relative abundance of k-mers. Should be used in combination with the AtchleyKmerMILClassifier.

For more details, see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292 .

Note that sequences in the repertoire with length shorter than skip_first_n_aa + skip_last_n_aa + k will not be encoded.

Arguments:

k (int): k-mer length

skip_first_n_aa (int): number of amino acids to remove from the beginning of the receptor sequence

skip_last_n_aa (int): number of amino acids to remove from the end of the receptor sequence

abundance: how to compute abundance term for k-mers; valid values are RELATIVE_ABUNDANCE, TCRB_RELATIVE_ABUNDANCE.

normalize_all_features (bool): when normalizing features to have 0 mean and unit variance, this parameter indicates if the abundance feature should be included in the normalization

YAML specification:

my_encoder:
    AtchleyKmer:
        k: 4
        skip_first_n_aa: 3
        skip_last_n_aa: 3
        abundance: RELATIVE_ABUNDANCE
        normalize_all_features: False

CompAIRRDistance

Encodes a given RepertoireDataset as a distance matrix, using the Morisita-Horn distance metric. Internally, CompAIRR is used for fast calculation of overlap between repertoires. This creates a pairwise distance matrix between each of the repertoires. The distance is calculated based on the number of matching receptor chain sequences between the repertoires. This matching may be defined to permit 1 or 2 mismatching amino acid/nucleotide positions and 1 indel in the sequence. Furthermore, matching may or may not include V and J gene information, and sequence frequencies may be included or ignored.

When mismatches (differences and indels) are allowed, the Morisita-Horn similarity may exceed 1. In this case, the Morisita-Horn distance (= similarity - 1) is set to 0 to avoid negative distance scores.

Arguments:

compairr_path (Path): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.

keep_compairr_input (bool): whether to keep the input file that was passed to CompAIRR. This may take a lot of storage space if the input dataset is large. By default the input file is not kept.

differences (int): Number of differences allowed between the sequences of two immune receptor chains, this may be between 0 and 2. By default, differences is 0.

indels (bool): Whether to allow an indel. This is only possible if differences is 1. By default, indels is False.

ignore_counts (bool): Whether to ignore the frequencies of the immune receptor chains. If False, frequencies will be included, meaning the ‘counts’ values for the receptors available in two repertoires are multiplied. If False, only the number of unique overlapping immune receptors (‘clones’) are considered. By default, ignore_counts is False.

ignore_genes (bool): Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains have to match. If True, gene information is ignored. By default, ignore_genes is False.

threads (int): The number of threads to use for parallelization. Default is 8.

YAML specification:

my_distance_encoder:
    CompAIRRDistance:
        compairr_path: optional/path/to/compairr
        differences: 0
        indels: False
        ignore_counts: False
        ignore_genes: False

CompAIRRSequenceAbundance

This encoder works similarly to the SequenceAbundanceEncoder, but internally uses CompAIRR to accelerate core computations.

This encoder represents the repertoires as vectors where:

  • the first element corresponds to the number of label-associated clonotypes

  • the second element is the total number of unique clonotypes

To determine what clonotypes (amino acid sequences with or without matching V/J genes) are label-associated, Fisher’s exact test (one-sided) is used.

The encoder also writes out files containing the contingency table used for fisher’s exact test, the resulting p-values, and the significantly abundant sequences (use RelevantSequenceExporter to export these sequences in AIRR format).

Reference: Emerson, Ryan O. et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.

Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. See Reproduction of the CMV status predictions study for an example using SequenceAbundanceEncoder.

Arguments:

p_value_threshold (float): The p value threshold to be used by the statistical test.

compairr_path (Path): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.

ignore_genes (bool): Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains have to match. If True, gene information is ignored. By default, ignore_genes is False.

sequence_batch_size (int): The number of sequences in a batch when comparing sequences across repertoires, typically 100s of thousands. This does not affect the results of the encoding, but may affect the speed and memory usage. The default value is 1.000.000

threads (int): The number of threads to use for parallelization. This does not affect the results of the encoding, only the speed. The default number of threads is 8.

keep_temporary_files (bool): whether to keep temporary files, including CompAIRR input, output and log files, and the sequence presence matrix. This may take a lot of storage space if the input dataset is large. By default, temporary files are not kept.

YAML specification:

my_sa_encoding:
    CompAIRRSequenceAbundance:
        compairr_path: optional/path/to/compairr
        p_value_threshold: 0.05
        ignore_genes: False
        threads: 8

DeepRC

DeepRCEncoder should be used in combination with the DeepRC ML method (DeepRC). This encoder writes the data in a RepertoireDataset to .tsv files. For each repertoire, one .tsv file is created containing the amino acid sequences and the counts. Additionally, one metadata .tsv file is created, which describes the subset of repertoires that is encoded by a given instance of the DeepRCEncoder.

Note: sequences where count is None, the count value will be set to 1

YAML specification:

my_deeprc_encoder: DeepRC

Distance

Encodes a given RepertoireDataset as distance matrix, where the pairwise distance between each of the repertoires is calculated. The distance is calculated based on the presence/absence of elements defined under attributes_to_match. Thus, if attributes_to_match contains only ‘sequence_aas’, this means the distance between two repertoires is maximal if they contain the same set of sequence_aas, and the distance is minimal if none of the sequence_aas are shared between two repertoires.

Arguments:

distance_metric (DistanceMetricType): The metric used to calculate the distance between two repertoires. Valid values are: JACCARD, MORISITA_HORN. The default distance metric is JACCARD (inverse Jaccard).

sequence_batch_size (int): The number of sequences to be processed at once. Increasing this number increases the memory use. The default value is 1000.

attributes_to_match (list): The attributes to consider when determining whether a sequence is present in both repertoires. Only the fields defined under attributes_to_match will be considered, all other fields are ignored. Valid values are sequence_aas, sequences, v_genes, j_genes, v_subgroups, j_subgroups, v_alleles, j_alleles, chains, counts, region_types, frame_types, sequence_identifiers, cell_ids. The default value is [‘sequence_aas’]

YAML specification:

my_distance_encoder:
    Distance:
        distance_metric: JACCARD
        sequence_batch_size: 1000
        attributes_to_match:
            - sequence_aas
            - v_genes
            - j_genes
            - chains
            - region_types

EvennessProfile

The EvennessProfileEncoder class encodes a repertoire based on the clonal frequency distribution. The evenness for a given repertoire is defined as follows:

\[^{\alpha} \mathrm{E}(\mathrm{f})=\frac{\left(\sum_{\mathrm{i}=1}^{\mathrm{n}} \mathrm{f}_{\mathrm{i}}^{\alpha}\right)^{\frac{1}{1-\alpha}}}{\mathrm{n}}\]

That is, it is the exponential of Renyi entropy at a given alpha divided by the species richness, or number of unique sequences.

Reference: Greiff et al. (2015). A bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status. Genome Medicine, 7(1), 49. doi.org/10.1186/s13073-015-0169-8

Arguments:

min_alpha (float): minimum alpha value to use

max_alpha (float): maximum alpha value to use

dimension (int): dimension of output evenness profile vector, or the number of alpha values to linearly space between min_alpha and max_alpha

YAML specification:

my_evenness_profile:
    EvennessProfile:
        min_alpha: 0
        max_alpha: 10
        dimension: 51

KmerAbundance

This encoder is related to the SequenceAbundanceEncoder, but identifies label-associated subsequences (k-mers) instead of full label-associated sequences.

This encoder represents the repertoires as vectors where:

  • the first element corresponds to the number of label-associated k-mers found in a repertoire

  • the second element is the total number of unique k-mers per repertoire

The label-associated k-mers are determined based on a one-sided Fisher’s exact test.

The encoder also writes out files containing the contingency table used for fisher’s exact test, the resulting p-values, and the significantly abundant k-mers.

Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. See Reproduction of the CMV status predictions study for an example using SequenceAbundanceEncoder.

Arguments:

p_value_threshold (float): The p value threshold to be used by the statistical test.

sequence_encoding (SequenceEncodingType): The type of k-mers that are used. The simplest (default) sequence_encoding is CONTINUOUS_KMER, which uses contiguous subsequences of length k to represent the k-mers. When gapped k-mers are used (GAPPED_KMER, GAPPED_KMER), the k-mers may contain gaps with a size between min_gap and max_gap, and the k-mer length is defined as a combination of k_left and k_right. When IMGT k-mers are used (IMGT_CONTINUOUS_KMER, IMGT_GAPPED_KMER), IMGT positional information is taken into account (i.e. the same sequence in a different position is considered to be a different k-mer).

k (int): Length of the k-mer (number of amino acids) when ungapped k-mers are used. The default value for k is 3.

k_left (int): When gapped k-mers are used, k_left indicates the length of the k-mer left of the gap. The default value for k_left is 1.

k_right (int): Same as k_left, but k_right determines the length of the k-mer right of the gap. The default value for k_right is 1.

min_gap (int): Minimum gap size when gapped k-mers are used. The default value for min_gap is 0.

max_gap: (int): Maximum gap size when gapped k-mers are used. The default value for max_gap is 0.

YAML specification:

my_sa_encoding:
    KmerAbundance:
        p_value_threshold: 0.05
        threads: 8

KmerFrequency

The KmerFrequencyEncoder class encodes a repertoire, sequence or receptor by frequencies of k-mers it contains. A k-mer is a sequence of letters of length k into which an immune receptor sequence can be decomposed. K-mers can be defined in different ways, as determined by the sequence_encoding.

Arguments:

sequence_encoding (SequenceEncodingType): The type of k-mers that are used. The simplest sequence_encoding is CONTINUOUS_KMER, which uses contiguous subsequences of length k to represent the k-mers. When gapped k-mers are used (GAPPED_KMER, GAPPED_KMER), the k-mers may contain gaps with a size between min_gap and max_gap, and the k-mer length is defined as a combination of k_left and k_right. When IMGT k-mers are used (IMGT_CONTINUOUS_KMER, IMGT_GAPPED_KMER), IMGT positional information is taken into account (i.e. the same sequence in a different position is considered to be a different k-mer). When the identity representation is used (IDENTITY), the k-mers just correspond to the original sequences.

normalization_type (NormalizationType): The way in which the k-mer frequencies should be normalized. The default value for normalization_type is l2.

reads (ReadsType): Reads type signify whether the counts of the sequences in the repertoire will be taken into account. If UNIQUE, only unique sequences (clonotypes) are encoded, and if ALL, the sequence ‘count’ value is taken into account when determining the k-mer frequency. The default value for reads is unique.

k (int): Length of the k-mer (number of amino acids) when ungapped k-mers are used. The default value for k is 3.

k_left (int): When gapped k-mers are used, k_left indicates the length of the k-mer left of the gap. The default value for k_left is 1.

k_right (int): Same as k_left, but k_right determines the length of the k-mer right of the gap. The default value for k_right is 1.

min_gap (int): Minimum gap size when gapped k-mers are used. The default value for min_gap is 0.

max_gap: (int): Maximum gap size when gapped k-mers are used. The default value for max_gap is 0.

sequence_type (SequenceType): Whether to work with nucleotide or amino acid sequences. Amino acid sequences are the default. To work with either sequence type, the sequences of the desired type should be included in the datasets, e.g., listed under ‘columns_to_load’ parameter. By default, both types will be included if available. Valid values are: AMINO_ACID and NUCLEOTIDE.

scale_to_unit_variance (bool): whether to scale the design matrix after normalization to have unit variance per feature. Setting this argument to True might improve the subsequent classifier’s performance depending on the type of the classifier. The default value for scale_to_unit_variance is true.

scale_to_zero_mean (bool): whether to scale the design matrix after normalization to have zero mean per feature. Setting this argument to True might improve the subsequent classifier’s performance depending on the type of the classifier. However, if the original design matrix was sparse, setting this argument to True will destroy the sparsity and will increase the memory consumption. The default value for scale_to_zero_mean is false.

YAML specification:

my_continuous_kmer:
    KmerFrequency:
        normalization_type: RELATIVE_FREQUENCY
        reads: UNIQUE
        sequence_encoding: CONTINUOUS_KMER
        sequence_type: NUCLEOTIDE
        k: 3
        scale_to_unit_variance: True
        scale_to_zero_mean: True
my_gapped_kmer:
    KmerFrequency:
        normalization_type: RELATIVE_FREQUENCY
        reads: UNIQUE
        sequence_encoding: GAPPED_KMER
        sequence_type: AMINO_ACID
        k_left: 2
        k_right: 2
        min_gap: 1
        max_gap: 3
        scale_to_unit_variance: True
        scale_to_zero_mean: False

MatchedReceptors

Encodes the dataset based on the matches between a dataset containing unpaired (single chain) data, and a paired reference receptor dataset. For each paired reference receptor, the frequency of either chain in the dataset is counted.

This encoding can be used in combination with the Matches report.

When sum_matches and normalize are set to True, this encoder behaves similarly as described in: Yao, Y. et al. ‘T cell receptor repertoire as a potential diagnostic marker for celiac disease’. Clinical Immunology Volume 222 (January 2021): 108621. doi.org/10.1016/j.clim.2020.108621 with the only exception being that this encoder uses paired receptors, while the original publication used single sequences (see also: MatchedSequences encoder).

Arguments:

reference (dict): A dictionary describing the reference dataset file. Import should be specified the same way as regular dataset import. It is only allowed to import a receptor dataset here (i.e., is_repertoire is False and paired is True by default, and these are not allowed to be changed).

max_edit_distances (dict): A dictionary specifying the maximum edit distance between a target sequence (from the repertoire) and the reference sequence. A maximum distance can be specified per chain, for example to allow for less strict matching of TCR alpha and BCR light chains. When only an integer is specified, this distance is applied to all possible chains.

reads (ReadsType): Reads type signify whether the counts of the sequences in the repertoire will be taken into account. If UNIQUE, only unique sequences (clonotypes) are counted, and if ALL, the sequence ‘count’ value is summed when determining the number of matches. The default value for reads is all.

sum_matches (bool): When sum_matches is False, the resulting encoded data matrix contains multiple columns with the number of matches per reference receptor chain. When sum_matches is true, the columns representing each of the two chains are summed together, meaning that there are only two aggregated sums of matches (one per chain) per repertoire in the encoded data. To use this encoder in combination with the Matches report, sum_matches must be set to False. When sum_matches is set to True, this encoder behaves similarly to the encoder described by Yao, Y. et al. By default, sum_matches is False.

normalize (bool): If True, the chain matches are divided by the total number of unique receptors in the repertoire (when reads = unique) or the total number of reads in the repertoire (when reads = all).

YAML Specification:

my_mr_encoding:
    MatchedReceptors:
        reference:
            format: VDJDB
            params:
                path: path/to/file.txt
        max_edit_distances:
            alpha: 1
            beta: 0

MatchedRegex

Encodes the dataset based on the matches between a RepertoireDataset and a collection of regular expressions. For each regular expression, the number of sequences in the RepertoireDataset containing the expression is counted. This can also be used to count how often a subsequence occurs in a RepertoireDataset.

The regular expressions are defined per chain, and it is possible to require a V gene match in addition to the CDR3 sequence containing the regular expression.

This encoding can be used in combination with the Matches report.

Arguments:

match_v_genes (bool): Whether V gene matches are required. If this is True, a match is only counted if the V gene matches the gene specified in the motif input file. By default match_v_genes is False.

reads (ReadsType): Reads type signify whether the counts of the sequences in the repertoire will be taken into account. If UNIQUE, only unique sequences (clonotypes) are counted, and if ALL, the sequence ‘count’ value is summed when determining the number of matches. The default value for reads is all.

motif_filepath (str): The path to the motif input file. This should be a tab separated file containing a column named ‘id’ and for every chain that should be matched a column containing the regex (<chain>_regex) and a column containing the V gene (<chain>V) if match_v_genes is True. The chains are specified by their three letter code, valid values are: TRA, TRB, TRG, TRD, IGH, IGL, IGK.

In the simplest case, when counting the number of occurrences of a given list of k-mers in TRB sequences, the contents of the motif file could look like this:

id

TRB_regex

1

ACG

2

EDNA

3

DFWG

It is also possible to test whether paired regular expressions occur in the dataset (for example: regular expressions matching both a TRA chain and a TRB chain) by specifying them on the same line. In a more complex case where both paired and unpaired regular expressions are specified, in addition to matching the V genes, the contents of the motif file could look like this:

id

TRA_regex

TRAV

TRB_regex

TRBV

1

AGQ.GSS

TRAV35

S[APL]GQY

TRBV29-1

2

ASS.R.*

TRBV7-3

YAML Specification:

my_mr_encoding:
    MatchedRegex:
        motif_filepath: path/to/file.txt
        match_v_genes: True
        reads: unique

MatchedSequences

Encodes the dataset based on the matches between a RepertoireDataset and a reference sequence dataset.

This encoding can be used in combination with the Matches report.

When sum_matches and normalize are set to True, this encoder behaves as described in: Yao, Y. et al. ‘T cell receptor repertoire as a potential diagnostic marker for celiac disease’. Clinical Immunology Volume 222 (January 2021): 108621. doi.org/10.1016/j.clim.2020.108621

Arguments:

reference (dict): A dictionary describing the reference dataset file. Import should be specified the same way as regular dataset import. It is only allowed to import a sequence dataset here (i.e., is_repertoire and paired are False by default, and are not allowed to be set to True).

max_edit_distance (int): The maximum edit distance between a target sequence (from the repertoire) and the reference sequence.

reads (ReadsType): Reads type signify whether the counts of the sequences in the repertoire will be taken into account. If UNIQUE, only unique sequences (clonotypes) are counted, and if ALL, the sequence ‘count’ value is summed when determining the number of matches. The default value for reads is all.

sum_matches (bool): When sum_matches is False, the resulting encoded data matrix contains multiple columns with the number of matches per reference sequence. When sum_matches is true, all columns are summed together, meaning that there is only one aggregated sum of matches per repertoire in the encoded data. To use this encoder in combination with the Matches report, sum_matches must be set to False. When sum_matches is set to True, this encoder behaves as described by Yao, Y. et al. By default, sum_matches is False.

normalize (bool): If True, the sequence matches are divided by the total number of unique sequences in the repertoire (when reads = unique) or the total number of reads in the repertoire (when reads = all).

YAML Specification:

my_ms_encoding:
    MatchedSequences:
        reference:
            format: VDJDB
            params:
                path: path/to/file.txt
        max_edit_distance: 1

OneHot

One-hot encoding for repertoires, sequences or receptors. In one-hot encoding, each alphabet character (amino acid or nucleotide) is replaced by a sparse vector with one 1 and the rest zeroes. The position of the 1 represents the alphabet character.

Arguments:

use_positional_info (bool): whether to include features representing the positional information. If True, three additional feature vectors will be added, representing the sequence start, sequence middle and sequence end. The values in these features are scaled between 0 and 1. A graphical representation of the values of these vectors is given below.

  Value of sequence start:         Value of sequence middle:        Value of sequence end:

1 \                              1    /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾\         1                          /
   \                                 /                   \                                  /
    \                               /                     \                                /
0    \_____________________      0 /                       \      0  _____________________/
  <----sequence length---->        <----sequence length---->         <----sequence length---->

distance_to_seq_middle (int): only applies when use_positional_info is True. This is the distance from the edge of the CDR3 sequence (IMGT positions 105 and 117) to the portion of the sequence that is considered ‘middle’. For example: if distance_to_seq_middle is 6 (default), all IMGT positions in the interval [111, 112) receive positional value 1. When using nucleotide sequences: note that the distance is measured in (amino acid) IMGT positions. If the complete sequence length is smaller than 2 * distance_to_seq_middle, the maximum value of the ‘start’ and ‘end’ vectors will not reach 0, and the maximum value of the ‘middle’ vector will not reach 1. A graphical representation of the positional vectors with a too short sequence is given below:

Value of sequence start         Value of sequence middle        Value of sequence end:
with very short sequence:       with very short sequence:       with very short sequence:

     1 \                               1                                 1    /
        \                                                                    /
         \                                /\                                /
     0                                 0 /  \                            0
       <->                               <-->                               <->

flatten (bool): whether to flatten the final onehot matrix to a 2-dimensional matrix [examples, other_dims_combined] This must be set to True when using onehot encoding in combination with scikit-learn ML methods (inheriting SklearnMethod), such as LogisticRegression, SVM, SVC, RandomForestClassifier and KNN.

sequence_type: whether to use nucleotide or amino acid sequence for encoding. Valid values are ‘nucleotide’ and ‘amino_acid’.

YAML specification:

one_hot_vanilla:
    OneHot:
        use_positional_info: False
        flatten: False
        sequence_type: amino_acid

one_hot_positional:
    OneHot:
        use_positional_info: True
        distance_to_seq_middle: 3
        flatten: False
        sequence_type: nucleotide

SequenceAbundance

This encoder represents the repertoires as vectors where:

  • the first element corresponds to the number of label-associated clonotypes

  • the second element is the total number of unique clonotypes

To determine what clonotypes (with features defined by comparison_attributes) are label-associated, one-sided Fisher’s exact test is used.

The encoder also writes out files containing the contingency table used for Fisher’s exact test, the resulting p-values, and the significantly abundant sequences (use RelevantSequenceExporter to export these sequences in AIRR format).

Reference: Emerson, Ryan O. et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.

Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. For full example of using this encoder, see Reproduction of the CMV status predictions study.

Arguments:

comparison_attributes (list): The attributes to be considered to group receptors into clonotypes. Only the fields specified in comparison_attributes will be considered, all other fields are ignored. Valid values are sequence_aas, sequences, v_genes, j_genes, v_subgroups, j_subgroups, v_alleles, j_alleles, chains, counts, region_types, frame_types, sequence_identifiers, cell_ids.

p_value_threshold (float): The p value threshold to be used by the statistical test.

sequence_batch_size (int): The number of sequences in a batch when comparing sequences across repertoires, typically 100s of thousands. This does not affect the results of the encoding, only the speed. The default value is 1.000.000

repertoire_batch_size (int): How many repertoires will be loaded at once. This does not affect the result of the encoding, only the speed. This value is a trade-off between the number of repertoires that can fit the RAM at the time and loading time from disk.

YAML specification:

my_sa_encoding:
    SequenceAbundance:
        comparison_attributes:
            - sequence_aas
            - v_genes
            - j_genes
            - chains
            - region_types
        p_value_threshold: 0.05
        sequence_batch_size: 100000
        repertoire_batch_size: 32

TCRdist

Encodes the given ReceptorDataset as a distance matrix between all receptors, where the distance is computed using TCRdist from the paper: Dash P, Fiore-Gartland AJ, Hertz T, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. 2017; 547(7661):89-93. doi:10.1038/nature22383.

For the implementation, TCRdist3 library was used (source code available here).

Arguments:

cores (int): number of processes to use for the computation

YAML specification:

my_tcr_dist_enc: # user-defined name
    TCRdist:
        cores: 4

Word2Vec

Word2VecEncoder learns the vector representations of k-mers based on the context (receptor sequence). It works for sequence and repertoire datasets. Similar idea was discussed in: Ostrovsky-Berman, M., Frankel, B., Polak, P. & Yaari, G. Immune2vec: Embedding B/T Cell Receptor Sequences in ℝN Using Natural Language Processing. Frontiers in Immunology 12, (2021).

This encoder relies on gensim’s implementation of Word2Vec and KmerHelper for k-mer extraction. Currently it works on amino acid level.

Arguments:

vector_size (int): The size of the vector to be learnt.

model_type (ModelType): The context which will be used to infer the representation of the sequence. If SEQUENCE is used, the context of a k-mer is defined by the sequence it occurs in (e.g. if the sequence is CASTTY and k-mer is AST, then its context consists of k-mers CAS, STT, TTY) If KMER_PAIR is used, the context for the k-mer is defined as all the k-mers that within one edit distance (e.g. for k-mer CAS, the context includes CAA, CAC, CAD etc.). Valid values are SEQUENCE, KMER_PAIR.

k (int): The length of the k-mers used for the encoding.

epochs (int): for how many epochs to train the word2vec model for a given set of sentences (corresponding to epochs parameter in gensim package)

window (int): max distance between two k-mers in a sequence (same as window parameter in gensim’s word2vec)

YAML specification:

encodings:
    my_w2v:
        Word2Vec:
            vector_size: 16
            k: 3
            model_type: SEQUENCE
            epochs: 100
            window: 8

Reports

Data reports

Data reports show some type of features or statistics about a given dataset.

When running the TrainMLModel instruction, data reports can be specified inside the ‘selection’ or ‘assessment’ specification under the keys ‘reports/data’ (current cross-validation split) or ‘reports/data_splits’ (train/test sub-splits). Example:

my_instruction:
    type: TrainMLModel
    selection:
        reports:
            data:
                - my_data_report
        # other parameters...
    assessment:
        reports:
            data:
                - my_data_report
        # other parameters...
    # other parameters...

Alternatively, when running the ExploratoryAnalysis instruction, data reports can be specified under ‘report’. Example:

my_instruction:
    type: ExploratoryAnalysis
    analyses:
        my_first_analysis:
            report: my_data_report
            # other parameters...
    # other parameters...

AminoAcidFrequencyDistribution

Generates a barplot showing the relative frequency of each amino acid at each position in the sequences of a dataset.

Arguments:

imgt_positions (bool): Whether to use IMGT positional numbering or sequence index numbering. When imgt_positions is True, IMGT positions are used, meaning sequences of unequal length are aligned according to their IMGT positions. By default imgt_positions is True.

relative_frequency (bool): Whether to plot relative frequencies (true) or absolute counts (false) of the positional amino acids. By default, relative_frequency is True.

split_by_label (bool): Whether to split the plots by a label. If set to true, the Dataset must either contain a single label, or alternatively the label of interest can be specified under ‘label’. By default, split_by_label is False.

label (str): if split_by_label is set to True, a label can be specified here.

YAML specification:

my_aa_freq_report:
  AminoAcidFrequencyDistribution:
    relative_frequency: False
    split_by_label: True
    label: CMV

CytoscapeNetworkExporter

This report exports the Receptor sequences to .sif format, such that they can directly be imported as a network in Cytoscape, to visualize chain sharing between the different receptors in a dataset (for example, for TCRs: how often one alpha chain is shared with multiple beta chains, and vice versa).

The Receptor sequences can be provided as a ReceptorDataset, or a RepertoireDataset (containing paired sequence information). In the latter case, one .sif file is exported per Repertoire.

YAML specification:

my_cyto_export: CytoscapeNetworkExporter

GLIPH2Exporter

Report which exports the receptor data to GLIPH2 format so that it can be directly used in GLIPH2 tool. Currently, the report accepts only receptor datasets.

GLIPH2 publication: Huang H, Wang C, Rubelt F, Scriba TJ, Davis MM. Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nature Biotechnology. Published online April 27, 2020:1-9. doi:10.1038/s41587-020-0505-4

Arguments:

condition (str): name of the parameter present in the receptor metadata in the dataset; condition can be anything which can be processed in GLIPH2, such as tissue type or treatment.

YAML specification:

my_gliph2_exporter: # user-defined name
    GLIPH2Exporter:
        condition: epitope # for instance, epitope parameter is present in receptors' metadata with values such as "MtbLys" for Mycobacterium tuberculosis (as shown in the original paper).

ReceptorDatasetOverview

This report plots the length distribution per chain for a receptor (paired-chain) dataset.

Arguments:

batch_size (int): how many receptors to load at once; 50 000 by default

YAML specification:

reports:
    my_receptor_overview_report: ReceptorDatasetOverview

RecoveredSignificantFeatures

Compares a given collection of groundtruth implanted signals (sequences or k-mers) to the significant label-associated k-mers or sequences according to Fisher’s exact test.

Internally uses the KmerAbundanceEncoder for calculating significant k-mers, and SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder to calculate significant full sequences (depending on whether the argument compairr_path was set).

This report creates two plots:

  • the first plot is a bar chart showing what percentage of the groundtruth implanted signals were found to be significant.

  • the second plot is a bar chart showing what percentage of the k-mers/sequences found to be significant match the

groundtruth implanted signals.

To compare k-mers or sequences of differing lengths, the groundtruth sequences or long k-mers are split into k-mers of the given size through a sliding window approach. When comparing ‘full_sequences’ to groundtruth sequences, a match is only registered if both sequences are of equal length.

Arguments:

groundtruth_sequences_path (str): Path to a file containing the true implanted (sub)sequences, e.g., full sequences or k-mers. The file should contain one sequence per line, without a header, and without V or J genes.

trim_leading_trailing (bool): Whether to trim the leading and trailing first positions from the provided groundtruth sequences, e.g., the leading C and trailing Y/F amino acids. This is necessary for comparing full sequences when the main dataset is imported using settings that also trim the leading and trailing positions (specified by the region_type parameter). By default, trim_leading_trailing is False.

p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.

k_values (list): Length of the k-mers (number of amino acids) created by the KmerAbundanceEncoder. When using a full sequence encoding (SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder), specify ‘full_sequence’ here. Each value specified under k_values will represent one bar in the output figure.

label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.

compairr_path (str): If ‘full_sequence’ is listed under k_values, the path to the CompAIRR executable may be provided. If the compairr_path is specified, the CompAIRRSequenceAbundanceEncoder will be used to compute the significant sequences. If the path is not specified and ‘full_sequence’ is listed under k-values, SequenceAbundanceEncoder will be used.

YAML specification:

my_recovered_significant_features_report:
    RecoveredSignificantFeatures:
        groundtruth_sequences_path: path/to/groundtruth/sequences.txt
        trim_leading_trailing: False
        p_values:
            - 0.1
            - 0.01
            - 0.001
            - 0.0001
        k_values:
            - 3
            - 4
            - 5
            - full_sequence
        compairr_path: path/to/compairr # can be specified if 'full_sequence' is listed under k_values
        label: # Define a label, and the positive class for that given label
            CMV:
                positive_class: +

RepertoireClonotypeSummary

Shows the number of distinct clonotypes per repertoire in a given dataset as a bar plot.

Arguments:

color_by_label (str): name of the label to use to color the plot, e.g., could be disease label, or None

YAML specification:

my_clonotype_summary_rep:
  RepertoireClonotypeSummary:
    color_by_label: celiac

SequenceLengthDistribution

Generates a histogram of the lengths of the sequences in a repertoire or sequence dataset.

Arguments:

sequence_type (str): whether to check the length of amino acid or nucletoide sequences; default value is ‘amino_acid’

YAML specification:

my_sld_report:
    SequenceLengthDistribution:
        sequence_type: amino_acid

SequencesWithSignificantKmers

Given a list of reference sequences, this report writes out the subsets of reference sequences containing significant k-mers (as computed by the KmerAbundanceEncoder using Fisher’s exact test).

For each combination of p-value and k-mer size given, a file is written containing all sequences containing a significant k-mer of the given size at the given p-value.

Arguments:

reference_sequences_path (str): Path to a file containing the reference sequences, The file should contain one sequence per line, without a header, and without V or J genes.

p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.

k_values (list): Length of the k-mers (number of amino acids) created by the KmerAbundanceEncoder. Each k-mer length will become one panel in the output figure.

label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.

YAML specification:

my_sequences_with_significant_kmers:
    SequencesWithSignificantKmers:
        reference_sequences_path: path/to/reference/sequences.txt
        p_values:
            - 0.1
            - 0.01
            - 0.001
            - 0.0001
        k_values:
            - 3
            - 4
            - 5
        label: # Define a label, and the positive class for that given label
            CMV:
                positive_class: +

SignificantFeatures

Plots a boxplot of the number of significant features (label-associated k-mers or sequences) per Repertoire according to Fisher’s exact test, across different classes for the given label.

Internally uses the KmerAbundanceEncoder for calculating significant k-mers, and SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder to calculate significant full sequences (depending on whether the argument compairr_path was set).

Arguments:

p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.

k_values (list): Length of the k-mers (number of amino acids) created by the KmerAbundanceEncoder. When using a full sequence encoding (SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder), specify ‘full_sequence’ here. Each value specified under k_values will represent one boxplot in the output figure.

label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.

compairr_path (str): If ‘full_sequence’ is listed under k_values, the path to the CompAIRR executable may be provided. If the compairr_path is specified, the CompAIRRSequenceAbundanceEncoder will be used to compute the significant sequences. If the path is not specified and ‘full_sequence’ is listed under k-values, SequenceAbundanceEncoder will be used.

log_scale (bool): Whether to plot the y axis in log10 scale (log_scale = True) or continuous scale (log_scale = False). By default, log_scale is False.

YAML specification:

my_significant_features_report:
    SignificantFeatures:
        p_values:
            - 0.1
            - 0.01
            - 0.001
            - 0.0001
        k_values:
            - 3
            - 4
            - 5
            - full_sequence
        compairr_path: path/to/compairr # can be specified if 'full_sequence' is listed under k_values
        label: # Define a label, and the positive class for that given label
            CMV:
                positive_class: +
        log_scale: False

SignificantKmerPositions

Plots the number of significant k-mers (as computed by the KmerAbundanceEncoder using Fisher’s exact test) observed at each IMGT position of a given list of reference sequences. This report creates a stacked bar chart, where each bar represents an IMGT position, and each segment of the stack represents the observed frequency of one ‘significant’ k-mer at that position.

Arguments:

reference_sequences_path (str): Path to a file containing the reference sequences, The file should contain one sequence per line, without a header, and without V or J genes.

p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.

k_values (list): Length of the k-mers (number of amino acids) created by the KmerAbundanceEncoder. Each k-mer length will become one panel in the output figure.

label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.

YAML specification:

my_significant_kmer_positions_report:
    SignificantKmerPositions:
        reference_sequences_path: path/to/reference/sequences.txt
        p_values:
            - 0.1
            - 0.01
            - 0.001
            - 0.0001
        k_values:
            - 3
            - 4
            - 5
        label: # Define a label, and the positive class for that given label
            CMV:
                positive_class: +

SimpleDatasetOverview

Generates a simple text-based overview of the properties of any dataset, including the dataset name, size, and metadata labels.

YAML specification:

reports:
    my_overview: SimpleDatasetOverview

Encoding reports

Encoding reports show some type of features or statistics about an encoded dataset, or may in some cases export relevant sequences or tables.

When running the TrainMLModel instruction, encoding reports can be specified inside the ‘selection’ or ‘assessment’ specification under the key ‘reports/encoding’. Example:

my_instruction:
    type: TrainMLModel
    selection:
        reports:
            encoding:
                - my_encoding_report
        # other parameters...
    assessment:
        reports:
            encoding:
                - my_encoding_report
        # other parameters...
    # other parameters...

Alternatively, when running the ExploratoryAnalysis instruction, encoding reports can be specified under ‘report’. Example:

my_instruction:
    type: ExploratoryAnalysis
    analyses:
        my_first_analysis:
            report: my_encoding_report
            # other parameters...
    # other parameters...

DesignMatrixExporter

Exports the design matrix and related information of a given encoded Dataset to csv files. If the encoded data has more than 2 dimensions (such as when using the OneHot encoder with option Flatten=False), the data are then exported to different formats to facilitate their import with external software.

Arguments:

file_format (str): the format and extension of the file to store the design matrix. The supported formats are: npy, csv, hdf5, npy.zip, csv.zip or hdf5.zip. Note: when using hdf5 or hdf5.zip output formats, make sure the ‘hdf5’ dependency is installed.

YAML specification:

my_dme_report:
    DesignMatrixExporter:
        file_format: csv

FeatureComparison

Compares the feature values in a given encoded data matrix across two values for a metadata label. These labels are specified in the metadata file for repertoire datasets, or as metadata columns for sequence and receptor datasets. Can be used in combination with any encoding and dataset type. This report produces a scatterplot, where each point represents one feature, and the values on the x and y axes are the average feature values across two subsets of the data. For example, when KmerFrequency encoder is used, and the comparison_label is used to represent a disease (true/false), then the features are the k-mers (AAA, AAC, etc..) and their x and y position in the scatterplot is determined by their frequency in the subset of the data where disease=true and disease=false.

Optional metadata labels can be specified to divide the scatterplot into groups based on color, row facets or column facets.

Alternatively, when the feature values are of interest without comparing them between labelled subgroups of the data, please use FeatureValueBarplot or FeatureDistribution instead.

Arguments:

comparison_label (str): Mandatory label. This label is used to split the encoded data matrix and define the x and y axes of the plot. This label is only allowed to have 2 classes (for example: sick and healthy, binding and non-binding).

color_grouping_label (str): Optional label that is used to color the points in the scatterplot. This can not be the same as comparison_label.

row_grouping_label (str): Optional label that is used to group scatterplots into different row facets. This can not be the same as comparison_label.

column_grouping_label (str): Optional label that is used to group scatterplots into different column facets. This can not be the same as comparison_label.

show_error_bar (bool): Whether to show the error bar (standard deviation) for the points, both in the x and y dimension.

log_scale (bool): Whether to plot the x and y axes in log10 scale (log_scale = True) or continuous scale (log_scale = False). By default, log_scale is False.

keep_fraction (float): The total number of features may be very large and only the features differing significantly across comparison labels may be of interest. When the keep_fraction parameter is set below 1, only the fraction of features that differs the most across comparison labels is kept for plotting (note that the produced .csv file still contains all data). By default, keep_fraction is 1, meaning that all features are plotted.

opacity (float): a value between 0 and 1 setting the opacity for data points making it easier to see if there are overlapping points

YAML specification:

my_comparison_report:
    FeatureComparison: # compare the different classes defined in the label disease
        comparison_label: disease

FeatureDistribution

Plots a boxplot for each feature in the encoded data matrix. Can be used in combination with any encoding and dataset type. Each boxplot represents a feature and shows the distribution of values for that feature. For example, when KmerFrequency encoder is used, the features are the k-mers (AAA, AAC, etc..) and the feature values are the frequencies per k-mer.

Two modes can be used: in the ‘normal’ mode there are normal boxplots corresponding to each column of the encoded dataset matrix; in the ‘sparse’ mode all zero cells are eliminated before passing the data to the boxplots. If mode is set to ‘auto’, then it will automatically set to ‘sparse’ if the density of the matrix is below 0.01

Optional metadata labels can be specified to divide the boxplots into groups based on color, row facets or column facets. These labels are specified in the metadata file for repertoire datasets, or as metadata columns for sequence and receptor datasets.

Alternatively, when only the mean feature values are of interest (as opposed to showing the complete distribution, as done here), please consider using FeatureValueBarplot instead. When comparing the feature values between two subsets of the data, please use FeatureComparison.

Arguments:

color_grouping_label (str): The label that is used to color each bar, at each level of the grouping_label.

row_grouping_label (str): The label that is used to group bars into different row facets.

column_grouping_label (str): The label that is used to group bars into different column facets.

mode (str): either ‘normal’, ‘sparse’ or ‘auto’ (default)

x_title (str): x-axis label

y_title (str): y-axis label

YAML specification:

my_fdistr_report:
    FeatureDistribution:
        mode: sparse

FeatureValueBarplot

Plots a barplot of the feature values in a given encoded data matrix, averaged across examples. Can be used in combination with any encoding and dataset type. Each bar in the barplot represents the mean value of a given feature, and along the x-axis are the different features. For example, when KmerFrequency encoder is used, the features are the k-mers (AAA, AAC, etc..) and the feature values are the frequencies per k-mer.

Optional metadata labels can be specified to divide the barplot into groups based on color, row facets or column facets. In this case, the average feature values in each group are plotted. These labels are specified in the metadata file for repertoire datasets, or as metadata columns for sequence and receptor datasets.

Alternatively, when the distribution of feature values is of interest (as opposed to showing only the mean, as done here), please consider using FeatureDistribution instead. When comparing the feature values between two subsets of the data, please use FeatureComparison.

Arguments:

color_grouping_label (str): The label that is used to color each bar, at each level of the grouping_label.

row_grouping_label (str): The label that is used to group bars into different row facets.

column_grouping_label (str): The label that is used to group bars into different column facets.

show_error_bar (bool): Whether to show the error bar (standard deviation) for the bars.

x_title (str): x-axis label

y_title (str): y-axis label

plot_top_n (int): plot n of the largest features on average separately (useful when there are too many features to plot at the same time)

plot_bottom_n (int): plot n of the smallest features on average separately (useful when there are too many features to plot at the same time)

plot_all_features (bool): whether to plot all (might be slow for large number of features)

YAML specification:

my_fvb_report:
    FeatureValueBarplot: # timepoint, disease_status and age_group are metadata labels
        column_grouping_label: timepoint
        row_grouping_label: disease_status
        color_grouping_label: age_group
        plot_all_features: true
        plot_top_n: 10
        plot_bottom_n: 5

Matches

Reports the number of matches that were found when using one of the following encoders:

Report results are:

  • A table containing all matches, where the rows correspond to the Repertoires, and the columns correspond to the objects to match (regular expressions or receptor sequences).

  • The repertoire sizes (read frequencies and the number of unique sequences per repertoire), for each of the chains. This can be used to calculate the percentage of matched sequences in a repertoire.

  • When using MatchedSequences encoder or MatchedReceptors encoder, tables describing the chains and receptors (ids, chains, V and J genes and sequences).

  • When using MatchedReceptors encoder or using MatchedRegex encoder with chain pairs, tables describing the paired matches (where a match was found in both chains) per repertoire.

YAML Specification:

my_match_report: Matches

RelevantSequenceExporter

Exports the sequences that are extracted as label-associated when using the SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder in AIRR-compliant format.

Arguments: there are no arguments for this report.

YAML specification:

my_relevant_sequences: RelevantSequenceExporter

ML model reports

ML model reports show some type of features or statistics about one trained ML model.

In the TrainMLModel instruction, ML model reports can be specified inside the ‘selection’ or ‘assessment’ specification under the key ‘reports/models’. Example:

my_instruction:
    type: TrainMLModel
    selection:
        reports:
            models:
                - my_ml_report
        # other parameters...
    assessment:
        reports:
            models:
                - my_ml_report
        # other parameters...
    # other parameters...

Coefficients

A report that plots the coefficients for a given ML method in a barplot. Can be used for LogisticRegression, SVM, SVC, and RandomForestClassifier. In the case of RandomForest, the feature importances will be plotted.

When used in TrainMLModel instruction, the report can be specified under ‘models’, both on the selection and assessment levels.

Which coefficients should be plotted (for example: only nonzero, above a certain threshold, …) can be specified. Multiple options can be specified simultaneously. By default the 25 largest coefficients are plotted. The full set of coefficients will also be exported as a csv file.

Arguments:

coefs_to_plot (list): A list specifying which coefficients should be plotted. Valid values are: ALL, NONZERO, CUTOFF, N_LARGEST.

cutoff (list): If ‘cutoff’ is specified under ‘coefs_to_plot’, the cutoff values can be specified here. The coefficients which have an absolute value equal to or greater than the cutoff will be plotted.

n_largest (list): If ‘n_largest’ is specified under ‘coefs_to_plot’, the values for n can be specified here. These should be integer values. The n largest coefficients are determined based on their absolute values.

YAML specification:

my_coef_report:
    Coefficients:
        coefs_to_plot:
            - all
            - nonzero
            - cutoff
            - n_largest
        cutoff:
            - 0.1
            - 0.01
        n_largest:
            - 5
            - 10

ConfounderAnalysis

A report that plots the numbers of false positives and false negatives with respect to each value of the metadata features specified by the user. This allows checking whether a given machine learning model makes more misclassifications for some values of a metadata feature than for the others.

Arguments:

metadata_labels (list): A list of the metadata features to use as a basis for the calculations

YAML specification:

my_confounder_report:
    ConfounderAnalysis:
        metadata_labels:
          - age
          - sex

DeepRCMotifDiscovery

This report plots the contributions of (i) input sequences and (ii) kernels to trained DeepRC model with respect to the test dataset. Contributions are computed using integrated gradients (IG). This report produces two figures:

  • inputs_integrated_gradients: Shows the contributions of the characters within the input sequences (test dataset) that was most important for immune status prediction of the repertoire. IG is only applied to sequences of positive class repertoires.

  • kernel_integrated_gradients: Shows the 1D CNN kernels with the highest contribution over all positions and amino acids.

For both inputs and kernels: Larger characters in the extracted motifs indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the immune status. For kernels only: contributions to positional encoding are indicated by < (beginning of sequence), ∧ (center of sequence), and > (end of sequence).

See DeepRCMotifDiscovery for repertoire classification for a usage example.

Reference: Michael Widrich, Bernhard Schäfl, Milena Pavlović, Geir Kjetil Sandve, Sepp Hochreiter, Victor Greiff, Günter Klambauer ‘DeepRC: Immune repertoire classification with attention-based deep massive multiple instance learning’. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.12.03815

Arguments:

n_steps (int): Number of IG steps (more steps -> better path integral -> finer contribution values). 50 is usually good enough.

threshold (float): Only applies to the plotting of kernels. Contributions are normalized to range [0, 1], and only kernels with normalized contributions above threshold are plotted.

YAML specification:

my_deeprc_report:
    DeepRCMotifDiscovery:
        threshold: 0.5
        n_steps: 50

MotifSeedRecovery

This report can be used to show how well implanted motifs (for example, through the Simulation instruction) can be recovered by various machine learning methods using the k-mer encoding. This report creates a boxplot, where the x axis (box grouping) represents the maximum possible overlap between an implanted motif seed and a kmer feature (measured in number of positions), and the y axis shows the coefficient size of the respective kmer feature. If the machine learning method has learned the implanted motif seeds, the coefficient size is expected to be largest for the kmer features with high overlap to the motif seeds.

Note that to use this report, the following criteria must be met:

  • KmerFrequencyEncoder must be used.

  • One of the following classifiers must be used: RandomForestClassifier, LogisticRegression, SVM, SVC

  • For each label, the implanted motif seeds relevant to that label must be specified

To find the overlap score between kmer features and implanted motif seeds, the two sequences are compared in a sliding window approach, and the maximum overlap is calculated.

Overlap scores between kmer features and implanted motifs are calculated differently based on the Hamming distance that was allowed during implanting.

Without hamming distance:
Seed:     AAA  -> score = 3
Feature: xAAAx
          ^^^

Seed:     AAA  -> score = 0
Feature: xAAxx

With hamming distance:
Seed:     AAA  -> score = 3
Feature: xAAAx
          ^^^

Seed:     AAA  -> score = 2
Feature: xAAxx
          ^^

Furthermore, gap positions in the motif seed are ignored:
Seed:     A/AA  -> score = 3
Feature: xAxAAx
          ^/^^

See Recovering simulated immune signals for more details and an example plot.

Arguments:

implanted_motifs_per_label (dict): a nested dictionary that specifies the motif seeds that were implanted in the given dataset. The first level of keys in this dictionary represents the different labels. In the inner dictionary there should be two keys: “seeds” and “hamming_distance”:

  • seeds: a list of motif seeds. The seeds may contain gaps, specified by a ‘/’ symbol.

  • hamming_distance: A boolean value that specifies whether hamming distance was allowed when implanting the motif seeds for a given label. Note that this applies to all seeds for this label.

  • gap_sizes: a list of all the possible gap sizes that were used when implanting a gapped motif seed. When no gapped seeds are used, this value has no effect.

YAML specification:

my_motif_report:
    MotifSeedRecovery:
        implanted_motifs_per_label:
            CD:
                seeds:
                - AA/A
                - AAA
                hamming_distance: False
                gap_sizes:
                - 0
                - 1
                - 2
            T1D:
                seeds:
                - CC/C
                - CCC
                hamming_distance: True
                gap_sizes:
                - 2

ROCCurve

A report that plots the ROC curve for a binary classifier.

YAML specification:

reports:
  my_roc_report: ROCCurve

SequenceAssociationLikelihood

Plots the beta distribution used as a prior for class assignment in ProbabilisticBinaryClassifier. The distribution plotted shows the probability that a sequence is associated with a given class for a label.

Attributes: the report does not take in any arguments.

YAML specification:

my_sequence_assoc_report: SequenceAssociationLikelihood

TCRdistMotifDiscovery

The report for discovering motifs in paired immune receptor data of given specificity based on TCRdist3. The receptors are hierarchically clustered based on the tcrdist distance and then motifs are discovered for each cluster. The report outputs logo plots for the motifs along with the raw data used for plotting in csv format.

For the implementation, TCRdist3 library was used (source code available here). More details on the functionality used for this report are available here.

Original publications:

Dash P, Fiore-Gartland AJ, Hertz T, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. 2017; 547(7661):89-93. doi:10.1038/nature22383

Mayer-Blackwell K, Schattgen S, Cohen-Lavi L, et al. TCR meta-clonotypes for biomarker discovery with tcrdist3: quantification of public, HLA-restricted TCR biomarkers of SARS-CoV-2 infection. bioRxiv. Published online December 26, 2020:2020.12.24.424260. doi:10.1101/2020.12.24.424260

Arguments:

positive_class_name (str): the class value (e.g., epitope) used to select only the receptors that are specific to the given epitope so that only those sequences are used to infer motifs; the reference receptors as required by TCRdist will be the ones from the dataset that have different or no epitope specified in their metadata; if the labels are available only on the epitope level (e.g., label is “AVFDRKSDAK” and classes are True and False), then here it should be specified that only the receptors with value “True” for label “AVFDRKSDAK” should be used; there is no default value for this argument

cores (int): number of processes to use for the computation of the distance and motifs

min_cluster_size (int): the minimum size of the cluster to discover the motifs for

use_reference_sequences (bool): when showing motifs, this parameter defines if reference sequences should be provided as well as a background

YAML specification:

my_tcr_dist_report: # user-defined name
    TCRdistMotifDiscovery:
        positive_class_name: True # class name, could also be epitope name, depending on how it's defined in the dataset
        cores: 4
        min_cluster_size: 30
        use_reference_sequences: False

TrainingPerformance

A report that plots the evaluation metrics for the performance given machine learning model and training dataset. The available metrics are accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision, recall, auc and log_loss (see immuneML.environment.Metric.Metric).

Arguments:

metrics (list): A list of metrics used to evaluate training performance. See immuneML.environment.Metric.Metric for available options.

YAML specification:

my_performance_report:
    TrainingPerformance:
        metrics:
            - accuracy
            - balanced_accuracy
            - confusion_matrix
            - f1_micro
            - f1_macro
            - f1_weighted
            - precision
            - recall
            - auc
            - log_loss

Train ML model reports

Train ML model reports plot general statistics or export data of multiple models simultaneously when running the TrainMLModel instruction.

In the TrainMLModel instruction, train ML model reports can be specified under ‘reports’. Example:

my_instruction:
    type: TrainMLModel
    reports:
        - my_train_ml_model_report
    # other parameters...

CVFeaturePerformance

This report plots the average training vs test performance w.r.t. given encoding parameter which is explicitly set in the feature attribute. It can be used only in combination with TrainMLModel instruction and can be only specified under ‘reports’

Arguments:

feature: name of the encoder parameter w.r.t. which the performance across training and test will be shown. Possible values depend on the encoder on which it is used.

is_feature_axis_categorical (bool): if the x-axis of the plot where features are shown should be categorical; alternatively it is automatically determined based on the feature values

YAML specification:

report1:
    CVFeaturePerformance:
        feature: p_value_threshold # parameter value of SequenceAbundance encoder
        is_feature_axis_categorical: True # show x-axis as categorical

DiseaseAssociatedSequenceCVOverlap

DiseaseAssociatedSequenceCVOverlap report makes one heatmap per label showing the overlap of disease-associated sequences (or k-mers) produced by the SequenceAbundanceEncoder, CompAIRRSequenceAbundanceEncoder or KmerAbundanceEncoder between folds of cross-validation (either inner or outer loop of the nested CV). The overlap is computed by the following equation:

\[overlap(X,Y) = \frac{|X \cap Y|}{min(|X|, |Y|)} x 100\]

For details, see Greiff V, Menzel U, Miho E, et al. Systems Analysis Reveals High Genetic and Antigen-Driven Predetermination of Antibody Repertoires throughout B Cell Development. Cell Reports. 2017;19(7):1467-1478. doi:10.1016/j.celrep.2017.04.054.

Arguments:

compare_in_selection (bool): whether to compute the overlap over the inner loop of the nested CV - the sequence overlap is shown across CV folds for the model chosen as optimal within that selection

compare_in_assessment (bool): whether to compute the overlap over the optimal models in the outer loop of the nested CV

YAML specification:

reports: # the report is defined with all other reports under definitions/reports
    my_overlap_report: DiseaseAssociatedSequenceCVOverlap # report has no parameters

MLSettingsPerformance

Report for TrainMLModel instruction: plots the performance for each of the setting combinations as defined under ‘settings’ in the assessment (outer validation) loop.

The performances are grouped by label (horizontal panels) encoding (vertical panels) and ML method (bar color). When multiple data splits are used, the average performance over the data splits is shown with an error bar representing the standard deviation.

This report can be used only with TrainMLModel instruction under ‘reports’.

Arguments:

single_axis_labels (bool): whether to use single axis labels. Note that using single axis labels makes the figure unsuited for rescaling, as the label position is given in a fixed distance from the axis. By default, single_axis_labels is False, resulting in standard plotly axis labels.

x_label_position (float): if single_axis_labels is True, this should be an integer specifying the x axis label position relative to the x axis. The default value for label_position is -0.1.

y_label_position (float): same as x_label_position, but for the y axis.

YAML specification:

my_hp_report: MLSettingsPerformance

ROCCurveSummary

This report plots ROC curves for all trained ML settings ([preprocessing], encoding, ML model) in the outer loop of cross-validation in the TrainMLModel instruction. If there are multiple splits in the outer loop, this report will make one plot per split. This report is defined only for binary classification. If there are multiple labels defined in the instruction, each label has to have two classes to be included in this report.

Arguments: there are no arguments for this report.

YAML specification:


reports:

my_roc_summary_report: ROCCurveSummary

ReferenceSequenceOverlap

The ReferenceSequenceOverlap report compares a list of disease-associated sequences (or k-mers) produced by the SequenceAbundanceEncoder, CompAIRRSequenceAbundanceEncoder or KmerAbundanceEncoder to a list of reference sequences. It outputs a Venn diagram and a list of sequences found both in the encoder and reference list.

The report compares the sequences by their sequence content and the additional comparison_attributes (such as V or J gene), as specified by the user.

Arguments:

reference_path (str): path to the reference file in csv format which contains one entry per row and has columns that correspond to the attributes listed under comparison_attributes argument

comparison_attributes (list): list of attributes to use for comparison; all of them have to be present in the reference file where they should be the names of the columns

label (str): name of the label for which the reference sequences/k-mers should be compared to the model; if none, it takes the one label from the instruction; if it is none and multiple labels were specified for the instruction, the report will not be generated

YAML specification:

reports: # the report is defined with all other reports under definitions/reports
    my_reference_overlap_report:
        ReferenceSequenceOverlap:
            reference_path: reference_sequences.csv  # example usage with SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder
            comparison_attributes:
                - sequence_aas
                - v_genes
                - j_genes
    my_reference_overlap_report_with_kmers:
        ReferenceSequenceOverlap:
            reference_path: reference_kmers.csv  # example usage with KmerAbundanceEncoder
            comparison_attributes:
                - k-mer

Multi dataset reports

Multi dataset reports are special reports that can be specified when running immuneML with the MultiDatasetBenchmarkTool.

When running the MultiDatasetBenchmarkTool, multi dataset reports can be specified under ‘benchmark_reports’. Example:

my_instruction:
    type: TrainMLModel
    benchmark_reports:
        - my_benchmark_report
    # other parameters...

DiseaseAssociatedSequenceOverlap

DiseaseAssociatedSequenceOverlap report makes a heatmap showing the overlap of disease-associated sequences (or k-mers) produced by the SequenceAbundanceEncoder, CompAIRRSequenceAbundanceEncoder or KmerAbundanceEncoder between multiple datasets of different sizes (different number of repertoires per dataset).

This plot can be used only with MultiDatasetBenchmarkTool

The overlap is computed by the following equation:

\[overlap(X,Y) = \frac{|X \cap Y|}{min(|X|, |Y|)} x 100\]

For details, see Greiff V, Menzel U, Miho E, et al. Systems Analysis Reveals High Genetic and Antigen-Driven Predetermination of Antibody Repertoires throughout B Cell Development. Cell Reports. 2017;19(7):1467-1478. doi:10.1016/j.celrep.2017.04.054.

YAML specification:

reports: # the report is defined with all other reports under definitions/reports
    my_overlap_report: DiseaseAssociatedSequenceOverlap # report has no parameters

PerformanceOverview

PerformanceOverview report creates an ROC plot and precision-recall plot for optimal trained models on multiple datasets. The labels on the plots are the names of the datasets, so it might be good to have user-friendly names when defining datasets that are still a combination of letters, numbers and the underscore sign.

This report can be used only with MultiDatasetBenchmarkTool as it will plot ROC and PR curve for trained models across datasets. Also, it requires the task to be immune repertoire classification and cannot be used for receptor or sequence classification. Furthermore, it uses predictions on the test dataset to assess the performance and plot the curves. If the parameter refit_optimal_model is set to True, all data will be used to fit the optimal model, so there will not be a test dataset which can be used to assess performance and the report will not be generated.

If datasets have the same number of examples, the baseline PR curve will be plotted as described in this publication: Saito T, Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE. 2015;10(3):e0118432. doi:10.1371/journal.pone.0118432

If the datasets have different number of examples, the baseline PR curve will not be plotted.

YAML specification:

reports:
    my_performance_report: PerformanceOverview

ML methods

When choosing which ML method(s) are most suitable for your use-case, please consider the following table. The table describes which of the ML methods can be used for binary classification (two classes per label), and which can be used for multi-class classification. Note that all classifiers can automatically be used for multi-label classification in immuneML. Furthermore, it describes what type of dataset the classifier can be applied to, and whether a third level of nested cross-validation can be used for the selection of model parameters (scikit-learn classifiers).

ML methods properties

ML method

binary classification

multi-class classification

sequence dataset

receptor dataset

repertoire dataset

model selection CV

AtchleyKmerMILClassifier

DeepRC

KNN

LogisticRegression

ProbabalisticBinaryClassifier

RandomForestClassifier

ReceptorCNN

SVM

TCRdistClassifier

AtchleyKmerMILClassifier

A binary Repertoire classifier which uses the data encoded by AtchleyKmer encoder to predict the repertoire label.

The original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292 .

Arguments:

iteration_count (int): max number of training iterations

threshold (float): loss threshold at which to stop training if reached

evaluate_at (int): log model performance every ‘evaluate_at’ iterations and store the model every ‘evaluate_at’ iterations if early stopping is used

use_early_stopping (bool): whether to use early stopping

learning_rate (float): learning rate for stochastic gradient descent

random_seed (int): random seed used

zero_abundance_weight_init (bool): whether to use 0 as initial weight for abundance term (if not, a random value is sampled from normal distribution with mean 0 and variance 1 / total_number_of_features

number_of_threads: number of threads to be used for training

initialization_count (int): how many times to repeat the fitting procedure from the beginning before choosing the optimal model (trains the model with multiple random initializations)

pytorch_device_name (str): The name of the pytorch device to use. This name will be passed to torch.device(pytorch_device_name).

YAML specification:

my_kmer_mil_classifier:
    AtchleyKmerMILClassifier:
        iteration_count: 100
        evaluate_at: 15
        use_early_stopping: False
        learning_rate: 0.01
        random_seed: 100
        zero_abundance_weight_init: True
        number_of_threads: 8
        threshold: 0.00001
        initialization_count: 4

DeepRC

This classifier uses the DeepRC method for repertoire classification. The DeepRC ML method should be used in combination with the DeepRC encoder. Also consider using the DeepRCMotifDiscovery report for interpretability.

Notes:

  • DeepRC uses PyTorch functionalities that depend on GPU. Therefore, DeepRC does not work on a CPU.

  • This wrapper around DeepRC currently only supports binary classification.

Reference: Michael Widrich, Bernhard Schäfl, Milena Pavlović, Geir Kjetil Sandve, Sepp Hochreiter, Victor Greiff, Günter Klambauer ‘DeepRC: Immune repertoire classification with attention-based deep massive multiple instance learning’. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.12.038158

Arguments:

validation_part (float): the part of the data that will be used for validation, the rest will be used for training.

add_positional_information (bool): whether positional information should be included in the input features.

kernel_size (int): the size of the 1D-CNN kernels.

n_kernels (int): the number of 1D-CNN kernels in each layer.

n_additional_convs (int): Number of additional 1D-CNN layers after first layer

n_attention_network_layers (int): Number of attention layers to compute keys

n_attention_network_units (int): Number of units in each attention layer

n_output_network_units (int): Number of units in the output layer

consider_seq_counts (bool): whether the input data should be scaled by the receptor sequence counts.

sequence_reduction_fraction (float): Fraction of number of sequences to which to reduce the number of sequences per bag based on attention weights. Has to be in range [0,1].

reduction_mb_size (int): Reduction of sequences per bag is performed using minibatches of reduction_mb_size` sequences to compute the attention weights.

n_updates (int): Number of updates to train for

n_torch_threads (int): Number of parallel threads to allow PyTorch

learning_rate (float): Learning rate for adam optimizer

l1_weight_decay (float): l1 weight decay factor. l1 weight penalty will be added to loss, scaled by l1_weight_decay

l2_weight_decay (float): l2 weight decay factor. l2 weight penalty will be added to loss, scaled by l2_weight_decay

sequence_counts_scaling_fn: it can either be log (logarithmic scaling of sequence counts) or None

evaluate_at (int): Evaluate model on training and validation set every evaluate_at updates. This will also check for a new best model for early stopping.

sample_n_sequences (int): Optional random sub-sampling of sample_n_sequences sequences per repertoire. Number of sequences per repertoire might be smaller than sample_n_sequences if repertoire is smaller or random indices have been drawn multiple times. If None, all sequences will be loaded for each repertoire.

training_batch_size (int): Number of repertoires per minibatch during training.

n_workers (int): Number of background processes to use for converting dataset to hdf5 container and training set data loader.

pytorch_device_name (str): The name of the pytorch device to use. This name will be passed to torch.device(self.pytorch_device_name). The default value is cuda:0

YAML specification:

my_deeprc_method:
    DeepRC:
        validation_part: 0.2
        add_positional_information: True
        kernel_size: 9

KNN

This is a wrapper of scikit-learn’s KNeighborsClassifier class. This ML method creates a distance matrix using the given encoded data. If the encoded data is already a distance matrix (for example, when using the Distance or CompAIRRDistance encoders), please use PrecomputedKNN instead.

Please see the scikit-learn documentation of KNeighborsClassifier for the parameters.

Scikit-learn models can be trained in two modes:

1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.

2. Passing a range of different hyperparameters to KNN, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the KNN model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.

By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.

Arguments:

KNN (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.

model_selection_cv (bool): If any of the hyperparameters under KNN is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.

model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.

YAML specification:

my_knn_method:
    KNN:
        # sklearn parameters (same names as in original sklearn class)
        weights: uniform # always use this setting for weights
        n_neighbors: [5, 10, 15] # find the optimal number of neighbors
        # Additional parameter that determines whether to print convergence warnings
        show_warnings: True
    # if any of the parameters under KNN is a list and model_selection_cv is True,
    # a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
    # and the optimal model will be selected
    model_selection_cv: True
    model_selection_n_folds: 5
# alternative way to define ML method with default values:
my_default_knn: KNN

LogisticRegression

This is a wrapper of scikit-learn’s LogisticRegression class. Please see the scikit-learn documentation of LogisticRegression for the parameters.

Note: if you are interested in plotting the coefficients of the logistic regression model, consider running the Coefficients report.

Scikit-learn models can be trained in two modes:

1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.

2. Passing a range of different hyperparameters to LogisticRegression, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the LogisticRegression model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.

By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.

Arguments:

LogisticRegression (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.

model_selection_cv (bool): If any of the hyperparameters under LogisticRegression is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.

model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.

YAML specification:

my_logistic_regression: # user-defined method name
    LogisticRegression: # name of the ML method
        # sklearn parameters (same names as in original sklearn class)
        penalty: l1 # always use penalty l1
        C: [0.01, 0.1, 1, 10, 100] # find the optimal value for C
        # Additional parameter that determines whether to print convergence warnings
        show_warnings: True
    # if any of the parameters under LogisticRegression is a list and model_selection_cv is True,
    # a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
    # and the optimal model will be selected
    model_selection_cv: True
    model_selection_n_folds: 5
# alternative way to define ML method with default values:
my_default_logistic_regression: LogisticRegression

PrecomputedKNN

This is a wrapper of scikit-learn’s KNeighborsClassifier class. This ML method takes a pre-computed distance matrix, as created by the Distance or CompAIRRDistance encoders. If you would like to use a different encoding in combination with KNN, please use KNN instead.

Please see the scikit-learn documentation of KNeighborsClassifier for the parameters.

Scikit-learn models can be trained in two modes:

1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.

2. Passing a range of different hyperparameters to KNN, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the KNN model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.

By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.

Arguments:

KNN (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.

model_selection_cv (bool): If any of the hyperparameters under KNN is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.

model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.

YAML specification:

my_knn_method:
    PrecomputedKNN:
        # sklearn parameters (same names as in original sklearn class)
        weights: uniform # always use this setting for weights
        n_neighbors: [5, 10, 15] # find the optimal number of neighbors
        # Additional parameter that determines whether to print convergence warnings
        show_warnings: True
    # if any of the parameters under KNN is a list and model_selection_cv is True,
    # a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
    # and the optimal model will be selected
    model_selection_cv: True
    model_selection_n_folds: 5
# alternative way to define ML method with default values:
my_default_knn: PrecomputedKNN

ProbabilisticBinaryClassifier

ProbabilisticBinaryClassifier predicts the class assignment in binary classification case based on encoding examples by number of successful trials and total number of trials. It models this ratio by one beta distribution per class and predicts the class of the new examples using log-posterior odds ratio with threshold at 0.

ProbabilisticBinaryClassifier is based on the paper (details on the classification can be found in the Online Methods section): Emerson, Ryan O., William S. DeWitt, Marissa Vignali, Jenna Gravley, Joyce K. Hu, Edward J. Osborne, Cindy Desmarais, et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.

Arguments:

max_iterations (int): maximum number of iterations while optimizing the parameters of the beta distribution (same for both classes)

update_rate (float): how much the computed gradient should influence the updated value of the parameters of the beta distribution

likelihood_threshold (float): at which threshold to stop the optimization (default -1e-10)

YAML specification:

my_probabilistic_classifier: # user-defined name of the ML method
    ProbabilisticBinaryClassifier: # method name
        max_iterations: 1000
        update_rate: 0.01

RandomForestClassifier

This is a wrapper of scikit-learn’s RandomForestClassifier class. Please see the scikit-learn documentation of RandomForestClassifier for the parameters.

Note: if you are interested in plotting the coefficients of the random forest classifier model, consider running the Coefficients report.

Scikit-learn models can be trained in two modes:

1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.

2. Passing a range of different hyperparameters to RandomForestClassifier, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the RandomForestClassifier model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.

By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.

Arguments:

RandomForestClassifier (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.

model_selection_cv (bool): If any of the hyperparameters under RandomForestClassifier is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.

model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.

YAML specification:

my_random_forest_classifier: # user-defined method name
    RandomForestClassifier: # name of the ML method
        # sklearn parameters (same names as in original sklearn class)
        random_state: 100 # always use this value for random state
        n_estimators: [10, 50, 100] # find the optimal number of trees in the forest
        # Additional parameter that determines whether to print convergence warnings
        show_warnings: True
    # if any of the parameters under RandomForestClassifier is a list and model_selection_cv is True,
    # a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
    # and the optimal model will be selected
    model_selection_cv: True
    model_selection_n_folds: 5
# alternative way to define ML method with default values:
my_default_random_forest: RandomForestClassifier

ReceptorCNN

A CNN which separately detects motifs using CNN kernels in each chain of paired receptor data, combines the kernel activations into a unique representation of the receptor and uses this representation to predict the antigen binding.

_images/receptor_cnn_immuneML.png

The architecture of the CNN for paired-chain receptor data

Requires one-hot encoded data as input (as produced by OneHot encoder), where use_positional_info must be set to True.

Notes:

  • ReceptorCNN can only be used with ReceptorDatasets, it does not work with SequenceDatasets

  • ReceptorCNN can only be used for binary classification, not multi-class classification.

Arguments:

kernel_count (count): number of kernels that will look for motifs for one chain

kernel_size (list): sizes of the kernels = how many amino acids to consider at the same time in the chain sequence, can be a tuple of values; e.g. for value [3, 4] of kernel_size, kernel_count*len(kernel_size) kernels will be created, with kernel_count kernels of size 3 and kernel_count kernels of size 4 per chain

positional_channels (int): how many positional channels where included in one-hot encoding of the receptor sequences (OneHot encoder adds 3 positional channels positional information is enabled)

sequence_type (SequenceType): type of the sequence

device: which device to use for the model (cpu or gpu) - for more details see PyTorch documentation on device parameter

number_of_threads (int): how many threads to use

random_seed (int): number used as a seed for random initialization

learning_rate (float): learning rate scaling the step size for optimization algorithm

iteration_count (int): for how many iterations to train the model

l1_weight_decay (float): weight decay l1 value for the CNN; encourages sparser representations

l2_weight_decay (float): weight decay l2 value for the CNN; shrinks weight coefficients towards zero

batch_size (int): how many receptors to process at once

training_percentage (float): what percentage of data to use for training (the rest will be used for validation); values between 0 and 1

evaluate_at (int): when to evaluate the model, e.g. every 100 iterations

background_probabilities: used for rescaling the kernel values to produce information gain matrix; represents the background probability of each amino acid (without positional information); if not specified, uniform background is assumed

YAML specification:

my_receptor_cnn:
    ReceptorCNN:
        kernel_count: 5
        kernel_size: [3]
        positional_channels: 3
        sequence_type: amino_acid
        device: cpu
        number_of_threads: 16
        random_seed: 100
        learning_rate: 0.01
        iteration_count: 10000
        l1_weight_decay: 0
        l2_weight_decay: 0
        batch_size: 5000

SVC

This is a wrapper of scikit-learn’s LinearSVC class. Please see the scikit-learn documentation of SVC for the parameters.

Note: if you are interested in plotting the coefficients of the SVC model, consider running the Coefficients report.

Scikit-learn models can be trained in two modes:

1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.

2. Passing a range of different hyperparameters to SVC, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the SVC model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.

By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.

Arguments:

SVC (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.

model_selection_cv (bool): If any of the hyperparameters under SVC is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.

model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.

YAML specification:

my_svc: # user-defined method name
    SVC: # name of the ML method
        # sklearn parameters (same names as in original sklearn class)
        C: [0.01, 0.1, 1, 10, 100] # find the optimal value for C
        # Additional parameter that determines whether to print convergence warnings
        show_warnings: True
    # if any of the parameters under SVC is a list and model_selection_cv is True,
    # a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
    # and the optimal model will be selected
    model_selection_cv: True
    model_selection_n_folds: 5
# alternative way to define ML method with default values:
my_default_svc: SVC

SVM

This is a wrapper of scikit-learn’s SVC class. Please see the scikit-learn documentation of SVC for the parameters.

Note: if you are interested in plotting the coefficients of the SVM model, consider running the Coefficients report.

Scikit-learn models can be trained in two modes:

1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.

2. Passing a range of different hyperparameters to SVM, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the SVM model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.

By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.

Arguments:

SVM (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.

model_selection_cv (bool): If any of the hyperparameters under SVM is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.

model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.

YAML specification:

my_svm: # user-defined method name
    SVM: # name of the ML method
        # sklearn parameters (same names as in original sklearn class)
        C: [0.01, 0.1, 1, 10, 100] # find the optimal value for C
        kernel: linear
        # Additional parameter that determines whether to print convergence warnings
        show_warnings: True
    # if any of the parameters under SVM is a list and model_selection_cv is True,
    # a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
    # and the optimal model will be selected
    model_selection_cv: True
    model_selection_n_folds: 5
# alternative way to define ML method with default values:
my_default_svm: SVM

TCRdistClassifier

Implementation of a nearest neighbors classifier based on TCR distances as presented in Dash P, Fiore-Gartland AJ, Hertz T, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. 2017; 547(7661):89-93. doi:10.1038/nature22383.

This method is implemented using scikit-learn’s KNeighborsClassifier with k determined at runtime from the training dataset size and weights linearly scaled to decrease with the distance of examples.

Arguments:

percentage (float): percentage of nearest neighbors to consider when determining receptor specificity based on known receptors (between 0 and 1)

show_warnings (bool): whether to show warnings generated by scikit-learn, by default this is True.

YAML specification:

my_tcr_method:
    TCRdistClassifier:
        percentage: 0.1
        show_warnings: True

Preprocessings

ChainRepertoireFilter

Removes all repertoires from the RepertoireDataset object which contain at least one sequence with chain different than “keep_chain” parameter. Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.

Since the filter removes repertoires from the dataset (examples in machine learning setting), it cannot be used with TrainMLModel instruction. If you want to filter out repertoires including a given chain, see DatasetExport instruction with preprocessing.

Arguments:

keep_chain (SequenceType): Which chain should be kept.

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            ChainRepertoireFilter:
                keep_chain: TRB

ClonesPerRepertoireFilter

Removes all repertoires from the RepertoireDataset, which contain fewer clonotypes than specified by the lower_limit, or more clonotypes than specified by the upper_limit. Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.

Since the filter removes repertoires from the dataset (examples in machine learning setting), it cannot be used with TrainMLModel instruction. If you want to use this filter, see DatasetExport instruction with preprocessing.

Arguments:

lower_limit (int): The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire.

upper_limit (int): The maximal inclusive upper limit for the number of clonotypes allowed in a repertoire.

When no lower or upper limit is specified, or the value -1 is specified, the limit is ignored.

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            ClonesPerRepertoireFilter:
                lower_limit: 100
                upper_limit: 100000

CountPerSequenceFilter

Removes all sequences from a Repertoire when they have a count below low_count_limit, or sequences with no count value if remove_without_counts is True. This filter can be applied to Repertoires and RepertoireDatasets.

Arguments:

low_count_limit (int): The inclusive minimal count value in order to retain a given sequence.

remove_without_count (bool): Whether the sequences without a reported count value should be removed.

remove_empty_repertoires (bool): Whether repertoires without sequences should be removed. Only has an effect when remove_without_count is also set to True. If this is true, this preprocessing cannot be used with TrainMLModel instruction, but only with DatasetExport instruction instead.

batch_size (int): number of repertoires that can be loaded at the same time (only affects the speed when applying this filter on a RepertoireDataset)

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            CountPerSequenceFilter:
                remove_without_count: True
                remove_empty_repertoires: True
                low_count_limit: 3
                batch_size: 4

DuplicateSequenceFilter

Collapses duplicate nucleotide or amino acid sequences within each repertoire in the given RepertoireDataset. This filter can be applied to Repertoires and RepertoireDatasets.

Sequences are considered duplicates if the following fields are identical:

  • amino acid or nucleotide sequence (whichever is specified)

  • v and j genes (note that the full field including subgroup + gene is used for matching, i.e. V1 and V1-1 are not considered duplicates)

  • chain

  • region type

For all other fields (the non-specified sequence type, custom lists, sequence identifier) only the first occurring value is kept.

Note that this means the count value of a sequence with a given sequence identifier might not be the same as before removing duplicates, unless count_agg = FIRST is used.

Arguments:

filter_sequence_type (SequenceType): Whether the sequences should be collapsed on the nucleotide or amino acid level. Valid values are: [‘AMINO_ACID’, ‘NUCLEOTIDE’].

batch_size (int): number of repertoires that can be loaded at the same time (only affects the speed)

count_agg (CountAggregationFunction): determines how the sequence counts of duplicate sequences are aggregated. Valid values are: [‘SUM’, ‘MAX’, ‘MIN’, ‘MEAN’, ‘FIRST’, ‘LAST’].

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            DuplicateSequenceFilter:
                # required parameters:
                filter_sequence_type: AMINO_ACID
                # optional parameters (if not specified the values bellow will be used):
                batch_size: 4
                count_agg: SUM

MetadataRepertoireFilter

Removes repertoires from a RepertoireDataset based on information stored in the metadata_file. Note that this filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.

Since this filter changes the number of repertoires (examples for the machine learning task), it cannot be used with TrainMLModel instruction. To filter out repertoires, use preprocessing from the DatasetExport instruction that will create a new dataset ready to be used for training machine learning models.

Arguments:

criteria (dict): a nested dictionary that specifies the criteria for keeping certain columns. See CriteriaMatcher for a more detailed explanation.

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            # Example filter that keeps repertoires with values greater than 1 in the "my_column_name" column of the metadata_file
            MetadataRepertoireFilter:
                type: GREATER_THAN
                value:
                    type: COLUMN
                    name: my_column_name
                threshold: 1

ReferenceSequenceAnnotator

Annotates each sequence in each repertoire if it matches any of the reference sequences provided as input parameter. This report uses CompAIRR internally. To match CDR3 sequences (and not JUNCTION), CompAIRR v1.10 or later is needed.

Arguments:

reference_sequences (dict): A dictionary describing the reference dataset file. Import should be specified the same way as regular dataset import. It is only allowed to import a receptor dataset here (i.e., is_repertoire is False and paired is True by default, and these are not allowed to be changed).

max_edit_distance (int): The maximum edit distance between a target sequence (from the repertoire) and the reference sequence.

compairr_path (str): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.

threads (int): how many threads to be used by CompAIRR for sequence matching

ignore_genes (bool): Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains have to match. If True, gene information is ignored. By default, ignore_genes is False.

output_column_name (str): in case there are multiple annotations, it is possible here to define the name of the column in the output repertoire files for this specific annotation

repertoire_batch_size (int): how many repertoires to process simultaneously; depending on the repertoire size, this parameter might be use to limit the memory usage

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - step1:
            ReferenceSequenceAnnotator:
                reference_sequences:
                    format: VDJDB
                    params:
                        path: path/to/file.csv
                compairr_path: optional/path/to/compairr
                ignore_genes: False
                max_edit_distance: 0
                output_column_name: matched
                threads: 4
                repertoire_batch_size: 5

SequenceLengthFilter

Removes sequences with length out of the predefined range.

Arguments:

sequence_type (SequenceType): Whether the sequences should be filtered on the nucleotide or amino acid level. Valid options are defined by the SequenceType enum.

min_len (int): minimum length of the sequence (sequences shorter than min_len will be removed); to not use min_len, set it to -1

max_len (int): maximum length of the sequence (sequences longer than max_len will be removed); to not use max_len, set it to -1

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            SequenceLengthFilter:
                sequence_type: AMINO_ACID
                min_len: 3 # -> remove all sequences shorter than 3
                max_len: -1 # -> no upper bound on the sequence length

SubjectRepertoireCollector

Merges all the Repertoires in a RepertoireDataset that have the same ‘subject_id’ specified in the metadata. The result is a RepertoireDataset with one Repertoire per subject. This preprocessing cannot be used in combination with TrainMLModel instruction because it can change the number of examples. To combine the repertoires in this way, use this preprocessing with DatasetExport instruction.

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter: SubjectRepertoireCollector

Instructions

DatasetExport

DatasetExport instruction takes a list of datasets as input, optionally applies preprocessing steps, and outputs the data in specified formats.

Arguments:

datasets (list): a list of datasets to export in all given formats

preprocessing_sequence (str): which preprocessing sequence to use on the dataset(s), this item is optional and does not have to be specified. When specified, the same preprocessing sequence will be applied to all datasets.

formats (list): a list of formats in which to export the datasets. Valid values are: AIRR, ImmuneML.

number_of_processes (int): how many processes to use during repertoire export (not used for sequence datasets)

YAML specification:

my_dataset_export_instruction: # user-defined instruction name
    type: DatasetExport # which instruction to execute
    datasets: # list of datasets to export
        - my_generated_dataset
        - my_dataset_from_adaptive
    preprocessing_sequence: my_preprocessing_sequence
    number_of_processes: 4
    export_formats: # list of formats to export the datasets to
        - AIRR
        - ImmuneML

ExploratoryAnalysis

Allows exploratory analysis of different datasets using encodings and reports.

Analysis is defined by a dictionary of ExploratoryAnalysisUnit objects that encapsulate a dataset, an encoding [optional] and a report to be executed on the [encoded] dataset. Each analysis specified under analyses is completely independent from all others.

Arguments:

analyses (dict): a dictionary of analyses to perform. The keys are the names of different analyses, and the values for each of the analyses are:

  • dataset: dataset on which to perform the exploratory analysis

  • preprocessing_sequence: which preprocessings to use on the dataset, this item is optional and does not have to be specified.

  • encoding: how to encode the dataset before running the report, this item is optional and does not have to be specified.

  • labels: if encoding is specified, the relevant labels must be specified here.

  • report: which report to run on the dataset. Reports specified here may be of the category Data reports or Encoding reports, depending on whether ‘encoding’ was specified.

number_of_processes: (int): how many processes should be created at once to speed up the analysis. For personal machines, 4 or 8 is usually a good choice.

YAML specification:

my_expl_analysis_instruction: # user-defined instruction name
    type: ExploratoryAnalysis # which instruction to execute
    analyses: # analyses to perform
        my_first_analysis: # user-defined name of the analysis
            dataset: d1 # dataset to use in the first analysis
            preprocessing_sequence: p1 # preprocessing sequence to use in the first analysis
            report: r1 # which report to generate using the dataset d1
        my_second_analysis: # user-defined name of another analysis
            dataset: d1 # dataset to use in the second analysis - can be the same or different from other analyses
            encoding: e1 # encoding to apply on the specified dataset (d1)
            report: r2 # which report to generate in the second analysis
            labels: # labels present in the dataset d1 which will be included in the encoded data on which report r2 will be run
                - celiac # name of the first label as present in the column of dataset's metadata file
                - CMV # name of the second label as present in the column of dataset's metadata file
    number_of_processes: 4 # number of parallel processes to create (could speed up the computation)

MLApplication

Instruction which enables using trained ML models and encoders on new datasets which do not necessarily have labeled data. When the same label is provided as the ML setting was trained for, performance metrics can be computed.

The predictions are stored in the predictions.csv in the result path in the following format:

example_id

cmv_predicted_class

cmv_1_proba

cmv_0_proba

e1

1

0.8

0.2

e2

0

0.2

0.8

e3

1

0.78

0.22

If the same label that the ML setting was trained for is present in the provided dataset, the ‘true’ label value will be added to the predictions table in addition:

Arguments:

dataset: dataset for which examples need to be classified

config_path: path to the zip file exported from MLModelTraining instruction (which includes train ML model, encoder, preprocessing etc.)

number_of_processes (int): how many processes should be created at once to speed up the analysis. For personal machines, 4 or 8 is usually a good choice.

metrics (list): a list of metrics (accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision, recall, auc, log_loss) to compute between the true and predicted classes. These metrics will only be computed when the same label with the same classes is provided for the dataset as the original label the ML setting was trained for.

Specification example for the MLApplication instruction:

instruction_name:
    type: MLApplication
    dataset: d1
    config_path: ./config.zip
    metrics:
    - accuracy
    - precision
    - recall
    number_of_processes: 4

Simulation

A simulation is an instruction that implants synthetic signals into the given dataset according to given parameters. This results in a new dataset containing modified sequences, and is annotated with metadata labels according to the implanted signals.

Arguments:

dataset: original dataset which will be used as a basis for implanting signals from the simulation

simulation: definition of how to perform the simulation.

export_formats: in which formats to export the dataset after simulation. Valid values are: AIRR, ImmuneML. Important note: Binary files in ImmuneML might not be compatible between different immuneML versions.

YAML specification:

my_simulation_instruction: # user-defined name of the instruction
    type: Simulation # which instruction to execute
    dataset: my_dataset # which dataset to use for implanting the signals
    simulation: my_simulation # how to implanting the signals - definition of the simulation
    export_formats: [AIRR] # in which formats to export the dataset

Subsampling

Subsampling is an instruction that subsamples a given dataset and creates multiple smaller dataset according to the parameters provided.

Arguments:

dataset (Dataset): original dataset which will be used as a basis for subsampling

subsampled_dataset_sizes (list): a list of dataset sizes (number of examples) each subsampled dataset should have

dataset_export_formats (list): in which formats to export the subsampled datasets. Valid values are: AIRR, ImmuneML.

YAML specification:

my_subsampling_instruction: # user-defined name of the instruction
    type: Subsampling # which instruction to execute
    dataset: my_dataset # original dataset to be subsampled, with e.g., 300 examples
    subsampled_dataset_sizes: # how large the subsampled datasets should be, one dataset will be created for each list item
        - 200 # one subsampled dataset with 200 examples (200 repertoires if my_dataset was repertoire dataset)
        - 100 # the other subsampled dataset will have 100 examples
    dataset_export_formats: # in which formats to export the subsampled datasets
        - ImmuneML
        - AIRR

TrainMLModel

Class implementing hyperparameter optimization and training and assessing the model through nested cross-validation (CV). The process is defined by two loops:

  • the outer loop over defined splits of the dataset for performance assessment

  • the inner loop over defined hyperparameter space and with cross-validation or train & validation split to choose the best hyperparameters.

Optimal model chosen by the inner loop is then retrained on the whole training dataset in the outer loop.

Note: If you are interested in plotting the performance of all combinations of encodings and ML methods on the test set, consider running the MLSettingsPerformance report as hyperparameter report in the assessment loop.

Arguments:

dataset: dataset to use for training and assessing the classifier

strategy: how to search different hyperparameters; common options include grid search, random search. Valid values are: GridSearch.

settings (list): a list of combinations of preprocessing_sequence, encoding and ml_method. preprocessing_sequence is optional, while encoding and ml_method are mandatory. These three options (and their parameters) can be optimized over, choosing the highest performing combination.

assessment: description of the outer loop (for assessment) of nested cross-validation. It describes how to split the data, how many splits to make, what percentage to use for training and what reports to execute on those splits. See SplitConfig.

selection: description of the inner loop (for selection) of nested cross-validation. The same as assessment argument, just to be executed in the inner loop. See SplitConfig.

metrics (list): a list of metrics (accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision, recall, auc, log_loss) to compute for all splits and settings created during the nested cross-validation. These metrics will be computed only for reporting purposes. For choosing the optimal setting, optimization_metric will be used.

optimization_metric: a metric to use for optimization (one of accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision, recall, auc, log_loss) and assessment in the nested cross-validation.

labels (list): a list of labels for which to train the classifiers. The goal of the nested CV is to find the setting which will have best performance in predicting the given label (e.g., if a subject has experienced an immune event or not). Performance and optimal settings will be reported for each label separately. If a label is binary, instead of specifying only its name, one should explicitly set the name of the positive class as well under parameter positive_class. If positive class is not set, one of the label classes will be assumed to be positive.

number_of_processes (int): how many processes should be created at once to speed up the analysis. For personal machines, 4 or 8 is usually a good choice.

reports (list): a list of report names to be executed after the nested CV has finished to show the overall performance or some statistic; the reports that can be provided here are Train ML model reports.

refit_optimal_model (bool): if the final combination of preprocessing-encoding-ML model should be refitted on the full dataset thus providing the final model to be exported from instruction; alternatively, train combination from one of the assessment folds will be used

YAML specification:

my_nested_cv_instruction: # user-defined name of the instruction
    type: TrainMLModel # which instruction should be executed
    settings: # a list of combinations of preprocessing, encoding and ml_method to optimize over
        - preprocessing: seq1 # preprocessing is optional
          encoding: e1 # mandatory field
          ml_method: simpleLR # mandatory field
        - preprocessing: seq1 # the second combination
          encoding: e2
          ml_method: simpleLR
    assessment: # outer loop of nested CV
        split_strategy: random # perform Monte Carlo CV (randomly split the data into train and test)
        split_count: 1 # how many train/test datasets to generate
        training_percentage: 0.7 # what percentage of the original data should be used for the training set
        reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods
            data_splits: # list of reports to execute on training/test datasets (before they are encoded)
                - rep1
            encoding: # list of reports to execute on encoded training/test datasets
                - rep2
            models: # list of reports to execute on trained ML methods for each assessment CV split
                - rep3
    selection: # inner loop of nested CV
        split_strategy: k_fold # perform k-fold CV
        split_count: 5 # how many fold to create: here these two parameters mean: do 5-fold CV
        reports:
            data_splits: # list of reports to execute on training/test datasets (in the inner loop, so these are actually training and validation datasets)
                - rep1
            models: # list of reports to execute on trained ML methods for each selection CV split
                - rep2
            encoding: # list of reports to execute on encoded training/test datasets (again, it is training/validation here)
                - rep3
    labels: # list of labels to optimize the classifier for, as given in the metadata for the dataset
        - celiac:
            positive_class: + # if it's binary classification, positive class parameter should be set
        - T1D # this is not binary label, so no need to specify positive class
    dataset: d1 # which dataset to use for the nested CV
    strategy: GridSearch # how to choose the combinations which to test from settings (GridSearch means test all)
    metrics: # list of metrics to compute for all settings, but these do not influence the choice of optimal model
        - accuracy
        - auc
    reports: # list of reports to execute when nested CV is finished to show overall performance
        - rep4
    number_of_processes: 4 # number of parallel processes to create (could speed up the computation)
    optimization_metric: balanced_accuracy # the metric to use for choosing the optimal model and during training
    refit_optimal_model: False # use trained model, do not refit on the full dataset

SplitConfig

SplitConfig describes how to split the data for cross-validation. It allows for the following combinations:

  • loocv (leave-one-out cross-validation)

  • k_fold (k-fold cross-validation)

  • stratified_k_fold (stratified k-fold cross-validation that can be used when immuneML is used for single-label classification, see `this documentation<https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html>`_ for more details on how this is implemented)

  • random (Monte Carlo cross-validation - randomly splitting the dataset to training and test datasets)

  • manual (train and test dataset are explicitly specified by providing metadata files for the two datasets)

  • leave_one_out_stratification (leave-one-out CV where one refers to a specific parameter, e.g. if subject is known in a receptor dataset, it is possible to have leave-subject-out CV - currently only available for receptor and sequence datasets).

Arguments:

split_strategy: one of the types of cross-validation listed above (LOOCV, K_FOLD, STRATIFIED_K_FOLD, MANUAL, `` or RANDOM)

split_count (int): if split_strategy is K_FOLD, then this defined how many splits to make (K), if split_strategy is RANDOM, split_count defines how many random splits to make, resulting in split_count training/test dataset pairs, or if split_strategy is LOOCV, MANUAL or LEAVE_ONE_OUT_STRATIFICATION, split_count does not need to be specified.

training_percentage: if split_strategy is RANDOM, this defines which portion of the original dataset to use for creating the training dataset; for other values of split_strategy, this parameter is not used.

reports: defines which reports to execute on which datasets or settings. See ReportConfig for more details.

manual_config: if split strategy is MANUAL, here the paths to metadata files should be given (fields train_metadata_path and test_metadata_path). The matching of examples is done using the “subject_id” field in for repertoire datasets so it has to be present in both the original dataset and the metadata files provided here. For receptor and sequence datasets, “example_id” field needs to be provided in the metadata files and it will be mapped to either ‘sequence_identifiers’ or ‘receptor_identifiers’ in the original dataset. If split strategy is anything other than MANUAL, this field has no effect and can be omitted.

leave_one_out_config: if split strategy is LEAVE_ONE_OUT_STRATIFICATION, this config describes which parameter to use for stratification thus making a list of train/test dataset combinations in which in the test set there are examples with only one value of the specified parameter. leave_one_out_config argument accepts two inputs: parameter which is the name of the parameter to use for stratification and min_count which defines the minimum number of examples that can be present in the test dataset. This type of generating train and test datasets is only supported for receptor and sequence datasets so far. If split strategy is anything else, this field has no effect and can be omitted.

YAML specification:

# as a part of a TrainMLModel instruction, defining the outer (assessment) loop of nested cross-validation:
assessment: # outer loop of nested CV
    split_strategy: random # perform Monte Carlo CV (randomly split the data into train and test)
    split_count: 5 # how many train/test datasets to generate
    training_percentage: 0.7 # what percentage of the original data should be used for the training set
    reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods
        data_splits: # list of data reports to execute on training/test datasets (before they are encoded)
            - rep1
        encoding: # list of encoding reports to execute on encoded training/test datasets
            - rep2
        models: # list of ML model reports to execute on the trained classifiers in the assessment loop
            - rep3

# as a part of a TrainMLModel instruction, defining the inner (selection) loop of nested cross-validation:
selection: # inner loop of nested CV
    split_strategy: leave_one_out_stratification
    leave_one_out_config: # perform leave-(subject)-out CV
        parameter: subject # which parameter to use for splitting, must be present in the metadata for each example
        min_count: 1 # what is the minimum number of examples with unique value of the parameter specified above for the analysis to be valid
    reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods
        data_splits: # list of data reports to execute on training/test datasets (before they are encoded)
            - rep1
        encoding: # list of encoding reports to execute on encoded training/test datasets
            - rep2
        encoding: # list of ML model reports to execute the trained classifiers in the selection loop
            - rep3

ReportConfig

A class encapsulating different report lists which can be executed while performing nested cross-validation (CV) using TrainMLModel instruction. All arguments are optional.

Arguments:

data: Data reports to be executed on the whole dataset before it is split to training/test or training/validation

data_splits: Data reports to be executed after the data has been split into training and test (assessment CV loop) or training and validation (selection CV loop) datasets before they are encoded

models: ML model reports to be executed on all trained classifiers

encoding: Encoding reports to be executed on each of the encoded training/test datasets or training/validation datasets

YAML specification:

# as a part of a TrainMLModel instruction, defining the outer (assessment) loop of nested cross-validation:
assessment: # outer loop of nested CV
    split_strategy: random # perform Monte Carlo CV (randomly split the data into train and test)
    split_count: 5 # how many train/test datasets to generate
    training_percentage: 0.7 # what percentage of the original data should be used for the training set
    reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods
        data_splits: # list of reports to execute on training/test datasets (before they are preprocessed and encoded)
            - my_data_split_report
        encoding: # list of reports to execute on encoded training/test datasets
            - my_encoding_report

# as a part of a TrainMLModel instruction, defining the inner (selection) loop of nested cross-validation:
selection: # inner loop of nested CV
    split_strategy: random # perform Monte Carlo CV (randomly split the data into train and validation)
    split_count: 5 # how many train/validation datasets to generate
    training_percentage: 0.7 # what percentage of the original data should be used for the training set
    reports: # reports to execute on training/validation datasets, encoded datasets and trained ML methods
        data_splits: # list of reports to execute on training/validation datasets (before they are preprocessed and encoded)
            - my_data_split_report
        encoding: # list of reports to execute on encoded training/validation datasets
            - my_encoding_report
        models:
            - my_ml_model_report

Output

HTML

Outputs HTML results of the analysis. This is currently the only defined format of presentation of results.

YAML specification:

definitions:
    ...
instructions:
    ...
output: # the output format
    format: HTML