Definitions#

The different components used inside an immuneML analysis are called definitions. These analysis components are used inside instructions to perform an analysis.

This page documents all possible definitions and their parameters in detail. For general usage examples please check out the Tutorials.

Please use the menu on the right side of this page to navigate to the documentation for the components of interest, or jump to one of the following sections:

Datasets#

Under the definitions/datasets component, the user can specify how to import a dataset from files. The file format determines which importer should be used, as listed below. See also: How to import data into immuneML.

For testing purposes, it is also possible to generate a random dataset instead of importing from files, using RandomReceptorDataset, RandomSequenceDataset or RandomRepertoireDataset import types. See also: How to generate a dataset with random sequences.

AIRR#

Imports data in AIRR format into a Repertoire-, Sequence- or ReceptorDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets or ReceptorDatasets should be used when predicting values for unpaired (single-chain) and paired immune receptors respectively, like antigen specificity.

The AIRR .tsv format is explained here: https://docs.airr-community.org/en/stable/datarep/format.html And the AIRR rearrangement schema can be found here: https://docs.airr-community.org/en/stable/datarep/rearrangements.html

When importing a ReceptorDataset, the AIRR field cell_id is used to determine the chain pairs.

Specification arguments:

  • path (str): For RepertoireDatasets, this is the path to a directory with AIRR files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

  • is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset or ReceptorDataset. By default, is_repertoire is set to True.

  • metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the AIRR files included under the column ‘filename’ are imported into the RepertoireDataset. For setting SequenceDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

  • paired (str): Required for Sequence- or ReceptorDatasets. This parameter determines whether to import a SequenceDataset (paired = False) or a ReceptorDataset (paired = True). In a ReceptorDataset, two sequences with chain types specified by receptor_chains are paired together based on the identifier given in the AIRR column named ‘cell_id’.

  • receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values are TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).

  • import_productive (bool): Whether productive sequences (with value ‘T’ in column productive) should be included in the imported sequences. By default, import_productive is True.

  • import_unknown_productivity (bool): Whether sequences with unknown productivity (missing value in column productive) should be included in the imported sequences. By default, import_unknown_productivity is True.

  • import_with_stop_codon (bool): Whether sequences with stop codons (with value ‘T’ in column stop_codon) should be included in the imported sequences. This only applies if column stop_codon is present. By default, import_with_stop_codon is False.

  • import_out_of_frame (bool): Whether out of frame sequences (with value ‘F’ in column vj_in_frame) should be included in the imported sequences. This only applies if column vj_in_frame is present. By default, import_out_of_frame is False.

  • import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

  • import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

  • import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

  • region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as AIRR uses the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

  • column_mapping (dict): A mapping from AIRR column names to immuneML’s internal data representation. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the AIRR file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aa’, ‘sequence’, ‘v_call’, ‘j_call’, ‘chain’, ‘duplicate_count’, ‘frame_type’, ‘sequence_id’, ‘cell_id’].. For AIRR, this is by default set to:

    junction: sequence
    junction_aa: sequence_aa
    locus: chain
    
  • column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For AIRR format, there is no default column_mapping_synonyms.

  • metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are AIRR column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. For AIRR format, there is no default metadata_column_mapping.

  • separator (str): Column separator, for AIRR this is by default “t”.

YAML specification:

definitions:
    datasets:
        my_airr_dataset:
            format: AIRR
            params:
                path: path/to/files/
                is_repertoire: True # whether to import a RepertoireDataset
                metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
                metadata_column_mapping: # metadata column mapping AIRR: immuneML for Sequence- or ReceptorDatasetDataset
                    airr_column_name1: metadata_label1
                    airr_column_name2: metadata_label2
                import_productive: True # whether to include productive sequences in the dataset
                import_with_stop_codon: False # whether to include sequences with stop codon in the dataset
                import_out_of_frame: False # whether to include out of frame sequences in the dataset
                import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
                import_empty_nt_sequences: True # keep sequences even if the `sequences` column is empty (provided that other fields are as specified here)
                import_empty_aa_sequences: False # remove all sequences with empty `sequence_aa` column
                # Optional fields with AIRR-specific defaults, only change when different behavior is required:
                separator: "\t" # column separator
                region_type: IMGT_CDR3 # what part of the sequence to import
                column_mapping: # column mapping AIRR: immuneML
                    junction: sequence
                    junction_aa: sequence_aa
                    locus: chain

Generic#

Imports data from any tabular file into a Repertoire-, Sequence- or ReceptorDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets or ReceptorDatasets should be used when predicting values for unpaired (single-chain) and paired immune receptors respectively, like antigen specificity.

This importer works similarly to other importers, but has no predefined default values for which fields are imported, and can therefore be tailored to import data from various different tabular files with headers.

For ReceptorDatasets: this importer assumes the two receptor sequences appear on different lines in the file, and can be paired together by a common sequence identifier. If you instead want to import a ReceptorDataset from a tabular file that contains both receptor chains on one line, see SingleLineReceptor import

Specification arguments:

  • path (str): For RepertoireDatasets, this is the path to a directory with files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

  • is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset or ReceptorDataset. By default, is_repertoire is set to True.

  • metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. For setting Sequence- or ReceptorDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

  • paired (str): Required for Sequence- or ReceptorDatasets. This parameter determines whether to import a SequenceDataset (paired = False) or a ReceptorDataset (paired = True). In a ReceptorDataset, two sequences with chain types specified by receptor_chains are paired together based on a common identifier. This identifier should be mapped to the immuneML field ‘sequence_identifiers’ using the column_mapping.

  • receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values are TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK.

  • import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

  • import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

  • import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

  • region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means immuneML assumes the IMGT junction (including leading C and trailing Y/F amino acids) is used in the input file, and the first and last amino acids will be removed from the sequences to retrieve the IMGT CDR3 sequence. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

  • column_mapping (dict): Required for all datasets. A mapping where the keys are the column names in the input file, and the values correspond to the names used in immuneML’s internal data representation. Valid immuneML fields that can be specified here are [‘sequence_aa’, ‘sequence’, ‘v_call’, ‘j_call’, ‘chain’, ‘duplicate_count’, ‘frame_type’, ‘sequence_id’, ‘cell_id’].. At least sequences (nucleotide) or sequence_aas (amino acids) must be specified, but all other fields are optional. A column mapping can look for example like this:

    file_column_amino_acids: sequence_aa
    file_column_v_genes: v_call
    file_column_j_genes: j_call
    file_column_frequencies: duplicate_count
    
  • column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For Generic import, there is no default column_mapping_synonyms.

  • metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are file column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. There is no default metadata_column_mapping.

    file_column_antigen_specificity: antigen_specificity
    
  • columns_to_load (list): Optional; specifies which columns to load from the input file. This may be useful if the input files contain many unused columns. If no value is specified, all columns are loaded.

  • separator (str): Required parameter. Column separator, for example “t” or “,”. The default value is “t”

YAML specification:

definitions:
    datasets:
        my_generic_dataset:
            format: Generic
            params:
                path: path/to/files/
                is_repertoire: True # whether to import a RepertoireDataset
                metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
                paired: False # whether to import SequenceDataset (False) or ReceptorDataset (True) when is_repertoire = False
                receptor_chains: TRA_TRB # what chain pair to import for a ReceptorDataset
                separator: "\t" # column separator
                import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
                import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
                import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
                region_type: IMGT_CDR3 # what part of the sequence to import
                column_mapping: # column mapping file: immuneML
                    file_column_amino_acids: sequence_aas
                    file_column_v_genes: v_call
                    file_column_j_genes: j_call
                    file_column_frequencies: duplicate_count
                metadata_column_mapping: # metadata column mapping file: immuneML
                    file_column_antigen_specificity: antigen_specificity
                columns_to_load:  # which subset of columns to load from the file
                    - file_column_amino_acids
                    - file_column_v_genes
                    - file_column_j_genes
                    - file_column_frequencies
                    - file_column_antigen_specificity

IGoR#

Imports data generated by IGoR simulations into a Repertoire-, or SequenceDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets should be used when predicting values for unpaired (single-chain) immune receptors, like antigen specificity.

Note that you should run IGoR with the –CDR3 option specified, this tool imports the generated CDR3 files. Sequences with missing anchors are not imported, meaning only sequences with value ‘1’ in the anchors_found column are imported. Nucleotide sequences are automatically translated to amino acid sequences.

Reference: Quentin Marcou, Thierry Mora, Aleksandra M. Walczak ‘High-throughput immune repertoire analysis with IGoR’. Nature Communications, (2018) doi.org/10.1038/s41467-018-02832-w.

Specification arguments:

  • path (str): For RepertoireDatasets, this is the path to a directory with IGoR files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

  • is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset. By default, is_repertoire is set to True.

  • metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the IGoR files included under the column ‘filename’ are imported into the RepertoireDataset. For setting SequenceDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

  • import_with_stop_codon (bool): Whether sequences with stop codons should be included in the imported sequences. By default, import_with_stop_codon is False.

  • import_out_of_frame (bool): Whether out of frame sequences (with value ‘0’ in column is_inframe) should be included in the imported sequences. By default, import_out_of_frame is False.

  • import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

  • import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

  • region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as IGoR uses the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

  • column_mapping (dict): A mapping from IGoR column names to immuneML’s internal data representation. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the IGoR file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aa’, ‘sequence’, ‘v_call’, ‘j_call’, ‘chain’, ‘duplicate_count’, ‘frame_type’, ‘sequence_id’, ‘cell_id’].. For IGoR, this is by default set to:

    nt_CDR3: sequences
    seq_index: sequence_identifiers
    
  • column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For IGoR format, there is no default column_mapping_synonyms.

  • metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are IGoR column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. For IGoR format, there is no default metadata_column_mapping.

  • separator (str): Column separator, for IGoR this is by default “,”.

YAML specification:

definitions:
    datasets:
        my_igor_dataset:
            format: IGoR
            params:
                path: path/to/files/
                is_repertoire: True # whether to import a RepertoireDataset (True) or a SequenceDataset (False)
                metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
                metadata_column_mapping: # metadata column mapping IGoR: immuneML for SequenceDataset
                    igor_column_name1: metadata_label1
                    igor_column_name2: metadata_label2
                import_with_stop_codon: False # whether to include sequences with stop codon in the dataset
                import_out_of_frame: False # whether to include out of frame sequences in the dataset
                import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
                import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
                # Optional fields with IGoR-specific defaults, only change when different behavior is required:
                separator: "," # column separator
                region_type: IMGT_CDR3 # what part of the sequence to import
                column_mapping: # column mapping IGoR: immuneML
                    nt_CDR3: sequences
                    seq_index: sequence_identifiers

IReceptor#

Imports AIRR datasets retrieved through the iReceptor Gateway into a Repertoire-, Sequence- or ReceptorDataset. The differences between this importer and the AIRR importer are:

  • This importer takes in a list of .zip files, which must contain one or more AIRR tsv files, and for each AIRR file, a corresponding metadata json file must be present.

  • This importer does not require a metadata csv file for RepertoireDataset import, it is generated automatically from the metadata json files.

RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets or ReceptorDatasets should be used when predicting values for unpaired (single-chain) and paired immune receptors respectively, like antigen specificity.

AIRR rearrangement schema can be found here: https://docs.airr-community.org/en/stable/datarep/rearrangements.html

When importing a ReceptorDataset, the AIRR field cell_id is used to determine the chain pairs.

Specification arguments:

  • path (str): This is the path to a directory with .zip files retrieved from the iReceptor Gateway. These .zip files should include AIRR files (with .tsv extension) and corresponding metadata.json files with matching names (e.g., for my_dataset.tsv the corresponding metadata file is called my_dataset-metadata.json). The zip files must use the .zip extension.

  • is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset or ReceptorDataset. By default, is_repertoire is set to True.

  • paired (str): Required for Sequence- or ReceptorDatasets. This parameter determines whether to import a SequenceDataset (paired = False) or a ReceptorDataset (paired = True). In a ReceptorDataset, two sequences with chain types specified by receptor_chains are paired together based on the identifier given in the AIRR column named ‘cell_id’.

  • receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values are TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).

  • import_productive (bool): Whether productive sequences (with value ‘T’ in column productive) should be included in the imported sequences. By default, import_productive is True.

  • import_with_stop_codon (bool): Whether sequences with stop codons (with value ‘T’ in column stop_codon) should be included in the imported sequences. This only applies if column stop_codon is present. By default, import_with_stop_codon is False.

  • import_out_of_frame (bool): Whether out of frame sequences (with value ‘F’ in column vj_in_frame) should be included in the imported sequences. This only applies if column vj_in_frame is present. By default, import_out_of_frame is False.

  • import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

  • import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

  • import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

  • region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as AIRR uses the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

  • column_mapping (dict): A mapping from AIRR column names to immuneML’s internal data representation. For AIRR, this is by default set to the values shown in YAML below. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the AIRR file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aa’, ‘sequence’, ‘v_call’, ‘j_call’, ‘chain’, ‘duplicate_count’, ‘frame_type’, ‘sequence_id’, ‘cell_id’]. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the AIRR file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aa’, ‘sequence’, ‘v_call’, ‘j_call’, ‘chain’, ‘duplicate_count’, ‘frame_type’, ‘sequence_id’, ‘cell_id’]..

    junction: sequences
    junction_aa: sequence_aas
    v_call: v_alleles
    j_call: j_alleles
    locus: chains
    duplicate_count: counts
    sequence_id: sequence_identifiers
    
  • column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For AIRR format, there is no default column_mapping_synonyms.

  • metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are AIRR column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. For AIRR format, there is no default metadata_column_mapping. When importing a RepertoireDataset, the metadata is automatically extracted from the metadata json files.

  • separator (str): Column separator, for AIRR this is by default “t”.

YAML specification:

definitions:
    datasets:
        my_airr_dataset:
            format: IReceptor
            params:
                path: path/to/zipfiles/
                is_repertoire: True # whether to import a RepertoireDataset
                metadata_column_mapping: # metadata column mapping AIRR: immuneML for Sequence- or ReceptorDatasetDataset
                    airr_column_name1: metadata_label1
                    airr_column_name2: metadata_label2
                import_productive: True # whether to include productive sequences in the dataset
                import_with_stop_codon: False # whether to include sequences with stop codon in the dataset
                import_out_of_frame: False # whether to include out of frame sequences in the dataset
                import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
                import_empty_nt_sequences: True # keep sequences even if the `sequences` column is empty (provided that other fields are as specified here)
                import_empty_aa_sequences: False # remove all sequences with empty `sequence_aas` column
                # Optional fields with AIRR-specific defaults, only change when different behavior is required:
                separator: "\t" # column separator
                region_type: IMGT_CDR3 # what part of the sequence to import
                column_mapping: # column mapping AIRR: immuneML
                    junction: sequences
                    junction_aa: sequence_aas
                    v_call: v_alleles
                    j_call: j_alleles
                    locus: chains
                    duplicate_count: counts
                    sequence_id: sequence_identifiers

ImmuneML#

Imports the dataset from the files previously exported by immuneML. It closely resembles AIRR format but relies on binary representations and is optimized for faster read-in at runtime.

ImmuneMLImport can import any kind of dataset (RepertoireDataset, SequenceDataset, ReceptorDataset).

This format includes:

  1. a dataset file in yaml format with iml_dataset extension with parameters:

    • name,

    • identifier,

    • metadata_file (for repertoire datasets),

    • metadata_fields (for repertoire datasets),

    • repertoire_ids (for repertoire datasets)

    • element_ids (for receptor and sequence datasets),

    • labels

  2. a csv metadata file (only for repertoire datasets, should be in the same folder as the iml_dataset file),

  3. data files for different types of data. For repertoire datasets, data files include one binary numpy file per repertoire with sequences and associated information and one metadata yaml file per repertoire with details such as repertoire identifier, disease status, subject id and other similar available information. For sequence and receptor datasets, sequences or receptors respectively, are stored in batches in binary numpy files.

Specification arguments:

  • path (str): The path to the previously created dataset file. This file should have an ‘.yaml’ extension. If the path has not been specified, immuneML attempts to load the dataset from a specified metadata file (only for RepertoireDatasets).

  • metadata_file (str): An optional metadata file for a RepertoireDataset. If specified, the RepertoireDataset metadata will be updated to the newly specified metadata without otherwise changing the Repertoire objects

YAML specification:

definitions:
    datasets:
        my_dataset:
            format: ImmuneML
            params:
                path: path/to/dataset.yaml
                metadata_file: path/to/metadata.csv

ImmunoSEQRearrangement#

Imports data from Adaptive Biotechnologies immunoSEQ Analyzer rearrangement-level .tsv files into a Repertoire-, or SequenceDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets should be used when predicting values for unpaired (single-chain) immune receptors, like antigen specificity.

The format of the files imported by this importer is described here: https://www.adaptivebiotech.com/wp-content/uploads/2019/07/MRK-00342_immunoSEQ_TechNote_DataExport_WEB_REV.pdf Alternatively, to import sample-level .tsv files, see ImmunoSEQSample import

The only difference between these two importers is which columns they load from the .tsv files.

Specification arguments:

  • path (str): For RepertoireDatasets, this is the path to a directory with files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

  • is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset. By default, is_repertoire is set to True.

  • metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the files included under the column ‘filename’ are imported into the RepertoireDataset. For setting SequenceDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

  • import_productive (bool): Whether productive sequences (with value ‘In’ in column frame_type) should be included in the imported sequences. By default, import_productive is True.

  • import_with_stop_codon (bool): Whether sequences with stop codons (with value ‘Stop’ in column frame_type) should be included in the imported sequences. By default, import_with_stop_codon is False.

  • import_out_of_frame (bool): Whether out of frame sequences (with value ‘Out’ in column frame_type) should be included in the imported sequences. By default, import_out_of_frame is False.

  • import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

  • import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

  • import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

  • region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as immunoSEQ files use the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

  • column_mapping (dict): A mapping from immunoSEQ column names to immuneML’s internal data representation. For immunoSEQ rearrangement-level files, this is by default set the values shown below in YAML format. A custom column mapping can be specified here if necessary (for example: adding additional data fields if they are present in the file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aa’, ‘sequence’, ‘v_call’, ‘j_call’, ‘chain’, ‘duplicate_count’, ‘frame_type’, ‘sequence_id’, ‘cell_id’]..

    rearrangement: sequence
    amino_acid: sequence_aa
    v_resolved: v_call
    j_resolved: j_call
    templates: duplicate_count
    locus: chain
    
  • column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For immunoSEQ rearrangement-level files, this is by default set to:

    v_resolved: v_alleles
    j_resolved: j_alleles
    
  • columns_to_load (list): Specifies which subset of columns must be loaded from the file. By default, this is: [rearrangement, v_family, v_gene, v_allele, j_family, j_gene, j_allele, amino_acid, templates, frame_type, locus]

  • metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are immunoSEQ column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. For immunoSEQ rearrangement .tsv files, there is no default metadata_column_mapping.

  • separator (str): Column separator, for ImmunoSEQ files this is by default “t”.

  • import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False

  • import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter will typically be False (import only non-empty amino acid sequences)

YAML specification:

definitions:
    datasets:
        my_immunoseq_dataset:
            format: ImmunoSEQRearrangement
            params:
                path: path/to/files/
                is_repertoire: True # whether to import a RepertoireDataset (True) or a SequenceDataset (False)
                metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
                metadata_column_mapping: # metadata column mapping ImmunoSEQ: immuneML for SequenceDataset
                    immunoseq_column_name1: metadata_label1
                    immunoseq_column_name2: metadata_label2
                import_productive: True # whether to include productive sequences in the dataset
                import_with_stop_codon: False # whether to include sequences with stop codon in the dataset
                import_out_of_frame: False # whether to include out of frame sequences in the dataset
                import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
                import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
                import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
                # Optional fields with ImmunoSEQ rearrangement-specific defaults, only change when different behavior is required:
                separator: "\t" # column separator
                columns_to_load: # subset of columns to load
                - rearrangement
                - v_family
                - v_gene
                - v_resolved
                - j_family
                - j_gene
                - j_resolved
                - amino_acid
                - templates
                - frame_type
                - locus
                region_type: IMGT_CDR3 # what part of the sequence to import
                column_mapping: # column mapping immunoSEQ: immuneML
                    rearrangement: sequence
                    amino_acid: sequence_aa
                    v_resolved: v_call
                    j_resolved: j_call
                    templates: duplicate_count
                    locus: chain

ImmunoSEQSample#

Imports data from Adaptive Biotechnologies immunoSEQ Analyzer sample-level .tsv files into a Repertoire-, or SequenceDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets should be used when predicting values for unpaired (single-chain) immune receptors, like antigen specificity.

The format of the files imported by this importer is described here in section 3.4.13 https://clients.adaptivebiotech.com/assets/downloads/immunoSEQ_AnalyzerManual.pdf Alternatively, to import rearrangement-level .tsv files, see ImmunoSEQRearrangement import. The only difference between these two importers is which columns they load from the .tsv files.

Specification arguments:

  • path (str): For RepertoireDatasets, this is the path to a directory with files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

  • is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset. By default, is_repertoire is set to True.

  • metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the files included under the column ‘filename’ are imported into the RepertoireDataset. For setting SequenceDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

  • import_productive (bool): Whether productive sequences (with value ‘In’ in column frame_type) should be included in the imported sequences. By default, import_productive is True.

  • import_with_stop_codon (bool): Whether sequences with stop codons (with value ‘Stop’ in column frame_type) should be included in the imported sequences. By default, import_with_stop_codon is False.

  • import_out_of_frame (bool): Whether out of frame sequences (with value ‘Out’ in column frame_type) should be included in the imported sequences. By default, import_out_of_frame is False.

  • import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

  • import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

  • import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

  • region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as immunoSEQ files use the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

  • column_mapping (dict): A mapping from immunoSEQ column names to immuneML’s internal data representation. For immunoSEQ sample-level files, this is by default set to the values shown bellow in YAML format. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aa’, ‘sequence’, ‘v_call’, ‘j_call’, ‘chain’, ‘duplicate_count’, ‘frame_type’, ‘sequence_id’, ‘cell_id’]..

    nucleotide: sequence
    aminoAcid: sequence_aa
    vGeneName: v_call
    jGeneName: j_call
    sequenceStatus: frame_type
    vFamilyName: v_family
    jFamilyName: j_family
    vGeneAllele: v_allele
    jGeneAllele: j_allele
    count (templates/reads): duplicate_count
    
  • column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For immunoSEQ sample .tsv files, there is no default column_mapping_synonyms.

  • columns_to_load (list): Specifies which subset of columns must be loaded from the file. By default, this is: [nucleotide, aminoAcid, count (templates/reads), vFamilyName, vGeneName, vGeneAllele, jFamilyName, jGeneName, jGeneAllele, sequenceStatus]; these are the columns from the original file that will be imported

  • metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are immunoSEQ column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. For immunoSEQ sample .tsv files, there is no default metadata_column_mapping.

  • separator (str): Column separator, for ImmunoSEQ files this is by default “t”.

YAML specification:

definitions:
    datasets:
        my_immunoseq_dataset:
            format: ImmunoSEQSample
            params:
                path: path/to/files/
                is_repertoire: True # whether to import a RepertoireDataset (True) or a SequenceDataset (False)
                metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
                metadata_column_mapping: # metadata column mapping ImmunoSEQ: immuneML for SequenceDataset
                    immunoseq_column_name1: metadata_label1
                    immunoseq_column_name2: metadata_label2
                import_productive: True # whether to include productive sequences in the dataset
                import_with_stop_codon: False # whether to include sequences with stop codon in the dataset
                import_out_of_frame: False # whether to include out of frame sequences in the dataset
                import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
                import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
                import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
                # Optional fields with ImmunoSEQ sample-specific defaults, only change when different behavior is required:
                separator: "\t" # column separator
                columns_to_load: # subset of columns to load
                - nucleotide
                - aminoAcid
                - count (templates/reads)
                - vFamilyName
                - vGeneName
                - vGeneAllele
                - jFamilyName
                - jGeneName
                - jGeneAllele
                - sequenceStatus
                region_type: IMGT_CDR3 # what part of the sequence to import
                column_mapping: # column mapping immunoSEQ: immuneML
                    nucleotide: sequence
                    aminoAcid: sequence_aa
                    vGeneName: v_call
                    jGeneName: j_call
                    sequenceStatus: frame_type
                    vFamilyName: v_family
                    jFamilyName: j_family
                    vGeneAllele: v_allele
                    jGeneAllele: j_allele
                    count (templates/reads): duplicate_count

MiXCR#

Imports data in MiXCR format into a Repertoire-, or SequenceDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets should be used when predicting values for unpaired (single-chain) immune receptors, like antigen specificity.

Specification arguments:

  • path (str): For RepertoireDatasets, this is the path to a directory with MiXCR files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

  • is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset. By default, is_repertoire is set to True.

  • metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the MiXCR files included under the column ‘filename’ are imported into the RepertoireDataset. For setting SequenceDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

  • import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence, such as ‘_’, are removed). By default import_illegal_characters is False.

  • import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

  • import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

  • region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as MiXCR uses IMGT junction as CDR3. Alternatively to importing the CDR3 sequence, other region types can be specified here as well. Valid values are IMGT_CDR3, IMGT_CDR1, IMGT_CDR2, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4.

  • column_mapping (dict): A mapping from MiXCR column names to immuneML’s internal data representation. The columns that specify the sequences to import are handled by the region_type parameter. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the MiXCR file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aa’, ‘sequence’, ‘v_call’, ‘j_call’, ‘chain’, ‘duplicate_count’, ‘frame_type’, ‘sequence_id’, ‘cell_id’].. For MiXCR, this is by default set to:

    cloneCount: duplicate_count
    allVHitsWithScore: v_call
    allJHitsWithScore: j_call
    
  • column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For MiXCR format, there is no default column_mapping_synonyms.

  • columns_to_load (list): Specifies which subset of columns must be loaded from the MiXCR file. By default, this is: [cloneCount, allVHitsWithScore, allJHitsWithScore, aaSeqCDR3, nSeqCDR3]

  • metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are MiXCR column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. For MiXCR format, there is no default metadata_column_mapping.

  • separator (str): Column separator, for MiXCR this is by default “t”.

YAML specification:

definitions:
    datasets:
        my_mixcr_dataset:
            format: MiXCR
            params:
                path: path/to/files/
                is_repertoire: True # whether to import a RepertoireDataset (True) or a SequenceDataset (False)
                metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
                metadata_column_mapping: # metadata column mapping MiXCR: immuneML for SequenceDataset
                    mixcrColumnName1: metadata_label1
                    mixcrColumnName2: metadata_label2
                region_type: IMGT_CDR3 # what part of the sequence to import
                import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
                import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
                import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
                # Optional fields with MiXCR-specific defaults, only change when different behavior is required:
                separator: "\t" # column separator
                columns_to_load: # subset of columns to load, sequence columns are handled by region_type parameter
                - cloneCount
                - allVHitsWithScore
                - allJHitsWithScore
                - aaSeqCDR3
                - nSeqCDR3
                column_mapping: # column mapping MiXCR: immuneML
                    cloneCount: duplicate_count
                    allVHitsWithScore: v_call
                    allJHitsWithScore: j_call

OLGA#

Imports data generated by OLGA simulations into a Repertoire-, or SequenceDataset. Assumes that the columns in each file correspond to: nucleotide sequences, amino acid sequences, v genes, j genes

Reference: Sethna, Zachary et al. ‘High-throughput immune repertoire analysis with IGoR’. Bioinformatics, (2019) doi.org/10.1093/bioinformatics/btz035.

Specification arguments:

  • path (str): For RepertoireDatasets, this is the path to a directory with OLGA files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

  • is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset. By default, is_repertoire is set to True.

  • metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the OLGA files included under the column ‘filename’ are imported into the RepertoireDataset. SequenceDataset metadata is currently not supported.

  • import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

  • import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

  • import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

  • region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as OLGA uses the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

  • separator (str): Column separator, for OLGA this is by default “t”.

  • column_mapping (dict): defines which columns to import from olga format: keys are the number of the columns and values are the names of the columns to be mapped to

YAML specification:

definitions:
    datasets:
        my_olga_dataset:
            format: OLGA
            params:
                path: path/to/files/
                is_repertoire: True # whether to import a RepertoireDataset (True) or a SequenceDataset (False)
                metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
                import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
                import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
                import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
                # Optional fields with OLGA-specific defaults, only change when different behavior is required:
                separator: "\t" # column separator
                region_type: IMGT_CDR3 # what part of the sequence to import
                columns_to_load: [0, 1, 2, 3]
                column_mapping:
                    0: sequence
                    1: sequence_aa
                    2: v_call
                    3: j_call

RandomReceptorDataset#

Returns a ReceptorDataset consisting of randomly generated sequences, which can be used for benchmarking purposes. The sequences consist of uniformly chosen amino acids or nucleotides.

Specification arguments:

  • receptor_count (int): The number of receptors the ReceptorDataset should contain.

  • chain_1_length_probabilities (dict): A mapping where the keys correspond to different sequence lengths for chain 1, and the values are the probabilities for choosing each sequence length. For example, to create a random ReceptorDataset where 40% of the sequences for chain 1 would be of length 10, and 60% of the sequences would have length 12, this mapping would need to be specified:

    10: 0.4
    12: 0.6
    
  • chain_2_length_probabilities (dict): Same as chain_1_length_probabilities, but for chain 2.

  • labels (dict): A mapping that specifies randomly chosen labels to be assigned to the receptors. One or multiple labels can be specified here. The keys of this mapping are the labels, and the values consist of another mapping between label classes and their probabilities. For example, to create a random ReceptorDataset with the label cmv_epitope where 70% of the receptors has class binding and the remaining 30% has class not_binding, the following mapping should be specified:

    cmv_epitope:
        binding: 0.7
        not_binding: 0.3
    

YAML specification:

definitions:
    datasets:
        my_random_dataset:
            format: RandomReceptorDataset
            params:
                receptor_count: 100 # number of random receptors to generate
                chain_1_length_probabilities:
                    14: 0.8 # 80% of all generated sequences for all receptors (for chain 1) will have length 14
                    15: 0.2 # 20% of all generated sequences across all receptors (for chain 1) will have length 15
                chain_2_length_probabilities:
                    14: 0.8 # 80% of all generated sequences for all receptors (for chain 2) will have length 14
                    15: 0.2 # 20% of all generated sequences across all receptors (for chain 2) will have length 15
                labels:
                    epitope1: # label name
                        True: 0.5 # 50% of the receptors will have class True
                        False: 0.5 # 50% of the receptors will have class False
                    epitope2: # next label with classes that will be assigned to receptors independently of the previous label or other parameters
                        1: 0.3 # 30% of the generated receptors will have class 1
                        0: 0.7 # 70% of the generated receptors will have class 0

RandomRepertoireDataset#

Returns a RepertoireDataset consisting of randomly generated sequences, which can be used for benchmarking purposes. The sequences consist of uniformly chosen amino acids or nucleotides.

Specification arguments:

  • repertoire_count (int): The number of repertoires the RepertoireDataset should contain.

  • sequence_count_probabilities (dict): A mapping where the keys are the number of sequences per repertoire, and the values are the probabilities that any of the repertoires would have that number of sequences. For example, to create a random RepertoireDataset where 40% of the repertoires would have 1000 sequences, and the other 60% would have 1100 sequences, this mapping would need to be specified:

    1000: 0.4
    1100: 0.6
    
  • sequence_length_probabilities (dict): A mapping where the keys correspond to different sequence lengths, and the values are the probabilities for choosing each sequence length. For example, to create a random RepertoireDataset where 40% of the sequences would be of length 10, and 60% of the sequences would have length 12, this mapping would need to be specified:

    10: 0.4
    12: 0.6
    
  • labels (dict): A mapping that specifies randomly chosen labels to be assigned to the Repertoires. One or multiple labels can be specified here. The keys of this mapping are the labels, and the values consist of another mapping between label classes and their probabilities. For example, to create a random RepertoireDataset with the label CMV where 70% of the Repertoires has class cmv_positive and the remaining 30% has class cmv_negative, the following mapping should be specified:

    CMV:
        cmv_positive: 0.7
        cmv_negative: 0.3
    

YAML specification:

definitions:
    datasets:
        my_random_dataset:
            format: RandomRepertoireDataset
            params:
                repertoire_count: 100 # number of random repertoires to generate
                sequence_count_probabilities:
                    10: 0.5 # probability that any of the repertoires would have 10 receptor sequences
                    20: 0.5
                sequence_length_probabilities:
                    10: 0.5 # probability that any of the receptor sequences would be 10 amino acids in length
                    12: 0.5
                labels: # randomly assigned labels (only useful for simple benchmarking)
                    cmv:
                        True: 0.5 # probability of value True for label cmv to be assigned to any repertoire
                        False: 0.5

RandomSequenceDataset#

Returns a SequenceDataset consisting of randomly generated sequences, which can be used for benchmarking purposes. The sequences consist of uniformly chosen amino acids or nucleotides.

Specification arguments:

  • sequence_count (int): The number of sequences the SequenceDataset should contain.

  • length_probabilities (dict): A mapping where the keys correspond to different sequence lengths and the values are the probabilities for choosing each sequence length. For example, to create a random SequenceDataset where 40% of the sequences would be of length 10, and 60% of the sequences would have length 12, this mapping would need to be specified:

    10: 0.4
    12: 0.6
    
  • labels (dict): A mapping that specifies randomly chosen labels to be assigned to the sequences. One or multiple labels can be specified here. The keys of this mapping are the labels, and the values consist of another mapping between label classes and their probabilities. For example, to create a random SequenceDataset with the label cmv_epitope where 70% of the sequences has class binding and the remaining 30% has class not_binding, the following mapping should be specified:

    cmv_epitope:
        binding: 0.7
        not_binding: 0.3
    

YAML specification:

definitions:
    datasets:
        my_random_dataset:
            format: RandomSequenceDataset
            params:
                sequence_count: 100 # number of random sequences to generate
                length_probabilities:
                    14: 0.8 # 80% of all generated sequences for all sequences will have length 14
                    15: 0.2 # 20% of all generated sequences across all sequences will have length 15
                labels:
                    epitope1: # label name
                        True: 0.5 # 50% of the sequences will have class True
                        False: 0.5 # 50% of the sequences will have class False
                    epitope2: # next label with classes that will be assigned to sequences independently of the previous label or other parameters
                        1: 0.3 # 30% of the generated sequences will have class 1
                        0: 0.7 # 70% of the generated sequences will have class 0

TenxGenomics#

Imports data from the 10x Genomics Cell Ranger analysis pipeline into a Repertoire-, Sequence- or ReceptorDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets or ReceptorDatasets should be used when predicting values for unpaired (single-chain) and paired immune receptors respectively, like antigen specificity.

The files that should be used as input are named ‘Clonotype consensus annotations (CSV)’, as described here: https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/output/annotation#consensus

Note: by default the 10xGenomics field ‘umis’ is used to define the immuneML field counts. If you want to use the 10x Genomics field reads instead, this can be changed in the column_mapping (set reads: counts). Furthermore, the 10xGenomics field clonotype_id is used for the immuneML field cell_id.

Specification arguments:

  • path (str): For RepertoireDatasets, this is the path to a directory with 10xGenomics files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

  • is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset or ReceptorDataset. By default, is_repertoire is set to True.

  • metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. For setting Sequence- or ReceptorDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

  • paired (str): Required for Sequence- or ReceptorDatasets. This parameter determines whether to import a SequenceDataset (paired = False) or a ReceptorDataset (paired = True). In a ReceptorDataset, two sequences with chain types specified by receptor_chains are paired together based on the identifier given in the 10xGenomics column named ‘clonotype_id’.

  • receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values are TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).

  • import_productive (bool): Whether productive sequences (with value ‘True’ in column productive) should be included in the imported sequences. By default, import_productive is True.

  • import_unproductive (bool): Whether productive sequences (with value ‘Fale’ in column productive) should be included in the imported sequences. By default, import_unproductive is False.

  • import_unknown_productivity (bool): Whether sequences with unknown productivity (missing or ‘NA’ value in column productive) should be included in the imported sequences. By default, import_unknown_productivity is True.

  • import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

  • import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

  • import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

  • region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as 10xGenomics uses IMGT junction as CDR3. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

  • column_mapping (dict): A mapping from 10xGenomics column names to immuneML’s internal data representation. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the 10xGenomics file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aa’, ‘sequence’, ‘v_call’, ‘j_call’, ‘chain’, ‘duplicate_count’, ‘frame_type’, ‘sequence_id’, ‘cell_id’].. For 10xGenomics, this is by default set to:

    cdr3: sequence_aa
    cdr3_nt: sequence
    v_gene: v_call
    j_gene: j_call
    umis: duplicate_count
    clonotype_id: cell_id
    consensus_id: sequence_id
    
  • column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For 10xGenomics format, there is no default column_mapping_synonyms.

  • metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are 10xGenomics column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. For 10xGenomics format, there is no default metadata_column_mapping.

  • separator (str): Column separator, for 10xGenomics this is by default “,”.

YAML specification:

definitions:
    datasets:
        my_10x_dataset:
            format: 10xGenomics
            params:
                path: path/to/files/
                is_repertoire: True # whether to import a RepertoireDataset
                metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
                paired: False # whether to import SequenceDataset (False) or ReceptorDataset (True) when is_repertoire = False
                receptor_chains: TRA_TRB # what chain pair to import for a ReceptorDataset
                metadata_column_mapping: # metadata column mapping 10xGenomics: immuneML for SequenceDataset
                    tenx_column_name1: metadata_label1
                    tenx_column_name2: metadata_label2
                import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
                import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
                import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
                # Optional fields with 10xGenomics-specific defaults, only change when different behavior is required:
                separator: "," # column separator
                region_type: IMGT_CDR3 # what part of the sequence to import
                column_mapping: # column mapping 10xGenomics: immuneML
                    cdr3: sequence_aa
                    cdr3_nt: sequence
                    v_gene: v_call
                    j_gene: j_call
                    umis: duplicate_count
                    clonotype_id: cell_id
                    consensus_id: sequence_id

VDJdb#

Imports data in VDJdb format into a Repertoire-, Sequence- or ReceptorDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets or ReceptorDatasets should be used when predicting values for unpaired (single-chain) and paired immune receptors respectively, like antigen specificity.

Specification arguments:

  • path (str): For RepertoireDatasets, this is the path to a directory with VDJdb files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.

  • is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset or ReceptorDataset. By default, is_repertoire is set to True.

  • metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. For setting Sequence- or ReceptorDataset metadata, metadata_file is ignored, see metadata_column_mapping instead.

  • paired (str): Required for Sequence- or ReceptorDatasets. This parameter determines whether to import a SequenceDataset (paired = False) or a ReceptorDataset (paired = True). In a ReceptorDataset, two sequences with chain types specified by receptor_chains are paired together based on the identifier given in the VDJdb column named ‘complex.id’.

  • receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values are TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).

  • import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.

  • import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.

  • import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.

  • region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as VDJdb uses IMGT junction as CDR3. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.

  • column_mapping (dict): A mapping from VDJdb column names to immuneML’s internal data representation. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the VDJdb file, or using alternative column names). Valid immuneML fields that can be specified here are [‘sequence_aa’, ‘sequence’, ‘v_call’, ‘j_call’, ‘chain’, ‘duplicate_count’, ‘frame_type’, ‘sequence_id’, ‘cell_id’].. For VDJdb, this is by default set to:

    V: v_call
    J: j_call
    CDR3: sequence_aa
    complex.id: sequence_id
    Gene: chain
    
  • column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For VDJdb format, there is no default column_mapping_synonyms.

  • metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are VDJdb column names and values are the names that are internally used in immuneML as metadata fields. This means that epitope, epitope_gene and epitope_species can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. For VDJdb format, this parameter is by default set to:

    Epitope: epitope
    Epitope gene: epitope_gene
    Epitope species: epitope_species
    
  • separator (str): Column separator, for VDJdb this is by default “t”.

YAML specification:

definitions:
    datasets:
        my_vdjdb_dataset:
            format: VDJdb
            params:
                path: path/to/files/
                is_repertoire: True # whether to import a RepertoireDataset
                metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
                paired: False # whether to import SequenceDataset (False) or ReceptorDataset (True) when is_repertoire = False
                receptor_chains: TRA_TRB # what chain pair to import for a ReceptorDataset
                import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
                import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
                import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
                # Optional fields with VDJdb-specific defaults, only change when different behavior is required:
                separator: "\t" # column separator
                region_type: IMGT_CDR3 # what part of the sequence to import
                column_mapping: # column mapping VDJdb: immuneML
                    V: v_call
                    J: j_call
                    CDR3: sequence_aa
                    complex.id: sequence_id
                    Gene: chain
                metadata_column_mapping: # metadata column mapping VDJdb: immuneML
                    Epitope: epitope
                    Epitope gene: epitope_gene
                    Epitope species: epitope_species

Encodings#

Under the definitions/encodings component, the user can specify how to encode a given dataset. An encoding is a numerical data representation, which may be used as input for a machine learning algorithm.

AtchleyKmer#

Represents a repertoire through Atchley factors and relative abundance of k-mers. Should be used in combination with the AtchleyKmerMILClassifier.

For more details, see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292 .

Note that sequences in the repertoire with length shorter than skip_first_n_aa + skip_last_n_aa + k will not be encoded.

Specification arguments:

  • k (int): k-mer length

  • skip_first_n_aa (int): number of amino acids to remove from the beginning of the receptor sequence

  • skip_last_n_aa (int): number of amino acids to remove from the end of the receptor sequence

  • abundance: how to compute abundance term for k-mers; valid values are RELATIVE_ABUNDANCE, TCRB_RELATIVE_ABUNDANCE.

  • normalize_all_features (bool): when normalizing features to have 0 mean and unit variance, this parameter indicates if the abundance feature should be included in the normalization

YAML specification:

definitions:
    encodings:
        my_encoder:
            AtchleyKmer:
                k: 4
                skip_first_n_aa: 3
                skip_last_n_aa: 3
                abundance: RELATIVE_ABUNDANCE
                normalize_all_features: False

CompAIRRDistance#

Encodes a given RepertoireDataset as a distance matrix, using the Morisita-Horn distance metric. Internally, CompAIRR is used for fast calculation of overlap between repertoires. This creates a pairwise distance matrix between each of the repertoires. The distance is calculated based on the number of matching receptor chain sequences between the repertoires. This matching may be defined to permit 1 or 2 mismatching amino acid/nucleotide positions and 1 indel in the sequence. Furthermore, matching may or may not include V and J gene information, and sequence frequencies may be included or ignored.

When mismatches (differences and indels) are allowed, the Morisita-Horn similarity may exceed 1. In this case, the Morisita-Horn distance (= similarity - 1) is set to 0 to avoid negative distance scores.

Specification arguments:

  • compairr_path (Path): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.

  • keep_compairr_input (bool): whether to keep the input file that was passed to CompAIRR. This may take a lot of storage space if the input dataset is large. By default, the input file is not kept.

  • differences (int): Number of differences allowed between the sequences of two immune receptor chains, this may be between 0 and 2. By default, differences is 0.

  • indels (bool): Whether to allow an indel. This is only possible if differences is 1. By default, indels is False.

  • ignore_counts (bool): Whether to ignore the frequencies of the immune receptor chains. If False, frequencies will be included, meaning the ‘counts’ values for the receptors available in two repertoires are multiplied. If False, only the number of unique overlapping immune receptors (‘clones’) are considered. By default, ignore_counts is False.

  • ignore_genes (bool): Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains have to match. If True, gene information is ignored. By default, ignore_genes is False.

  • threads (int): The number of threads to use for parallelization. Default is 8.

YAML specification:

definitions:
    encodings:
        my_distance_encoder:
            CompAIRRDistance:
                compairr_path: optional/path/to/compairr
                differences: 0
                indels: False
                ignore_counts: False
                ignore_genes: False

CompAIRRSequenceAbundance#

This encoder works similarly to the SequenceAbundanceEncoder, but internally uses CompAIRR to accelerate core computations.

This encoder represents the repertoires as vectors where:

  • the first element corresponds to the number of label-associated clonotypes

  • the second element is the total number of unique clonotypes

To determine what clonotypes (amino acid sequences with or without matching V/J genes) are label-associated, Fisher’s exact test (one-sided) is used.

The encoder also writes out files containing the contingency table used for fisher’s exact test, the resulting p-values, and the significantly abundant sequences (use RelevantSequenceExporter to export these sequences in AIRR format).

Reference: Emerson, Ryan O. et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.

Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. See Reproduction of the CMV status predictions study for an example using SequenceAbundanceEncoder.

Specification arguments:

  • p_value_threshold (float): The p value threshold to be used by the statistical test.

  • compairr_path (Path): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.

  • ignore_genes (bool): Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains have to match. If True, gene information is ignored. By default, ignore_genes is False.

  • sequence_batch_size (int): The number of sequences in a batch when comparing sequences across repertoires, typically 100s of thousands. This does not affect the results of the encoding, but may affect the speed and memory usage. The default value is 1.000.000

  • threads (int): The number of threads to use for parallelization. This does not affect the results of the encoding, only the speed. The default number of threads is 8.

  • keep_temporary_files (bool): whether to keep temporary files, including CompAIRR input, output and log files, and the sequence presence matrix. This may take a lot of storage space if the input dataset is large. By default, temporary files are not kept.

YAML specification:

definitions:
    encodings:
        my_sa_encoding:
            CompAIRRSequenceAbundance:
                compairr_path: optional/path/to/compairr
                p_value_threshold: 0.05
                ignore_genes: False
                threads: 8

DeepRC#

DeepRCEncoder should be used in combination with the DeepRC ML method (DeepRC). This encoder writes the data in a RepertoireDataset to .tsv files. For each repertoire, one .tsv file is created containing the amino acid sequences and the counts. Additionally, one metadata .tsv file is created, which describes the subset of repertoires that is encoded by a given instance of the DeepRCEncoder.

Note: sequences where count is None, the count value will be set to 1

YAML specification:

definitions:
    encodings:
        my_deeprc_encoder: DeepRC

Distance#

Encodes a given RepertoireDataset as distance matrix, where the pairwise distance between each of the repertoires is calculated. The distance is calculated based on the presence/absence of elements defined under attributes_to_match. Thus, if attributes_to_match contains only ‘sequence_aas’, this means the distance between two repertoires is maximal if they contain the same set of sequence_aas, and the distance is minimal if none of the sequence_aas are shared between two repertoires.

Specification arguments:

  • distance_metric (DistanceMetricType): The metric used to calculate the distance between two repertoires. Valid values are: JACCARD, MORISITA_HORN. The default distance metric is JACCARD (inverse Jaccard).

  • sequence_batch_size (int): The number of sequences to be processed at once. Increasing this number increases the memory use. The default value is 1000.

  • attributes_to_match (list): The attributes to consider when determining whether a sequence is present in both repertoires. Only the fields defined under attributes_to_match will be considered, all other fields are ignored. Valid values are sequence_aa, sequence, v_call, j_call, chain, duplicate_count, region_type, frame_type, sequence_id, cell_id. The default value is [‘sequence_aas’]

YAML specification:

definitions:
    encodings:
        my_distance_encoder:
            Distance:
                distance_metric: JACCARD
                sequence_batch_size: 1000
                attributes_to_match:
                    - sequence_aa
                    - v_call
                    - j_call
                    - chain
                    - region_type

EvennessProfile#

The EvennessProfileEncoder class encodes a repertoire based on the clonal frequency distribution. The evenness for a given repertoire is defined as follows:

\[^{\alpha} \mathrm{E}(\mathrm{f})=\frac{\left(\sum_{\mathrm{i}=1}^{\mathrm{n}} \mathrm{f}_{\mathrm{i}}^{\alpha}\right)^{\frac{1}{1-\alpha}}}{\mathrm{n}}\]

That is, it is the exponential of Renyi entropy at a given alpha divided by the species richness, or number of unique sequences.

Reference: Greiff et al. (2015). A bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status. Genome Medicine, 7(1), 49. doi.org/10.1186/s13073-015-0169-8

Specification arguments:

  • min_alpha (float): minimum alpha value to use

  • max_alpha (float): maximum alpha value to use

  • dimension (int): dimension of output evenness profile vector, or the number of alpha values to linearly space between min_alpha and max_alpha

YAML specification:

definitions:
    encodings:
        my_evenness_profile:
            EvennessProfile:
                min_alpha: 0
                max_alpha: 10
                dimension: 51

KmerAbundance#

This encoder is related to the SequenceAbundanceEncoder, but identifies label-associated subsequences (k-mers) instead of full label-associated sequences.

This encoder represents the repertoires as vectors where:

  • the first element corresponds to the number of label-associated k-mers found in a repertoire

  • the second element is the total number of unique k-mers per repertoire

The label-associated k-mers are determined based on a one-sided Fisher’s exact test.

The encoder also writes out files containing the contingency table used for fisher’s exact test, the resulting p-values, and the significantly abundant k-mers.

Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. See Reproduction of the CMV status predictions study for an example using SequenceAbundanceEncoder.

Specification arguments:

  • p_value_threshold (float): The p value threshold to be used by the statistical test.

  • sequence_encoding (SequenceEncodingType): The type of k-mers that are used. The simplest (default) sequence_encoding is CONTINUOUS_KMER, which uses contiguous subsequences of length k to represent the k-mers. When gapped k-mers are used (GAPPED_KMER, GAPPED_KMER), the k-mers may contain gaps with a size between min_gap and max_gap, and the k-mer length is defined as a combination of k_left and k_right. When IMGT k-mers are used (IMGT_CONTINUOUS_KMER, IMGT_GAPPED_KMER), IMGT positional information is taken into account (i.e. the same sequence in a different position is considered to be a different k-mer).

  • k (int): Length of the k-mer (number of amino acids) when ungapped k-mers are used. The default value for k is 3.

  • k_left (int): When gapped k-mers are used, k_left indicates the length of the k-mer left of the gap. The default value for k_left is 1.

  • k_right (int): Same as k_left, but k_right determines the length of the k-mer right of the gap. The default value for k_right is 1.

  • min_gap (int): Minimum gap size when gapped k-mers are used. The default value for min_gap is 0.

  • max_gap: (int): Maximum gap size when gapped k-mers are used. The default value for max_gap is 0.

YAML specification:

definitions:
    encodings:
        my_ka_encoding:
            KmerAbundance:
                p_value_threshold: 0.05
                threads: 8

KmerFrequency#

The KmerFrequencyEncoder class encodes a repertoire, sequence or receptor by frequencies of k-mers it contains. A k-mer is a sequence of letters of length k into which an immune receptor sequence can be decomposed. K-mers can be defined in different ways, as determined by the sequence_encoding.

Specification arguments:

  • sequence_encoding (SequenceEncodingType): The type of k-mers that are used. The simplest sequence_encoding is CONTINUOUS_KMER, which uses contiguous subsequences of length k to represent the k-mers. When gapped k-mers are used (GAPPED_KMER, GAPPED_KMER), the k-mers may contain gaps with a size between min_gap and max_gap, and the k-mer length is defined as a combination of k_left and k_right. When IMGT k-mers are used (IMGT_CONTINUOUS_KMER, IMGT_GAPPED_KMER), IMGT positional information is taken into account (i.e. the same sequence in a different position is considered to be a different k-mer). When the identity representation is used (IDENTITY), the k-mers just correspond to the original sequences.

  • normalization_type (NormalizationType): The way in which the k-mer frequencies should be normalized. The default value for normalization_type is l2.

  • reads (ReadsType): Reads type signify whether the counts of the sequences in the repertoire will be taken into account. If UNIQUE, only unique sequences (clonotypes) are encoded, and if ALL, the sequence ‘count’ value is taken into account when determining the k-mer frequency. The default value for reads is unique.

  • k (int): Length of the k-mer (number of amino acids) when ungapped k-mers are used. The default value for k is 3.

  • k_left (int): When gapped k-mers are used, k_left indicates the length of the k-mer left of the gap. The default value for k_left is 1.

  • k_right (int): Same as k_left, but k_right determines the length of the k-mer right of the gap. The default value for k_right is 1.

  • min_gap (int): Minimum gap size when gapped k-mers are used. The default value for min_gap is 0.

  • max_gap: (int): Maximum gap size when gapped k-mers are used. The default value for max_gap is 0.

  • sequence_type (str): Whether to work with nucleotide or amino acid sequences. Amino acid sequences are the default. To work with either sequence type, the sequences of the desired type should be included in the datasets, e.g., listed under ‘columns_to_load’ parameter. By default, both types will be included if available. Valid values are: AMINO_ACID and NUCLEOTIDE.

  • scale_to_unit_variance (bool): whether to scale the design matrix after normalization to have unit variance per feature. Setting this argument to True might improve the subsequent classifier’s performance depending on the type of the classifier. The default value for scale_to_unit_variance is true.

  • scale_to_zero_mean (bool): whether to scale the design matrix after normalization to have zero mean per feature. Setting this argument to True might improve the subsequent classifier’s performance depending on the type of the classifier. However, if the original design matrix was sparse, setting this argument to True will destroy the sparsity and will increase the memory consumption. The default value for scale_to_zero_mean is false.

YAML specification:

definitions:
    encodings:
        my_continuous_kmer:
            KmerFrequency:
                normalization_type: RELATIVE_FREQUENCY
                reads: UNIQUE
                sequence_encoding: CONTINUOUS_KMER
                sequence_type: NUCLEOTIDE
                k: 3
                scale_to_unit_variance: True
                scale_to_zero_mean: True
        my_gapped_kmer:
            KmerFrequency:
                normalization_type: RELATIVE_FREQUENCY
                reads: UNIQUE
                sequence_encoding: GAPPED_KMER
                sequence_type: AMINO_ACID
                k_left: 2
                k_right: 2
                min_gap: 1
                max_gap: 3
                scale_to_unit_variance: True
                scale_to_zero_mean: False

MatchedReceptors#

Encodes the dataset based on the matches between a dataset containing unpaired (single chain) data, and a paired reference receptor dataset. For each paired reference receptor, the frequency of either chain in the dataset is counted.

This encoding can be used in combination with the Matches report.

When sum_matches and normalize are set to True, this encoder behaves similarly as described in: Yao, Y. et al. ‘T cell receptor repertoire as a potential diagnostic marker for celiac disease’. Clinical Immunology Volume 222 (January 2021): 108621. doi.org/10.1016/j.clim.2020.108621 with the only exception being that this encoder uses paired receptors, while the original publication used single sequences (see also: MatchedSequences encoder).

Specification arguments:

  • reference (dict): A dictionary describing the reference dataset file. Import should be specified the same way as regular dataset import. It is only allowed to import a receptor dataset here (i.e., is_repertoire is False and paired is True by default, and these are not allowed to be changed).

  • max_edit_distances (dict): A dictionary specifying the maximum edit distance between a target sequence (from the repertoire) and the reference sequence. A maximum distance can be specified per chain, for example to allow for less strict matching of TCR alpha and BCR light chains. When only an integer is specified, this distance is applied to all possible chains.

  • reads (ReadsType): Reads type signify whether the counts of the sequences in the repertoire will be taken into account. If UNIQUE, only unique sequences (clonotypes) are counted, and if ALL, the sequence ‘count’ value is summed when determining the number of matches. The default value for reads is all.

  • sum_matches (bool): When sum_matches is False, the resulting encoded data matrix contains multiple columns with the number of matches per reference receptor chain. When sum_matches is true, the columns representing each of the two chains are summed together, meaning that there are only two aggregated sums of matches (one per chain) per repertoire in the encoded data. To use this encoder in combination with the Matches report, sum_matches must be set to False. When sum_matches is set to True, this encoder behaves similarly to the encoder described by Yao, Y. et al. By default, sum_matches is False.

  • normalize (bool): If True, the chain matches are divided by the total number of unique receptors in the repertoire (when reads = unique) or the total number of reads in the repertoire (when reads = all).

YAML specification:

definitions:
    encodings:
        my_mr_encoding:
            MatchedReceptors:
                reference:
                    format: VDJDB
                    params:
                        path: path/to/file.txt
                max_edit_distances:
                    alpha: 1
                    beta: 0

MatchedRegex#

Encodes the dataset based on the matches between a RepertoireDataset and a collection of regular expressions. For each regular expression, the number of sequences in the RepertoireDataset containing the expression is counted. This can also be used to count how often a subsequence occurs in a RepertoireDataset.

The regular expressions are defined per chain, and it is possible to require a V gene match in addition to the CDR3 sequence containing the regular expression.

This encoding can be used in combination with the Matches report.

Specification arguments:

  • match_v_genes (bool): Whether V gene matches are required. If this is True, a match is only counted if the V gene matches the gene specified in the motif input file. By default match_v_genes is False.

  • reads (ReadsType): Reads type signify whether the counts of the sequences in the repertoire will be taken into account. If UNIQUE, only unique sequences (clonotypes) are counted, and if ALL, the sequence ‘count’ value is summed when determining the number of matches. The default value for reads is all.

  • motif_filepath (str): The path to the motif input file. This should be a tab separated file containing a column named ‘id’ and for every chain that should be matched a column containing the regex (<chain>_regex) and a column containing the V gene (<chain>V) if match_v_genes is True. The chains are specified by their three-letter code, see Chain.

In the simplest case, when counting the number of occurrences of a given list of k-mers in TRB sequences, the contents of the motif file could look like this:

id

TRB_regex

1

ACG

2

EDNA

3

DFWG

It is also possible to test whether paired regular expressions occur in the dataset (for example: regular expressions matching both a TRA chain and a TRB chain) by specifying them on the same line. In a more complex case where both paired and unpaired regular expressions are specified, in addition to matching the V genes, the contents of the motif file could look like this:

id

TRA_regex

TRAV

TRB_regex

TRBV

1

AGQ.GSS

TRAV35

S[APL]GQY

TRBV29-1

2

ASS.R.*

TRBV7-3

YAML specification:

definitions:
    encodings:
        my_mr_encoding:
            MatchedRegex:
                motif_filepath: path/to/file.txt
                match_v_genes: True
                reads: unique

MatchedSequences#

Encodes the dataset based on the matches between a RepertoireDataset and a reference sequence dataset.

This encoding can be used in combination with the Matches report.

When sum_matches and normalize are set to True, this encoder behaves as described in: Yao, Y. et al. ‘T cell receptor repertoire as a potential diagnostic marker for celiac disease’. Clinical Immunology Volume 222 (January 2021): 108621. doi.org/10.1016/j.clim.2020.108621

Specification arguments:

  • reference (dict): A dictionary describing the reference dataset file. Import should be specified the same way as regular dataset import. It is only allowed to import a sequence dataset here (i.e., is_repertoire and paired are False by default, and are not allowed to be set to True).

  • max_edit_distance (int): The maximum edit distance between a target sequence (from the repertoire) and the reference sequence.

  • reads (ReadsType): Reads type signify whether the counts of the sequences in the repertoire will be taken into account. If UNIQUE, only unique sequences (clonotypes) are counted, and if ALL, the sequence ‘count’ value is summed when determining the number of matches. The default value for reads is all.

  • sum_matches (bool): When sum_matches is False, the resulting encoded data matrix contains multiple columns with the number of matches per reference sequence. When sum_matches is true, all columns are summed together, meaning that there is only one aggregated sum of matches per repertoire in the encoded data. To use this encoder in combination with the Matches report, sum_matches must be set to False. When sum_matches is set to True, this encoder behaves as described by Yao, Y. et al. By default, sum_matches is False.

  • normalize (bool): If True, the sequence matches are divided by the total number of unique sequences in the repertoire (when reads = unique) or the total number of reads in the repertoire (when reads = all).

YAML specification:

definitions:
    encodings:
        my_ms_encoding:
            MatchedSequences:
                reference:
                    format: VDJDB
                    params:
                        path: path/to/file.txt
                max_edit_distance: 1

Motif#

This encoder enumerates every possible positional motif, and keeps only the motifs associated with the positive class. A ‘motif’ is defined as a combination of position-specific amino acids. These motifs may contain one or multiple gaps. Motifs are filtered out based on a minimal precision and recall threshold for predicting the positive class.

Note: the MotifEncoder can only be used for sequences of the same length.

The ideal recall threshold(s) given a user-defined precision threshold can be calibrated using the MotifGeneralizationAnalysis report. It is recommended to first run this report in ExploratoryAnalysisInstruction before using this encoder for ML.

This encoder can be used in combination with the BinaryFeatureClassifier in order to learn a minimal set of compatible motifs for predicting the positive class. Alternatively, it may be combined with scikit-learn methods, such as for example LogisticRegression, to learn a weight per motif.

Specification arguments:

  • max_positions (int): The maximum motif size. This is number of positional amino acids the motif consists of (excluding gaps). The default value for max_positions is 4.

  • min_positions (int): The minimum motif size (see also: max_positions). The default value for max_positions is 1.

  • min_precision (float): The minimum precision threshold for keeping a motif. The default value for min_precision is 0.8.

  • min_recall (float): The minimum recall threshold for keeping a motif. The default value for min_precision is 0. It is also possible to specify a recall threshold for each motif size. In this case, a dictionary must be specified where the motif sizes are keys and the recall values are values. Use the MotifGeneralizationAnalysis report to calibrate the optimal recall threshold given a user-defined precision threshold to ensure generalisability to unseen data.

  • min_true_positives (int): The minimum number of true positive sequences that a motif needs to occur in. The default value for min_true_positives is 10.

  • candidate_motif_filepath (str): Optional filepath for pre-filterd candidate motifs. This may be used to save time. Only the given candidate motifs are considered. When this encoder has been run previously, a candidate motifs file named ‘all_candidate_motifs.tsv’ will have been exported. This file contains all possible motifs with high enough min_true_positives without applying precision and recall thresholds. The file must be a tab-separated file, structured as follows:

    indices

    amino_acids

    1&2&3

    A&G&C

    5&7

    E&D

    The example above contains two motifs: AGC in positions 123, and E-D in positions 5-7 (with a gap at position 6).

  • label (str): The name of the binary label to train the encoder for. This is only necessary when the dataset contains multiple labels.

YAML specification:

definitions:
    encodings:
        my_motif_encoder:
            MotifEncoder:
                max_positions: 4
                min_precision: 0.8
                min_recall:  # different recall thresholds for each motif size
                    1: 0.5   # For shorter motifs, a stricter recall threshold is used
                    2: 0.1
                    3: 0.01
                    4: 0.001
                min_true_positives: 10

OneHot#

One-hot encoding for repertoires, sequences or receptors. In one-hot encoding, each alphabet character (amino acid or nucleotide) is replaced by a sparse vector with one 1 and the rest zeroes. The position of the 1 represents the alphabet character.

Specification arguments:

  • use_positional_info (bool): whether to include features representing the positional information. If True, three additional feature vectors will be added, representing the sequence start, sequence middle and sequence end. The values in these features are scaled between 0 and 1. A graphical representation of the values of these vectors is given below.

  Value of sequence start:         Value of sequence middle:        Value of sequence end:

1 \                              1    /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾\         1                          /
   \                                 /                   \                                  /
    \                               /                     \                                /
0    \_____________________      0 /                       \      0  _____________________/
  <----sequence length---->        <----sequence length---->         <----sequence length---->
  • distance_to_seq_middle (int): only applies when use_positional_info is True. This is the distance from the edge of the CDR3 sequence (IMGT positions 105 and 117) to the portion of the sequence that is considered ‘middle’. For example: if distance_to_seq_middle is 6 (default), all IMGT positions in the interval [111, 112) receive positional value 1. When using nucleotide sequences: note that the distance is measured in (amino acid) IMGT positions. If the complete sequence length is smaller than 2 * distance_to_seq_middle, the maximum value of the ‘start’ and ‘end’ vectors will not reach 0, and the maximum value of the ‘middle’ vector will not reach 1. A graphical representation of the positional vectors with a too short sequence is given below:

Value of sequence start         Value of sequence middle        Value of sequence end:
with very short sequence:       with very short sequence:       with very short sequence:

     1 \                               1                                 1    /
        \                                                                    /
         \                                /\                                /
     0                                 0 /  \                            0
       <->                               <-->                               <->
  • flatten (bool): whether to flatten the final onehot matrix to a 2-dimensional matrix [examples, other_dims_combined] This must be set to True when using onehot encoding in combination with scikit-learn ML methods (inheriting SklearnMethod), such as LogisticRegression, SVM, SVC, RandomForestClassifier and KNN.

  • sequence_type: whether to use nucleotide or amino acid sequence for encoding. Valid values are ‘nucleotide’ and ‘amino_acid’.

YAML specification:

definitions:
    encodings:
        one_hot_vanilla:
            OneHot:
                use_positional_info: False
                flatten: False
                sequence_type: amino_acid

        one_hot_positional:
            OneHot:
                use_positional_info: True
                distance_to_seq_middle: 3
                flatten: False
                sequence_type: nucleotide

SequenceAbundance#

This encoder represents the repertoires as vectors where:

  • the first element corresponds to the number of label-associated clonotypes

  • the second element is the total number of unique clonotypes

To determine what clonotypes (with features defined by comparison_attributes) are label-associated, one-sided Fisher’s exact test is used.

The encoder also writes out files containing the contingency table used for Fisher’s exact test, the resulting p-values, and the significantly abundant sequences (use RelevantSequenceExporter to export these sequences in AIRR format).

Reference: Emerson, Ryan O. et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.

Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. For full example of using this encoder, see Reproduction of the CMV status predictions study.

Specification arguments:

  • comparison_attributes (list): The attributes to be considered to group receptors into clonotypes. Only the fields specified in comparison_attributes will be considered, all other fields are ignored. Valid values are sequence_aa, sequence, v_call, j_call, chain, duplicate_count, region_type, frame_type, sequence_id, cell_id.

  • p_value_threshold (float): The p value threshold to be used by the statistical test.

  • sequence_batch_size (int): The number of sequences in a batch when comparing sequences across repertoires, typically 100s of thousands. This does not affect the results of the encoding, only the speed. The default value is 1.000.000

  • repertoire_batch_size (int): How many repertoires will be loaded at once. This does not affect the result of the encoding, only the speed. This value is a trade-off between the number of repertoires that can fit the RAM at the time and loading time from disk.

YAML specification:

definitions:
    encodings:
        my_sa_encoding:
            SequenceAbundance:
                comparison_attributes:
                    - sequence_aa
                    - v_call
                    - j_call
                    - chain
                    - region_type
                p_value_threshold: 0.05
                sequence_batch_size: 100000
                repertoire_batch_size: 32

SimilarToPositiveSequence#

A simple baseline encoding, to be used in combination with BinaryFeatureClassifier using keep_all = True. This encoder keeps track of all positive sequences in the training set, and ignores the negative sequences. Any sequence within a given hamming distance from a positive training sequence will be classified positive, all other sequences will be classified negative.

Specification arguments:

  • hamming_distance (int): Maximum number of differences allowed between any positive sequence of the training set and a new observed sequence in order for the observed sequence to be classified as ‘positive’.

  • compairr_path (Path): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.

  • ignore_genes (bool): Only used when compairr is used. Whether to ignore V and J gene information. If False, the V and J genes between two sequences have to match for the sequence to be considered ‘similar’. If True, gene information is ignored. By default, ignore_genes is False.

  • threads (int): The number of threads to use for parallelization. This does not affect the results of the encoding, only the speed. The default number of threads is 8.

  • keep_temporary_files (bool): whether to keep temporary files, including CompAIRR input, output and log files, and the sequence presence matrix. This may take a lot of storage space if the input dataset is large. By default temporary files are not kept.

YAML specification:

definitions:
    encodings:
        my_sequence_encoder:
            SimilarToPositiveSequenceEncoder:
                hamming_distance: 2

TCRdist#

Encodes the given ReceptorDataset as a distance matrix between all receptors, where the distance is computed using TCRdist from the paper: Dash P, Fiore-Gartland AJ, Hertz T, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. 2017; 547(7661):89-93. doi:10.1038/nature22383.

For the implementation, TCRdist3 library was used (source code available here).

Specification arguments:

  • cores (int): number of processes to use for the computation

YAML specification:

definitions:
    encodings:
        my_tcr_dist_enc:
            TCRdist:
                cores: 4

Word2Vec#

Word2VecEncoder learns the vector representations of k-mers based on the context (receptor sequence). It works for sequence and repertoire datasets. Similar idea was discussed in: Ostrovsky-Berman, M., Frankel, B., Polak, P. & Yaari, G. Immune2vec: Embedding B/T Cell Receptor Sequences in ℝN Using Natural Language Processing. Frontiers in Immunology 12, (2021).

This encoder relies on gensim’s implementation of Word2Vec and KmerHelper for k-mer extraction. Currently it works on amino acid level.

Specification arguments:

  • vector_size (int): The size of the vector to be learnt.

  • model_type (ModelType): The context which will be used to infer the representation of the sequence. If SEQUENCE is used, the context of a k-mer is defined by the sequence it occurs in (e.g. if the sequence is CASTTY and k-mer is AST, then its context consists of k-mers CAS, STT, TTY) If KMER_PAIR is used, the context for the k-mer is defined as all the k-mers that within one edit distance (e.g. for k-mer CAS, the context includes CAA, CAC, CAD etc.). Valid values are SEQUENCE, KMER_PAIR.

  • k (int): The length of the k-mers used for the encoding.

  • epochs (int): for how many epochs to train the word2vec model for a given set of sentences (corresponding to epochs parameter in gensim package)

  • window (int): max distance between two k-mers in a sequence (same as window parameter in gensim’s word2vec)

YAML pecification:

definitions:
    encodings:
        encodings:
            my_w2v:
                Word2Vec:
                    vector_size: 16
                    k: 3
                    model_type: SEQUENCE
                    epochs: 100
                    window: 8

ML methods#

Under the definitions/ml_methods component, the user can specify different ML methods to use on a given (encoded) dataset.

From version 3, immuneML includes different types of ML methods:

Note

Clustering methods, Generative models and Dimensionality reduction methods are experimental features.

Classifiers#

ML method classifiers are algorithms which can be trained to predict some label on immune repertoires, receptors or sequences.

These methods can be trained using the TrainMLModel instruction, and previously trained models can be applied to new data using the MLApplication instruction.

When choosing which ML method(s) are most suitable for your use-case, please consider the following table:

ML methods properties#

ML method

binary classification

multi-class classification

sequence dataset

receptor dataset

repertoire dataset

model selection CV

AtchleyKmerMILClassifier

BinaryFeatureClassifier

DeepRC

KNN

KerasSequenceCnn

LogisticRegression

PrecomputedKNN

ProbabalisticBinaryClassifier

RandomForestClassifier

ReceptorCNN

SVC

SVM

TCRdistClassifier

AtchleyKmerMILClassifier#

A binary Repertoire classifier which uses the data encoded by AtchleyKmer encoder to predict the repertoire label.

The original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292 .

Specification arguments:

  • iteration_count (int): max number of training iterations

  • threshold (float): loss threshold at which to stop training if reached

  • evaluate_at (int): log model performance every ‘evaluate_at’ iterations and store the model every ‘evaluate_at’ iterations if early stopping is used

  • use_early_stopping (bool): whether to use early stopping

  • learning_rate (float): learning rate for stochastic gradient descent

  • random_seed (int): random seed used

  • zero_abundance_weight_init (bool): whether to use 0 as initial weight for abundance term (if not, a random value is sampled from normal distribution with mean 0 and variance 1 / total_number_of_features

  • number_of_threads: number of threads to be used for training

  • initialization_count (int): how many times to repeat the fitting procedure from the beginning before choosing the optimal model (trains the model with multiple random initializations)

  • pytorch_device_name (str): The name of the pytorch device to use. This name will be passed to torch.device(pytorch_device_name).

YAML specification:

definitions:
    ml_methods:
        my_kmer_mil_classifier:
            AtchleyKmerMILClassifier:
                iteration_count: 100
                evaluate_at: 15
                use_early_stopping: False
                learning_rate: 0.01
                random_seed: 100
                zero_abundance_weight_init: True
                number_of_threads: 8
                threshold: 0.00001
                initialization_count: 4

BinaryFeatureClassifier#

A simple classifier that takes in encoded data containing features with only 1/0 or True/False values.

This classifier gives a positive prediction if any of the binary features for an example are ‘true’. Optionally, the classifier can select an optimal subset of these features. In this case, the given data is split into a training and validation set, a minimal set of features is learned through greedy forward selection, and the validation set is used to determine when to stop growing the set of features (earlystopping). Earlystopping is reached when the optimization metric on the validation set no longer improves for a given number of features (patience). The optimization metric is the same metric as the one used for optimization in the TrainMLModelInstruction.

Currently, this classifier can be used in combination with two encoders:

  • The classifier can be used in combination with the MotifEncoder,

such that sequences containing any of the positive class-associated motifs are classified as positive. A reduced subset of binding-associated motifs can be learned (when keep_all is false). This results in a set of complementary motifs, minimizing the redundant predictions made by different motifs.

  • Alternatively, this classifier can be combined with the SimilarToPositiveSequenceEncoder

such that any sequence that falls within a given hamming distance from any of the positive class sequences in the training set are classified as positive. Parameter keep_all should be set to true, since this encoder creates only 1 feature.

Specification arguments:

  • training_percentage (float): What percentage of data to use for training (the rest will be used for validation); values between 0 and 1

  • keep_all (bool): Whether to keep all the input features (true) or learn a reduced subset (false). By default, keep_all is false.

  • random_seed (int): Random seed for splitting the data into training and validation sets when learning a minimal subset of features. This is only used when keep_all is false.

  • max_features (int): The maximum number of features to allow in the reduced subset. When this number is reached, no more features are added even if the earlystopping criterion is not reached yet. This is only used when keep_all is false. By default, max_features is 100.

  • patience (int): The patience for earlystopping. When earlystopping is reached, <patience> more features are added to the reduced set to test whether the optimization metric on the validation set improves again. By default, patience is 5.

  • min_delta (float): The delta value used to test if there was improvement between the previous set of features and the new set of features (+1). By default, min_delta is 0, meaning the new set of features does not need to yield a higher optimization metric score on the validation set, but it needs to be at least equally high as the previous set.

YAML specification:

definitions:
    ml_methods:
        my_motif_classifier:
            MotifClassifier:
                training_percentage: 0.7
                max_features: 100
                patience: 5
                min_delta: 0
                keep_all: false

DeepRC#

This classifier uses the DeepRC method for repertoire classification. The DeepRC ML method should be used in combination with the DeepRC encoder. Also consider using the DeepRCMotifDiscovery report for interpretability.

Notes:

  • DeepRC uses PyTorch functionalities that depend on GPU. Therefore, DeepRC does not work on a CPU.

  • This wrapper around DeepRC currently only supports binary classification.

Reference: Michael Widrich, Bernhard Schäfl, Milena Pavlović, Geir Kjetil Sandve, Sepp Hochreiter, Victor Greiff, Günter Klambauer ‘DeepRC: Immune repertoire classification with attention-based deep massive multiple instance learning’. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.12.038158

Specification arguments:

  • validation_part (float): the part of the data that will be used for validation, the rest will be used for training.

  • add_positional_information (bool): whether positional information should be included in the input features.

  • kernel_size (int): the size of the 1D-CNN kernels.

  • n_kernels (int): the number of 1D-CNN kernels in each layer.

  • n_additional_convs (int): Number of additional 1D-CNN layers after first layer

  • n_attention_network_layers (int): Number of attention layers to compute keys

  • n_attention_network_units (int): Number of units in each attention layer

  • n_output_network_units (int): Number of units in the output layer

  • consider_seq_counts (bool): whether the input data should be scaled by the receptor sequence counts.

  • sequence_reduction_fraction (float): Fraction of number of sequences to which to reduce the number of sequences per bag based on attention weights. Has to be in range [0,1].

  • reduction_mb_size (int): Reduction of sequences per bag is performed using minibatches of reduction_mb_size` sequences to compute the attention weights.

  • n_updates (int): Number of updates to train for

  • n_torch_threads (int): Number of parallel threads to allow PyTorch

  • learning_rate (float): Learning rate for adam optimizer

  • l1_weight_decay (float): l1 weight decay factor. l1 weight penalty will be added to loss, scaled by l1_weight_decay

  • l2_weight_decay (float): l2 weight decay factor. l2 weight penalty will be added to loss, scaled by l2_weight_decay

  • sequence_counts_scaling_fn: it can either be log (logarithmic scaling of sequence counts) or None

  • evaluate_at (int): Evaluate model on training and validation set every evaluate_at updates. This will also check for a new best model for early stopping.

  • sample_n_sequences (int): Optional random sub-sampling of sample_n_sequences sequences per repertoire. Number of sequences per repertoire might be smaller than sample_n_sequences if repertoire is smaller or random indices have been drawn multiple times. If None, all sequences will be loaded for each repertoire.

  • training_batch_size (int): Number of repertoires per minibatch during training.

  • n_workers (int): Number of background processes to use for converting dataset to hdf5 container and training set data loader.

  • pytorch_device_name (str): The name of the pytorch device to use. This name will be passed to torch.device(self.pytorch_device_name). The default value is cuda:0

YAML specification:

definitions:
    ml_methods:
        my_deeprc_method:
            DeepRC:
                validation_part: 0.2
                add_positional_information: True
                kernel_size: 9

KNN#

This is a wrapper of scikit-learn’s KNeighborsClassifier class. This ML method creates a distance matrix using the given encoded data. If the encoded data is already a distance matrix (for example, when using the Distance or CompAIRRDistance encoders), please use PrecomputedKNN instead.

Please see the scikit-learn documentation of KNeighborsClassifier for the parameters.

Scikit-learn models can be trained in two modes:

1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.

2. Passing a range of different hyperparameters to KNN, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the KNN model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.

By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.

Specification arguments:

  • KNN (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.

  • model_selection_cv (bool): If any of the hyperparameters under KNN is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.

  • model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.

YAML specification:

definitions:
    ml_methods:
        my_knn_method:
            KNN:
                # sklearn parameters (same names as in original sklearn class)
                weights: uniform # always use this setting for weights
                n_neighbors: [5, 10, 15] # find the optimal number of neighbors
                # Additional parameter that determines whether to print convergence warnings
                show_warnings: True
            # if any of the parameters under KNN is a list and model_selection_cv is True,
            # a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
            # and the optimal model will be selected
            model_selection_cv: True
            model_selection_n_folds: 5
        # alternative way to define ML method with default values:
        my_default_knn: KNN

KerasSequenceCNN#

A CNN-based classifier for sequence datasets. Should be used in combination with source.encodings.onehot.OneHotEncoder.OneHotEncoder. This classifier integrates the CNN proposed by Mason et al., the original code can be found at: https://github.com/dahjan/DMS_opt/blob/master/scripts/CNN.py

Note: make sure keras and tensorflow dependencies are installed (see installation instructions).

Reference: Derek M. Mason, Simon Friedensohn, Cédric R. Weber, Christian Jordi, Bastian Wagner, Simon M. Men1, Roy A. Ehling, Lucia Bonati, Jan Dahinden, Pablo Gainza, Bruno E. Correia and Sai T. Reddy ‘Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning’. Nat Biomed Eng 5, 600–612 (2021). https://doi.org/10.1038/s41551-021-00699-9

Specification arguments:

  • units_per_layer (list): A nested list specifying the layers of the CNN. The first element in each nested list defines the layer type, other elements define the layer parameters. Valid layer types are: CONV (keras.layers.Conv1D), DROP (keras.layers.Dropout), POOL (keras.layers.MaxPool1D), FLAT (keras.layers.Flatten), DENSE (keras.layers.Dense). The parameters per layer type are as follows:

    • [CONV, <filters>, <kernel_size>, <strides>]

    • [DROP, <rate>]

    • [POOL, <pool_size>, <strides>]

    • [FLAT]

    • [DENSE, <units>]

  • activation (str): The Activation function to use in the convolutional or dense layers. Activation functions can be chosen from keras.activations. For example, rely or softmax. By default, relu is used.

  • training_percentage (float): The fraction of sequences that will be randomly assigned to form the training set (the rest will be the validation set). Should be a value between 0 and 1. By default, training_percentage is 0.7.

YAML specification:

definitions:
    ml_methods:
        my_cnn:
            KerasSequenceCNN:
                training_percentage: 0.7
                units_per_layer: [[CONV, 400, 3, 1], [DROP, 0.5], [POOL, 2, 1], [FLAT], [DENSE, 50]]
                activation: relu

LogisticRegression#

This is a wrapper of scikit-learn’s LogisticRegression class. Please see the scikit-learn documentation of LogisticRegression for the parameters.

Note: if you are interested in plotting the coefficients of the logistic regression model, consider running the Coefficients report.

Scikit-learn models can be trained in two modes:

1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.

2. Passing a range of different hyperparameters to LogisticRegression, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the LogisticRegression model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.

By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.

Specification arguments:

  • LogisticRegression (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.

  • model_selection_cv (bool): If any of the hyperparameters under LogisticRegression is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.

  • model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.

YAML specification:

definitions:
    ml_methods:
        my_logistic_regression: # user-defined method name
            LogisticRegression: # name of the ML method
                # sklearn parameters (same names as in original sklearn class)
                penalty: l1 # always use penalty l1
                C: [0.01, 0.1, 1, 10, 100] # find the optimal value for C
                # Additional parameter that determines whether to print convergence warnings
                show_warnings: True
            # if any of the parameters under LogisticRegression is a list and model_selection_cv is True,
            # a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
            # and the optimal model will be selected
            model_selection_cv: True
            model_selection_n_folds: 5
        # alternative way to define ML method with default values:
        my_default_logistic_regression: LogisticRegression

PrecomputedKNN#

This is a wrapper of scikit-learn’s KNeighborsClassifier class. This ML method takes a pre-computed distance matrix, as created by the Distance or CompAIRRDistance encoders. If you would like to use a different encoding in combination with KNN, please use KNN instead.

Please see the scikit-learn documentation of KNeighborsClassifier for the parameters.

Scikit-learn models can be trained in two modes:

1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.

2. Passing a range of different hyperparameters to KNN, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the KNN model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.

By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.

Specification arguments:

  • KNN (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.

  • model_selection_cv (bool): If any of the hyperparameters under KNN is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.

  • model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.

YAML specification:

definitions:
    ml_methods:
        my_knn_method:
            PrecomputedKNN:
                # sklearn parameters (same names as in original sklearn class)
                weights: uniform # always use this setting for weights
                n_neighbors: [5, 10, 15] # find the optimal number of neighbors
                # Additional parameter that determines whether to print convergence warnings
                show_warnings: True
            # if any of the parameters under KNN is a list and model_selection_cv is True,
            # a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
            # and the optimal model will be selected
            model_selection_cv: True
            model_selection_n_folds: 5
        # alternative way to define ML method with default values:
        my_default_knn: PrecomputedKNN

ProbabilisticBinaryClassifier#

ProbabilisticBinaryClassifier predicts the class assignment in binary classification case based on encoding examples by number of successful trials and total number of trials. It models this ratio by one beta distribution per class and predicts the class of the new examples using log-posterior odds ratio with threshold at 0.

ProbabilisticBinaryClassifier is based on the paper (details on the classification can be found in the Online Methods section): Emerson, Ryan O., William S. DeWitt, Marissa Vignali, Jenna Gravley, Joyce K. Hu, Edward J. Osborne, Cindy Desmarais, et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.

Specification arguments:

  • max_iterations (int): maximum number of iterations while optimizing the parameters of the beta distribution (same for both classes)

  • update_rate (float): how much the computed gradient should influence the updated value of the parameters of the beta distribution

  • likelihood_threshold (float): at which threshold to stop the optimization (default -1e-10)

YAML specification:

definitions:
    ml_methods:
        my_probabilistic_classifier: # user-defined name of the ML method
            ProbabilisticBinaryClassifier: # method name
                max_iterations: 1000
                update_rate: 0.01

RandomForestClassifier#

This is a wrapper of scikit-learn’s RandomForestClassifier class. Please see the scikit-learn documentation of RandomForestClassifier for the parameters.

Note: if you are interested in plotting the coefficients of the random forest classifier model, consider running the Coefficients report.

Scikit-learn models can be trained in two modes:

1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.

2. Passing a range of different hyperparameters to RandomForestClassifier, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the RandomForestClassifier model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.

By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.

Specification arguments:

  • RandomForestClassifier (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.

  • model_selection_cv (bool): If any of the hyperparameters under RandomForestClassifier is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.

  • model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.

YAML specification:

definitions:
    ml_methods:
        my_random_forest_classifier: # user-defined method name
            RandomForestClassifier: # name of the ML method
                # sklearn parameters (same names as in original sklearn class)
                random_state: 100 # always use this value for random state
                n_estimators: [10, 50, 100] # find the optimal number of trees in the forest
                # Additional parameter that determines whether to print convergence warnings
                show_warnings: True
            # if any of the parameters under RandomForestClassifier is a list and model_selection_cv is True,
            # a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
            # and the optimal model will be selected
            model_selection_cv: True
            model_selection_n_folds: 5
        # alternative way to define ML method with default values:
        my_default_random_forest: RandomForestClassifier

ReceptorCNN#

A CNN which separately detects motifs using CNN kernels in each chain of paired receptor data, combines the kernel activations into a unique representation of the receptor and uses this representation to predict the antigen binding.

yaml_specs/_static/images/receptor_cnn_immuneML.png

The architecture of the CNN for paired-chain receptor data#

Requires one-hot encoded data as input (as produced by OneHot encoder), where use_positional_info must be set to True.

Notes:

  • ReceptorCNN can only be used with ReceptorDatasets, it does not work with SequenceDatasets

  • ReceptorCNN can only be used for binary classification, not multi-class classification.

Specification arguments:

  • kernel_count (count): number of kernels that will look for motifs for one chain

  • kernel_size (list): sizes of the kernels = how many amino acids to consider at the same time in the chain sequence, can be a tuple of values; e.g. for value [3, 4] of kernel_size, kernel_count*len(kernel_size) kernels will be created, with kernel_count kernels of size 3 and kernel_count kernels of size 4 per chain

  • positional_channels (int): how many positional channels where included in one-hot encoding of the receptor sequences (OneHot encoder adds 3 positional channels positional information is enabled)

  • sequence_type (SequenceType): type of the sequence

  • device: which device to use for the model (cpu or gpu) - for more details see PyTorch documentation on device parameter

  • number_of_threads (int): how many threads to use

  • random_seed (int): number used as a seed for random initialization

  • learning_rate (float): learning rate scaling the step size for optimization algorithm

  • iteration_count (int): for how many iterations to train the model

  • l1_weight_decay (float): weight decay l1 value for the CNN; encourages sparser representations

  • l2_weight_decay (float): weight decay l2 value for the CNN; shrinks weight coefficients towards zero

  • batch_size (int): how many receptors to process at once

  • training_percentage (float): what percentage of data to use for training (the rest will be used for validation); values between 0 and 1

  • evaluate_at (int): when to evaluate the model, e.g. every 100 iterations

  • background_probabilities: used for rescaling the kernel values to produce information gain matrix; represents the background probability of each amino acid (without positional information); if not specified, uniform background is assumed

YAML specification:

definitions:
    ml_methods:
        my_receptor_cnn:
            ReceptorCNN:
                kernel_count: 5
                kernel_size: [3]
                positional_channels: 3
                sequence_type: amino_acid
                device: cpu
                number_of_threads: 16
                random_seed: 100
                learning_rate: 0.01
                iteration_count: 10000
                l1_weight_decay: 0
                l2_weight_decay: 0
                batch_size: 5000

SVC#

This is a wrapper of scikit-learn’s LinearSVC class. Please see the scikit-learn documentation of SVC for the parameters.

Note: if you are interested in plotting the coefficients of the SVC model, consider running the Coefficients report.

Scikit-learn models can be trained in two modes:

1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.

2. Passing a range of different hyperparameters to SVC, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the SVC model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.

By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.

Specification arguments:

  • SVC (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.

  • model_selection_cv (bool): If any of the hyperparameters under SVC is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.

  • model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.

YAML specification:

definitions:
    ml_methods:
        my_svc: # user-defined method name
            SVC: # name of the ML method
                # sklearn parameters (same names as in original sklearn class)
                C: [0.01, 0.1, 1, 10, 100] # find the optimal value for C
                # Additional parameter that determines whether to print convergence warnings
                show_warnings: True
            # if any of the parameters under SVC is a list and model_selection_cv is True,
            # a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
            # and the optimal model will be selected
            model_selection_cv: True
            model_selection_n_folds: 5
        # alternative way to define ML method with default values:
        my_default_svc: SVC

SVM#

This is a wrapper of scikit-learn’s SVC class. Please see the scikit-learn documentation of SVC for the parameters.

Note: if you are interested in plotting the coefficients of the SVM model, consider running the Coefficients report.

Scikit-learn models can be trained in two modes:

1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.

2. Passing a range of different hyperparameters to SVM, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the SVM model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.

By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.

Specification arguments:

  • SVM (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.

  • model_selection_cv (bool): If any of the hyperparameters under SVM is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.

  • model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.

YAML specification:

definitions:
    ml_methods:
        my_svm: # user-defined method name
            SVM: # name of the ML method
                # sklearn parameters (same names as in original sklearn class)
                C: [0.01, 0.1, 1, 10, 100] # find the optimal value for C
                kernel: linear
                # Additional parameter that determines whether to print convergence warnings
                show_warnings: True
            # if any of the parameters under SVM is a list and model_selection_cv is True,
            # a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
            # and the optimal model will be selected
            model_selection_cv: True
            model_selection_n_folds: 5
        # alternative way to define ML method with default values:
        my_default_svm: SVM

TCRdistClassifier#

Implementation of a nearest neighbors classifier based on TCR distances as presented in Dash P, Fiore-Gartland AJ, Hertz T, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. 2017; 547(7661):89-93. doi:10.1038/nature22383.

This method is implemented using scikit-learn’s KNeighborsClassifier with k determined at runtime from the training dataset size and weights linearly scaled to decrease with the distance of examples.

Specification arguments:

  • percentage (float): percentage of nearest neighbors to consider when determining receptor specificity based on known receptors (between 0 and 1)

  • show_warnings (bool): whether to show warnings generated by scikit-learn, by default this is True.

YAML specification:

definitions:
    ml_methods:
        my_tcr_method:
            TCRdistClassifier:
                percentage: 0.1
                show_warnings: True

Clustering methods#

Note

This is an experimental feature

Clustering methods are algorithms which can be used to cluster repertoires, receptors or sequences without using external label information (such as disease or antigen binding state)

These methods can be used in the Clustering instruction.

KMeans#

k-means clustering method which wraps scikit-learn’s KMeans. Input arguments for the method are the same as supported by scikit-learn (see KMeans scikit-learn documentation for details).

YAML specification:

definitions:
    ml_methods:
        my_kmeans:
            KMeans:
                # arguments as defined by scikit-learn
                n_clusters: 2

Generative models#

Note

This is an experimental feature

Generative models are algorithms which can be trained to learn patterns in existing datasets, and then be used to generate new synthetic datasets.

These methods can be used in the TrainGenModel instruction, and previously trained models can be used to generate data using the ApplyGenModel instruction.

ExperimentalImport#

Allows to import existing experimental data and do annotations and simulations on top of them.

YAML specification:

definitions:
    ml_methods:
        generative_model:
            type: ExperimentalImport
            import_format: AIRR
            tmp_import_path: ./tmp/
            import_params:
                path: path/to/files/
                region_type: IMGT_CDR3 # what part of the sequence to import
                column_mapping: # column mapping AIRR: immuneML
                    junction: sequence
                    junction_aa: sequence_aa
                    locus: chain

OLGA#

This is a wrapper for the OLGA package as described by Sethna et al. 2019 (OLGA package on PyPI or GitHub: https://github.com/statbiophys/OLGA).

Reference:

Zachary Sethna, Yuval Elhanati, Curtis G Callan, Jr, Aleksandra M Walczak, Thierry Mora, OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs, Bioinformatics, Volume 35, Issue 17, 1 September 2019, Pages 2974–2981, https://doi.org/10.1093/bioinformatics/btz035

Note:

  • OLGA generates sequences that correspond to IMGT junction and are used for matching as such. See the https://github.com/statbiophys/OLGA for more details.

  • Gene names are as provided in OLGA (either in default models or in the user-specified model files). For simulation, one should use gene names in the same format.

Note

While this is a generative model, in the current version of immuneML it cannot be used in combination with TrainGenModel or ApplyGenModel instruction. If you want to use OLGA for sequence simulation, see Dataset simulation with LIgO.

`

Specification arguments:

  • model_path (str): if not default model, this parameter should point to a folder where the four OLGA/IGOR format files are stored (could also be inferred from some experimental data)

  • default_model_name (str): if not using custom models, one of the OLGA default models could be specified here; the value should be the same as it would be passed to command line in OLGA: e.g., humanTRB, human IGH

YAML specification:

definitions:
    ml_methods:
        generative_model:
            type: OLGA
            model_path: None
            default_model_name: humanTRB

PWM#

Note

This is an experimental feature

This is a baseline implementation of a positional weight matrix. It is estimated from a set of sequences for each of the different lengths that appear in the dataset.

Specification arguments:

  • chain (str): which chain is generated (for now, it is only assigned to the generated sequences)

  • sequence_type (str): amino_acid or nucleotide

  • region_type (str): which region type to use (e.g., IMGT_CDR3), this is only assigned to the generated sequences

YAML specification:

definitions:
    ml_methods:
        my_pwm:
            PWM:
                chain: beta
                sequence_type: amino_acid
                region_type: IMGT_CDR3

SimpleLSTM#

This is a simple generative model for receptor sequences based on LSTM.

Similar models have been proposed in:

Akbar, R. et al. (2022). In silico proof of principle of machine learning-based antibody design at unconstrained scale. mAbs, 14(1), 2031482. https://doi.org/10.1080/19420862.2022.2031482

Saka, K. et al. (2021). Antibody design using LSTM based deep generative model from phage display library for affinity maturation. Scientific Reports, 11(1), Article 1. https://doi.org/10.1038/s41598-021-85274-7

Specification arguments:

  • sequence_type (str): whether the model should work on amino_acid or nucleotide level

  • hidden_size (int): how many LSTM cells should exist per layer

  • num_layers (int): how many hidden LSTM layers should there be

  • num_epochs (int): for how many epochs to train the model

  • learning_rate (float): what learning rate to use for optimization

  • batch_size (int): how many examples (sequences) to use for training for one batch

  • embed_size (int): the dimension of the sequence embedding

  • temperature (float): a higher temperature leads to faster yet more unstable learning

YAML specification:

definitions:
    ml_methods:
        my_simple_lstm:
            sequence_type: amino_acid
            hidden_size: 50
            num_layers: 1
            num_epochs: 5000
            learning_rate: 0.001
            batch_size: 100
            embed_size: 100

SimpleVAE#

SimpleVAE is a generative model on sequence level that relies on variational autoencoder. This type of model was proposed by Davidsen et al. 2019, and this implementation is inspired by their original implementation available at https://github.com/matsengrp/vampire.

References:

Davidsen, K., Olson, B. J., DeWitt, W. S., III, Feng, J., Harkins, E., Bradley, P., & Matsen, F. A., IV. (2019). Deep generative models for T cell receptor protein sequences. eLife, 8, e46935. https://doi.org/10.7554/eLife.46935

Specification arguments:

  • chain (str): which chain the sequence come from, e.g., TRB

  • beta (float): VAE hyperparameter that balanced the reconstruction loss and latent dimension regularization

  • latent_dim (int): latent dimension of the VAE

  • linear_nodes_count (int): in linear layers, how many nodes to use

  • num_epochs (int): how many epochs to use for training

  • batch_size (int): how many examples to consider at the same time

  • j_gene_embed_dim (int): dimension of J gene embedding

  • v_gene_embed_dim (int): dimension of V gene embedding

  • cdr3_embed_dim (int): dimension of the cdr3 embedding

  • pretrains (int): how many times to attempt pretraining to initialize the weights and use warm-up for the beta hyperparameter before the main training process

  • warmup_epochs (int): how many epochs to use for training where beta hyperparameter is linearly increased from 0 up to its max value; this is in addition to num_epochs set above

  • patience (int): number of epochs to wait before the training is stopped when the loss is not improving

  • iter_count_prob_estimation (int): how many iterations to use to estimate the log probability of the generated sequence (the more iterations, the better the estimated log probability)

  • vocab (list): which letters (amino acids) are allowed - this is automatically filled for new models (no need to set)

  • max_cdr3_len (int): what is the maximum cdr3 length - this is automatically filled for new models (no need to set)

  • unique_v_genes (list): list of allowed V genes (this will be automatically filled from the dataset if not provided here manually)

  • unique_j_genes (list): list of allowed J genes (this will be automatically filled from the dataset if not provided here manually)

  • device (str): name of the device where to train the model (e.g., cpu)

YAML specification:

definitions:
    ml_methods:
        my_vae:
            SimpleVAE:
                chain: beta
                beta: 0.75
                latent_dim: 20
                linear_nodes_count: 75
                num_epochs: 5000
                batch_size: 10000
                j_gene_embed_dim: 13
                v_gene_embed_dim: 30
                cdr3_embed_dim: 21
                pretrains: 10
                warmup_epochs: 20
                patience: 20
                device: cpu

SoNNia#

SoNNia models the selection process of T and B cell receptor repertoires. It is based on the SoNNia Python package. It supports SequenceDataset as input, but not RepertoireDataset.

Original publication: Isacchini, G., Walczak, A. M., Mora, T., & Nourmohammad, A. (2021). Deep generative selection models of T and B cell receptor repertoires with soNNia. Proceedings of the National Academy of Sciences, 118(14), e2023141118. https://doi.org/10.1073/pnas.2023141118

Specification arguments:

  • chain (str)

  • batch_size (int)

  • epochs (int)

  • deep (bool)

  • include_joint_genes (bool)

  • n_gen_seqs (int)

  • custom_model_path (str)

  • default_model_name (str)

    YAML specification:

definitions:
    ml_methods:
        my_sonnia_model:
            SoNNia:
                ...

Dimensionality reduction methods#

Note

This is an experimental feature

Dimensionality reduction methods are algorithms which can be used to reduce the dimensionality of encoded datasets, in order to uncover and analyze patterns present in the data.

These methods can be used in the ExploratoryAnalysis and Clustering instructions.

PCA#

Principal component analysis (PCA) method which wraps scikit-learn’s PCA. Input arguments for the method are the same as supported by scikit-learn (see PCA scikit-learn documentation for details).

YAML specification:

definitions:
    ml_methods:
        my_pca:
            PCA:
                # arguments as defined by scikit-learn
                n_components: 2

TSNE#

t-distributed Stochastic Neighbor Embedding (t-SNE) method which wraps scikit-learn’s TSNE. It can be useful for visualizing high-dimensional data. Input arguments for the method are the same as supported by scikit-learn (see TSNE scikit-learn documentation for details).

YAML specification:

definitions:
    ml_methods:
        my_tsne:
            TSNE:
                # arguments as defined by scikit-learn
                n_components: 2
                init: pca

UMAP#

Uniform manifold approximation and projection (UMAP) method which wraps umap-learn’s UMAP. Input arguments for the method are the same as supported by umap-learn (see UMAP in the umap-learn documentation for details).

Note that when providing the arguments for UMAP in the immuneML’s specification, it is not possible to set functions as input values (e.g., for the metric parameter, it has to be one of the predefined metrics available in umap-learn).

YAML specification:

definitions:
    ml_methods:
        my_umap:
            UMAP:
                # arguments as defined by scikit-learn
                n_components: 2
                n_neighbors: 15
                metric: euclidean

Reports#

Under the definitions/reports component, the user can specify reports which visualise or summarise different properties of the dataset or analysis.

Reports have been divided into different types. Different types of reports can be specified depending on which instruction is run. Click on the name of the report type to see more details.

  • Data reports show some type of features or statistics about a given dataset.

  • Encoding reports show some type of features or statistics about an encoded dataset, or may export relevant sequences or tables.

  • ML model reports show some type of features or statistics about a single trained ML model (e.g., model coefficients).

  • Train ML model reports plot general statistics or export data of multiple models simultaneously when running the TrainMLModel instruction (e.g., performance comparison between models).

  • Multi dataset reports are special reports that can be specified when running immuneML with the MultiDatasetBenchmarkTool. See Manuscript use case 1: Robustness assessment for an example.

Data reports#

Data reports show some type of features or statistics about a given dataset.

When running the TrainMLModel instruction, data reports can be specified inside the ‘selection’ or ‘assessment’ specification under the keys ‘reports/data’ (current cross-validation split) or ‘reports/data_splits’ (train/test sub-splits). Example:

my_instruction:
    type: TrainMLModel
    selection:
        reports:
            data:
                - my_data_report
        # other parameters...
    assessment:
        reports:
            data:
                - my_data_report
        # other parameters...
    # other parameters...

Alternatively, when running the ExploratoryAnalysis instruction, data reports can be specified under ‘report’. Example:

my_instruction:
    type: ExploratoryAnalysis
    analyses:
        my_first_analysis:
            report: my_data_report
            # other parameters...
    # other parameters...

AminoAcidFrequencyDistribution#

Generates a barplot showing the relative frequency of each amino acid at each position in the sequences of a dataset.

Specification arguments:

  • imgt_positions (bool): Whether to use IMGT positional numbering or sequence index numbering. When imgt_positions is True, IMGT positions are used, meaning sequences of unequal length are aligned according to their IMGT positions. By default, imgt_positions is True.

  • relative_frequency (bool): Whether to plot relative frequencies (true) or absolute counts (false) of the positional amino acids. By default, relative_frequency is True.

  • split_by_label (bool): Whether to split the plots by a label. If set to true, the Dataset must either contain a single label, or alternatively the label of interest can be specified under ‘label’. If split_by_label is set to true, the percentage-wise frequency difference between classes is plotted additionally. By default, split_by_label is False.

  • label (str): if split_by_label is set to True, a label can be specified here.

YAML specification:

definitions:
    reports:
        my_aa_freq_report:
            AminoAcidFrequencyDistribution:
                relative_frequency: False
                split_by_label: True
                label: CMV

GLIPH2Exporter#

Report which exports the receptor data to GLIPH2 format so that it can be directly used in GLIPH2 tool. Currently, the report accepts only receptor datasets.

GLIPH2 publication: Huang H, Wang C, Rubelt F, Scriba TJ, Davis MM. Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nature Biotechnology. Published online April 27, 2020:1-9. doi:10.1038/s41587-020-0505-4

Specification arguments:

  • condition (str): name of the parameter present in the receptor metadata in the dataset; condition can be anything which can be processed in GLIPH2, such as tissue type or treatment.

YAML specification:

definitions:
    reports:
        my_gliph2_exporter:
            GLIPH2Exporter:
                condition: epitope # for instance, epitope parameter is present in receptors' metadata with values such as "MtbLys" for Mycobacterium tuberculosis (as shown in the original paper).

MotifGeneralizationAnalysis#

This report splits the given dataset into a training and validation set, identifies significant motifs using the MotifEncoder on the training set and plots the precision/recall and precision/true positive predictions of motifs on both the training and validation sets. This can be used to: - determine the optimal recall cutoff for motifs of a given size - investigate how well motifs learned on a training set generalize to a test set

After running this report and determining the optimal recall cutoffs, the report MotifTestSetPerformance can be run to plot the performance on an independent test set.

Specification arguments:

  • label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.

  • training_set_identifier_path (str): Path to a file containing ‘sequence_identifiers’ of the sequences used for the training set. This file should have a single column named ‘example_id’ and have one sequence identifier per line. If training_set_identifier_path is not set, a random subset of the data (according to training_percentage) will be assigned to be the training set.

  • training_percentage (float): If training_set_identifier_path is not set, this value is used to specify the fraction of sequences that will be randomly assigned to form the training set. Should be a value between 0 and 1. By default, training_percentage is 0.7.

  • random_seed (int): Random seed for splitting the data into training and validation sets a training_set_identifier_path is not provided.

  • split_by_motif_size (bool): Whether to split the analysis per motif size. If true, a recall threshold is learned for each motif size, and figures are generated for each motif size independently. By default, split_by_motif_size is true.

  • min_precision: MotifEncoder parameter. The minimum precision threshold for keeping a motif on the training set. By default, min_precision is 0.9.

  • test_precision_threshold (float). The desired precision on the test set, given that motifs are learned by using a training set with a precision threshold of min_precision. It is recommended for test_precision_threshold to be lower than min_precision, e.g., min_precision - 0.1. By default, test_precision_threshold is 0.8.

  • min_recall (float): MotifEncoder parameter. The minimum recall threshold for keeping a motif. Any learned recall threshold will be at least as high as the set min_recall value. The default value for min_recall is 0.

  • min_true_positives (int): MotifEncoder parameter. The minimum number of true positive training sequences that a motif needs to occur in. The default value for min_true_positives is 1.

  • max_positions (int): MotifEncoder parameter. The maximum motif size. This is number of positional amino acids the motif consists of (excluding gaps). The default value for max_positions is 4.

  • min_positions (int): MotifEncoder parameter. The minimum motif size (see also: max_positions). The default value for min_positions is 1.

  • smoothen_combined_precision (bool): whether to add a smoothed line representing the combined precision to the precision-vs-TP plot. When set to True, this may take considerable extra time to compute. By default, plot_smoothed_combined_precision is set to True.

  • min_points_in_window (int): Parameter for smoothing the combined_precision line in the precision-vs-TP plot through lognormal kernel smoothing with adaptive window size. This parameter determines the minimum number of points that need to be present in a window to determine the adaptive window size. By default, min_points_in_window is 50.

  • smoothing_constant1: Parameter for smoothing the combined_precision line in the precision-vs-TP plot through lognormal kernel smoothing with adaptive window size. This smoothing constant determines the dependence of the smoothness on the window size. Increasing this increases smoothness for regions where few points are present. By default, smoothing_constant1 is 5.

  • smoothing_constant2: Parameter for smoothing the combined_precision line in the precision-vs-TP plot through lognormal kernel smoothing. with adaptive window size. This smoothing constant can be used to scale the overall kernel width, thus influencing the smoothness of all regions regardless of data density. By default, smoothing_constant2 is 10.

  • training_set_name (str): Name of the training set to be used in figures. By default, the training_set_name is ‘training set’.

  • test_set_name (str): Name of the test set to be used in figures. By default, the test_set_name is ‘test set’.

  • highlight_motifs_path (str): Path to a set of motifs of interest to highlight in the output figures (such as implanted ground-truth motifs). By default, no motifs are highlighted.

  • highlight_motifs_name (str): IF highlight_motifs_path is defined, this name will be used to label the motifs of interest in the output figures.

YAML specification:

definitions:
    reports:
        my_motif_generalization:
            MotifGeneralizationAnalysis:
                min_precision: 0.9
                min_recall: 0.1
                label: # Define a label, and the positive class for that given label
                    CMV:
                        positive_class: +

ReceptorDatasetOverview#

This report plots the length distribution per chain for a receptor (paired-chain) dataset.

Specification arguments:

  • batch_size (int): how many receptors to load at once; 50 000 by default

YAML specification:

definitions:
    reports:
        my_receptor_overview_report: ReceptorDatasetOverview

RecoveredSignificantFeatures#

Compares a given collection of groundtruth implanted signals (sequences or k-mers) to the significant label-associated k-mers or sequences according to Fisher’s exact test.

Internally uses the KmerAbundanceEncoder for calculating significant k-mers, and SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder to calculate significant full sequences (depending on whether the argument compairr_path was set).

This report creates two plots:

  • the first plot is a bar chart showing what percentage of the ground truth implanted signals were found to be significant.

  • the second plot is a bar chart showing what percentage of the k-mers/sequences found to be significant match the ground truth implanted signals.

To compare k-mers or sequences of differing lengths, the ground truth sequences or long k-mers are split into k-mers of the given size through a sliding window approach. When comparing ‘full_sequences’ to ground truth sequences, a match is only registered if both sequences are of equal length.

Specification arguments:

  • groundtruth_sequences_path (str): Path to a file containing the true implanted (sub)sequences, e.g., full sequences or k-mers. The file should contain one sequence per line, without a header, and without V or J genes.

  • trim_leading_trailing (bool): Whether to trim the leading and trailing first positions from the provided groundtruth sequences, e.g., the leading C and trailing Y/F amino acids. This is necessary for comparing full sequences when the main dataset is imported using settings that also trim the leading and trailing positions (specified by the region_type parameter). By default, trim_leading_trailing is False.

  • p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.

  • k_values (list): Length of the k-mers (number of amino acids) created by the KmerAbundanceEncoder. When using a full sequence encoding (SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder), specify ‘full_sequence’ here. Each value specified under k_values will represent one bar in the output figure.

  • label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.

  • compairr_path (str): If ‘full_sequence’ is listed under k_values, the path to the CompAIRR executable may be provided. If the compairr_path is specified, the CompAIRRSequenceAbundanceEncoder will be used to compute the significant sequences. If the path is not specified and ‘full_sequence’ is listed under k-values, SequenceAbundanceEncoder will be used.

YAML specification:

definitions:
    reports:
        my_recovered_significant_features_report:
            RecoveredSignificantFeatures:
                groundtruth_sequences_path: path/to/groundtruth/sequences.txt
                trim_leading_trailing: False
                p_values:
                    - 0.1
                    - 0.01
                    - 0.001
                    - 0.0001
                k_values:
                    - 3
                    - 4
                    - 5
                    - full_sequence
                compairr_path: path/to/compairr # can be specified if 'full_sequence' is listed under k_values
                label: # Define a label, and the positive class for that given label
                    CMV:
                        positive_class: +

RepertoireClonotypeSummary#

Shows the number of distinct clonotypes per repertoire in a given dataset as a bar plot.

Specification arguments:

  • color_by_label (str): name of the label to use to color the plot, e.g., could be disease label, or None

YAML specification:

definitions:
    reports:
        my_clonotype_summary_rep:
            RepertoireClonotypeSummary:
                color_by_label: celiac

SequenceLengthDistribution#

Generates a histogram of the lengths of the sequences in a repertoire or sequence dataset.

Specification arguments:

  • sequence_type (str): whether to check the length of amino acid or nucleotide sequences; default value is ‘amino_acid’

YAML specification:

definitions:
    reports:
        my_sld_report:
            SequenceLengthDistribution:
                sequence_type: amino_acid

SequencesWithSignificantKmers#

Given a list of reference sequences, this report writes out the subsets of reference sequences containing significant k-mers (as computed by the KmerAbundanceEncoder using Fisher’s exact test).

For each combination of p-value and k-mer size given, a file is written containing all sequences containing a significant k-mer of the given size at the given p-value.

Specification arguments:

  • reference_sequences_path (str): Path to a file containing the reference sequences, The file should contain one sequence per line, without a header, and without V or J genes.

  • p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.

  • k_values (list): Length of the k-mers (number of amino acids) created by the KmerAbundanceEncoder. Each k-mer length will become one panel in the output figure.

  • label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.

YAML specification:

definitions:
    reports:
        my_sequences_with_significant_kmers:
            SequencesWithSignificantKmers:
                reference_sequences_path: path/to/reference/sequences.txt
                p_values:
                    - 0.1
                    - 0.01
                    - 0.001
                    - 0.0001
                k_values:
                    - 3
                    - 4
                    - 5
                label: # Define a label, and the positive class for that given label
                    CMV:
                        positive_class: +

SignificantFeatures#

Plots a boxplot of the number of significant features (label-associated k-mers or sequences) per Repertoire according to Fisher’s exact test, across different classes for the given label.

Internally uses the KmerAbundanceEncoder for calculating significant k-mers, and SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder to calculate significant full sequences (depending on whether the argument compairr_path was set).

Specification arguments:

  • p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.

  • k_values (list): Length of the k-mers (number of amino acids) created by the KmerAbundanceEncoder. When using a full sequence encoding (SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder), specify ‘full_sequence’ here. Each value specified under k_values will represent one boxplot in the output figure.

  • label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.

  • compairr_path (str): If ‘full_sequence’ is listed under k_values, the path to the CompAIRR executable may be provided. If the compairr_path is specified, the CompAIRRSequenceAbundanceEncoder will be used to compute the significant sequences. If the path is not specified and ‘full_sequence’ is listed under k-values, SequenceAbundanceEncoder will be used.

  • log_scale (bool): Whether to plot the y axis in log10 scale (log_scale = True) or continuous scale (log_scale = False). By default, log_scale is False.

YAML specification:

definitions:
    reports:
        my_significant_features_report:
            SignificantFeatures:
                p_values:
                    - 0.1
                    - 0.01
                    - 0.001
                    - 0.0001
                k_values:
                    - 3
                    - 4
                    - 5
                    - full_sequence
                compairr_path: path/to/compairr # can be specified if 'full_sequence' is listed under k_values
                label: # Define a label, and the positive class for that given label
                    CMV:
                        positive_class: +
                log_scale: False

SignificantKmerPositions#

Plots the number of significant k-mers (as computed by the KmerAbundanceEncoder using Fisher’s exact test) observed at each IMGT position of a given list of reference sequences. This report creates a stacked bar chart, where each bar represents an IMGT position, and each segment of the stack represents the observed frequency of one ‘significant’ k-mer at that position.

Specification arguments:

  • reference_sequences_path (str): Path to a file containing the reference sequences, The file should contain one sequence per line, without a header, and without V or J genes.

  • p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.

  • k_values (list): Length of the k-mers (number of amino acids) created by the KmerAbundanceEncoder. Each k-mer length will become one panel in the output figure.

  • label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.

YAML specification:

definitions:
    reports:
        my_significant_kmer_positions_report:
            SignificantKmerPositions:
                reference_sequences_path: path/to/reference/sequences.txt
                p_values:
                    - 0.1
                    - 0.01
                    - 0.001
                    - 0.0001
                k_values:
                    - 3
                    - 4
                    - 5
                label: # Define a label, and the positive class for that given label
                    CMV:
                        positive_class: +

SimpleDatasetOverview#

Generates a simple text-based overview of the properties of any dataset, including the dataset name, size, and metadata labels.

YAML specification:

definitions:
    reports:
        my_overview: SimpleDatasetOverview

VJGeneDistribution#

This report creates several plots to gain insight into the V and J gene distribution of a given dataset. When a label is provided, the information in the plots is separated per label value, either by color or by creating separate plots. This way one can for example see if a particular V or J gene is more prevalent across disease associated receptors.

  • Individual V and J gene distributions: for sequence and receptor datasets, a bar plot is created showing how often

each V or J gene occurs in the dataset. For repertoire datasets, boxplots are used to represent how often each V or J gene is used across all repertoires. Since repertoires may differ in size, these counts are normalised by the repertoire size (original count values are additionaly exported in tsv files).

  • Combined V and J gene distributions: for sequence and receptor datasets, a heatmap is created showing how often each

combination of V and J genes occurs in the dataset. A similar plot is created for repertoire datasets, except in this case only the average value for the normalised gene usage frequencies are shown (original count values are additionaly exported in tsv files).

Specification arguments:

  • split_by_label (bool): Whether to split the plots by a label. If set to true, the Dataset must either contain a single label, or alternatively the label of interest can be specified under ‘label’. By default, split_by_label is False.

  • label (str): Optional label for separating the results by color/creating separate plots. Note that this should the name of a valid dataset label.

YAML specification:

definitions:
    reports:
        my_vj_gene_report:
            VJGeneDistribution:
                label: ag_binding

Encoding reports#

Encoding reports show some type of features or statistics about an encoded dataset, or may in some cases export relevant sequences or tables.

When running the TrainMLModel instruction, encoding reports can be specified inside the ‘selection’ or ‘assessment’ specification under the key ‘reports/encoding’. Example:

my_instruction:
    type: TrainMLModel
    selection:
        reports:
            encoding:
                - my_encoding_report
        # other parameters...
    assessment:
        reports:
            encoding:
                - my_encoding_report
        # other parameters...
    # other parameters...

Alternatively, when running the ExploratoryAnalysis instruction, encoding reports can be specified under ‘report’. Example:

my_instruction:
    type: ExploratoryAnalysis
    analyses:
        my_first_analysis:
            report: my_encoding_report
            # other parameters...
    # other parameters...

DesignMatrixExporter#

Exports the design matrix and related information of a given encoded Dataset to csv files. If the encoded data has more than 2 dimensions (such as when using the OneHot encoder with option Flatten=False), the data are then exported to different formats to facilitate their import with external software.

Specification arguments:

  • file_format (str): the format and extension of the file to store the design matrix. The supported formats are: npy, csv, hdf5, npy.zip, csv.zip or hdf5.zip.

Note: when using hdf5 or hdf5.zip output formats, make sure the ‘hdf5’ dependency is installed.

YAML specification:

definitions:
    reports:
        my_dme_report:
            DesignMatrixExporter:
                file_format: csv

DimensionalityReduction#

This report visualizes the data obtained by dimensionality reduction.

Specification arguments:

  • label (str): name of the label to use for highlighting data points

YAML specification:

definitions:
    reports:
        rep1:
            DimensionalityReduction:
                label: epitope

FeatureComparison#

Compares the feature values in a given encoded data matrix across two values for a metadata label. These labels are specified in the metadata file for repertoire datasets, or as metadata columns for sequence and receptor datasets. Can be used in combination with any encoding and dataset type. This report produces a scatterplot, where each point represents one feature, and the values on the x and y axes are the average feature values across two subsets of the data. For example, when KmerFrequency encoder is used, and the comparison_label is used to represent a disease (true/false), then the features are the k-mers (AAA, AAC, etc..) and their x and y position in the scatterplot is determined by their frequency in the subset of the data where disease=true and disease=false.

Optional metadata labels can be specified to divide the scatterplot into groups based on color, row facets or column facets.

Alternatively, when the feature values are of interest without comparing them between labelled subgroups of the data, please use FeatureValueBarplot or FeatureDistribution instead.

Specification arguments:

  • comparison_label (str): Mandatory label. This label is used to split the encoded data matrix and define the x and y axes of the plot. This label is only allowed to have 2 classes (for example: sick and healthy, binding and non-binding).

  • color_grouping_label (str): Optional label that is used to color the points in the scatterplot. This can not be the same as comparison_label.

  • row_grouping_label (str): Optional label that is used to group scatterplots into different row facets. This can not be the same as comparison_label.

  • column_grouping_label (str): Optional label that is used to group scatterplots into different column facets. This can not be the same as comparison_label.

  • show_error_bar (bool): Whether to show the error bar (standard deviation) for the points, both in the x and y dimension.

  • log_scale (bool): Whether to plot the x and y axes in log10 scale (log_scale = True) or continuous scale (log_scale = False). By default, log_scale is False.

  • keep_fraction (float): The total number of features may be very large and only the features differing significantly across comparison labels may be of interest. When the keep_fraction parameter is set below 1, only the fraction of features that differs the most across comparison labels is kept for plotting (note that the produced .csv file still contains all data). By default, keep_fraction is 1, meaning that all features are plotted.

  • opacity (float): a value between 0 and 1 setting the opacity for data points making it easier to see if there are overlapping points

YAML specification:

definitions:
    reports:
        my_comparison_report:
            FeatureComparison: # compare the different classes defined in the label disease
                comparison_label: disease

FeatureDistribution#

Plots a boxplot for each feature in the encoded data matrix. Can be used in combination with any encoding and dataset type. Each boxplot represents a feature and shows the distribution of values for that feature. For example, when KmerFrequency encoder is used, the features are the k-mers (AAA, AAC, etc..) and the feature values are the frequencies per k-mer.

Two modes can be used: in the ‘normal’ mode there are normal boxplots corresponding to each column of the encoded dataset matrix; in the ‘sparse’ mode all zero cells are eliminated before passing the data to the boxplots. If mode is set to ‘auto’, then it will automatically set to ‘sparse’ if the density of the matrix is below 0.01

Optional metadata labels can be specified to divide the boxplots into groups based on color, row facets or column facets. These labels are specified in the metadata file for repertoire datasets, or as metadata columns for sequence and receptor datasets.

Alternatively, when only the mean feature values are of interest (as opposed to showing the complete distribution, as done here), please consider using FeatureValueBarplot instead. When comparing the feature values between two subsets of the data, please use FeatureComparison.

Specification arguments:

  • color_grouping_label (str): The label that is used to color each bar, at each level of the grouping_label.

  • row_grouping_label (str): The label that is used to group bars into different row facets.

  • column_grouping_label (str): The label that is used to group bars into different column facets.

  • mode (str): either ‘normal’, ‘sparse’ or ‘auto’ (default)

  • x_title (str): x-axis label

  • y_title (str): y-axis label

YAML specification:

definitions:
    reports:
        my_fdistr_report:
            FeatureDistribution:
                mode: sparse

FeatureValueBarplot#

Plots a barplot of the feature values in a given encoded data matrix, averaged across examples. Can be used in combination with any encoding and dataset type. Each bar in the barplot represents the mean value of a given feature, and along the x-axis are the different features. For example, when KmerFrequency encoder is used, the features are the k-mers (AAA, AAC, etc..) and the feature values are the frequencies per k-mer.

Optional metadata labels can be specified to divide the barplot into groups based on color, row facets or column facets. In this case, the average feature values in each group are plotted. These labels are specified in the metadata file for repertoire datasets, or as metadata columns for sequence and receptor datasets.

Alternatively, when the distribution of feature values is of interest (as opposed to showing only the mean, as done here), please consider using FeatureDistribution instead. When comparing the feature values between two subsets of the data, please use FeatureComparison.

Specification arguments:

  • color_grouping_label (str): The label that is used to color each bar, at each level of the grouping_label.

  • row_grouping_label (str): The label that is used to group bars into different row facets.

  • column_grouping_label (str): The label that is used to group bars into different column facets.

  • show_error_bar (bool): Whether to show the error bar (standard deviation) for the bars.

  • x_title (str): x-axis label

  • y_title (str): y-axis label

  • plot_top_n (int): plot n of the largest features on average separately (useful when there are too many features to plot at the same time)

  • plot_bottom_n (int): plot n of the smallest features on average separately (useful when there are too many features to plot at the same time)

  • plot_all_features (bool): whether to plot all (might be slow for large number of features)

YAML specification:

definitions:
    reports:
        my_fvb_report:
            FeatureValueBarplot: # timepoint, disease_status and age_group are metadata labels
                column_grouping_label: timepoint
                row_grouping_label: disease_status
                color_grouping_label: age_group
                plot_all_features: true
                plot_top_n: 10
                plot_bottom_n: 5

GroundTruthMotifOverlap#

Creates report displaying overlap between learned motifs and groundtruth motifs implanted in a given sequence dataset. This report must be used in combination with the MotifEncoder.

Specification arguments:

  • groundtruth_motifs_path (str): Path to a .tsv file containing groundtruth position-specific motifs. The file should specify the motifs as position-specific amino acids, one column representing the positions concatenated with an ‘&’ symbol, the next column specifying the amino acids concatenated with ‘&’ symbol, and the last column specifying the implant rate.

    Example:

    indices

    amino_acids

    n_sequences

    0

    A

    4

    4&8&9

    G&A&C

    30

    This file shows a motif ‘A’ at position 0 implanted in 4 sequences, and motif G—AC implanted between positions 4 and 9 in 30 sequences

YAML specification:

definitions:
    reports:
        my_ground_truth_motif_report:
            GroundTruthMotifOverlap:
                groundtruth_motifs_path: path/to/file.tsv

Matches#

Reports the number of matches that were found when using one of the following encoders:

Report results are:

  • A table containing all matches, where the rows correspond to the Repertoires, and the columns correspond to the objects to match (regular expressions or receptor sequences).

  • The repertoire sizes (read frequencies and the number of unique sequences per repertoire), for each of the chains. This can be used to calculate the percentage of matched sequences in a repertoire.

  • When using MatchedSequences encoder or MatchedReceptors encoder, tables describing the chains and receptors (ids, chains, V and J genes and sequences).

  • When using MatchedReceptors encoder or using MatchedRegex encoder with chain pairs, tables describing the paired matches (where a match was found in both chains) per repertoire.

YAML specification:

definitions:
    reports:
        my_match_report: Matches

MotifTestSetPerformance#

This report can be used to show the performance of a learned set motifs using the MotifEncoder on an independent test set of unseen data.

It is recommended to first run the report MotifGeneralizationAnalysis in order to calibrate the optimal recall thresholds and plot the performance of motifs on training- and validation sets.

Specification arguments:

  • test_dataset (dict): parameters for importing a SequenceDataset to use as an independent test set. By default, the import parameters ‘is_repertoire’ and ‘paired’ will be set to False to ensure a SequenceDataset is imported.

YAML specification:

definitions:
    reports:
        my_motif_report:
            MotifTestSetPerformance:
                test_dataset:
                    format: AIRR # choose any valid import format
                    params:
                        path: path/to/files/
                        is_repertoire: False  # is_repertoire must be False to import a SequenceDataset
                        paired: False         # paired must be False to import a SequenceDataset
                        # optional other parameters...

NonMotifSequenceSimilarity#

Plots the similarity of positions outside the motifs of interest. This report can be used to investigate if the motifs of interest as determined by the MotifEncoder have a tendency occur in sequences that are naturally very similar or dissimilar.

For each motif, the subset of sequences containing the motif is selected, and the hamming distances are computed between all sequences in this subset. Finally, a plot is created showing the distribution of hamming distances between the sequences containing the motif. For motifs occurring in sets of very similar sequences, this distribution will lean towards small hamming distances. Likewise, for motifs occurring in a very diverse set of sequences, the distribution will lean towards containing more large hamming distances.

Specification arguments:

  • motif_color_map (dict): An optional mapping between motif sizes and colors. If no mapping is given, default colors will be chosen.

YAML specification:

definitions:
    reports:
        my_motif_sim:
            NonMotifSimilarity:
                motif_color_map:
                    3: "#66C5CC"
                    4: "#F6CF71"
                    5: "#F89C74"

PositionalMotifFrequencies#

This report must be used in combination with the MotifEncoder. Plots a stacked bar plot of amino acid occurrence at different indices in any given dataset, along with a plot investigating motif continuity which displays a bar plot of the gap sizes between the amino acids in the motifs in the given dataset. Note that a distance of 1 means that the amino acids are continuous (next to each other).

Specification arguments:

  • motif_color_map (dict): Optional mapping between motif lengths and specific colors to be used. Example:

    motif_color_map:

    1: #66C5CC 2: #F6CF71 3: #F89C74

YAML specification:

definitions:
    reports:
        my_pos_motif_report:
            PositionalMotifFrequencies:
                motif_color_map:

RelevantSequenceExporter#

Exports the sequences that are extracted as label-associated when using the SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder in AIRR-compliant format.

YAML specification:

definitions:
    reports:
        my_relevant_sequences: RelevantSequenceExporter

ML model reports#

ML model reports show some type of features or statistics about a single trained ML model.

In the TrainMLModel instruction, ML model reports can be specified inside the ‘selection’ or ‘assessment’ specification under the key ‘reports/models’. Example:

my_instruction:
    type: TrainMLModel
    selection:
        reports:
            models:
                - my_ml_report
        # other parameters...
    assessment:
        reports:
            models:
                - my_ml_report
        # other parameters...
    # other parameters...

BinaryFeaturePrecisionRecall#

Plots the precision and recall scores for each added feature to the collection of features selected by the BinaryFeatureClassifier.

YAML specification:

definitions:
    reports:
        my_report: BinaryFeaturePrecisionRecall

Coefficients#

A report that plots the coefficients for a given ML method in a barplot. Can be used for LogisticRegression, SVM, SVC, and RandomForestClassifier. In the case of RandomForest, the feature importances will be plotted.

When used in TrainMLModel instruction, the report can be specified under ‘models’, both on the selection and assessment levels.

Which coefficients should be plotted (for example: only nonzero, above a certain threshold, …) can be specified. Multiple options can be specified simultaneously. By default the 25 largest coefficients are plotted. The full set of coefficients will also be exported as a csv file.

Specification arguments:

  • coefs_to_plot (list): A list specifying which coefficients should be plotted. Valid values are: ALL, NONZERO, CUTOFF, N_LARGEST.

  • cutoff (list): If ‘cutoff’ is specified under ‘coefs_to_plot’, the cutoff values can be specified here. The coefficients which have an absolute value equal to or greater than the cutoff will be plotted.

  • n_largest (list): If ‘n_largest’ is specified under ‘coefs_to_plot’, the values for n can be specified here. These should be integer values. The n largest coefficients are determined based on their absolute values.

YAML specification:

definitions:
    reports:
        my_coef_report:
            Coefficients:
                coefs_to_plot:
                    - all
                    - nonzero
                    - cutoff
                    - n_largest
                cutoff:
                    - 0.1
                    - 0.01
                n_largest:
                    - 5
                    - 10

ConfounderAnalysis#

A report that plots the numbers of false positives and false negatives with respect to each value of the metadata features specified by the user. This allows checking whether a given machine learning model makes more misclassifications for some values of a metadata feature than for the others.

Specification arguments:

  • metadata_labels (list): A list of the metadata features to use as a basis for the calculations

YAML specification:

definitions:
    reports:
        my_confounder_report:
            ConfounderAnalysis:
                metadata_labels:
                  - age
                  - sex

DeepRCMotifDiscovery#

This report plots the contributions of (i) input sequences and (ii) kernels to trained DeepRC model with respect to the test dataset. Contributions are computed using integrated gradients (IG). This report produces two figures:

  • inputs_integrated_gradients: Shows the contributions of the characters within the input sequences (test dataset) that was most important for immune status prediction of the repertoire. IG is only applied to sequences of positive class repertoires.

  • kernel_integrated_gradients: Shows the 1D CNN kernels with the highest contribution over all positions and amino acids.

For both inputs and kernels: Larger characters in the extracted motifs indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the immune status. For kernels only: contributions to positional encoding are indicated by < (beginning of sequence), ∧ (center of sequence), and > (end of sequence).

See DeepRCMotifDiscovery for repertoire classification for a usage example.

Reference:

Widrich, M., et al. (2020). Modern Hopfield Networks and Attention for Immune Repertoire Classification. Advances in Neural Information Processing Systems, 33. https://proceedings.neurips.cc//paper/2020/hash/da4902cb0bc38210839714ebdcf0efc3-Abstract.html

Specification arguments:

  • n_steps (int): Number of IG steps (more steps -> better path integral -> finer contribution values). 50 is usually good enough.

  • threshold (float): Only applies to the plotting of kernels. Contributions are normalized to range [0, 1], and only kernels with normalized contributions above threshold are plotted.

YAML specification:

definitions:
    reports:
        my_deeprc_report:
            DeepRCMotifDiscovery:
                threshold: 0.5
                n_steps: 50

MotifSeedRecovery#

This report can be used to show how well implanted motifs (for example, through the Simulation instruction) can be recovered by various machine learning methods using the k-mer encoding. This report creates a boxplot, where the x axis (box grouping) represents the maximum possible overlap between an implanted motif seed and a kmer feature (measured in number of positions), and the y axis shows the coefficient size of the respective kmer feature. If the machine learning method has learned the implanted motif seeds, the coefficient size is expected to be largest for the kmer features with high overlap to the motif seeds.

Note that to use this report, the following criteria must be met:

  • KmerFrequencyEncoder must be used.

  • One of the following classifiers must be used: RandomForestClassifier, LogisticRegression, SVM, SVC

  • For each label, the implanted motif seeds relevant to that label must be specified

To find the overlap score between kmer features and implanted motif seeds, the two sequences are compared in a sliding window approach, and the maximum overlap is calculated.

Overlap scores between kmer features and implanted motifs are calculated differently based on the Hamming distance that was allowed during implanting.

Without hamming distance:
Seed:     AAA  -> score = 3
Feature: xAAAx
          ^^^

Seed:     AAA  -> score = 0
Feature: xAAxx

With hamming distance:
Seed:     AAA  -> score = 3
Feature: xAAAx
          ^^^

Seed:     AAA  -> score = 2
Feature: xAAxx
          ^^

Furthermore, gap positions in the motif seed are ignored:
Seed:     A/AA  -> score = 3
Feature: xAxAAx
          ^/^^

See Recovering simulated immune signals for more details and an example plot.

Specification arguments:

  • implanted_motifs_per_label (dict): a nested dictionary that specifies the motif seeds that were implanted in the given dataset. The first level of keys in this dictionary represents the different labels. In the inner dictionary there should be two keys: “seeds” and “hamming_distance”:

    • seeds: a list of motif seeds. The seeds may contain gaps, specified by a ‘/’ symbol.

    • hamming_distance: A boolean value that specifies whether hamming distance was allowed when implanting the motif seeds for a given label. Note that this applies to all seeds for this label.

    • gap_sizes: a list of all the possible gap sizes that were used when implanting a gapped motif seed. When no gapped seeds are used, this value has no effect.

YAML specification:

definitions:
    reports:
        my_motif_report:
            MotifSeedRecovery:
                implanted_motifs_per_label:
                    CD:
                        seeds:
                        - AA/A
                        - AAA
                        hamming_distance: False
                        gap_sizes:
                        - 0
                        - 1
                        - 2
                    T1D:
                        seeds:
                        - CC/C
                        - CCC
                        hamming_distance: True
                        gap_sizes:
                        - 2

ROCCurve#

A report that plots the ROC curve for a binary classifier.

YAML specification:

definitions:
    reports:
        my_roc_report: ROCCurve

SequenceAssociationLikelihood#

Plots the beta distribution used as a prior for class assignment in ProbabilisticBinaryClassifier. The distribution plotted shows the probability that a sequence is associated with a given class for a label.

YAML specification:

definitions:
    reports:
        my_sequence_assoc_report: SequenceAssociationLikelihood

TCRdistMotifDiscovery#

The report for discovering motifs in paired immune receptor data of given specificity based on TCRdist3. The receptors are hierarchically clustered based on the tcrdist distance and then motifs are discovered for each cluster. The report outputs logo plots for the motifs along with the raw data used for plotting in csv format.

For the implementation, TCRdist3 library was used (source code available here). More details on the functionality used for this report are available here.

Original publications:

Dash P, Fiore-Gartland AJ, Hertz T, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. 2017; 547(7661):89-93. doi:10.1038/nature22383

Mayer-Blackwell K, Schattgen S, Cohen-Lavi L, et al. TCR meta-clonotypes for biomarker discovery with tcrdist3: quantification of public, HLA-restricted TCR biomarkers of SARS-CoV-2 infection. bioRxiv. Published online December 26, 2020:2020.12.24.424260. doi:10.1101/2020.12.24.424260

Specification arguments:

  • positive_class_name (str): the class value (e.g., epitope) used to select only the receptors that are specific to the given epitope so that only those sequences are used to infer motifs; the reference receptors as required by TCRdist will be the ones from the dataset that have different or no epitope specified in their metadata; if the labels are available only on the epitope level (e.g., label is “AVFDRKSDAK” and classes are True and False), then here it should be specified that only the receptors with value “True” for label “AVFDRKSDAK” should be used; there is no default value for this argument

  • cores (int): number of processes to use for the computation of the distance and motifs

  • min_cluster_size (int): the minimum size of the cluster to discover the motifs for

  • use_reference_sequences (bool): when showing motifs, this parameter defines if reference sequences should be provided as well as a background

YAML specification:

definitions:
    reports:
        my_tcr_dist_report: # user-defined name
            TCRdistMotifDiscovery:
                positive_class_name: True # class name, could also be epitope name, depending on how it's defined in the dataset
                cores: 4
                min_cluster_size: 30
                use_reference_sequences: False

TrainingPerformance#

A report that plots the evaluation metrics for the performance given machine learning model and training dataset. The available metrics are accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision, recall, auc and log_loss (see immuneML.environment.Metric.Metric).

Specification arguments:

  • metrics (list): A list of metrics used to evaluate training performance. See immuneML.environment.Metric.Metric for available options.

YAML specification:

definitions:
    reports:
        my_performance_report:
            TrainingPerformance:
                metrics:
                    - accuracy
                    - balanced_accuracy
                    - confusion_matrix
                    - f1_micro
                    - f1_macro
                    - f1_weighted
                    - precision
                    - recall
                    - auc
                    - log_loss

Train ML model reports#

Train ML model reports plot general statistics or export data of multiple models simultaneously when running the TrainMLModel instruction.

In the TrainMLModel instruction, train ML model reports can be specified under ‘reports’. Example:

my_instruction:
    type: TrainMLModel
    reports:
        - my_train_ml_model_report
    # other parameters...

CVFeaturePerformance#

This report plots the average training vs test performance w.r.t. given encoding parameter which is explicitly set in the feature attribute. It can be used only in combination with TrainMLModel instruction and can be only specified under ‘reports’

Specification arguments:

  • feature: name of the encoder parameter w.r.t. which the performance across training and test will be shown. Possible values depend on the encoder on which it is used.

  • is_feature_axis_categorical (bool): if the x-axis of the plot where features are shown should be categorical; alternatively it is automatically determined based on the feature values

YAML specification:

definitions:
    reports:
        report1:
            CVFeaturePerformance:
                feature: p_value_threshold # parameter value of SequenceAbundance encoder
                is_feature_axis_categorical: True # show x-axis as categorical

DiseaseAssociatedSequenceCVOverlap#

DiseaseAssociatedSequenceCVOverlap report makes one heatmap per label showing the overlap of disease-associated sequences (or k-mers) produced by the SequenceAbundanceEncoder, CompAIRRSequenceAbundanceEncoder or KmerAbundanceEncoder between folds of cross-validation (either inner or outer loop of the nested CV). The overlap is computed by the following equation:

\[overlap(X,Y) = \frac{|X \cap Y|}{min(|X|, |Y|)} x 100\]

For details, see Greiff V, Menzel U, Miho E, et al. Systems Analysis Reveals High Genetic and Antigen-Driven Predetermination of Antibody Repertoires throughout B Cell Development. Cell Reports. 2017;19(7):1467-1478. doi:10.1016/j.celrep.2017.04.054.

Specification arguments:

  • compare_in_selection (bool): whether to compute the overlap over the inner loop of the nested CV - the sequence overlap is shown across CV folds for the model chosen as optimal within that selection

  • compare_in_assessment (bool): whether to compute the overlap over the optimal models in the outer loop of the nested CV

YAML specification:

definitions:
    reports:
        my_overlap_report: DiseaseAssociatedSequenceCVOverlap # report has no parameters

MLSettingsPerformance#

Report for TrainMLModel instruction: plots the performance for each of the setting combinations as defined under ‘settings’ in the assessment (outer validation) loop.

The performances are grouped by label (horizontal panels) encoding (vertical panels) and ML method (bar color). When multiple data splits are used, the average performance over the data splits is shown with an error bar representing the standard deviation.

This report can be used only with TrainMLModel instruction under ‘reports’.

Specification arguments:

  • single_axis_labels (bool): whether to use single axis labels. Note that using single axis labels makes the figure unsuited for rescaling, as the label position is given in a fixed distance from the axis. By default, single_axis_labels is False, resulting in standard plotly axis labels.

  • x_label_position (float): if single_axis_labels is True, this should be an integer specifying the x axis label position relative to the x axis. The default value for label_position is -0.1.

  • y_label_position (float): same as x_label_position, but for the y-axis.

YAML specification:

definitions:
    reports:
        my_hp_report: MLSettingsPerformance

ROCCurveSummary#

This report plots ROC curves for all trained ML settings ([preprocessing], encoding, ML model) in the outer loop of cross-validation in the TrainMLModel instruction. If there are multiple splits in the outer loop, this report will make one plot per split. This report is defined only for binary classification. If there are multiple labels defined in the instruction, each label has to have two classes to be included in this report.

YAML specification:

definitions:
    reports:
        my_roc_summary_report: ROCCurveSummary

ReferenceSequenceOverlap#

The ReferenceSequenceOverlap report compares a list of disease-associated sequences (or k-mers) produced by the SequenceAbundanceEncoder, CompAIRRSequenceAbundanceEncoder or KmerAbundanceEncoder to a list of reference sequences. It outputs a Venn diagram and a list of sequences found both in the encoder and reference list.

The report compares the sequences by their sequence content and the additional comparison_attributes (such as V or J gene), as specified by the user.

Specification arguments:

  • reference_path (str): path to the reference file in csv format which contains one entry per row and has columns that correspond to the attributes listed under comparison_attributes argument

  • comparison_attributes (list): list of attributes to use for comparison; all of them have to be present in the reference file where they should be the names of the columns

  • label (str): name of the label for which the reference sequences/k-mers should be compared to the model; if none, it takes the one label from the instruction; if it is none and multiple labels were specified for the instruction, the report will not be generated

YAML specification:

definitions:
    reports:
        my_reference_overlap_report:
            ReferenceSequenceOverlap:
                reference_path: reference_sequences.csv  # example usage with SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder
                comparison_attributes:
                    - sequence_aa
                    - v_call
                    - j_call
        my_reference_overlap_report_with_kmers:
            ReferenceSequenceOverlap:
                reference_path: reference_kmers.csv  # example usage with KmerAbundanceEncoder
                comparison_attributes:
                    - k-mer

Multi dataset reports#

Multi dataset reports are special reports that can be specified when running immuneML with the MultiDatasetBenchmarkTool. See Manuscript use case 1: Robustness assessment for an example.

When running the MultiDatasetBenchmarkTool, multi dataset reports can be specified under ‘benchmark_reports’. Example:

my_instruction:
    type: TrainMLModel
    benchmark_reports:
        - my_benchmark_report
    # other parameters...

DiseaseAssociatedSequenceOverlap#

DiseaseAssociatedSequenceOverlap report makes a heatmap showing the overlap of disease-associated sequences (or k-mers) produced by the SequenceAbundanceEncoder, CompAIRRSequenceAbundanceEncoder or KmerAbundanceEncoder between multiple datasets of different sizes (different number of repertoires per dataset).

This plot can be used only with MultiDatasetBenchmarkTool.

The overlap is computed by the following equation:

\[overlap(X,Y) = \frac{|X \cap Y|}{min(|X|, |Y|)} * 100\]

For details, see: Greiff V, Menzel U, Miho E, et al. Systems Analysis Reveals High Genetic and Antigen-Driven Predetermination of Antibody Repertoires throughout B Cell Development. Cell Reports. 2017;19(7):1467-1478. doi:10.1016/j.celrep.2017.04.054.

YAML specification:

definitions:
    reports:
        my_overlap_report: DiseaseAssociatedSequenceOverlap # report has no parameters

PerformanceOverview#

PerformanceOverview report creates an ROC plot and precision-recall plot for optimal trained models on multiple datasets. The labels on the plots are the names of the datasets, so it might be good to have user-friendly names when defining datasets that are still a combination of letters, numbers and the underscore sign.

This report can be used only with MultiDatasetBenchmarkTool as it will plot ROC and PR curve for trained models across datasets. Also, it requires the task to be immune repertoire classification and cannot be used for receptor or sequence classification. Furthermore, it uses predictions on the test dataset to assess the performance and plot the curves. If the parameter refit_optimal_model is set to True, all data will be used to fit the optimal model, so there will not be a test dataset which can be used to assess performance and the report will not be generated.

If datasets have the same number of examples, the baseline PR curve will be plotted as described in this publication: Saito T, Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE. 2015;10(3):e0118432. doi:10.1371/journal.pone.0118432

If the datasets have different number of examples, the baseline PR curve will not be plotted.

YAML specification:

definitions:
    reports:
        my_performance_report: PerformanceOverview

Preprocessings#

Under the definitions/preprocessing_sequences component, the user can specify different preprocessing steps to apply to a dataset before performing an analysis. This is optional.

ChainRepertoireFilter#

Removes all repertoires from the RepertoireDataset object which contain at least one sequence with chain different than “keep_chain” parameter. Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.

Since the filter removes repertoires from the dataset (examples in machine learning setting), it cannot be used with TrainMLModel instruction. If you want to filter out repertoires including a given chain, see DatasetExport instruction with preprocessing.

Specification arguments:

  • keep_chain (str): Which chain should be kept, valid values are “TRA”, “TRB”, “IGH”, “IGL”, “IGK”

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            ChainRepertoireFilter:
                keep_chain: TRB

ClonesPerRepertoireFilter#

Removes all repertoires from the RepertoireDataset, which contain fewer clonotypes than specified by the lower_limit, or more clonotypes than specified by the upper_limit. Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets. When no lower or upper limit is specified, or the value -1 is specified, the limit is ignored.

Since the filter removes repertoires from the dataset (examples in machine learning setting), it cannot be used with TrainMLModel instruction. If you want to use this filter, see DatasetExport instruction with preprocessing.

Specification arguments:

  • lower_limit (int): The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire.

  • upper_limit (int): The maximal inclusive upper limit for the number of clonotypes allowed in a repertoire.

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            ClonesPerRepertoireFilter:
                lower_limit: 100
                upper_limit: 100000

CountPerSequenceFilter#

Removes all sequences from a Repertoire when they have a count below low_count_limit, or sequences with no count value if remove_without_counts is True. This filter can be applied to Repertoires and RepertoireDatasets.

Specification arguments:

  • low_count_limit (int): The inclusive minimal count value in order to retain a given sequence.

  • remove_without_count (bool): Whether the sequences without a reported count value should be removed.

  • remove_empty_repertoires (bool): Whether repertoires without sequences should be removed. Only has an effect when remove_without_count is also set to True. If this is true, this preprocessing cannot be used with TrainMLModel instruction, but only with DatasetExport instruction instead.

  • batch_size (int): number of repertoires that can be loaded at the same time (only affects the speed when applying this filter on a RepertoireDataset)

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            CountPerSequenceFilter:
                remove_without_count: True
                remove_empty_repertoires: True
                low_count_limit: 3
                batch_size: 4

DuplicateSequenceFilter#

Collapses duplicate nucleotide or amino acid sequences within each repertoire in the given RepertoireDataset. This filter can be applied to Repertoires and RepertoireDatasets.

Sequences are considered duplicates if the following fields are identical:

  • amino acid or nucleotide sequence (whichever is specified)

  • v and j genes (note that the full field including subgroup + gene is used for matching, i.e. V1 and V1-1 are not considered duplicates)

  • chain

  • region type

For all other fields (the non-specified sequence type, custom lists, sequence identifier) only the first occurring value is kept.

Note that this means the count value of a sequence with a given sequence identifier might not be the same as before removing duplicates, unless count_agg = FIRST is used.

Specification arguments:

  • filter_sequence_type (SequenceType): Whether the sequences should be collapsed on the nucleotide or amino acid level. Valid values are: [‘amino_acid’, ‘nucleotide’].

  • batch_size (int): number of repertoires that can be loaded at the same time (only affects the speed)

  • count_agg (CountAggregationFunction): determines how the sequence counts of duplicate sequences are aggregated. Valid values are: [‘sum’, ‘max’, ‘min’, ‘mean’, ‘first’, ‘last’].

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            DuplicateSequenceFilter:
                # required parameters:
                filter_sequence_type: AMINO_ACID
                # optional parameters (if not specified the values bellow will be used):
                batch_size: 4
                count_agg: SUM

MetadataRepertoireFilter#

Removes repertoires from a RepertoireDataset based on information stored in the metadata_file. Note that this filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.

Since this filter changes the number of repertoires (examples for the machine learning task), it cannot be used with TrainMLModel instruction. To filter out repertoires, use preprocessing from the DatasetExport instruction that will create a new dataset ready to be used for training machine learning models.

Specification arguments:

  • criteria (dict): a nested dictionary that specifies the criteria for keeping certain columns. See CriteriaMatcher for a more detailed explanation.

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            # Example filter that keeps repertoires with values greater than 1 in the "my_column_name" column of the metadata_file
            MetadataRepertoireFilter:
                type: GREATER_THAN
                value:
                    type: COLUMN
                    name: my_column_name
                threshold: 1

ReferenceSequenceAnnotator#

Annotates each sequence in each repertoire if it matches any of the reference sequences provided as input parameter. This report uses CompAIRR internally. To match CDR3 sequences (and not JUNCTION), CompAIRR v1.10 or later is needed.

Specification arguments:

  • reference_sequences (dict): A dictionary describing the reference dataset file. Import should be specified the same way as regular dataset import. It is only allowed to import a receptor dataset here (i.e., is_repertoire is False and paired is True by default, and these are not allowed to be changed).

  • max_edit_distance (int): The maximum edit distance between a target sequence (from the repertoire) and the reference sequence.

  • compairr_path (str): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.

  • threads (int): how many threads to be used by CompAIRR for sequence matching

  • ignore_genes (bool): Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains have to match. If True, gene information is ignored. By default, ignore_genes is False.

  • output_column_name (str): in case there are multiple annotations, it is possible here to define the name of the column in the output repertoire files for this specific annotation

  • repertoire_batch_size (int): how many repertoires to process simultaneously; depending on the repertoire size, this parameter might be use to limit the memory usage

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - step1:
            ReferenceSequenceAnnotator:
                reference_sequences:
                    format: VDJDB
                    params:
                        path: path/to/file.csv
                compairr_path: optional/path/to/compairr
                ignore_genes: False
                max_edit_distance: 0
                output_column_name: matched
                threads: 4
                repertoire_batch_size: 5

SequenceLengthFilter#

Removes sequences with length out of the predefined range.

Specification arguments:

  • sequence_type (SequenceType): Whether the sequences should be filtered on the nucleotide or amino acid level. Valid options are defined by the SequenceType enum.

  • min_len (int): minimum length of the sequence (sequences shorter than min_len will be removed); to not use min_len, set it to -1

  • max_len (int): maximum length of the sequence (sequences longer than max_len will be removed); to not use max_len, set it to -1

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            SequenceLengthFilter:
                sequence_type: AMINO_ACID
                min_len: 3 # -> remove all sequences shorter than 3
                max_len: -1 # -> no upper bound on the sequence length

SubjectRepertoireCollector#

Merges all the Repertoires in a RepertoireDataset that have the same ‘subject_id’ specified in the metadata. The result is a RepertoireDataset with one Repertoire per subject. This preprocessing cannot be used in combination with TrainMLModel instruction because it can change the number of examples. To combine the repertoires in this way, use this preprocessing with DatasetExport instruction.

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter: SubjectRepertoireCollector

Simulation#

Under the definitions/simulation component, the user can specify parameters necessary for simulating synthetic immune signals into an AIRR dataset. See also Dataset simulation with LIgO.

Motifs#

Motifs are the objects which are implanted into sequences during simulation. They are defined under definitions/motifs. There are several different motif types, each having their own parameters.

SeedMotif#

Describes motifs by seed, possible gaps, allowed hamming distances, positions that can be changed and what they can be changed to.

Specification arguments:

  • seed (str): An amino acid sequence that represents the basic motif seed. All implanted motifs correspond to the seed, or a modified version thereof, as specified in its instantiation strategy. If this argument is set, seed_chain1 and seed_chain2 arguments are not used.

  • min_gap (int): The minimum gap length, in case the original seed contains a gap.

  • max_gap (int): The maximum gap length, in case the original seed contains a gap.

  • hamming_distance_probabilities (dict): The probability of modifying the given seed with each number of modifications. The keys represent the number of modifications (hamming distance) between the original seed and the implanted motif, and the values represent the probabilities for the respective number of modifications. For example {0: 0.7, 1: 0.3} means that 30% of the time one position will be modified, and the remaining 70% of the time the motif will remain unmodified with respect to the seed. The values of hamming_distance_probabilities must sum to 1.

  • position_weights (dict): A dictionary containing the relative probabilities of choosing each position for hamming distance modification. The keys represent the position in the seed, where counting starts at 0. If the index of a gap is specified in position_weights, it will be removed. The values represent the relative probabilities for modifying each position when it gets selected for modification. For example {0: 0.6, 1: 0, 2: 0.4} means that when a sequence is selected for a modification (as specified in hamming_distance_probabilities), then 60% of the time the amino acid at index 0 is modified, and the remaining 40% of the time the amino acid at index 2. If the values of position_weights do not sum to 1, the remainder will be redistributed over all positions, including those not specified.

  • alphabet_weights (dict): A dictionary describing the relative probabilities of choosing each amino acid for hamming distance modification. The keys of the dictionary represent the amino acids and the values are the relative probabilities for choosing this amino acid. If the values of alphabet_weights do not sum to 1, the remainder will be redistributed over all possible amino acids, including those not specified.

YAML specification:

definitions:
    motifs:
        # examples for single chain receptor data
        my_simple_motif: # this will be the identifier of the motif
            seed: AAA # motif is always AAA
        my_gapped_motif:
            seed: AA/A # this motif can be AAA, AA_A, CAA, CA_A, DAA, DA_A, EAA, EA_A
            min_gap: 0
            max_gap: 1
            hamming_distance_probabilities: # it can have a max of 1 substitution
                0: 0.7
                1: 0.3
            position_weights: # note that index 2, the position of the gap, is excluded from position_weights
                0: 1 # only first position can be changed
                1: 0
                3: 0
            alphabet_weights: # the first A can be replaced by C, D or E
                C: 0.4
                D: 0.4
                E: 0.2

PWM#

Motifs defined by a positional weight matrix and using bionumpy’s PWM internally. For more details on bionumpy’s implementation of PWM, as well as for supported formats, see the documentation at https://bionumpy.github.io/bionumpy/tutorials/position_weight_matrix.html.

Specification arguments:

  • file_path: path to the file where the PWM is stored

  • threshold (float): when matching PWM to a sequence, this is the threshold to consider the sequence as containing the motif

YAML specification:

definitions:
    motifs:
        my_custom_pwm: # this will be the identifier of the motif
            file_path: my_pwm_1.csv
            threshold: 2

Signals#

A signal represents a collection of motifs, and optionally, position weights showing where one of the motifs of the signal can occur in a sequence. The signals are defined under definitions/signals.

A signal is associated with a metadata label, which is assigned to a receptor or repertoire. For example antigen-specific/disease-associated (receptor) or diseased (repertoire).

Note

IMGT positions

To use sequence position weights, IMGT positions should be explicitly specified as strings, under quotation marks, to allow for all positions to be properly distinguished.

Specification arguments:

  • motifs (list): A list of the motifs associated with this signal, either defined by seed or by position weight matrix. Alternatively, it can be a list of a list of motifs, in which case the motifs in the same sublist (max 2 motifs) have to co-occur in the same sequence

  • sequence_position_weights (dict): a dictionary specifying for each IMGT position in the sequence how likely it is for the signal to be there. If the position is not present in the sequence, the probability of the signal occurring at that position will be redistributed to other positions with probabilities that are not explicitly set to 0 by the user.

  • v_call (str): V gene with allele if available that has to co-occur with one of the motifs for the signal to exist; can be used in combination with rejection sampling, or full sequence implanting, otherwise ignored; to match in a sequence for rejection sampling, it is checked if this value is contained in the same field of generated sequence;

  • j_call (str): J gene with allele if available that has to co-occur with one of the motifs for the signal to exist; can be used in combination with rejection sampling, or full sequence implanting, otherwise ignored; to match in a sequence for rejection sampling, it is checked if this value is contained in the same field of generated sequence;

  • source_file (str): path to the file where the custom signal function is; cannot be combined with the arguments listed above (motifs, v_call, j_call, sequence_position_weights)

  • is_present_func (str): name of the function from the source_file file that will be used to specify the signal; the function’s signature must be:

def is_present(sequence_aa: str, sequence: str, v_call: str, j_call: str) -> bool:
    # custom implementation where all or some of these arguments can be used
clonal_frequency:
  a: 2 # shape parameter of the distribution
  loc: 0 # 0 by default but can be used to shift the distribution

YAML specification:

definitions:
    signals:
        my_signal:
            motifs:
                - my_simple_motif
                - my_gapped_motif
            sequence_position_weights:
                '109': 0.5
                '110': 0.5
            v_call: TRBV1
            j_call: TRBJ1
            clonal_frequency:
                a: 2
                loc: 0
        signal_with_custom_func:
            source_file: signal_func.py
            is_present_func: is_signal_present
            clonal_frequency:
                a: 2
                loc: 0

Simulation config#

The simulation config defines all parameters of the simulation. It can contain one or more simulation config items, which define groups of repertoires or receptors that have the same simulation parameters, such as signals, generative model, clonal frequencies, and noise parameters.

Specification arguments:

  • sim_items (dict): a list of SimConfigItems defining individual units of simulation

  • is_repertoire (bool): whether the simulation is on a repertoire (person) or sequence/receptor level

  • paired: if the simulation should output paired data, this parameter should contain a list of a list of sim_item pairs referenced by name that should be combined; if paired data is not needed, then it should be False

  • sequence_type (str): either amino_acid or nucleotide

  • simulation_strategy (str): either RejectionSampling or Implanting, see the tutorials for more information on choosing one of these

  • keep_p_gen_dist (bool): if possible, whether to keep the distribution of generation probabilities of the sequences the same as provided by the model without any signals

  • p_gen_bin_count (int): if keep_p_gen_dist is true, how many bins to use to approximate the generation probability distribution

  • remove_seqs_with_signals (bool): if true, it explicitly controls the proportions of signals in sequences and removes any accidental occurrences

  • species (str): species that the sequences come from; used to select correct genes to export full length sequences; default is ‘human’

  • implanting_scaling_factor (int): determines in how many receptors to implant the signal in reach iteration; this is computed as number_of_receptors_needed_for_signal * implanting_scaling_factor; useful when using Implanting simulation strategy in combination with importance sampling, since the generation probability of some receptors with implanted signals might be very rare and those receptors might end up not being kept often with importance sampling; this parameter is only used when keep_p_gen_dist is set to True

YAML specification:

definitions:
    simulations:
        sim1:
            is_repertoire: false
            paired: false
            sequence_type: amino_acid
            simulation_strategy: RejectionSampling
            sim_items:
                sim_item1: # group of sequences with same simulation params
                    generative_model:
                        chain: beta
                        default_model_name: humanTRB
                        model_path: null
                        type: OLGA
                    number_of_examples: 100
                    seed: 1002
                    signals:
                        signal1: 1

Simulation config item#

When performing a simulation, one or more simulation config items can be specified. Config items define groups of repertoires or receptors that have the same simulation parameters, such as signals, generative model, clonal frequencies, noise parameters.

Specification arguments:

  • signals (dict): signals for the simulation item and the proportion of sequences in the repertoire that will have the given signal. For receptor-level simulation, the proportion will always be 1.

  • is_noise (bool): indicates whether the implanting should be regarded as noise; if it is True, the signals will be implanted as specified, but the repertoire/receptor in question will have negative class.

  • generative_model: parameters of the generative model, including its type, path to the model; currently supported models are OLGA and ExperimentalImport

  • seed (int): starting random seed for the generative model (it should differ across simulation items, or it can be set to null when not used)

  • false_positives_prob_in_receptors (float): when performing repertoire level simulation, what percentage of sequences should be false positives

  • false_negative_prob_in_receptors (float): when performing repertoire level simulation, what percentage of sequences should be false negatives

  • immune_events (dict): a set of key-value pairs that will be added to the metadata (same values for all data generated in one simulation sim_item) and can be later used as labels

  • default_clonal_frequency (dict): clonal frequency in Ligo is simulated through scipy’s zeta distribution function for generating random numbers, with parameters provided under default_clonal_frequency parameter. These parameters will be used to assign count values to sequences that do not contain any signals if they are required by the simulation. If clonal frequency shouldn’t be used, this parameter can be None

clonal_frequency:
    a: 2 # shape parameter of the distribution
    loc: 0 # 0 by default but can be used to shift the distribution
  • sequence_len_limits (dict): allows for filtering the generated sequences by length, needs to have parameters min and max specified; if not used, min/max should be -1

sequence_len_limits:
    min: 4 # keep sequences of length 4 and longer
    max: -1 # no limit on the max length of the sequences

YAML specification:

definitions:
    simulations: # definitions of simulations should be under key simulations in the definitions part of the specification
        # one simulation with multiple implanting objects, a part of definition section
        my_simulation:
            sim_item1:
                number_of_examples: 10
                seed: null # don't use seed
                receptors_in_repertoire_count: 100
                generative_model:
                    chain: beta
                    default_model_name: humanTRB
                    model_path: null
                    type: OLGA
                signals:
                    my_signal: 0.25
                    my_signal2: 0.01
                    my_signal__my_signal2: 0.02 # my_signal and my_signal2 will co-occur in 2% of the receptors in all 10 repertoires
            sim_item2:
                number_of_examples: 5
                receptors_in_repertoire_count: 150
                seed: 10 #
                generative_model:
                    chain: beta
                    default_model_name: humanTRB
                    model_path: null
                    type: OLGA
                signals:
                    my_signal: 0.75
                default_clonal_frequency:
                    a: 2
                sequence_len_limits:
                    min: 3