Definitions¶
The different components used inside an immuneML analysis are called definitions
.
These analysis components are used inside instructions
to perform an analysis.
This page documents all possible definitions and their parameters in detail. For general usage examples please check out the Tutorials.
Please use the menu on the right side of this page to navigate to the documentation for the components of interest, or jump to one of the following sections:
Datasets¶
Under the definitions/datasets
component, the user can specify how to import a dataset from files.
The file format determines which importer should be used, as listed below. See also: How to import data into immuneML.
For testing purposes, it is also possible to generate a random dataset instead of importing from files, using RandomReceptorDataset, RandomSequenceDataset or RandomRepertoireDataset import types. See also: How to generate a dataset with random sequences.
AIRR¶
Imports data in AIRR format into a Repertoire-, Sequence- or ReceptorDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets or ReceptorDatasets should be used when predicting values for unpaired (single-chain) and paired immune receptors respectively, like antigen specificity.
The AIRR .tsv format is explained here: https://docs.airr-community.org/en/stable/datarep/format.html And the AIRR rearrangement schema can be found here: https://docs.airr-community.org/en/stable/datarep/rearrangements.html
When importing a ReceptorDataset, the AIRR field cell_id is used to determine the chain pairs.
Specification arguments:
path (str): For RepertoireDatasets, this is the path to a directory with AIRR files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.
is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset or ReceptorDataset. By default, is_repertoire is set to True.
metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the AIRR files included under the column ‘filename’ are imported into the RepertoireDataset. For setting Sequence- or ReceptorDataset labels, metadata_file is ignored, use label_columns instead.
label_columns (list): For Sequence- or ReceptorDataset, this parameter can be used to explicitly set the column names of labels to import. These labels can be used as prediction target. When label_columns are not set, label names are attempted to be discovered automatically (any column name which is not used in the column_mapping). For setting RepertoireDataset labels, label_columns is ignored, use metadata_file instead.
paired (str): Required for Sequence- or ReceptorDatasets. This parameter determines whether to import a SequenceDataset (paired = False) or a ReceptorDataset (paired = True). In a ReceptorDataset, two sequences with chain types specified by receptor_chains are paired together based on the identifier given in the AIRR column named ‘cell_id’.
receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values are TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).
import_productive (bool): Whether productive sequences (with value ‘T’ in column productive) should be included in the imported sequences. By default, import_productive is True.
import_unknown_productivity (bool): Whether sequences with unknown productivity (missing value in column productive) should be included in the imported sequences. By default, import_unknown_productivity is True.
import_with_stop_codon (bool): Whether sequences with stop codons (with value ‘T’ in column stop_codon) should be included in the imported sequences. This only applies if column stop_codon is present. By default, import_with_stop_codon is False.
import_out_of_frame (bool): Whether out of frame sequences (with value ‘F’ in column vj_in_frame) should be included in the imported sequences. This only applies if column vj_in_frame is present. By default, import_out_of_frame is False.
import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.
import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.
import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.
column_mapping (dict): A mapping from AIRR column names to immuneML’s internal data representation. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the AIRR file, or using alternative column names).
additional_column_in_the_file: column_name_to_be_used_in_analysis
separator (str): Column separator, for AIRR this is by default “t”.
YAML specification:
definitions:
datasets:
my_airr_dataset:
format: AIRR
params:
path: path/to/files/
is_repertoire: True # whether to import a RepertoireDataset
metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
import_productive: True # whether to include productive sequences in the dataset
import_with_stop_codon: False # whether to include sequences with stop codon in the dataset
import_out_of_frame: False # whether to include out of frame sequences in the dataset
import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
import_empty_nt_sequences: True # keep sequences even if the `sequences` column is empty (provided that other fields are as specified here)
import_empty_aa_sequences: False # remove all sequences with empty column
# Optional fields with AIRR-specific defaults, only change when different behavior is required:
separator: "\t" # column separator
region_type: IMGT_CDR3 # what part of the sequence check for import
Generic¶
Imports data from any tabular file into a Repertoire-, Sequence- or ReceptorDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets or ReceptorDatasets should be used when predicting values for unpaired (single-chain) and paired immune receptors respectively, like antigen specificity.
This importer works similarly to other importers, but has no predefined default values for which fields are imported, and can therefore be tailored to import data from various different tabular files with headers.
For ReceptorDatasets: this importer assumes the two receptor sequences appear on different lines in the file, and can be paired together by a common sequence identifier. If you instead want to import a ReceptorDataset from a tabular file that contains both receptor chains on one line, see SingleLineReceptor import
Specification arguments:
path (str): For RepertoireDatasets, this is the path to a directory with files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.
is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset or ReceptorDataset. By default, is_repertoire is set to True.
metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. For setting Sequence- or ReceptorDataset labels, metadata_file is ignored, use label_columns instead.
label_columns (list): For Sequence- or ReceptorDataset, this parameter can be used to explicitly set the column names of labels to import. These labels can be used as prediction target. When label_columns are not set, label names are attempted to be discovered automatically (any column name which is not used in the column_mapping). For setting RepertoireDataset labels, label_columns is ignored, use metadata_file instead.
paired (str): Required for Sequence- or ReceptorDatasets. This parameter determines whether to import a SequenceDataset (paired = False) or a ReceptorDataset (paired = True). In a ReceptorDataset, two sequences with chain types specified by receptor_chains are paired together based on a common identifier. This identifier should be mapped to the immuneML field ‘sequence_identifiers’ using the column_mapping.
receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values are TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK.
import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.
import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.
import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.
region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means immuneML assumes the IMGT junction (including leading C and trailing Y/F amino acids) is used in the input file, and the first and last amino acids will be removed from the sequences to retrieve the IMGT CDR3 sequence. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.
column_mapping (dict): Required for all datasets. A mapping where the keys are the column names in the input file, and the values correspond to the names in the AIRR format. A column mapping can look for example like this:
file_column_amino_acids: cdr3_aa file_column_v_genes: v_call file_column_j_genes: j_call file_column_frequencies: duplicate_count
column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For Generic import, there is no default column_mapping_synonyms.
columns_to_load (list): Optional; specifies which columns to load from the input file. This may be useful if the input files contain many unused columns. If no value is specified, all columns are loaded.
separator (str): Required parameter. Column separator, for example “t” or “,”. The default value is “t”
YAML specification:
definitions:
datasets:
my_generic_dataset:
format: Generic
params:
path: path/to/files/
is_repertoire: True # whether to import a RepertoireDataset
metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
paired: False # whether to import SequenceDataset (False) or ReceptorDataset (True) when is_repertoire = False
receptor_chains: TRA_TRB # what chain pair to import for a ReceptorDataset
separator: "\t" # column separator
import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
import_empty_aa_sequences: False # filter out sequences if they don't have amino acid sequence set
region_type: IMGT_CDR3 # which column to check for illegal characters/empty strings etc
column_mapping: # column mapping file: immuneML/AIRR column names
file_column_amino_acids: junction_aa
file_column_v_genes: v_call
file_column_j_genes: j_call
file_column_frequencies: duplicate_count
file_column_antigen_specificity: antigen_specificity
columns_to_load: # which subset of columns to load from the file
- file_column_amino_acids
- file_column_v_genes
- file_column_j_genes
- file_column_frequencies
- file_column_antigen_specificity
IGoR¶
Imports data generated by IGoR simulations into a Repertoire-, or SequenceDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets should be used when predicting values for unpaired (single-chain) immune receptors, like antigen specificity.
Note that you should run IGoR with the –CDR3 option specified, this tool imports the generated CDR3 files. Sequences with missing anchors are not imported, meaning only sequences with value ‘1’ in the anchors_found column are imported. Nucleotide sequences are automatically translated to amino acid sequences.
Reference: Quentin Marcou, Thierry Mora, Aleksandra M. Walczak ‘High-throughput immune repertoire analysis with IGoR’. Nature Communications, (2018) doi.org/10.1038/s41467-018-02832-w.
Specification arguments:
path (str): For RepertoireDatasets, this is the path to a directory with IGoR files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.
is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset. By default, is_repertoire is set to True.
metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the IGoR files included under the column ‘filename’ are imported into the RepertoireDataset. For setting Sequence- or ReceptorDataset labels, metadata_file is ignored, use label_columns instead.
label_columns (list): For Sequence- or ReceptorDataset, this parameter can be used to explicitly set the column names of labels to import. These labels can be used as prediction target. When label_columns are not set, label names are attempted to be discovered automatically (any column name which is not used in the column_mapping). For setting RepertoireDataset labels, label_columns is ignored, use metadata_file instead.
import_with_stop_codon (bool): Whether sequences with stop codons should be included in the imported sequences. By default, import_with_stop_codon is False.
import_out_of_frame (bool): Whether out of frame sequences (with value ‘0’ in column is_inframe) should be included in the imported sequences. By default, import_out_of_frame is False.
import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default, import_illegal_characters is False.
import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.
region_type (str): Which part of the sequence to check when importing. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as IGoR uses the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values for region_type are the names of the
RegionType
enum.column_mapping (dict): A mapping from IGoR column names to immuneML’s internal data representation. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the IGoR file, or using alternative column names). Valid immuneML fields that can be specified here are defined by Repertoire.FIELDS. For IGoR, this is by default set to:
nt_CDR3: cdr3 seq_index: sequence_id
separator (str): Column separator, for IGoR this is by default “,”.
YAML specification:
definitions:
datasets:
my_igor_dataset:
format: IGoR
params:
path: path/to/files/
is_repertoire: True # whether to import a RepertoireDataset (True) or a SequenceDataset (False)
metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
import_with_stop_codon: False # whether to include sequences with stop codon in the dataset
import_out_of_frame: False # whether to include out of frame sequences in the dataset
import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
# Optional fields with IGoR-specific defaults, only change when different behavior is required:
separator: "," # column separator
region_type: IMGT_CDR3 # what part of the sequence to import
column_mapping: # column mapping IGoR: immuneML
nt_CDR3: cdr3
seq_index: sequence_id
igor_column_name1: metadata_label1
igor_column_name2: metadata_label2
IReceptor¶
Imports AIRR datasets retrieved through the iReceptor Gateway into a Repertoire-, Sequence- or ReceptorDataset. The differences between this importer and the AIRR importer are:
This importer takes in a list of .zip files, which must contain one or more AIRR tsv files, and for each AIRR file, a corresponding metadata json file must be present.
This importer does not require a metadata csv file for RepertoireDataset import, it is generated automatically from the metadata json files.
RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets or ReceptorDatasets should be used when predicting values for unpaired (single-chain) and paired immune receptors respectively, like antigen specificity.
AIRR rearrangement schema can be found here: https://docs.airr-community.org/en/stable/datarep/rearrangements.html
When importing a ReceptorDataset, the AIRR field cell_id is used to determine the chain pairs.
Specification arguments:
path (str): This is the path to a directory with .zip files retrieved from the iReceptor Gateway. These .zip files should include AIRR files (with .tsv extension) and corresponding metadata.json files with matching names (e.g., for my_dataset.tsv the corresponding metadata file is called my_dataset-metadata.json). The zip files must use the .zip extension.
is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset or ReceptorDataset. By default, is_repertoire is set to True.
label_columns (list): For Sequence- or ReceptorDataset, this parameter can be used to explicitly set the column names of labels to import. These labels can be used as prediction target. When label_columns are not set, label names are attempted to be discovered automatically (any column name which is not used in the column_mapping). For RepertoireDataset labels, label_columns is ignored, metadata is discovered automatically from the metadata json.
paired (str): Required for Sequence- or ReceptorDatasets. This parameter determines whether to import a SequenceDataset (paired = False) or a ReceptorDataset (paired = True). In a ReceptorDataset, two sequences with chain types specified by receptor_chains are paired together based on the identifier given in the AIRR column named ‘cell_id’.
receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values are TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).
import_productive (bool): Whether productive sequences (with value ‘T’ in column productive) should be included in the imported sequences. By default, import_productive is True.
import_with_stop_codon (bool): Whether sequences with stop codons (with value ‘T’ in column stop_codon) should be included in the imported sequences. This only applies if column stop_codon is present. By default, import_with_stop_codon is False.
import_out_of_frame (bool): Whether out of frame sequences (with value ‘F’ in column vj_in_frame) should be included in the imported sequences. This only applies if column vj_in_frame is present. By default, import_out_of_frame is False.
import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.
import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.
import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.
region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as AIRR uses the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.
separator (str): Column separator, for AIRR this is by default “t”.
YAML specification:
definitions:
datasets:
my_airr_dataset:
format: IReceptor
params:
path: path/to/zipfiles/
is_repertoire: True # whether to import a RepertoireDataset
metadata_column_mapping: # metadata column mapping AIRR: immuneML for Sequence- or ReceptorDatasetDataset
airr_column_name1: metadata_label1
airr_column_name2: metadata_label2
import_productive: True # whether to include productive sequences in the dataset
import_with_stop_codon: False # whether to include sequences with stop codon in the dataset
import_out_of_frame: False # whether to include out of frame sequences in the dataset
import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
import_empty_nt_sequences: True # keep sequences even if the `sequences` column is empty (provided that other fields are as specified here)
import_empty_aa_sequences: False # remove all sequences with empty `sequence_aas` column
# Optional fields with AIRR-specific defaults, only change when different behavior is required:
separator: "\t" # column separator
region_type: IMGT_CDR3 # what part of the sequence to import
ImmunoSEQRearrangement¶
Imports data from Adaptive Biotechnologies immunoSEQ Analyzer rearrangement-level .tsv files into a Repertoire-, or SequenceDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets should be used when predicting values for unpaired (single-chain) immune receptors, like antigen specificity.
The format of the files imported by this importer is described here: https://www.adaptivebiotech.com/wp-content/uploads/2019/07/MRK-00342_immunoSEQ_TechNote_DataExport_WEB_REV.pdf Alternatively, to import sample-level .tsv files, see ImmunoSEQSample import
The only difference between these two importers is which columns they load from the .tsv files.
Specification arguments:
path (str): For RepertoireDatasets, this is the path to a directory with files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.
is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset. By default, is_repertoire is set to True.
metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the files included under the column ‘filename’ are imported into the RepertoireDataset. For setting Sequence- or ReceptorDataset labels, metadata_file is ignored, use label_columns instead.
label_columns (list): For Sequence- or ReceptorDataset, this parameter can be used to explicitly set the column names of labels to import. These labels can be used as prediction target. When label_columns are not set, label names are attempted to be discovered automatically (any column name which is not used in the column_mapping). For setting RepertoireDataset labels, label_columns is ignored, use metadata_file instead.
import_productive (bool): Whether productive sequences (with value ‘In’ in column frame_type) should be included in the imported sequences. By default, import_productive is True.
import_with_stop_codon (bool): Whether sequences with stop codons (with value ‘Stop’ in column frame_type) should be included in the imported sequences. By default, import_with_stop_codon is False.
import_out_of_frame (bool): Whether out of frame sequences (with value ‘Out’ in column frame_type) should be included in the imported sequences. By default, import_out_of_frame is False.
import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.
import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.
import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.
region_type (str): Which part of the sequence to check when importing. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as immunoSEQ files use the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.
column_mapping (dict): A mapping from immunoSEQ column names to immuneML’s internal data representation. For immunoSEQ rearrangement-level files, this is by default set the values shown below in YAML format. A custom column mapping can be specified here if necessary (for example: adding additional data fields if they are present in the file, or using alternative column names). Valid immuneML fields that can be specified here are defined by Repertoire.FIELDS.
rearrangement: sequence amino_acid: junction_aa v_resolved: v_call j_resolved: j_call templates: duplicate_count
columns_to_load (list): Specifies which subset of columns must be loaded from the file. By default, this is: [rearrangement, v_family, v_gene, v_allele, j_family, j_gene, j_allele, amino_acid, templates, frame_type, locus]
separator (str): Column separator, for ImmunoSEQ files this is by default “t”.
import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False
import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter will typically be False (import only non-empty amino acid sequences)
YAML specification:
definitions:
datasets:
my_immunoseq_dataset:
format: ImmunoSEQRearrangement
params:
path: path/to/files/
is_repertoire: True # whether to import a RepertoireDataset (True) or a SequenceDataset (False)
metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
metadata_column_mapping: # metadata column mapping ImmunoSEQ: immuneML for SequenceDataset
immunoseq_column_name1: metadata_label1
immunoseq_column_name2: metadata_label2
import_productive: True # whether to include productive sequences in the dataset
import_with_stop_codon: False # whether to include sequences with stop codon in the dataset
import_out_of_frame: False # whether to include out of frame sequences in the dataset
import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
# Optional fields with ImmunoSEQ rearrangement-specific defaults, only change when different behavior is required:
separator: "\t" # column separator
columns_to_load: # subset of columns to load
- rearrangement
- v_family
- v_gene
- v_resolved
- j_family
- j_gene
- j_resolved
- amino_acid
- templates
- frame_type
- locus
region_type: IMGT_CDR3 # what part of the sequence to import
column_mapping: # column mapping immunoSEQ: immuneML
rearrangement: cdr3
amino_acid: cdr3_aa
v_resolved: v_call
j_resolved: j_call
templates: duplicate_count
ImmunoSEQSample¶
Imports data from Adaptive Biotechnologies immunoSEQ Analyzer sample-level .tsv files into a Repertoire-, or SequenceDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets should be used when predicting values for unpaired (single-chain) immune receptors, like antigen specificity.
The format of the files imported by this importer is described here in section 3.4.13 https://clients.adaptivebiotech.com/assets/downloads/immunoSEQ_AnalyzerManual.pdf Alternatively, to import rearrangement-level .tsv files, see ImmunoSEQRearrangement import. The only difference between these two importers is which columns they load from the .tsv files.
Specification arguments:
path (str): For RepertoireDatasets, this is the path to a directory with files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.
is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset. By default, is_repertoire is set to True.
metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the files included under the column ‘filename’ are imported into the RepertoireDataset. For setting Sequence- or ReceptorDataset labels, metadata_file is ignored, use label_columns instead.
label_columns (list): For Sequence- or ReceptorDataset, this parameter can be used to explicitly set the column names of labels to import. These labels can be used as prediction target. When label_columns are not set, label names are attempted to be discovered automatically (any column name which is not used in the column_mapping). For setting RepertoireDataset labels, label_columns is ignored, use metadata_file instead.
import_productive (bool): Whether productive sequences (with value ‘In’ in column frame_type) should be included in the imported sequences. By default, import_productive is True.
import_with_stop_codon (bool): Whether sequences with stop codons (with value ‘Stop’ in column frame_type) should be included in the imported sequences. By default, import_with_stop_codon is False.
import_out_of_frame (bool): Whether out of frame sequences (with value ‘Out’ in column frame_type) should be included in the imported sequences. By default, import_out_of_frame is False.
import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.
import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.
import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.
region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as immunoSEQ files use the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.
column_mapping (dict): A mapping from immunoSEQ column names to immuneML’s internal data representation. For immunoSEQ sample-level files, this is by default set to the values shown bellow in YAML format. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the file, or using alternative column names). Valid immuneML fields that can be specified here are defined by Repertoire.FIELDS.
nucleotide: cdr3 aminoAcid: cdr3_aa count (templates/reads): duplicate_count
column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For immunoSEQ sample .tsv files, there is no default column_mapping_synonyms.
columns_to_load (list): Specifies which subset of columns must be loaded from the file. By default, this is: [nucleotide, aminoAcid, count (templates/reads), vFamilyName, vGeneName, vGeneAllele, jFamilyName, jGeneName, jGeneAllele, sequenceStatus]; these are the columns from the original file that will be imported
metadata_column_mapping (dict): Specifies metadata for Sequence- and ReceptorDatasets. This should specify a mapping similar to column_mapping where keys are immunoSEQ column names and values are the names that are internally used in immuneML as metadata fields. These metadata fields can be used as prediction labels for Sequence- and ReceptorDatasets. This parameter can also be used to specify sequence-level metadata columns for RepertoireDatasets, which can be used by reports. To set prediction label metadata for RepertoireDatasets, see metadata_file instead. For immunoSEQ sample .tsv files, there is no default metadata_column_mapping.
separator (str): Column separator, for ImmunoSEQ files this is by default “t”.
YAML specification:
definitions:
datasets:
my_immunoseq_dataset:
format: ImmunoSEQSample
params:
path: path/to/files/
is_repertoire: True # whether to import a RepertoireDataset (True) or a SequenceDataset (False)
metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
metadata_column_mapping: # metadata column mapping ImmunoSEQ: immuneML for SequenceDataset
immunoseq_column_name1: metadata_label1
immunoseq_column_name2: metadata_label2
import_productive: True # whether to include productive sequences in the dataset
import_with_stop_codon: False # whether to include sequences with stop codon in the dataset
import_out_of_frame: False # whether to include out of frame sequences in the dataset
import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
# Optional fields with ImmunoSEQ sample-specific defaults, only change when different behavior is required:
separator: "\t" # column separator
columns_to_load: # subset of columns to load
- nucleotide
- aminoAcid
- count (templates/reads)
- vFamilyName
- vGeneName
- vGeneAllele
- jFamilyName
- jGeneName
- jGeneAllele
- sequenceStatus
region_type: IMGT_CDR3 # what part of the sequence to import
column_mapping: # column mapping immunoSEQ: immuneML
nucleotide: sequence
aminoAcid: junction_aa
vGeneName: v_call
jGeneName: j_call
sequenceStatus: frame_type
vFamilyName: v_family
jFamilyName: j_family
vGeneAllele: v_allele
jGeneAllele: j_allele
count (templates/reads): duplicate_count
MiXCR¶
Imports data in MiXCR format into a Repertoire-, or SequenceDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets should be used when predicting values for unpaired (single-chain) immune receptors, like antigen specificity.
Specification arguments:
path (str): For RepertoireDatasets, this is the path to a directory with MiXCR files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.
is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset. By default, is_repertoire is set to True.
metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the MiXCR files included under the column ‘filename’ are imported into the RepertoireDataset. For setting Sequence- or ReceptorDataset labels, metadata_file is ignored, use label_columns instead.
label_columns (list): For Sequence- or ReceptorDataset, this parameter can be used to explicitly set the column names of labels to import. These labels can be used as prediction target. When label_columns are not set, label names are attempted to be discovered automatically (any column name which is not used in the column_mapping). For setting RepertoireDataset labels, label_columns is ignored, use metadata_file instead.
import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence, such as ‘_’, are removed). By default import_illegal_characters is False.
import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.
import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.
region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as MiXCR uses IMGT junction as CDR3. Alternatively to importing the CDR3 sequence, other region types can be specified here as well. Valid values for region_type are IMGT_CDR3, IMGT_JUNCTION, IMGT_CDR1, IMGT_CDR2, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4.
column_mapping (dict): A mapping from MiXCR column names to immuneML’s data representation. The columns that specify the sequences to import are handled by the region_type parameter. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the MiXCR file, or using alternative column names). immuneML uses fields as defined in the AIRR schema. For MiXCR, this is by default set to:
cloneCount: duplicate_count allVHitsWithScore: v_call allJHitsWithScore: j_call
columns_to_load (list): Specifies which subset of columns must be loaded from the MiXCR file. By default, this is: [cloneCount, allVHitsWithScore, allJHitsWithScore, aaSeqCDR3, nSeqCDR3]
separator (str): Column separator, for MiXCR this is by default “t”.
YAML specification:
definitions:
datasets:
my_mixcr_dataset:
format: MiXCR
params:
path: path/to/files/
is_repertoire: True # whether to import a RepertoireDataset (True) or a SequenceDataset (False)
metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
region_type: IMGT_CDR3 # what part of the sequence to import
import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
import_empty_aa_sequences: False # filter out sequences if they don't have sequence_aa set
# Optional fields with MiXCR-specific defaults, only change when different behavior is required:
separator: "\t" # column separator
columns_to_load: # subset of columns to load, sequence columns are handled by region_type parameter
- cloneCount
- allVHitsWithScore
- allJHitsWithScore
- aaSeqCDR3
- nSeqCDR3
column_mapping: # column mapping MiXCR: immuneML
cloneCount: duplicate_count
allVHitsWithScore: v_call
allJHitsWithScore: j_call
mixcrColumnName1: metadata_label1
mixcrColumnName2: metadata_label2
OLGA¶
Imports data generated by OLGA simulations into a Repertoire-, or SequenceDataset. Assumes that the columns in each file correspond to: nucleotide sequences, amino acid sequences, v genes, j genes
Reference: Sethna, Zachary et al. ‘High-throughput immune repertoire analysis with IGoR’. Bioinformatics, (2019) doi.org/10.1093/bioinformatics/btz035.
Specification arguments:
path (str): For RepertoireDatasets, this is the path to a directory with OLGA files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.
is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset. By default, is_repertoire is set to True.
metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. Only the OLGA files included under the column ‘filename’ are imported into the RepertoireDataset. SequenceDataset metadata is currently not supported.
import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.
import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.
import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.
region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as OLGA uses the IMGT junction. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.
separator (str): Column separator, for OLGA this is by default “t”.
column_mapping (dict): defines which columns to import from olga format: keys are the number of the columns and values are the names of the columns to be mapped to
YAML specification:
definitions:
datasets:
my_olga_dataset:
format: OLGA
params:
path: path/to/files/
is_repertoire: True # whether to import a RepertoireDataset (True) or a SequenceDataset (False)
metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
import_empty_aa_sequences: False # filter out sequences if they don't have amino acid sequence set
# Optional fields with OLGA-specific defaults, only change when different behavior is required:
separator: "\t" # column separator
columns_to_load: [0, 1, 2, 3]
column_mapping:
0: junction
1: junction_aa
2: v_call
3: j_call
RandomReceptorDataset¶
Returns a ReceptorDataset consisting of randomly generated sequences, which can be used for benchmarking purposes. The sequences consist of uniformly chosen amino acids or nucleotides.
Specification arguments:
receptor_count (int): The number of receptors the ReceptorDataset should contain.
chain_1_length_probabilities (dict): A mapping where the keys correspond to different sequence lengths for chain 1, and the values are the probabilities for choosing each sequence length. For example, to create a random ReceptorDataset where 40% of the sequences for chain 1 would be of length 10, and 60% of the sequences would have length 12, this mapping would need to be specified:
10: 0.4 12: 0.6
chain_2_length_probabilities (dict): Same as chain_1_length_probabilities, but for chain 2.
labels (dict): A mapping that specifies randomly chosen labels to be assigned to the receptors. One or multiple labels can be specified here. The keys of this mapping are the labels, and the values consist of another mapping between label classes and their probabilities. For example, to create a random ReceptorDataset with the label cmv_epitope where 70% of the receptors has class binding and the remaining 30% has class not_binding, the following mapping should be specified:
cmv_epitope: binding: 0.7 not_binding: 0.3
YAML specification:
definitions:
datasets:
my_random_dataset:
format: RandomReceptorDataset
params:
receptor_count: 100 # number of random receptors to generate
chain_1_length_probabilities:
14: 0.8 # 80% of all generated sequences for all receptors (for chain 1) will have length 14
15: 0.2 # 20% of all generated sequences across all receptors (for chain 1) will have length 15
chain_2_length_probabilities:
14: 0.8 # 80% of all generated sequences for all receptors (for chain 2) will have length 14
15: 0.2 # 20% of all generated sequences across all receptors (for chain 2) will have length 15
labels:
epitope1: # label name
True: 0.5 # 50% of the receptors will have class True
False: 0.5 # 50% of the receptors will have class False
epitope2: # next label with classes that will be assigned to receptors independently of the previous label or other parameters
1: 0.3 # 30% of the generated receptors will have class 1
0: 0.7 # 70% of the generated receptors will have class 0
RandomRepertoireDataset¶
Returns a RepertoireDataset consisting of randomly generated sequences, which can be used for benchmarking purposes. The sequences consist of uniformly chosen amino acids or nucleotides.
Specification arguments:
repertoire_count (int): The number of repertoires the RepertoireDataset should contain.
sequence_count_probabilities (dict): A mapping where the keys are the number of sequences per repertoire, and the values are the probabilities that any of the repertoires would have that number of sequences. For example, to create a random RepertoireDataset where 40% of the repertoires would have 1000 sequences, and the other 60% would have 1100 sequences, this mapping would need to be specified:
1000: 0.4 1100: 0.6
sequence_length_probabilities (dict): A mapping where the keys correspond to different sequence lengths, and the values are the probabilities for choosing each sequence length. For example, to create a random RepertoireDataset where 40% of the sequences would be of length 10, and 60% of the sequences would have length 12, this mapping would need to be specified:
10: 0.4 12: 0.6
labels (dict): A mapping that specifies randomly chosen labels to be assigned to the Repertoires. One or multiple labels can be specified here. The keys of this mapping are the labels, and the values consist of another mapping between label classes and their probabilities. For example, to create a random RepertoireDataset with the label CMV where 70% of the Repertoires has class cmv_positive and the remaining 30% has class cmv_negative, the following mapping should be specified:
CMV: cmv_positive: 0.7 cmv_negative: 0.3
YAML specification:
definitions:
datasets:
my_random_dataset:
format: RandomRepertoireDataset
params:
repertoire_count: 100 # number of random repertoires to generate
sequence_count_probabilities:
10: 0.5 # probability that any of the repertoires would have 10 receptor sequences
20: 0.5
sequence_length_probabilities:
10: 0.5 # probability that any of the receptor sequences would be 10 amino acids in length
12: 0.5
labels: # randomly assigned labels (only useful for simple benchmarking)
cmv:
True: 0.5 # probability of value True for label cmv to be assigned to any repertoire
False: 0.5
RandomSequenceDataset¶
Returns a SequenceDataset consisting of randomly generated sequences, which can be used for benchmarking purposes. The sequences consist of uniformly chosen amino acids or nucleotides.
Specification arguments:
sequence_count (int): The number of sequences the SequenceDataset should contain.
length_probabilities (dict): A mapping where the keys correspond to different sequence lengths and the values are the probabilities for choosing each sequence length. For example, to create a random SequenceDataset where 40% of the sequences would be of length 10, and 60% of the sequences would have length 12, this mapping would need to be specified:
10: 0.4 12: 0.6
labels (dict): A mapping that specifies randomly chosen labels to be assigned to the sequences. One or multiple labels can be specified here. The keys of this mapping are the labels, and the values consist of another mapping between label classes and their probabilities. For example, to create a random SequenceDataset with the label cmv_epitope where 70% of the sequences has class binding and the remaining 30% has class not_binding, the following mapping should be specified:
cmv_epitope: binding: 0.7 not_binding: 0.3
region_type (str): which region_type to assign to all randomly generated sequences
YAML specification:
definitions:
datasets:
my_random_dataset:
format: RandomSequenceDataset
params:
sequence_count: 100 # number of random sequences to generate
length_probabilities:
14: 0.8 # 80% of all generated sequences for all sequences will have length 14
15: 0.2 # 20% of all generated sequences across all sequences will have length 15
labels:
epitope1: # label name
True: 0.5 # 50% of the sequences will have class True
False: 0.5 # 50% of the sequences will have class False
epitope2: # next label with classes that will be assigned to sequences independently of the previous label or other parameters
1: 0.3 # 30% of the generated sequences will have class 1
0: 0.7 # 70% of the generated sequences will have class 0
TenxGenomics¶
Imports data from the 10x Genomics Cell Ranger analysis pipeline into a Repertoire-, Sequence- or ReceptorDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets or ReceptorDatasets should be used when predicting values for unpaired (single-chain) and paired immune receptors respectively, like antigen specificity.
The files that should be used as input are named ‘Clonotype consensus annotations (CSV)’, as described here: https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/output/annotation#consensus
Note: by default the 10xGenomics field ‘umis’ is used to define the immuneML field counts. If you want to use the 10x Genomics field reads instead, this can be changed in the column_mapping (set reads: counts). Furthermore, the 10xGenomics field clonotype_id is used for the immuneML field cell_id.
Specification arguments:
path (str): For RepertoireDatasets, this is the path to a directory with 10xGenomics files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.
is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset or ReceptorDataset. By default, is_repertoire is set to True.
metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions.For setting Sequence- or ReceptorDataset labels, metadata_file is ignored, use label_columns instead.
label_columns (list): For Sequence- or ReceptorDataset, this parameter can be used to explicitly set the column names of labels to import. These labels can be used as prediction target. When label_columns are not set, label names are attempted to be discovered automatically (any column name which is not used in the column_mapping). For setting RepertoireDataset labels, label_columns is ignored, use metadata_file instead.
paired (str): Required for Sequence- or ReceptorDatasets. This parameter determines whether to import a SequenceDataset (paired = False) or a ReceptorDataset (paired = True). In a ReceptorDataset, two sequences with chain types specified by receptor_chains are paired together based on the identifier given in the 10xGenomics column named ‘clonotype_id’.
receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values are TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).
import_productive (bool): Whether productive sequences (with value ‘True’ in column productive) should be included in the imported sequences. By default, import_productive is True.
import_unproductive (bool): Whether productive sequences (with value ‘Fale’ in column productive) should be included in the imported sequences. By default, import_unproductive is False.
import_unknown_productivity (bool): Whether sequences with unknown productivity (missing or ‘NA’ value in column productive) should be included in the imported sequences. By default, import_unknown_productivity is True.
import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.
import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.
import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.
region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as 10xGenomics uses IMGT junction as CDR3. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.
column_mapping (dict): A mapping from 10xGenomics column names to immuneML’s internal data representation. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the 10xGenomics file, or using alternative column names). Valid immuneML fields that can be specified here are defined by the AIRR schema.. For 10xGenomics, this is by default set to:
cdr3: junction cdr3_nt: junction_aa v_gene: v_call j_gene: j_call umis: duplicate_count clonotype_id: cell_id consensus_id: sequence_id
column_mapping_synonyms (dict): This is a column mapping that can be used if a column could have alternative names. The formatting is the same as column_mapping. If some columns specified in column_mapping are not found in the file, the columns specified in column_mapping_synonyms are instead attempted to be loaded. For 10xGenomics format, there is no default column_mapping_synonyms.
separator (str): Column separator, for 10xGenomics this is by default “,”.
YAML specification:
definitions:
datasets:
my_10x_dataset:
format: 10xGenomics
params:
path: path/to/files/
is_repertoire: True # whether to import a RepertoireDataset
metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
paired: False # whether to import SequenceDataset (False) or ReceptorDataset (True) when is_repertoire = False
receptor_chains: TRA_TRB # what chain pair to import for a ReceptorDataset
import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
import_empty_aa_sequences: False # filter out sequences if they don't have amino acid sequence set
# Optional fields with 10xGenomics-specific defaults, only change when different behavior is required:
separator: "," # column separator
region_type: IMGT_CDR3 # what part of the sequence to import
column_mapping: # column mapping 10xGenomics: immuneML
cdr3: junction_aa
cdr3_nt: junction
v_gene: v_call
j_gene: j_call
umis: duplicate_count
clonotype_id: cell_id
consensus_id: sequence_id
VDJdb¶
Imports data in VDJdb format into a Repertoire-, Sequence- or ReceptorDataset. RepertoireDatasets should be used when making predictions per repertoire, such as predicting a disease state. SequenceDatasets or ReceptorDatasets should be used when predicting values for unpaired (single-chain) and paired immune receptors respectively, like antigen specificity.
Specification arguments:
path (str): For RepertoireDatasets, this is the path to a directory with VDJdb files to import. For Sequence- or ReceptorDatasets this path may either be the path to the file to import, or the path to the folder locating one or multiple files with .tsv, .csv or .txt extensions. By default path is set to the current working directory.
is_repertoire (bool): If True, this imports a RepertoireDataset. If False, it imports a SequenceDataset or ReceptorDataset. By default, is_repertoire is set to True.
metadata_file (str): Required for RepertoireDatasets. This parameter specifies the path to the metadata file. This is a csv file with columns filename, subject_id and arbitrary other columns which can be used as labels in instructions. For setting Sequence- or ReceptorDataset labels, metadata_file is ignored, use label_columns instead.
label_columns (list): For Sequence- or ReceptorDataset, this parameter can be used to explicitly set the column names of labels to import. By default, label_columns for VDJdbImport are [Epitope, Epitope gene, Epitope species]. These labels can be used as prediction target. When label_columns are not set, label names are attempted to be discovered automatically (any column name which is not used in the column_mapping). For setting RepertoireDataset labels, label_columns is ignored, use metadata_file instead.
paired (str): Required for Sequence- or ReceptorDatasets. This parameter determines whether to import a SequenceDataset (paired = False) or a ReceptorDataset (paired = True). In a ReceptorDataset, two sequences with chain types specified by receptor_chains are paired together based on the identifier given in the VDJdb column named ‘complex.id’.
receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values are TRA_TRB, TRG_TRD, IGH_IGL, IGH_IGK. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).
import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon ‘*’, or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.
import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.
import_empty_aa_sequences (bool): imports sequences which have an empty amino acid sequence field; can be True or False; for analysis on amino acid sequences, this parameter should be False (import only non-empty amino acid sequences). By default, import_empty_aa_sequences is set to False.
region_type (str): Which part of the sequence to import. By default, this value is set to IMGT_CDR3. This means the first and last amino acids are removed from the CDR3 sequence, as VDJdb uses IMGT junction as CDR3. Specifying any other value will result in importing the sequences as they are. Valid values are IMGT_CDR1, IMGT_CDR2, IMGT_CDR3, IMGT_FR1, IMGT_FR2, IMGT_FR3, IMGT_FR4, IMGT_JUNCTION, FULL_SEQUENCE.
column_mapping (dict): A mapping from VDJdb column names to immuneML’s internal data representation. A custom column mapping can be specified here if necessary (for example; adding additional data fields if they are present in the VDJdb file, or using alternative column names). Valid immuneML fields that can be specified here are defined by Repertoire.FIELDS. For VDJdb, this is by default set to:
V: v_call J: j_call CDR3: junction_aa complex.id: cell_id Gene: locus
separator (str): Column separator, for VDJdb this is by default “t”.
YAML specification:
definitions:
datasets:
my_vdjdb_dataset:
format: VDJdb
params:
path: path/to/files/
is_repertoire: True # whether to import a RepertoireDataset
metadata_file: path/to/metadata.csv # metadata file for RepertoireDataset
paired: False # whether to import SequenceDataset (False) or ReceptorDataset (True) when is_repertoire = False
receptor_chains: TRA_TRB # what chain pair to import for a ReceptorDataset
import_illegal_characters: False # remove sequences with illegal characters for the sequence_type being used
import_empty_nt_sequences: True # keep sequences even though the nucleotide sequence might be empty
import_empty_aa_sequences: False # filter out sequences if they don't have amino acid sequence set
# Optional fields with VDJdb-specific defaults, only change when different behavior is required:
separator: "\t" # column separator
region_type: IMGT_CDR3 # what part of the sequence to import
column_mapping: # column mapping VDJdb: immuneML
V: v_call
J: j_call
CDR3: junction_aa
complex.id: sequence_id
Gene: chain
Epitope: epitope
Epitope gene: epitope_gene
Epitope species: epitope_species
Encodings¶
Under the definitions/encodings
component, the user can specify how to encode a given dataset.
An encoding is a numerical data representation, which may be used as input for a machine learning algorithm.
AtchleyKmer¶
Represents a repertoire through Atchley factors and relative abundance of k-mers. Should be used in combination with the AtchleyKmerMILClassifier.
For more details, see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292 .
Note that sequences in the repertoire with length shorter than skip_first_n_aa + skip_last_n_aa + k will not be encoded.
Dataset type:
RepertoireDatasets
Specification arguments:
k (int): k-mer length
skip_first_n_aa (int): number of amino acids to remove from the beginning of the receptor sequence
skip_last_n_aa (int): number of amino acids to remove from the end of the receptor sequence
abundance: how to compute abundance term for k-mers; valid values are RELATIVE_ABUNDANCE, TCRB_RELATIVE_ABUNDANCE.
normalize_all_features (bool): when normalizing features to have 0 mean and unit variance, this parameter indicates if the abundance feature should be included in the normalization
YAML specification:
definitions:
encodings:
my_encoder:
AtchleyKmer:
k: 4
skip_first_n_aa: 3
skip_last_n_aa: 3
abundance: RELATIVE_ABUNDANCE
normalize_all_features: False
CompAIRRDistance¶
Encodes a given RepertoireDataset as a distance matrix, using the Morisita-Horn distance metric. Internally, CompAIRR is used for fast calculation of overlap between repertoires. This creates a pairwise distance matrix between each of the repertoires. The distance is calculated based on the number of matching receptor chain sequences between the repertoires. This matching may be defined to permit 1 or 2 mismatching amino acid/nucleotide positions and 1 indel in the sequence. Furthermore, matching may or may not include V and J gene information, and sequence frequencies may be included or ignored.
When mismatches (differences and indels) are allowed, the Morisita-Horn similarity may exceed 1. In this case, the Morisita-Horn distance (= similarity - 1) is set to 0 to avoid negative distance scores.
Dataset type:
RepertoireDatasets
Specification arguments:
compairr_path (Path): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.
keep_compairr_input (bool): whether to keep the input file that was passed to CompAIRR. This may take a lot of storage space if the input dataset is large. By default, the input file is not kept.
differences (int): Number of differences allowed between the sequences of two immune receptor chains, this may be between 0 and 2. By default, differences is 0.
indels (bool): Whether to allow an indel. This is only possible if differences is 1. By default, indels is False.
ignore_counts (bool): Whether to ignore the frequencies of the immune receptor chains. If False, frequencies will be included, meaning the ‘counts’ values for the receptors available in two repertoires are multiplied. If False, only the number of unique overlapping immune receptors (‘clones’) are considered. By default, ignore_counts is False.
ignore_genes (bool): Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains have to match. If True, gene information is ignored. By default, ignore_genes is False.
threads (int): The number of threads to use for parallelization. Default is 8.
YAML specification:
definitions:
encodings:
my_distance_encoder:
CompAIRRDistance:
compairr_path: optional/path/to/compairr
differences: 0
indels: False
ignore_counts: False
ignore_genes: False
CompAIRRSequenceAbundance¶
This encoder works similarly to the SequenceAbundanceEncoder
,
but internally uses CompAIRR to accelerate core computations.
This encoder represents the repertoires as vectors where:
the first element corresponds to the number of label-associated clonotypes
the second element is the total number of unique clonotypes
To determine what clonotypes (amino acid sequences with or without matching V/J genes) are label-associated, Fisher’s exact test (one-sided) is used.
The encoder also writes out files containing the contingency table used for fisher’s exact test,
the resulting p-values, and the significantly abundant sequences
(use RelevantSequenceExporter
to export these sequences in AIRR format).
Reference: Emerson, Ryan O. et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.
Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label
in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class.
See Reproduction of the CMV status predictions study for an example using SequenceAbundanceEncoder
.
Dataset type:
RepertoireDatasets
Specification arguments:
p_value_threshold (float): The p value threshold to be used by the statistical test.
compairr_path (Path): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.
ignore_genes (bool): Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains have to match. If True, gene information is ignored. By default, ignore_genes is False.
sequence_batch_size (int): The number of sequences in a batch when comparing sequences across repertoires, typically 100s of thousands. This does not affect the results of the encoding, but may affect the speed and memory usage. The default value is 1.000.000
threads (int): The number of threads to use for parallelization. This does not affect the results of the encoding, only the speed. The default number of threads is 8.
keep_temporary_files (bool): whether to keep temporary files, including CompAIRR input, output and log files, and the sequence presence matrix. This may take a lot of storage space if the input dataset is large. By default, temporary files are not kept.
YAML specification:
definitions:
encodings:
my_sa_encoding:
CompAIRRSequenceAbundance:
compairr_path: optional/path/to/compairr
p_value_threshold: 0.05
ignore_genes: False
threads: 8
DeepRC¶
DeepRCEncoder should be used in combination with the DeepRC ML method (DeepRC). This encoder writes the data in a RepertoireDataset to .tsv files. For each repertoire, one .tsv file is created containing the amino acid sequences and the counts. Additionally, one metadata .tsv file is created, which describes the subset of repertoires that is encoded by a given instance of the DeepRCEncoder.
Note: sequences where count is None, the count value will be set to 1
Dataset type:
RepertoireDatasets
YAML specification:
definitions:
encodings:
my_deeprc_encoder: DeepRC
Distance¶
Encodes a given RepertoireDataset as distance matrix, where the pairwise distance between each of the repertoires is calculated. The distance is calculated based on the presence/absence of elements defined under attributes_to_match. Thus, if attributes_to_match contains only ‘sequence_aas’, this means the distance between two repertoires is maximal if they contain the same set of sequence_aas, and the distance is minimal if none of the sequence_aas are shared between two repertoires.
Specification arguments:
distance_metric (
DistanceMetricType
): The metric used to calculate the distance between two repertoires. Valid values are: JACCARD, MORISITA_HORN. The default distance metric is JACCARD (inverse Jaccard).sequence_batch_size (int): The number of sequences to be processed at once. Increasing this number increases the memory use. The default value is 1000.
attributes_to_match (list): The attributes to consider when determining whether a sequence is present in both repertoires. Only the fields defined under attributes_to_match will be considered, all other fields are ignored. Valid values include any repertoire attribute as defined in AIRR rearrangement schema (cdr3_aa, v_call, j_call, etc).
YAML specification:
definitions:
encodings:
my_distance_encoder:
Distance:
distance_metric: JACCARD
sequence_batch_size: 1000
attributes_to_match:
- cdr3_aa
- v_call
- j_call
EvennessProfile¶
The EvennessProfileEncoder class encodes a repertoire based on the clonal frequency distribution. The evenness for a given repertoire is defined as follows:
That is, it is the exponential of Renyi entropy at a given alpha divided by the species richness, or number of unique sequences.
Reference: Greiff et al. (2015). A bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status. Genome Medicine, 7(1), 49. doi.org/10.1186/s13073-015-0169-8
Dataset type:
RepertoireDatasets
Specification arguments:
min_alpha (float): minimum alpha value to use
max_alpha (float): maximum alpha value to use
dimension (int): dimension of output evenness profile vector, or the number of alpha values to linearly space between min_alpha and max_alpha
YAML specification:
definitions:
encodings:
my_evenness_profile:
EvennessProfile:
min_alpha: 0
max_alpha: 10
dimension: 51
KmerAbundance¶
This encoder is related to the SequenceAbundanceEncoder
,
but identifies label-associated subsequences (k-mers) instead of full label-associated sequences.
This encoder represents the repertoires as vectors where:
the first element corresponds to the number of label-associated k-mers found in a repertoire
the second element is the total number of unique k-mers per repertoire
The label-associated k-mers are determined based on a one-sided Fisher’s exact test.
The encoder also writes out files containing the contingency table used for fisher’s exact test, the resulting p-values, and the significantly abundant k-mers.
Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label
in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class.
See Reproduction of the CMV status predictions study for an example using SequenceAbundanceEncoder
.
Dataset type:
RepertoireDatasets
Specification arguments:
p_value_threshold (float): The p value threshold to be used by the statistical test.
sequence_encoding (
SequenceEncodingType
): The type of k-mers that are used. The simplest (default) sequence_encoding isCONTINUOUS_KMER
, which uses contiguous subsequences of length k to represent the k-mers. When gapped k-mers are used (GAPPED_KMER
,GAPPED_KMER
), the k-mers may contain gaps with a size between min_gap and max_gap, and the k-mer length is defined as a combination of k_left and k_right. When IMGT k-mers are used (IMGT_CONTINUOUS_KMER
,IMGT_GAPPED_KMER
), IMGT positional information is taken into account (i.e. the same sequence in a different position is considered to be a different k-mer).k (int): Length of the k-mer (number of amino acids) when ungapped k-mers are used. The default value for k is 3.
k_left (int): When gapped k-mers are used, k_left indicates the length of the k-mer left of the gap. The default value for k_left is 1.
k_right (int): Same as k_left, but k_right determines the length of the k-mer right of the gap. The default value for k_right is 1.
min_gap (int): Minimum gap size when gapped k-mers are used. The default value for min_gap is 0.
max_gap: (int): Maximum gap size when gapped k-mers are used. The default value for max_gap is 0.
YAML specification:
definitions:
encodings:
my_ka_encoding:
KmerAbundance:
p_value_threshold: 0.05
threads: 8
KmerFrequency¶
The KmerFrequencyEncoder class encodes a repertoire, sequence or receptor by frequencies of k-mers it contains. A k-mer is a sequence of letters of length k into which an immune receptor sequence can be decomposed. K-mers can be defined in different ways, as determined by the sequence_encoding.
Dataset type:
SequenceDatasets
ReceptorDatasets
RepertoireDatasets
Specification arguments:
sequence_encoding (
SequenceEncodingType
): The type of k-mers that are used. The simplest sequence_encoding isCONTINUOUS_KMER
, which uses contiguous subsequences of length k to represent the k-mers. When gapped k-mers are used (GAPPED_KMER
,GAPPED_KMER
), the k-mers may contain gaps with a size between min_gap and max_gap, and the k-mer length is defined as a combination of k_left and k_right. When IMGT k-mers are used (IMGT_CONTINUOUS_KMER
,IMGT_GAPPED_KMER
), IMGT positional information is taken into account (i.e. the same sequence in a different position is considered to be a different k-mer). When the identity representation is used (IDENTITY
), the k-mers just correspond to the original sequences.normalization_type (
NormalizationType
): The way in which the k-mer frequencies should be normalized. The default value for normalization_type is l2.reads (
ReadsType
): Reads type signify whether the counts of the sequences in the repertoire will be taken into account. IfUNIQUE
, only unique sequences (clonotypes) are encoded, and ifALL
, the sequence ‘count’ value is taken into account when determining the k-mer frequency. The default value for reads is unique.k (int): Length of the k-mer (number of amino acids) when ungapped k-mers are used. The default value for k is 3.
k_left (int): When gapped k-mers are used, k_left indicates the length of the k-mer left of the gap. The default value for k_left is 1.
k_right (int): Same as k_left, but k_right determines the length of the k-mer right of the gap. The default value for k_right is 1.
min_gap (int): Minimum gap size when gapped k-mers are used. The default value for min_gap is 0.
max_gap: (int): Maximum gap size when gapped k-mers are used. The default value for max_gap is 0.
sequence_type (str): Whether to work with nucleotide or amino acid sequences. Amino acid sequences are the default. To work with either sequence type, the sequences of the desired type should be included in the datasets, e.g., listed under ‘columns_to_load’ parameter. By default, both types will be included if available. Valid values are: AMINO_ACID and NUCLEOTIDE.
scale_to_unit_variance (bool): whether to scale the design matrix after normalization to have unit variance per feature. Setting this argument to True might improve the subsequent classifier’s performance depending on the type of the classifier. The default value for scale_to_unit_variance is true.
scale_to_zero_mean (bool): whether to scale the design matrix after normalization to have zero mean per feature. Setting this argument to True might improve the subsequent classifier’s performance depending on the type of the classifier. However, if the original design matrix was sparse, setting this argument to True will destroy the sparsity and will increase the memory consumption. The default value for scale_to_zero_mean is false.
YAML specification:
definitions:
encodings:
my_continuous_kmer:
KmerFrequency:
normalization_type: RELATIVE_FREQUENCY
reads: UNIQUE
sequence_encoding: CONTINUOUS_KMER
sequence_type: NUCLEOTIDE
k: 3
scale_to_unit_variance: True
scale_to_zero_mean: True
my_gapped_kmer:
KmerFrequency:
normalization_type: RELATIVE_FREQUENCY
reads: UNIQUE
sequence_encoding: GAPPED_KMER
sequence_type: AMINO_ACID
k_left: 2
k_right: 2
min_gap: 1
max_gap: 3
scale_to_unit_variance: True
scale_to_zero_mean: False
MatchedReceptors¶
Encodes the dataset based on the matches between a dataset containing unpaired (single chain) data, and a paired reference receptor dataset. For each paired reference receptor, the frequency of either chain in the dataset is counted.
This encoding can be used in combination with the Matches report.
When sum_matches and normalize are set to True, this encoder behaves similarly as described in: Yao, Y. et al. ‘T cell receptor repertoire as a potential diagnostic marker for celiac disease’. Clinical Immunology Volume 222 (January 2021): 108621. doi.org/10.1016/j.clim.2020.108621 with the only exception being that this encoder uses paired receptors, while the original publication used single sequences (see also: MatchedSequences encoder).
Dataset type:
RepertoireDatasets
Specification arguments:
reference (dict): A dictionary describing the reference dataset file. Import should be specified the same way as regular dataset import. It is only allowed to import a receptor dataset here (i.e., is_repertoire is False and paired is True by default, and these are not allowed to be changed).
max_edit_distances (dict): A dictionary specifying the maximum edit distance between a target sequence (from the repertoire) and the reference sequence. A maximum distance can be specified per chain, for example to allow for less strict matching of TCR alpha and BCR light chains. When only an integer is specified, this distance is applied to all possible chains.
reads (
ReadsType
): Reads type signify whether the counts of the sequences in the repertoire will be taken into account. IfUNIQUE
, only unique sequences (clonotypes) are counted, and ifALL
, the sequence ‘count’ value is summed when determining the number of matches. The default value for reads is all.sum_matches (bool): When sum_matches is False, the resulting encoded data matrix contains multiple columns with the number of matches per reference receptor chain. When sum_matches is true, the columns representing each of the two chains are summed together, meaning that there are only two aggregated sums of matches (one per chain) per repertoire in the encoded data. To use this encoder in combination with the Matches report, sum_matches must be set to False. When sum_matches is set to True, this encoder behaves similarly to the encoder described by Yao, Y. et al. By default, sum_matches is False.
normalize (bool): If True, the chain matches are divided by the total number of unique receptors in the repertoire (when reads = unique) or the total number of reads in the repertoire (when reads = all).
YAML specification:
definitions:
encodings:
my_mr_encoding:
MatchedReceptors:
reference:
format: VDJDB
params:
path: path/to/file.txt
max_edit_distances:
alpha: 1
beta: 0
MatchedRegex¶
Encodes the dataset based on the matches between a RepertoireDataset and a collection of regular expressions. For each regular expression, the number of sequences in the RepertoireDataset containing the expression is counted. This can also be used to count how often a subsequence occurs in a RepertoireDataset.
The regular expressions are defined per chain, and it is possible to require a V gene match in addition to the CDR3 sequence containing the regular expression.
This encoding can be used in combination with the Matches report.
Dataset type:
RepertoireDatasets
Specification arguments:
match_v_genes (bool): Whether V gene matches are required. If this is True, a match is only counted if the V gene matches the gene specified in the motif input file. By default match_v_genes is False.
reads (
ReadsType
): Reads type signify whether the counts of the sequences in the repertoire will be taken into account. IfUNIQUE
, only unique sequences (clonotypes) are counted, and ifALL
, the sequence ‘count’ value is summed when determining the number of matches. The default value for reads is all.motif_filepath (str): The path to the motif input file. This should be a tab separated file containing a column named ‘id’ and for every chain that should be matched a column containing the regex (<chain>_regex) and a column containing the V gene (<chain>V) if match_v_genes is True. The chains are specified by their three-letter code, see
Chain
.
In the simplest case, when counting the number of occurrences of a given list of k-mers in TRB sequences, the contents of the motif file could look like this:
id
TRB_regex
1
ACG
2
EDNA
3
DFWG
It is also possible to test whether paired regular expressions occur in the dataset (for example: regular expressions matching both a TRA chain and a TRB chain) by specifying them on the same line. In a more complex case where both paired and unpaired regular expressions are specified, in addition to matching the V genes, the contents of the motif file could look like this:
id
TRA_regex
TRAV
TRB_regex
TRBV
1
AGQ.GSS
TRAV35
S[APL]GQY
TRBV29-1
2
ASS.R.*
TRBV7-3
YAML specification:
definitions:
encodings:
my_mr_encoding:
MatchedRegex:
motif_filepath: path/to/file.txt
match_v_genes: True
reads: unique
MatchedSequences¶
Encodes the dataset based on the matches between a RepertoireDataset and a reference sequence dataset.
This encoding can be used in combination with the Matches report.
When sum_matches and normalize are set to True, this encoder behaves as described in: Yao, Y. et al. ‘T cell receptor repertoire as a potential diagnostic marker for celiac disease’. Clinical Immunology Volume 222 (January 2021): 108621. doi.org/10.1016/j.clim.2020.108621
Dataset type:
RepertoireDatasets
Specification arguments:
reference (dict): A dictionary describing the reference dataset file. Import should be specified the same way as regular dataset import. It is only allowed to import a sequence dataset here (i.e., is_repertoire and paired are False by default, and are not allowed to be set to True).
max_edit_distance (int): The maximum edit distance between a target sequence (from the repertoire) and the reference sequence.
reads (
ReadsType
): Reads type signify whether the counts of the sequences in the repertoire will be taken into account. IfUNIQUE
, only unique sequences (clonotypes) are counted, and ifALL
, the sequence ‘count’ value is summed when determining the number of matches. The default value for reads is all.sum_matches (bool): When sum_matches is False, the resulting encoded data matrix contains multiple columns with the number of matches per reference sequence. When sum_matches is true, all columns are summed together, meaning that there is only one aggregated sum of matches per repertoire in the encoded data. To use this encoder in combination with the Matches report, sum_matches must be set to False. When sum_matches is set to True, this encoder behaves as described by Yao, Y. et al. By default, sum_matches is False.
normalize (bool): If True, the sequence matches are divided by the total number of unique sequences in the repertoire (when reads = unique) or the total number of reads in the repertoire (when reads = all).
YAML specification:
definitions:
encodings:
my_ms_encoding:
MatchedSequences:
reference:
format: VDJDB
params:
path: path/to/file.txt
max_edit_distance: 1
Motif¶
This encoder enumerates every possible positional motif in a sequence dataset, and keeps only the motifs associated with the positive class. A ‘motif’ is defined as a combination of position-specific amino acids. These motifs may contain one or multiple gaps. Motifs are filtered out based on a minimal precision and recall threshold for predicting the positive class.
Note: the MotifEncoder can only be used for sequences of the same length.
The ideal recall threshold(s) given a user-defined precision threshold can be calibrated using the
MotifGeneralizationAnalysis
report. It is recommended to first run this report
in ExploratoryAnalysisInstruction
before using this encoder for ML.
This encoder can be used in combination with the BinaryFeatureClassifier
in order to
learn a minimal set of compatible motifs for predicting the positive class.
Alternatively, it may be combined with scikit-learn methods, such as for example LogisticRegression
,
to learn a weight per motif.
Dataset type:
SequenceDatasets
Specification arguments:
max_positions (int): The maximum motif size. This is number of positional amino acids the motif consists of (excluding gaps). The default value for max_positions is 4.
min_positions (int): The minimum motif size (see also: max_positions). The default value for max_positions is 1.
no_gaps (bool): Must be set to True if only contiguous motifs (position-specific k-mers) are allowed. By default, no_gaps is False, meaning both gapped and ungapped motifs are searched for.
min_precision (float): The minimum precision threshold for keeping a motif. The default value for min_precision is 0.8.
min_recall (float): The minimum recall threshold for keeping a motif. The default value for min_precision is 0. It is also possible to specify a recall threshold for each motif size. In this case, a dictionary must be specified where the motif sizes are keys and the recall values are values. Use the
MotifGeneralizationAnalysis
report to calibrate the optimal recall threshold given a user-defined precision threshold to ensure generalisability to unseen data.min_true_positives (int): The minimum number of true positive sequences that a motif needs to occur in. The default value for min_true_positives is 10.
candidate_motif_filepath (str): Optional filepath for pre-filterd candidate motifs. This may be used to save time. Only the given candidate motifs are considered. When this encoder has been run previously, a candidate motifs file named ‘all_candidate_motifs.tsv’ will have been exported. This file contains all possible motifs with high enough min_true_positives without applying precision and recall thresholds. The file must be a tab-separated file, structured as follows:
indices
amino_acids
1&2&3
A&G&C
5&7
E&D
The example above contains two motifs: AGC in positions 123, and E-D in positions 5-7 (with a gap at position 6).
label (str): The name of the binary label to train the encoder for. This is only necessary when the dataset contains multiple labels.
YAML specification:
definitions:
encodings:
my_motif_encoder:
MotifEncoder:
max_positions: 4
min_precision: 0.8
min_recall: # different recall thresholds for each motif size
1: 0.5 # For shorter motifs, a stricter recall threshold is used
2: 0.1
3: 0.01
4: 0.001
min_true_positives: 10
OneHot¶
One-hot encoding for repertoires, sequences or receptors. In one-hot encoding, each alphabet character (amino acid or nucleotide) is replaced by a sparse vector with one 1 and the rest zeroes. The position of the 1 represents the alphabet character.
Dataset type:
SequenceDatasets
ReceptorDatasets
RepertoireDatasets
Specification arguments:
use_positional_info (bool): whether to include features representing the positional information. If True, three additional feature vectors will be added, representing the sequence start, sequence middle and sequence end. The values in these features are scaled between 0 and 1. A graphical representation of the values of these vectors is given below.
Value of sequence start: Value of sequence middle: Value of sequence end:
1 \ 1 /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾\ 1 /
\ / \ /
\ / \ /
0 \_____________________ 0 / \ 0 _____________________/
<----sequence length----> <----sequence length----> <----sequence length---->
distance_to_seq_middle (int): only applies when use_positional_info is True. This is the distance from the edge of the CDR3 sequence (IMGT positions 105 and 117) to the portion of the sequence that is considered ‘middle’. For example: if distance_to_seq_middle is 6 (default), all IMGT positions in the interval [111, 112) receive positional value 1. When using nucleotide sequences: note that the distance is measured in (amino acid) IMGT positions. If the complete sequence length is smaller than 2 * distance_to_seq_middle, the maximum value of the ‘start’ and ‘end’ vectors will not reach 0, and the maximum value of the ‘middle’ vector will not reach 1. A graphical representation of the positional vectors with a too short sequence is given below:
Value of sequence start Value of sequence middle Value of sequence end:
with very short sequence: with very short sequence: with very short sequence:
1 \ 1 1 /
\ /
\ /\ /
0 0 / \ 0
<-> <--> <->
flatten (bool): whether to flatten the final onehot matrix to a 2-dimensional matrix [examples, other_dims_combined] This must be set to True when using onehot encoding in combination with scikit-learn ML methods (inheriting
SklearnMethod
), such as LogisticRegression, SVM, SVC, RandomForestClassifier and KNN.sequence_type: whether to use nucleotide or amino acid sequence for encoding. Valid values are ‘nucleotide’ and ‘amino_acid’.
YAML specification:
definitions:
encodings:
one_hot_vanilla:
OneHot:
use_positional_info: False
flatten: False
sequence_type: amino_acid
one_hot_positional:
OneHot:
use_positional_info: True
distance_to_seq_middle: 3
flatten: False
sequence_type: nucleotide
SequenceAbundance¶
This encoder represents the repertoires as vectors where:
the first element corresponds to the number of label-associated clonotypes
the second element is the total number of unique clonotypes
To determine what clonotypes (with features defined by comparison_attributes) are label-associated, one-sided Fisher’s exact test is used.
The encoder also writes out files containing the contingency table used for Fisher’s exact test,
the resulting p-values, and the significantly abundant sequences
(use RelevantSequenceExporter
to export these sequences in AIRR format).
Reference: Emerson, Ryan O. et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.
Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. For full example of using this encoder, see Reproduction of the CMV status predictions study.
Dataset type:
RepertoireDatasets
Specification arguments:
comparison_attributes (list): The attributes to be considered to group receptors into clonotypes. Only the fields specified in comparison_attributes will be considered, all other fields are ignored. Valid comparison value can be any repertoire field name (e.g., as specified in the AIRR rearrangement schema).
p_value_threshold (float): The p value threshold to be used by the statistical test.
sequence_batch_size (int): The number of sequences in a batch when comparing sequences across repertoires, typically 100s of thousands. This does not affect the results of the encoding, only the speed. The default value is 1.000.000
repertoire_batch_size (int): How many repertoires will be loaded at once. This does not affect the result of the encoding, only the speed. This value is a trade-off between the number of repertoires that can fit the RAM at the time and loading time from disk.
YAML specification:
definitions:
encodings:
my_sa_encoding:
SequenceAbundance:
comparison_attributes:
- cdr3_aa
- v_call
- j_call
p_value_threshold: 0.05
sequence_batch_size: 100000
repertoire_batch_size: 32
SimilarToPositiveSequence¶
A simple baseline encoding, to be used in combination with BinaryFeatureClassifier
using keep_all = True.
This encoder keeps track of all positive sequences in the training set, and ignores the negative sequences.
Any sequence within a given hamming distance from a positive training sequence will be classified positive,
all other sequences will be classified negative.
Dataset type:
SequenceDatasets
Specification arguments:
hamming_distance (int): Maximum number of differences allowed between any positive sequence of the training set and a new observed sequence in order for the observed sequence to be classified as ‘positive’.
compairr_path (Path): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.
ignore_genes (bool): Only used when compairr is used. Whether to ignore V and J gene information. If False, the V and J genes between two sequences have to match for the sequence to be considered ‘similar’. If True, gene information is ignored. By default, ignore_genes is False.
threads (int): The number of threads to use for parallelization. This does not affect the results of the encoding, only the speed. The default number of threads is 8.
keep_temporary_files (bool): whether to keep temporary files, including CompAIRR input, output and log files, and the sequence presence matrix. This may take a lot of storage space if the input dataset is large. By default temporary files are not kept.
YAML specification:
definitions:
encodings:
my_sequence_encoder:
SimilarToPositiveSequenceEncoder:
hamming_distance: 2
TCRdist¶
Encodes the given ReceptorDataset as a distance matrix between all receptors, where the distance is computed using TCRdist from the paper: Dash P, Fiore-Gartland AJ, Hertz T, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. 2017; 547(7661):89-93. doi:10.1038/nature22383.
For the implementation, TCRdist3 library was used (source code available here).
Dataset type:
ReceptorDatasets
Specification arguments:
cores (int): number of processes to use for the computation
YAML specification:
definitions:
encodings:
my_tcr_dist_enc:
TCRdist:
cores: 4
Word2Vec¶
Word2VecEncoder learns the vector representations of k-mers based on the context (receptor sequence). Similar idea was discussed in: Ostrovsky-Berman, M., Frankel, B., Polak, P. & Yaari, G. Immune2vec: Embedding B/T Cell Receptor Sequences in ℝN Using Natural Language Processing. Frontiers in Immunology 12, (2021).
This encoder relies on gensim’s implementation of Word2Vec and KmerHelper for k-mer extraction. Currently it works on amino acid level.
Dataset type:
SequenceDatasets
RepertoireDatasets
Specification arguments:
vector_size (int): The size of the vector to be learnt.
model_type (
ModelType
): The context which will be used to infer the representation of the sequence. IfSEQUENCE
is used, the context of a k-mer is defined by the sequence it occurs in (e.g. if the sequence is CASTTY and k-mer is AST, then its context consists of k-mers CAS, STT, TTY) IfKMER_PAIR
is used, the context for the k-mer is defined as all the k-mers that within one edit distance (e.g. for k-mer CAS, the context includes CAA, CAC, CAD etc.). Valid values are SEQUENCE, KMER_PAIR.k (int): The length of the k-mers used for the encoding.
epochs (int): for how many epochs to train the word2vec model for a given set of sentences (corresponding to epochs parameter in gensim package)
window (int): max distance between two k-mers in a sequence (same as window parameter in gensim’s word2vec)
YAML pecification:
definitions:
encodings:
encodings:
my_w2v:
Word2Vec:
vector_size: 16
k: 3
model_type: SEQUENCE
epochs: 100
window: 8
ML methods¶
Under the definitions/ml_methods
component, the user can specify different ML methods to use on a given (encoded) dataset.
From version 3, immuneML includes different types of ML methods:
Classifiers which make predictions about labelled data.
Clustering methods which can cluster unlabelled data.
Generative models to generate new AIR sequences.
Dimensionality reduction methods to reduce the dimensionality of the data before analysing it.
Note
Clustering methods, Generative models and Dimensionality reduction methods are experimental features.
Classifiers¶
ML method classifiers are algorithms which can be trained to predict some label on immune repertoires, receptors or sequences.
These methods can be trained using the TrainMLModel instruction, and previously trained models can be applied to new data using the MLApplication instruction.
When choosing which ML method(s) are most suitable for your use-case, please consider the following table:
ML method |
binary classification |
multi-class classification |
sequence dataset |
receptor dataset |
repertoire dataset |
model selection CV |
---|---|---|---|---|---|---|
AtchleyKmerMILClassifier |
✓ |
✗ |
✗ |
✗ |
✓ |
✗ |
BinaryFeatureClassifier |
✓ |
✗ |
✓ |
✗ |
✗ |
✗ |
DeepRC |
✓ |
✗ |
✗ |
✗ |
✓ |
✗ |
KNN |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
KerasSequenceCnn |
✓ |
✗ |
✓ |
✗ |
✗ |
✗ |
LogisticRegression |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
PrecomputedKNN |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
ProbabalisticBinaryClassifier |
✓ |
✗ |
✗ |
✗ |
✓ |
✗ |
RandomForestClassifier |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
ReceptorCNN |
✓ |
✗ |
✗ |
✓ |
✗ |
✗ |
SVC |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
SVM |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
TCRdistClassifier |
✓ |
✓ |
✓ |
✓ |
✓ |
✗ |
AtchleyKmerMILClassifier¶
A binary Repertoire classifier which uses the data encoded by AtchleyKmer encoder to predict the repertoire label.
The original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292 .
Specification arguments:
iteration_count (int): max number of training iterations
threshold (float): loss threshold at which to stop training if reached
evaluate_at (int): log model performance every ‘evaluate_at’ iterations and store the model every ‘evaluate_at’ iterations if early stopping is used
use_early_stopping (bool): whether to use early stopping
learning_rate (float): learning rate for stochastic gradient descent
random_seed (int): random seed used
zero_abundance_weight_init (bool): whether to use 0 as initial weight for abundance term (if not, a random value is sampled from normal distribution with mean 0 and variance 1 / total_number_of_features
number_of_threads: number of threads to be used for training
initialization_count (int): how many times to repeat the fitting procedure from the beginning before choosing the optimal model (trains the model with multiple random initializations)
pytorch_device_name (str): The name of the pytorch device to use. This name will be passed to torch.device(pytorch_device_name).
YAML specification:
definitions:
ml_methods:
my_kmer_mil_classifier:
AtchleyKmerMILClassifier:
iteration_count: 100
evaluate_at: 15
use_early_stopping: False
learning_rate: 0.01
random_seed: 100
zero_abundance_weight_init: True
number_of_threads: 8
threshold: 0.00001
initialization_count: 4
BinaryFeatureClassifier¶
A simple classifier that takes in encoded data containing features with only 1/0 or True/False values.
This classifier gives a positive prediction if any of the binary features for an example are ‘true’.
Optionally, the classifier can select an optimal subset of these features. In this case, the given data is split
into a training and validation set, a minimal set of features is learned through greedy forward selection,
and the validation set is used to determine when to stop growing the set of features (earlystopping).
Earlystopping is reached when the optimization metric on the validation set no longer improves for a given number of features (patience).
The optimization metric is the same metric as the one used for optimization in the TrainMLModelInstruction
.
Currently, this classifier can be used in combination with two encoders:
The classifier can be used in combination with the
MotifEncoder
,
such that sequences containing any of the positive class-associated motifs are classified as positive. A reduced subset of binding-associated motifs can be learned (when keep_all is false). This results in a set of complementary motifs, minimizing the redundant predictions made by different motifs.
Alternatively, this classifier can be combined with the
SimilarToPositiveSequenceEncoder
such that any sequence that falls within a given hamming distance from any of the positive class sequences in the training set are classified as positive. Parameter keep_all should be set to true, since this encoder creates only 1 feature.
Specification arguments:
training_percentage (float): What percentage of data to use for training (the rest will be used for validation); values between 0 and 1
keep_all (bool): Whether to keep all the input features (true) or learn a reduced subset (false). By default, keep_all is false.
random_seed (int): Random seed for splitting the data into training and validation sets when learning a minimal subset of features. This is only used when keep_all is false.
max_features (int): The maximum number of features to allow in the reduced subset. When this number is reached, no more features are added even if the earlystopping criterion is not reached yet. This is only used when keep_all is false. By default, max_features is 100.
patience (int): The patience for earlystopping. When earlystopping is reached, <patience> more features are added to the reduced set to test whether the optimization metric on the validation set improves again. By default, patience is 5.
min_delta (float): The delta value used to test if there was improvement between the previous set of features and the new set of features (+1). By default, min_delta is 0, meaning the new set of features does not need to yield a higher optimization metric score on the validation set, but it needs to be at least equally high as the previous set.
YAML specification:
definitions:
ml_methods:
my_motif_classifier:
MotifClassifier:
training_percentage: 0.7
max_features: 100
patience: 5
min_delta: 0
keep_all: false
DeepRC¶
This classifier uses the DeepRC method for repertoire classification. The DeepRC ML method should be used in combination with the DeepRC encoder. Also consider using the DeepRCMotifDiscovery report for interpretability.
Notes:
DeepRC uses PyTorch functionalities that depend on GPU. Therefore, DeepRC does not work on a CPU.
This wrapper around DeepRC currently only supports binary classification.
Reference: Michael Widrich, Bernhard Schäfl, Milena Pavlović, Geir Kjetil Sandve, Sepp Hochreiter, Victor Greiff, Günter Klambauer ‘DeepRC: Immune repertoire classification with attention-based deep massive multiple instance learning’. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.12.038158
Specification arguments:
validation_part (float): the part of the data that will be used for validation, the rest will be used for training.
add_positional_information (bool): whether positional information should be included in the input features.
kernel_size (int): the size of the 1D-CNN kernels.
n_kernels (int): the number of 1D-CNN kernels in each layer.
n_additional_convs (int): Number of additional 1D-CNN layers after first layer
n_attention_network_layers (int): Number of attention layers to compute keys
n_attention_network_units (int): Number of units in each attention layer
n_output_network_units (int): Number of units in the output layer
consider_seq_counts (bool): whether the input data should be scaled by the receptor sequence counts.
sequence_reduction_fraction (float): Fraction of number of sequences to which to reduce the number of sequences per bag based on attention weights. Has to be in range [0,1].
reduction_mb_size (int): Reduction of sequences per bag is performed using minibatches of reduction_mb_size` sequences to compute the attention weights.
n_updates (int): Number of updates to train for
n_torch_threads (int): Number of parallel threads to allow PyTorch
learning_rate (float): Learning rate for adam optimizer
l1_weight_decay (float): l1 weight decay factor. l1 weight penalty will be added to loss, scaled by l1_weight_decay
l2_weight_decay (float): l2 weight decay factor. l2 weight penalty will be added to loss, scaled by l2_weight_decay
sequence_counts_scaling_fn: it can either be log (logarithmic scaling of sequence counts) or None
evaluate_at (int): Evaluate model on training and validation set every evaluate_at updates. This will also check for a new best model for early stopping.
sample_n_sequences (int): Optional random sub-sampling of sample_n_sequences sequences per repertoire. Number of sequences per repertoire might be smaller than sample_n_sequences if repertoire is smaller or random indices have been drawn multiple times. If None, all sequences will be loaded for each repertoire.
training_batch_size (int): Number of repertoires per minibatch during training.
n_workers (int): Number of background processes to use for converting dataset to hdf5 container and training set data loader.
pytorch_device_name (str): The name of the pytorch device to use. This name will be passed to torch.device(self.pytorch_device_name). The default value is cuda:0
YAML specification:
definitions:
ml_methods:
my_deeprc_method:
DeepRC:
validation_part: 0.2
add_positional_information: True
kernel_size: 9
KNN¶
This is a wrapper of scikit-learn’s KNeighborsClassifier class. This ML method creates a distance matrix using the given encoded data. If the encoded data is already a distance matrix (for example, when using the Distance or CompAIRRDistance encoders), please use PrecomputedKNN instead.
Please see the scikit-learn documentation of KNeighborsClassifier for the parameters.
Scikit-learn models can be trained in two modes:
1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.
2. Passing a range of different hyperparameters to KNN, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the KNN model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.
By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.
Specification arguments:
KNN (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.
model_selection_cv (bool): If any of the hyperparameters under KNN is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.
model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.
YAML specification:
definitions:
ml_methods:
my_knn_method:
KNN:
# sklearn parameters (same names as in original sklearn class)
weights: uniform # always use this setting for weights
n_neighbors: [5, 10, 15] # find the optimal number of neighbors
# Additional parameter that determines whether to print convergence warnings
show_warnings: True
# if any of the parameters under KNN is a list and model_selection_cv is True,
# a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
# and the optimal model will be selected
model_selection_cv: True
model_selection_n_folds: 5
# alternative way to define ML method with default values:
my_default_knn: KNN
KerasSequenceCNN¶
A CNN-based classifier for sequence datasets. Should be used in combination with source.encodings.onehot.OneHotEncoder.OneHotEncoder
.
This classifier integrates the CNN proposed by Mason et al., the original code can be found at: https://github.com/dahjan/DMS_opt/blob/master/scripts/CNN.py
Note: make sure keras and tensorflow dependencies are installed (see installation instructions).
Reference: Derek M. Mason, Simon Friedensohn, Cédric R. Weber, Christian Jordi, Bastian Wagner, Simon M. Men1, Roy A. Ehling, Lucia Bonati, Jan Dahinden, Pablo Gainza, Bruno E. Correia and Sai T. Reddy ‘Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning’. Nat Biomed Eng 5, 600–612 (2021). https://doi.org/10.1038/s41551-021-00699-9
Specification arguments:
units_per_layer (list): A nested list specifying the layers of the CNN. The first element in each nested list defines the layer type, other elements define the layer parameters. Valid layer types are: CONV (keras.layers.Conv1D), DROP (keras.layers.Dropout), POOL (keras.layers.MaxPool1D), FLAT (keras.layers.Flatten), DENSE (keras.layers.Dense). The parameters per layer type are as follows:
[CONV, <filters>, <kernel_size>, <strides>]
[DROP, <rate>]
[POOL, <pool_size>, <strides>]
[FLAT]
[DENSE, <units>]
activation (str): The Activation function to use in the convolutional or dense layers. Activation functions can be chosen from keras.activations. For example, rely or softmax. By default, relu is used.
training_percentage (float): The fraction of sequences that will be randomly assigned to form the training set (the rest will be the validation set). Should be a value between 0 and 1. By default, training_percentage is 0.7.
YAML specification:
definitions:
ml_methods:
my_cnn:
KerasSequenceCNN:
training_percentage: 0.7
units_per_layer: [[CONV, 400, 3, 1], [DROP, 0.5], [POOL, 2, 1], [FLAT], [DENSE, 50]]
activation: relu
LogisticRegression¶
This is a wrapper of scikit-learn’s LogisticRegression class. Please see the scikit-learn documentation of LogisticRegression for the parameters.
Note: if you are interested in plotting the coefficients of the logistic regression model, consider running the Coefficients report.
Scikit-learn models can be trained in two modes:
1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.
2. Passing a range of different hyperparameters to LogisticRegression, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the LogisticRegression model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.
By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.
Specification arguments:
LogisticRegression (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.
model_selection_cv (bool): If any of the hyperparameters under LogisticRegression is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.
model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.
YAML specification:
definitions:
ml_methods:
my_logistic_regression: # user-defined method name
LogisticRegression: # name of the ML method
# sklearn parameters (same names as in original sklearn class)
penalty: l1 # always use penalty l1
C: [0.01, 0.1, 1, 10, 100] # find the optimal value for C
# Additional parameter that determines whether to print convergence warnings
show_warnings: True
# if any of the parameters under LogisticRegression is a list and model_selection_cv is True,
# a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
# and the optimal model will be selected
model_selection_cv: True
model_selection_n_folds: 5
# alternative way to define ML method with default values:
my_default_logistic_regression: LogisticRegression
PrecomputedKNN¶
This is a wrapper of scikit-learn’s KNeighborsClassifier class. This ML method takes a pre-computed distance matrix, as created by the Distance or CompAIRRDistance encoders. If you would like to use a different encoding in combination with KNN, please use KNN instead.
Please see the scikit-learn documentation of KNeighborsClassifier for the parameters.
Scikit-learn models can be trained in two modes:
1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.
2. Passing a range of different hyperparameters to KNN, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the KNN model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.
By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.
Specification arguments:
KNN (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.
model_selection_cv (bool): If any of the hyperparameters under KNN is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.
model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.
YAML specification:
definitions:
ml_methods:
my_knn_method:
PrecomputedKNN:
# sklearn parameters (same names as in original sklearn class)
weights: uniform # always use this setting for weights
n_neighbors: [5, 10, 15] # find the optimal number of neighbors
# Additional parameter that determines whether to print convergence warnings
show_warnings: True
# if any of the parameters under KNN is a list and model_selection_cv is True,
# a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
# and the optimal model will be selected
model_selection_cv: True
model_selection_n_folds: 5
# alternative way to define ML method with default values:
my_default_knn: PrecomputedKNN
ProbabilisticBinaryClassifier¶
ProbabilisticBinaryClassifier predicts the class assignment in binary classification case based on encoding examples by number of successful trials and total number of trials. It models this ratio by one beta distribution per class and predicts the class of the new examples using log-posterior odds ratio with threshold at 0.
ProbabilisticBinaryClassifier is based on the paper (details on the classification can be found in the Online Methods section): Emerson, Ryan O., William S. DeWitt, Marissa Vignali, Jenna Gravley, Joyce K. Hu, Edward J. Osborne, Cindy Desmarais, et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.
Specification arguments:
max_iterations (int): maximum number of iterations while optimizing the parameters of the beta distribution (same for both classes)
update_rate (float): how much the computed gradient should influence the updated value of the parameters of the beta distribution
likelihood_threshold (float): at which threshold to stop the optimization (default -1e-10)
YAML specification:
definitions:
ml_methods:
my_probabilistic_classifier: # user-defined name of the ML method
ProbabilisticBinaryClassifier: # method name
max_iterations: 1000
update_rate: 0.01
RandomForestClassifier¶
This is a wrapper of scikit-learn’s RandomForestClassifier class. Please see the scikit-learn documentation of RandomForestClassifier for the parameters.
Note: if you are interested in plotting the coefficients of the random forest classifier model, consider running the Coefficients report.
Scikit-learn models can be trained in two modes:
1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.
2. Passing a range of different hyperparameters to RandomForestClassifier, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the RandomForestClassifier model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.
By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.
Specification arguments:
RandomForestClassifier (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.
model_selection_cv (bool): If any of the hyperparameters under RandomForestClassifier is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.
model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.
YAML specification:
definitions:
ml_methods:
my_random_forest_classifier: # user-defined method name
RandomForestClassifier: # name of the ML method
# sklearn parameters (same names as in original sklearn class)
random_state: 100 # always use this value for random state
n_estimators: [10, 50, 100] # find the optimal number of trees in the forest
# Additional parameter that determines whether to print convergence warnings
show_warnings: True
# if any of the parameters under RandomForestClassifier is a list and model_selection_cv is True,
# a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
# and the optimal model will be selected
model_selection_cv: True
model_selection_n_folds: 5
# alternative way to define ML method with default values:
my_default_random_forest: RandomForestClassifier
ReceptorCNN¶
A CNN which separately detects motifs using CNN kernels in each chain of paired receptor data, combines the kernel activations into a unique representation of the receptor and uses this representation to predict the antigen binding.
Requires one-hot encoded data as input (as produced by OneHot encoder), where use_positional_info must be set to True.
Notes:
ReceptorCNN can only be used with ReceptorDatasets, it does not work with SequenceDatasets
ReceptorCNN can only be used for binary classification, not multi-class classification.
Specification arguments:
kernel_count (count): number of kernels that will look for motifs for one chain
kernel_size (list): sizes of the kernels = how many amino acids to consider at the same time in the chain sequence, can be a tuple of values; e.g. for value [3, 4] of kernel_size, kernel_count*len(kernel_size) kernels will be created, with kernel_count kernels of size 3 and kernel_count kernels of size 4 per chain
positional_channels (int): how many positional channels where included in one-hot encoding of the receptor sequences (OneHot encoder adds 3 positional channels positional information is enabled)
sequence_type (SequenceType): type of the sequence
device: which device to use for the model (cpu or gpu) - for more details see PyTorch documentation on device parameter
number_of_threads (int): how many threads to use
random_seed (int): number used as a seed for random initialization
learning_rate (float): learning rate scaling the step size for optimization algorithm
iteration_count (int): for how many iterations to train the model
l1_weight_decay (float): weight decay l1 value for the CNN; encourages sparser representations
l2_weight_decay (float): weight decay l2 value for the CNN; shrinks weight coefficients towards zero
batch_size (int): how many receptors to process at once
training_percentage (float): what percentage of data to use for training (the rest will be used for validation); values between 0 and 1
evaluate_at (int): when to evaluate the model, e.g. every 100 iterations
background_probabilities: used for rescaling the kernel values to produce information gain matrix; represents the background probability of each amino acid (without positional information); if not specified, uniform background is assumed
YAML specification:
definitions:
ml_methods:
my_receptor_cnn:
ReceptorCNN:
kernel_count: 5
kernel_size: [3]
positional_channels: 3
sequence_type: amino_acid
device: cpu
number_of_threads: 16
random_seed: 100
learning_rate: 0.01
iteration_count: 10000
l1_weight_decay: 0
l2_weight_decay: 0
batch_size: 5000
SVC¶
This is a wrapper of scikit-learn’s LinearSVC class. Please see the scikit-learn documentation of SVC for the parameters.
Note: if you are interested in plotting the coefficients of the SVC model, consider running the Coefficients report.
Scikit-learn models can be trained in two modes:
1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.
2. Passing a range of different hyperparameters to SVC, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the SVC model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.
By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.
Specification arguments:
SVC (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.
model_selection_cv (bool): If any of the hyperparameters under SVC is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.
model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.
YAML specification:
definitions:
ml_methods:
my_svc: # user-defined method name
SVC: # name of the ML method
# sklearn parameters (same names as in original sklearn class)
C: [0.01, 0.1, 1, 10, 100] # find the optimal value for C
# Additional parameter that determines whether to print convergence warnings
show_warnings: True
# if any of the parameters under SVC is a list and model_selection_cv is True,
# a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
# and the optimal model will be selected
model_selection_cv: True
model_selection_n_folds: 5
# alternative way to define ML method with default values:
my_default_svc: SVC
SVM¶
This is a wrapper of scikit-learn’s SVC class. Please see the scikit-learn documentation of SVC for the parameters.
Note: if you are interested in plotting the coefficients of the SVM model, consider running the Coefficients report.
Scikit-learn models can be trained in two modes:
1. Creating a model using a given set of hyperparameters, and relying on the selection and assessment loop in the TrainMLModel instruction to select the optimal model.
2. Passing a range of different hyperparameters to SVM, and using a third layer of nested cross-validation to find the optimal hyperparameters through grid search. In this case, only the SVM model with the optimal hyperparameter settings is further used in the inner selection loop of the TrainMLModel instruction.
By default, mode 1 is used. In order to use mode 2, model_selection_cv and model_selection_n_folds must be set.
Specification arguments:
SVM (dict): Under this key, hyperparameters can be specified that will be passed to the scikit-learn class. Any scikit-learn hyperparameters can be specified here. In mode 1, a single value must be specified for each of the scikit-learn hyperparameters. In mode 2, it is possible to specify a range of different hyperparameters values in a list. It is also allowed to mix lists and single values in mode 2, in which case the grid search will only be done for the lists, while the single-value hyperparameters will be fixed. In addition to the scikit-learn hyperparameters, parameter show_warnings (True/False) can be specified here. This determines whether scikit-learn warnings, such as convergence warnings, should be printed. By default show_warnings is True.
model_selection_cv (bool): If any of the hyperparameters under SVM is a list and model_selection_cv is True, a grid search will be done over the given hyperparameters, using the number of folds specified in model_selection_n_folds. By default, model_selection_cv is False.
model_selection_n_folds (int): The number of folds that should be used for the cross validation grid search if model_selection_cv is True.
YAML specification:
definitions:
ml_methods:
my_svm: # user-defined method name
SVM: # name of the ML method
# sklearn parameters (same names as in original sklearn class)
C: [0.01, 0.1, 1, 10, 100] # find the optimal value for C
kernel: linear
# Additional parameter that determines whether to print convergence warnings
show_warnings: True
# if any of the parameters under SVM is a list and model_selection_cv is True,
# a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds,
# and the optimal model will be selected
model_selection_cv: True
model_selection_n_folds: 5
# alternative way to define ML method with default values:
my_default_svm: SVM
TCRdistClassifier¶
Implementation of a nearest neighbors classifier based on TCR distances as presented in Dash P, Fiore-Gartland AJ, Hertz T, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. 2017; 547(7661):89-93. doi:10.1038/nature22383.
This method is implemented using scikit-learn’s KNeighborsClassifier with k determined at runtime from the training dataset size and weights linearly scaled to decrease with the distance of examples.
Specification arguments:
percentage (float): percentage of nearest neighbors to consider when determining receptor specificity based on known receptors (between 0 and 1)
show_warnings (bool): whether to show warnings generated by scikit-learn, by default this is True.
YAML specification:
definitions:
ml_methods:
my_tcr_method:
TCRdistClassifier:
percentage: 0.1
show_warnings: True
Clustering methods¶
Clustering methods are algorithms which can be used to cluster repertoires, receptors or sequences without using external label information (such as disease or antigen binding state)
These methods can be used in the Clustering instruction.
KMeans¶
k-means clustering method which wraps scikit-learn’s KMeans. Input arguments for the method are the same as supported by scikit-learn (see KMeans scikit-learn documentation for details).
YAML specification:
definitions:
ml_methods:
my_kmeans:
KMeans:
# arguments as defined by scikit-learn
n_clusters: 2
Generative models¶
Generative models are algorithms which can be trained to learn patterns in existing datasets, and then be used to generate new synthetic datasets.
These methods can be used in the TrainGenModel instruction, and previously trained models can be used to generate data using the ApplyGenModel instruction.
ExperimentalImport¶
Allows to import existing experimental data and do annotations and simulations on top of them. This model should be used only for LIgO simulation and not with TrainGenModel instruction.
YAML specification:
definitions:
ml_methods:
generative_model:
type: ExperimentalImport
import_format: AIRR
tmp_import_path: ./tmp/
import_params:
path: path/to/files/
region_type: IMGT_CDR3 # what part of the sequence to import
column_mapping: # column mapping AIRR: immuneML
junction: sequence
junction_aa: sequence_aa
locus: chain
OLGA¶
This is a wrapper for the OLGA package as described by Sethna et al. 2019 (OLGA package on PyPI or GitHub: https://github.com/statbiophys/OLGA ). This model should be used only for LIgO simulation and is not yet supported for use with TrainGenModel instruction.
Reference:
Zachary Sethna, Yuval Elhanati, Curtis G Callan, Jr, Aleksandra M Walczak, Thierry Mora, OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs, Bioinformatics, Volume 35, Issue 17, 1 September 2019, Pages 2974–2981, https://doi.org/10.1093/bioinformatics/btz035
Note:
OLGA generates sequences that correspond to IMGT junction and are used for matching as such. See the https://github.com/statbiophys/OLGA for more details.
Gene names are as provided in OLGA (either in default models or in the user-specified model files). For simulation, one should use gene names in the same format.
Note
While this is a generative model, in the current version of immuneML it cannot be used in combination with TrainGenModel or ApplyGenModel instruction. If you want to use OLGA for sequence simulation, see Dataset simulation with LIgO.
Specification arguments:
model_path (str): if not default model, this parameter should point to a folder where the four OLGA/IGOR format files are stored (could also be inferred from some experimental data)
default_model_name (str): if not using custom models, one of the OLGA default models could be specified here; the value should be the same as it would be passed to command line in OLGA: e.g., humanTRB, human IGH
YAML specification:
definitions:
ml_methods:
generative_model:
type: OLGA
model_path: None
default_model_name: humanTRB
PWM¶
This is a baseline implementation of a positional weight matrix. It is estimated from a set of sequences for each of the different lengths that appear in the dataset.
Specification arguments:
locus (str): which chain is generated (for now, it is only assigned to the generated sequences)
sequence_type (str): amino_acid or nucleotide
region_type (str): which region type to use (e.g., IMGT_CDR3), this is only assigned to the generated sequences
YAML specification:
definitions:
ml_methods:
my_pwm:
PWM:
locus: beta
sequence_type: amino_acid
region_type: IMGT_CDR3
SimpleLSTM¶
This is a simple generative model for receptor sequences based on LSTM.
Similar models have been proposed in:
Akbar, R. et al. (2022). In silico proof of principle of machine learning-based antibody design at unconstrained scale. mAbs, 14(1), 2031482. https://doi.org/10.1080/19420862.2022.2031482
Saka, K. et al. (2021). Antibody design using LSTM based deep generative model from phage display library for affinity maturation. Scientific Reports, 11(1), Article 1. https://doi.org/10.1038/s41598-021-85274-7
Specification arguments:
sequence_type (str): whether the model should work on amino_acid or nucleotide level
hidden_size (int): how many LSTM cells should exist per layer
num_layers (int): how many hidden LSTM layers should there be
num_epochs (int): for how many epochs to train the model
learning_rate (float): what learning rate to use for optimization
batch_size (int): how many examples (sequences) to use for training for one batch
embed_size (int): the dimension of the sequence embedding
temperature (float): a higher temperature leads to faster yet more unstable learning
YAML specification:
definitions:
ml_methods:
my_simple_lstm:
sequence_type: amino_acid
hidden_size: 50
num_layers: 1
num_epochs: 5000
learning_rate: 0.001
batch_size: 100
embed_size: 100
SimpleVAE¶
SimpleVAE is a generative model on sequence level that relies on variational autoencoder. This type of model was proposed by Davidsen et al. 2019, and this implementation is inspired by their original implementation available at https://github.com/matsengrp/vampire.
References:
Davidsen, K., Olson, B. J., DeWitt, W. S., III, Feng, J., Harkins, E., Bradley, P., & Matsen, F. A., IV. (2019). Deep generative models for T cell receptor protein sequences. eLife, 8, e46935. https://doi.org/10.7554/eLife.46935
Specification arguments:
locus (str): which locus the sequence come from, e.g., TRB
beta (float): VAE hyperparameter that balanced the reconstruction loss and latent dimension regularization
latent_dim (int): latent dimension of the VAE
linear_nodes_count (int): in linear layers, how many nodes to use
num_epochs (int): how many epochs to use for training
batch_size (int): how many examples to consider at the same time
j_gene_embed_dim (int): dimension of J gene embedding
v_gene_embed_dim (int): dimension of V gene embedding
cdr3_embed_dim (int): dimension of the cdr3 embedding
pretrains (int): how many times to attempt pretraining to initialize the weights and use warm-up for the beta hyperparameter before the main training process
warmup_epochs (int): how many epochs to use for training where beta hyperparameter is linearly increased from 0 up to its max value; this is in addition to num_epochs set above
patience (int): number of epochs to wait before the training is stopped when the loss is not improving
iter_count_prob_estimation (int): how many iterations to use to estimate the log probability of the generated sequence (the more iterations, the better the estimated log probability)
vocab (list): which letters (amino acids) are allowed - this is automatically filled for new models (no need to set)
max_cdr3_len (int): what is the maximum cdr3 length - this is automatically filled for new models (no need to set)
unique_v_genes (list): list of allowed V genes (this will be automatically filled from the dataset if not provided here manually)
unique_j_genes (list): list of allowed J genes (this will be automatically filled from the dataset if not provided here manually)
device (str): name of the device where to train the model (e.g., cpu)
YAML specification:
definitions:
ml_methods:
my_vae:
SimpleVAE:
locus: beta
beta: 0.75
latent_dim: 20
linear_nodes_count: 75
num_epochs: 5000
batch_size: 10000
j_gene_embed_dim: 13
v_gene_embed_dim: 30
cdr3_embed_dim: 21
pretrains: 10
warmup_epochs: 20
patience: 20
device: cpu
SoNNia¶
SoNNia models the selection process of T and B cell receptor repertoires. It is based on the SoNNia Python package. It supports SequenceDataset as input, but not RepertoireDataset.
Original publication: Isacchini, G., Walczak, A. M., Mora, T., & Nourmohammad, A. (2021). Deep generative selection models of T and B cell receptor repertoires with soNNia. Proceedings of the National Academy of Sciences, 118(14), e2023141118. https://doi.org/10.1073/pnas.2023141118
Specification arguments:
locus (str)
batch_size (int)
epochs (int)
deep (bool)
include_joint_genes (bool)
n_gen_seqs (int)
custom_model_path (str)
default_model_name (str)
YAML specification:
definitions:
ml_methods:
my_sonnia_model:
SoNNia:
...
Dimensionality reduction methods¶
Dimensionality reduction methods are algorithms which can be used to reduce the dimensionality of encoded datasets, in order to uncover and analyze patterns present in the data.
These methods can be used in the ExploratoryAnalysis and Clustering instructions.
PCA¶
Principal component analysis (PCA) method which wraps scikit-learn’s PCA. Input arguments for the method are the same as supported by scikit-learn (see PCA scikit-learn documentation for details).
YAML specification:
definitions:
ml_methods:
my_pca:
PCA:
# arguments as defined by scikit-learn
n_components: 2
TSNE¶
t-distributed Stochastic Neighbor Embedding (t-SNE) method which wraps scikit-learn’s TSNE. It can be useful for visualizing high-dimensional data. Input arguments for the method are the same as supported by scikit-learn (see TSNE scikit-learn documentation for details).
YAML specification:
definitions:
ml_methods:
my_tsne:
TSNE:
# arguments as defined by scikit-learn
n_components: 2
init: pca
UMAP¶
Uniform manifold approximation and projection (UMAP) method which wraps umap-learn’s UMAP. Input arguments for the method are the same as supported by umap-learn (see UMAP in the umap-learn documentation for details).
Note that when providing the arguments for UMAP in the immuneML’s specification, it is not possible to set functions as input values (e.g., for the metric parameter, it has to be one of the predefined metrics available in umap-learn).
YAML specification:
definitions:
ml_methods:
my_umap:
UMAP:
# arguments as defined by scikit-learn
n_components: 2
n_neighbors: 15
metric: euclidean
Reports¶
Under the definitions/reports
component, the user can specify reports which visualise or summarise different properties
of the dataset or analysis.
Reports have been divided into different types. Different types of reports can be specified depending on which instruction is run. Click on the name of the report type to see more details.
Data reports show some type of features or statistics about a given dataset.
Encoding reports show some type of features or statistics about an encoded dataset, or may export relevant sequences or tables.
ML model reports show some type of features or statistics about a single trained ML model (e.g., model coefficients).
Train ML model reports plot general statistics or export data of multiple models simultaneously when running the TrainMLModel instruction (e.g., performance comparison between models).
Multi dataset reports are special reports that can be specified when running immuneML with the
MultiDatasetBenchmarkTool
. See Manuscript use case 1: Robustness assessment for an example.
Data reports¶
Data reports show some type of features or statistics about a given dataset.
When running the TrainMLModel instruction, data reports can be specified inside the ‘selection’ or ‘assessment’ specification under the keys ‘reports/data’ (current cross-validation split) or ‘reports/data_splits’ (train/test sub-splits). Example:
my_instruction:
type: TrainMLModel
selection:
reports:
data:
- my_data_report
# other parameters...
assessment:
reports:
data:
- my_data_report
# other parameters...
# other parameters...
Alternatively, when running the ExploratoryAnalysis instruction, data reports can be specified under ‘report’. Example:
my_instruction:
type: ExploratoryAnalysis
analyses:
my_first_analysis:
report: my_data_report
# other parameters...
# other parameters...
AminoAcidFrequencyDistribution¶
Generates a barplot showing the relative frequency of each amino acid at each position in the sequences of a dataset.
Example output:
Specification arguments:
alignment (str): Alignment style for aligning sequences of different lengths. Options are as follows:
CENTER: center-align sequences of different lengths. The middle amino acid of any sequence be labelled position 0. By default, alignment is CENTER.
LEFT: left-align sequences of different lengths, starting at 0.
RIGHT: right align sequences of different lengths, ending at 0 (counting towards negative numbers).
IMGT: align sequences based on their IMGT positional numbering, considering the sequence region_type (IMGT_CDR3 or IMGT_JUNCTION). The main difference between CENTER and IMGT is that IMGT aligns the first and last amino acids, adding gaps in the middle, whereas CENTER aligns the middle of the sequences, padding with gaps at the start and end of the sequence. When region_type is IMGT_JUNCTION, the IMGT positions run from 104 (conserved C) to 118 (conserved W/F). When IMGT_CDR3 is used, these positions are 105 to 117. For long CDR3 sequences, additional numbers are added in between IMGT positions 111 and 112. See the official IMGT documentation for more details: https://www.imgt.org/IMGTScientificChart/Numbering/CDR3-IMGTgaps.html
relative_frequency (bool): Whether to plot relative frequencies (true) or absolute counts (false) of the positional amino acids. Note that when sequences are of different length, setting relative_frequency to True will produce different results depending on the alignment type, as some positions are only covered by the longest sequences. By default, relative_frequency is False.
split_by_label (bool): Whether to split the plots by a label. If set to true, the Dataset must either contain a single label, or alternatively the label of interest can be specified under ‘label’. If split_by_label is set to true, the percentage-wise frequency difference between classes is plotted additionally. By default, split_by_label is False.
label (str): if split_by_label is set to True, a label can be specified here.
region_type (str): which part of the sequence to check; e.g., IMGT_CDR3
YAML specification:
definitions:
reports:
my_aa_freq_report:
AminoAcidFrequencyDistribution:
relative_frequency: False
split_by_label: True
label: CMV
region_type: IMGT_CDR3
GLIPH2Exporter¶
Report which exports the receptor data to GLIPH2 format so that it can be directly used in GLIPH2 tool. Currently, the report accepts only receptor datasets.
GLIPH2 publication: Huang H, Wang C, Rubelt F, Scriba TJ, Davis MM. Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nature Biotechnology. Published online April 27, 2020:1-9. doi:10.1038/s41587-020-0505-4
Specification arguments:
condition (str): name of the parameter present in the receptor metadata in the dataset; condition can be anything which can be processed in GLIPH2, such as tissue type or treatment.
YAML specification:
definitions:
reports:
my_gliph2_exporter:
GLIPH2Exporter:
condition: epitope # for instance, epitope parameter is present in receptors' metadata with values such as "MtbLys" for Mycobacterium tuberculosis (as shown in the original paper).
MotifGeneralizationAnalysis¶
This report splits the given dataset into a training and validation set, identifies significant motifs using the
MotifEncoder
on the training set and plots the precision/recall and precision/true positive predictions of motifs
on both the training and validation sets. This can be used to:
determine the optimal recall cutoff for motifs of a given size
investigate how well motifs learned on a training set generalize to a test set
After running this report and determining the optimal recall cutoffs, the report
MotifTestSetPerformance
can be run to
plot the performance on an independent test set.
Note: the MotifEncoder (and thus this report) can only be used for sequences of the same length.
Specification arguments:
label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.
training_set_identifier_path (str): Path to a file containing ‘sequence_identifiers’ of the sequences used for the training set. This file should have a single column named ‘example_id’ and have one sequence identifier per line. If training_set_identifier_path is not set, a random subset of the data (according to training_percentage) will be assigned to be the training set.
training_percentage (float): If training_set_identifier_path is not set, this value is used to specify the fraction of sequences that will be randomly assigned to form the training set. Should be a value between 0 and 1. By default, training_percentage is 0.7.
random_seed (int): Random seed for splitting the data into training and validation sets a training_set_identifier_path is not provided.
split_by_motif_size (bool): Whether to split the analysis per motif size. If true, a recall threshold is learned for each motif size, and figures are generated for each motif size independently. By default, split_by_motif_size is true.
min_precision:
MotifEncoder
parameter. The minimum precision threshold for keeping a motif on the training set. By default, min_precision is 0.9.test_precision_threshold (float). The desired precision on the test set, given that motifs are learned by using a training set with a precision threshold of min_precision. It is recommended for test_precision_threshold to be lower than min_precision, e.g., min_precision - 0.1. By default, test_precision_threshold is 0.8.
min_recall (float):
MotifEncoder
parameter. The minimum recall threshold for keeping a motif. Any learned recall threshold will be at least as high as the set min_recall value. The default value for min_recall is 0.min_true_positives (int):
MotifEncoder
parameter. The minimum number of true positive training sequences that a motif needs to occur in. The default value for min_true_positives is 1.max_positions (int):
MotifEncoder
parameter. The maximum motif size. This is number of positional amino acids the motif consists of (excluding gaps). The default value for max_positions is 4.min_positions (int):
MotifEncoder
parameter. The minimum motif size (see also: max_positions). The default value for min_positions is 1.no_gaps (bool):
MotifEncoder
parameter. Must be set to True if only contiguous motifs (position-specific k-mers) are allowed. By default, no_gaps is False, meaning both gapped and ungapped motifs are searched for.smoothen_combined_precision (bool): whether to add a smoothed line representing the combined precision to the precision-vs-TP plot. When set to True, this may take considerable extra time to compute. By default, plot_smoothed_combined_precision is set to True.
min_points_in_window (int): Parameter for smoothing the combined_precision line in the precision-vs-TP plot through lognormal kernel smoothing with adaptive window size. This parameter determines the minimum number of points that need to be present in a window to determine the adaptive window size. By default, min_points_in_window is 50.
smoothing_constant1: Parameter for smoothing the combined_precision line in the precision-vs-TP plot through lognormal kernel smoothing with adaptive window size. This smoothing constant determines the dependence of the smoothness on the window size. Increasing this increases smoothness for regions where few points are present. By default, smoothing_constant1 is 5.
smoothing_constant2: Parameter for smoothing the combined_precision line in the precision-vs-TP plot through lognormal kernel smoothing. with adaptive window size. This smoothing constant can be used to scale the overall kernel width, thus influencing the smoothness of all regions regardless of data density. By default, smoothing_constant2 is 10.
training_set_name (str): Name of the training set to be used in figures. By default, the training_set_name is ‘training set’.
test_set_name (str): Name of the test set to be used in figures. By default, the test_set_name is ‘test set’.
highlight_motifs_path (str): Path to a set of motifs of interest to highlight in the output figures (such as implanted ground-truth motifs). By default, no motifs are highlighted.
highlight_motifs_name (str): IF highlight_motifs_path is defined, this name will be used to label the motifs of interest in the output figures.
YAML specification:
definitions:
reports:
my_motif_generalization:
MotifGeneralizationAnalysis:
min_precision: 0.9
min_recall: 0.1
label: # Define a label, and the positive class for that given label
CMV:
positive_class: +
ReceptorDatasetOverview¶
This report plots the length distribution per chain for a receptor (paired-chain) dataset.
Specification arguments:
batch_size (int): how many receptors to load at once; 50 000 by default
YAML specification:
definitions:
reports:
my_receptor_overview_report: ReceptorDatasetOverview
RecoveredSignificantFeatures¶
Compares a given collection of ground truth implanted signals (sequences or k-mers) to the significant label-associated k-mers or sequences according to Fisher’s exact test.
Internally uses the KmerAbundanceEncoder
for calculating
significant k-mers, and
SequenceAbundanceEncoder
or
CompAIRRSequenceAbundanceEncoder
to calculate significant full sequences (depending on whether the argument compairr_path was set).
This report creates two plots:
the first plot is a bar chart showing what percentage of the ground truth implanted signals were found to be significant.
the second plot is a bar chart showing what percentage of the k-mers/sequences found to be significant match the ground truth implanted signals.
To compare k-mers or sequences of differing lengths, the ground truth sequences or long k-mers are split into k-mers of the given size through a sliding window approach. When comparing ‘full_sequences’ to ground truth sequences, a match is only registered if both sequences are of equal length.
Specification arguments:
ground_truth_sequences_path (str): Path to a file containing the true implanted (sub)sequences, e.g., full sequences or k-mers. The file should contain one sequence per line, without a header, and without V or J genes.
sequence_type (str): either amino acid or nucleotide; which type of sequence to use for the analysis
region_type (str): which AIRR field to use for comparison, e.g. IMGT_CDR3 or IMGT_JUNCTION
p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.
k_values (list): Length of the k-mers (number of amino acids) created by the
KmerAbundanceEncoder
. When using a full sequence encoding (SequenceAbundanceEncoder
orCompAIRRSequenceAbundanceEncoder
), specify ‘full_sequence’ here. Each value specified under k_values will represent one bar in the output figure.label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.
compairr_path (str): If ‘full_sequence’ is listed under k_values, the path to the CompAIRR executable may be provided. If the compairr_path is specified, the
CompAIRRSequenceAbundanceEncoder
will be used to compute the significant sequences. If the path is not specified and ‘full_sequence’ is listed under k-values,SequenceAbundanceEncoder
will be used.
YAML specification:
definitions:
reports:
my_recovered_significant_features_report:
RecoveredSignificantFeatures:
groundtruth_sequences_path: path/to/groundtruth/sequences.txt
trim_leading_trailing: False
p_values:
- 0.1
- 0.01
- 0.001
- 0.0001
k_values:
- 3
- 4
- 5
- full_sequence
compairr_path: path/to/compairr # can be specified if 'full_sequence' is listed under k_values
label: # Define a label, and the positive class for that given label
CMV:
positive_class: +
RepertoireClonotypeSummary¶
Shows the number of distinct clonotypes per repertoire in a given dataset as a bar plot.
Specification arguments:
color_by_label (str): name of the label to use to color the plot, e.g., could be disease label, or None
YAML specification:
definitions:
reports:
my_clonotype_summary_rep:
RepertoireClonotypeSummary:
color_by_label: celiac
SequenceCountDistribution¶
Generates a histogram of the duplicate counts of the sequences in a dataset.
Specification arguments:
split_by_label (bool): Whether to split the plots by a label. If set to true, the Dataset must either contain a single label, or alternatively the label of interest can be specified under ‘label’. By default, split_by_label is False.
label (str): Optional label for separating the results by color/creating separate plots. Note that this should the name of a valid dataset label.
YAML specification:
my_sld_report:
SequenceCountDistribution:
label: disease
SequenceLengthDistribution¶
Generates a histogram of the lengths of the sequences in a dataset.
Specification arguments:
sequence_type (str): whether to check the length of amino acid or nucleotide sequences; default value is ‘amino_acid’
region_type (str): which part of the sequence to examine; e.g., IMGT_CDR3
YAML specification:
definitions:
reports:
my_sld_report:
SequenceLengthDistribution:
sequence_type: amino_acid
region_type: IMGT_CDR3
SequencesWithSignificantKmers¶
Given a list of reference sequences, this report writes out the subsets of reference sequences containing significant k-mers
(as computed by the KmerAbundanceEncoder
using Fisher’s exact test).
For each combination of p-value and k-mer size given, a file is written containing all sequences containing a significant k-mer of the given size at the given p-value.
Specification arguments:
reference_sequences_path (str): Path to a file containing the reference sequences, The file should contain one sequence per line, without a header, and without V or J genes.
p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.
k_values (list): Length of the k-mers (number of amino acids) created by the
KmerAbundanceEncoder
. Each k-mer length will become one panel in the output figure.label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.
YAML specification:
definitions:
reports:
my_sequences_with_significant_kmers:
SequencesWithSignificantKmers:
reference_sequences_path: path/to/reference/sequences.txt
p_values:
- 0.1
- 0.01
- 0.001
- 0.0001
k_values:
- 3
- 4
- 5
label: # Define a label, and the positive class for that given label
CMV:
positive_class: +
SignificantFeatures¶
Plots a boxplot of the number of significant features (label-associated k-mers or sequences) per Repertoire according to Fisher’s exact test, across different classes for the given label.
Internally uses the KmerAbundanceEncoder
for calculating
significant k-mers, and
SequenceAbundanceEncoder
or
CompAIRRSequenceAbundanceEncoder
to calculate significant full sequences (depending on whether the argument compairr_path was set).
Specification arguments:
p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.
k_values (list): Length of the k-mers (number of amino acids) created by the
KmerAbundanceEncoder
. When using a full sequence encoding (SequenceAbundanceEncoder
orCompAIRRSequenceAbundanceEncoder
), specify ‘full_sequence’ here. Each value specified under k_values will represent one boxplot in the output figure.label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.
compairr_path (str): If ‘full_sequence’ is listed under k_values, the path to the CompAIRR executable may be provided. If the compairr_path is specified, the
CompAIRRSequenceAbundanceEncoder
will be used to compute the significant sequences. If the path is not specified and ‘full_sequence’ is listed under k-values,SequenceAbundanceEncoder
will be used.log_scale (bool): Whether to plot the y axis in log10 scale (log_scale = True) or continuous scale (log_scale = False). By default, log_scale is False.
YAML specification:
definitions:
reports:
my_significant_features_report:
SignificantFeatures:
p_values:
- 0.1
- 0.01
- 0.001
- 0.0001
k_values:
- 3
- 4
- 5
- full_sequence
compairr_path: path/to/compairr # can be specified if 'full_sequence' is listed under k_values
label: # Define a label, and the positive class for that given label
CMV:
positive_class: +
log_scale: False
SignificantKmerPositions¶
Plots the number of significant k-mers (as computed by the KmerAbundanceEncoder
using Fisher’s exact test)
observed at each IMGT position of a given list of reference sequences.
This report creates a stacked bar chart, where each bar represents an IMGT position, and each segment of the stack represents the observed frequency
of one ‘significant’ k-mer at that position.
Specification arguments:
reference_sequences_path (str): Path to a file containing the reference sequences, The file should contain one sequence per line, without a header, and without V or J genes.
p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.
k_values (list): Length of the k-mers (number of amino acids) created by the
KmerAbundanceEncoder
. Each k-mer length will become one panel in the output figure.label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.
sequence_type (str): nucleotide or amino_acid
region_type (str): which AIRR field to consider, e.g., IMGT_CDR3 or IMGT_JUNCTION
YAML specification:
definitions:
reports:
my_significant_kmer_positions_report:
SignificantKmerPositions:
reference_sequences_path: path/to/reference/sequences.txt
p_values:
- 0.1
- 0.01
- 0.001
- 0.0001
k_values:
- 3
- 4
- 5
label: # Define a label, and the positive class for that given label
CMV:
positive_class: +
SimpleDatasetOverview¶
Generates a simple text-based overview of the properties of any dataset, including the dataset name, size, and metadata labels.
YAML specification:
definitions:
reports:
my_overview: SimpleDatasetOverview
VJGeneDistribution¶
This report creates several plots to gain insight into the V and J gene distribution of a given dataset. When a label is provided, the information in the plots is separated per label value, either by color or by creating separate plots. This way one can for example see if a particular V or J gene is more prevalent across disease associated receptors.
Individual V and J gene distributions: for sequence and receptor datasets, a bar plot is created showing how often
each V or J gene occurs in the dataset. For repertoire datasets, boxplots are used to represent how often each V or J gene is used across all repertoires. Since repertoires may differ in size, these counts are normalised by the repertoire size (original count values are additionaly exported in tsv files).
Combined V and J gene distributions: for sequence and receptor datasets, a heatmap is created showing how often each
combination of V and J genes occurs in the dataset. A similar plot is created for repertoire datasets, except in this case only the average value for the normalised gene usage frequencies are shown (original count values are additionaly exported in tsv files).
Specification arguments:
split_by_label (bool): Whether to split the plots by a label. If set to true, the Dataset must either contain a single label, or alternatively the label of interest can be specified under ‘label’. By default, split_by_label is False.
label (str): Optional label for separating the results by color/creating separate plots. Note that this should the name of a valid dataset label.
YAML specification:
definitions:
reports:
my_vj_gene_report:
VJGeneDistribution:
label: ag_binding
Encoding reports¶
Encoding reports show some type of features or statistics about an encoded dataset, or may in some cases export relevant sequences or tables.
When running the TrainMLModel instruction, encoding reports can be specified inside the ‘selection’ or ‘assessment’ specification under the key ‘reports/encoding’. Example:
my_instruction:
type: TrainMLModel
selection:
reports:
encoding:
- my_encoding_report
# other parameters...
assessment:
reports:
encoding:
- my_encoding_report
# other parameters...
# other parameters...
Alternatively, when running the ExploratoryAnalysis instruction, encoding reports can be specified under ‘report’. Example:
my_instruction:
type: ExploratoryAnalysis
analyses:
my_first_analysis:
report: my_encoding_report
# other parameters...
# other parameters...
DesignMatrixExporter¶
Exports the design matrix and related information of a given encoded Dataset to csv files. If the encoded data has more than 2 dimensions (such as when using the OneHot encoder with option Flatten=False), the data are then exported to different formats to facilitate their import with external software.
Specification arguments:
file_format (str): the format and extension of the file to store the design matrix. The supported formats are: npy, csv, pt, hdf5, npy.zip, csv.zip or hdf5.zip.
Note: when using hdf5 or hdf5.zip output formats, make sure the ‘hdf5’ dependency is installed.
YAML specification:
definitions:
reports:
my_dme_report:
DesignMatrixExporter:
file_format: csv
DimensionalityReduction¶
This report visualizes the data obtained by dimensionality reduction.
Specification arguments:
label (str): name of the label to use for highlighting data points; or None
YAML specification:
definitions:
reports:
rep1:
DimensionalityReduction:
label: epitope
FeatureComparison¶
Encoding a dataset results in a numeric matrix, where the rows are examples (e.g., sequences, receptors, repertoires) and the columns are features. For example, when KmerFrequency encoder is used, the features are the k-mers (AAA, AAC, etc..) and the feature values are the frequencies per k-mer.
This report separates the examples based on a binary metadata label, and plots the mean feature value of each feature in one example group against the other example group (for example: plot the feature value of ‘sick’ repertoires on the x axis, and ‘healthy’ repertoires on the y axis to spot consistent differences). The plot can be separated into different colors or facets using other metadata labels (for example: plot the average feature values of ‘cohort1’, ‘cohort2’ and ‘cohort3’ in different colors to spot biases).
Alternatively, when plotting features without comparing them across a binary label, see:
FeatureValueBarplot
report to plot
a simple bar chart per feature (average across examples).
Or FeatureDistribution
report to plot
the distribution of each feature across examples, rather than only showing the mean value in a bar plot.
Example output:
Specification arguments:
comparison_label (str): Mandatory label. This label is used to split the encoded data matrix and define the x and y axes of the plot. This label is only allowed to have 2 classes (for example: sick and healthy, binding and non-binding).
color_grouping_label (str): Optional label that is used to color the points in the scatterplot. This can not be the same as comparison_label.
row_grouping_label (str): Optional label that is used to group scatterplots into different row facets. This can not be the same as comparison_label.
column_grouping_label (str): Optional label that is used to group scatterplots into different column facets. This can not be the same as comparison_label.
show_error_bar (bool): Whether to show the error bar (standard deviation) for the points, both in the x and y dimension.
log_scale (bool): Whether to plot the x and y axes in log10 scale (log_scale = True) or continuous scale (log_scale = False). By default, log_scale is False.
keep_fraction (float): The total number of features may be very large and only the features differing significantly across comparison labels may be of interest. When the keep_fraction parameter is set below 1, only the fraction of features that differs the most across comparison labels is kept for plotting (note that the produced .csv file still contains all data). By default, keep_fraction is 1, meaning that all features are plotted.
opacity (float): a value between 0 and 1 setting the opacity for data points making it easier to see if there are overlapping points
YAML specification:
definitions:
reports:
my_comparison_report:
FeatureComparison: # compare the different classes defined in the label disease
comparison_label: disease
FeatureDistribution¶
Encoding a dataset results in a numeric matrix, where the rows are examples (e.g., sequences, receptors, repertoires) and the columns are features. For example, when KmerFrequency encoder is used, the features are the k-mers (AAA, AAC, etc..) and the feature values are the frequencies per k-mer.
This report plots the distribution of feature values. For each feature, a violin plot is created to show the distribution of feature values across all examples. The violin plots can be separated into different colors or facets using metadata labels (for example: plot the feature distributions of ‘cohort1’, ‘cohort2’ and ‘cohort3’ in different colors to spot biases).
See also: FeatureValueBarplot
report to plot
a simple bar chart per feature (average across examples), rather than the entire distribution.
Or FeatureComparison
report to compare
features across binary metadata labels (e.g., plot the feature value of ‘sick’ repertoires on the x axis,
and ‘healthy’ repertoires on the y axis).
Example output:
Specification arguments:
color_grouping_label (str): The label that is used to color each bar, at each level of the grouping_label.
row_grouping_label (str): The label that is used to group bars into different row facets.
column_grouping_label (str): The label that is used to group bars into different column facets.
mode (str): either ‘normal’, ‘sparse’ or ‘auto’ (default). in the ‘normal’ mode there are normal boxplots corresponding to each column of the encoded dataset matrix; in the ‘sparse’ mode all zero cells are eliminated before passing the data to the boxplots. If mode is set to ‘auto’, then it will automatically set to ‘sparse’ if the density of the matrix is below 0.01
x_title (str): x-axis label
y_title (str): y-axis label
YAML specification:
definitions:
reports:
my_fdistr_report:
FeatureDistribution:
mode: sparse
FeatureValueBarplot¶
Encoding a dataset results in a numeric matrix, where the rows are examples (e.g., sequences, receptors, repertoires) and the columns are features. For example, when KmerFrequency encoder is used, the features are the k-mers (AAA, AAC, etc..) and the feature values are the frequencies per k-mer.
This report plots the mean feature values per feature. A bar plot is created where the average feature value across all examples is shown, with optional error bars. The bar plots can be separated into different colors or facets using metadata labels (for example: plot the average feature values of ‘cohort1’, ‘cohort2’ and ‘cohort3’ in different colors to spot biases).
See also: FeatureDistribution
report to plot
the distribution of each feature across examples, rather than only showin the mean value in a bar plot.
Or FeatureComparison
report to compare
features across binary metadata labels (e.g., plot the feature value of ‘sick’ repertoires on the x axis,
and ‘healthy’ repertoires on the y axis.).
Example output:
Specification arguments:
color_grouping_label (str): The label that is used to color each bar, at each level of the grouping_label.
row_grouping_label (str): The label that is used to group bars into different row facets.
column_grouping_label (str): The label that is used to group bars into different column facets.
show_error_bar (bool): Whether to show the error bar (standard deviation) for the bars.
x_title (str): x-axis label
y_title (str): y-axis label
plot_top_n (int): plot n of the largest features on average separately (useful when there are too many features to plot at the same time)
plot_bottom_n (int): plot n of the smallest features on average separately (useful when there are too many features to plot at the same time)
plot_all_features (bool): whether to plot all (might be slow for large number of features)
YAML specification:
definitions:
reports:
my_fvb_report:
FeatureValueBarplot: # timepoint, disease_status and age_group are metadata labels
column_grouping_label: timepoint
row_grouping_label: disease_status
color_grouping_label: age_group
plot_all_features: true
plot_top_n: 10
plot_bottom_n: 5
GroundTruthMotifOverlap¶
Creates report displaying overlap between learned motifs and groundtruth motifs implanted in a given sequence dataset. This report must be used in combination with the MotifEncoder.
Specification arguments:
groundtruth_motifs_path (str): Path to a .tsv file containing groundtruth position-specific motifs. The file should specify the motifs as position-specific amino acids, one column representing the positions concatenated with an ‘&’ symbol, the next column specifying the amino acids concatenated with ‘&’ symbol, and the last column specifying the implant rate.
Example:
indices
amino_acids
n_sequences
0
A
4
4&8&9
G&A&C
30
This file shows a motif ‘A’ at position 0 implanted in 4 sequences, and motif G—AC implanted between positions 4 and 9 in 30 sequences
YAML specification:
definitions:
reports:
my_ground_truth_motif_report:
GroundTruthMotifOverlap:
groundtruth_motifs_path: path/to/file.tsv
Matches¶
Reports the number of matches that were found when using one of the following encoders:
MatchedSequences encoder
MatchedReceptors encoder
MatchedRegex encoder
Report results are:
A table containing all matches, where the rows correspond to the Repertoires, and the columns correspond to the objects to match (regular expressions or receptor sequences).
The repertoire sizes (read frequencies and the number of unique sequences per repertoire), for each of the chains. This can be used to calculate the percentage of matched sequences in a repertoire.
When using MatchedSequences encoder or MatchedReceptors encoder, tables describing the chains and receptors (ids, chains, V and J genes and sequences).
When using MatchedReceptors encoder or using MatchedRegex encoder with chain pairs, tables describing the paired matches (where a match was found in both chains) per repertoire.
YAML specification:
definitions:
reports:
my_match_report: Matches
MotifTestSetPerformance¶
This report can be used to show the performance of a learned set motifs using the MotifEncoder
on an independent test set of unseen data.
It is recommended to first run the report MotifGeneralizationAnalysis
in order to calibrate the optimal recall thresholds and plot the performance of motifs on training- and validation sets.
Specification arguments:
test_dataset (dict): parameters for importing a SequenceDataset to use as an independent test set. By default, the import parameters ‘is_repertoire’ and ‘paired’ will be set to False to ensure a SequenceDataset is imported.
YAML specification:
definitions:
reports:
my_motif_report:
MotifTestSetPerformance:
test_dataset:
format: AIRR # choose any valid import format
params:
path: path/to/files/
is_repertoire: False # is_repertoire must be False to import a SequenceDataset
paired: False # paired must be False to import a SequenceDataset
# optional other parameters...
NonMotifSequenceSimilarity¶
Plots the similarity of positions outside the motifs of interest. This report can be used to investigate if the
motifs of interest as determined by the MotifEncoder
have a tendency occur in sequences that are naturally very similar or dissimilar.
For each motif, the subset of sequences containing the motif is selected, and the hamming distances are computed between all sequences in this subset. Finally, a plot is created showing the distribution of hamming distances between the sequences containing the motif. For motifs occurring in sets of very similar sequences, this distribution will lean towards small hamming distances. Likewise, for motifs occurring in a very diverse set of sequences, the distribution will lean towards containing more large hamming distances.
Specification arguments:
motif_color_map (dict): An optional mapping between motif sizes and colors. If no mapping is given, default colors will be chosen.
YAML specification:
definitions:
reports:
my_motif_sim:
NonMotifSimilarity:
motif_color_map:
3: "#66C5CC"
4: "#F6CF71"
5: "#F89C74"
PositionalMotifFrequencies¶
This report must be used in combination with the MotifEncoder
.
Plots a stacked bar plot of amino acid occurrence at different indices in any given dataset, along with a plot
investigating motif continuity which displays a bar plot of the gap sizes between the amino acids in the motifs in
the given dataset. Note that a distance of 1 means that the amino acids are continuous (next to each other).
Specification arguments:
motif_color_map (dict): Optional mapping between motif lengths and specific colors to be used. Example:
- motif_color_map:
1: #66C5CC 2: #F6CF71 3: #F89C74
YAML specification:
definitions:
reports:
my_pos_motif_report:
PositionalMotifFrequencies:
motif_color_map:
RelevantSequenceExporter¶
Exports the sequences that are extracted as label-associated when using the SequenceAbundanceEncoder
or
CompAIRRSequenceAbundanceEncoder
in AIRR-compliant format.
YAML specification:
definitions:
reports:
my_relevant_sequences: RelevantSequenceExporter
ML model reports¶
ML model reports show some type of features or statistics about a single trained ML model.
In the TrainMLModel instruction, ML model reports can be specified inside the ‘selection’ or ‘assessment’ specification under the key ‘reports/models’. Example:
my_instruction:
type: TrainMLModel
selection:
reports:
models:
- my_ml_report
# other parameters...
assessment:
reports:
models:
- my_ml_report
# other parameters...
# other parameters...
BinaryFeaturePrecisionRecall¶
Plots the precision and recall scores for each added feature to the collection of features selected by the BinaryFeatureClassifier.
YAML specification:
definitions:
reports:
my_report: BinaryFeaturePrecisionRecall
Coefficients¶
A report that plots the coefficients for a given ML method in a barplot. Can be used for LogisticRegression, SVM, SVC, and RandomForestClassifier. In the case of RandomForest, the feature importances will be plotted.
When used in TrainMLModel instruction, the report can be specified under ‘models’, both on the selection and assessment levels.
Which coefficients should be plotted (for example: only nonzero, above a certain threshold, …) can be specified. Multiple options can be specified simultaneously. By default the 25 largest coefficients are plotted. The full set of coefficients will also be exported as a csv file.
Example output:
Specification arguments:
coefs_to_plot (list): A list specifying which coefficients should be plotted. Valid values are: ALL, NONZERO, CUTOFF, N_LARGEST.
cutoff (list): If ‘cutoff’ is specified under ‘coefs_to_plot’, the cutoff values can be specified here. The coefficients which have an absolute value equal to or greater than the cutoff will be plotted.
n_largest (list): If ‘n_largest’ is specified under ‘coefs_to_plot’, the values for n can be specified here. These should be integer values. The n largest coefficients are determined based on their absolute values.
YAML specification:
definitions:
reports:
my_coef_report:
Coefficients:
coefs_to_plot:
- all
- nonzero
- cutoff
- n_largest
cutoff:
- 0.1
- 0.01
n_largest:
- 5
- 10
ConfounderAnalysis¶
A report that plots the numbers of false positives and false negatives with respect to each value of the metadata features specified by the user. This allows checking whether a given machine learning model makes more misclassifications for some values of a metadata feature than for the others.
Specification arguments:
metadata_labels (list): A list of the metadata features to use as a basis for the calculations
YAML specification:
definitions:
reports:
my_confounder_report:
ConfounderAnalysis:
metadata_labels:
- age
- sex
DeepRCMotifDiscovery¶
This report plots the contributions of (i) input sequences and (ii) kernels to trained DeepRC model with respect to the test dataset. Contributions are computed using integrated gradients (IG). This report produces two figures:
inputs_integrated_gradients: Shows the contributions of the characters within the input sequences (test dataset) that was most important for immune status prediction of the repertoire. IG is only applied to sequences of positive class repertoires.
kernel_integrated_gradients: Shows the 1D CNN kernels with the highest contribution over all positions and amino acids.
For both inputs and kernels: Larger characters in the extracted motifs indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the immune status. For kernels only: contributions to positional encoding are indicated by < (beginning of sequence), ∧ (center of sequence), and > (end of sequence).
See DeepRCMotifDiscovery for repertoire classification for a more detailed example.
Reference:
Widrich, M., et al. (2020). Modern Hopfield Networks and Attention for Immune Repertoire Classification. Advances in Neural Information Processing Systems, 33. https://proceedings.neurips.cc//paper/2020/hash/da4902cb0bc38210839714ebdcf0efc3-Abstract.html
Example output:
Specification arguments:
n_steps (int): Number of IG steps (more steps -> better path integral -> finer contribution values). 50 is usually good enough.
threshold (float): Only applies to the plotting of kernels. Contributions are normalized to range [0, 1], and only kernels with normalized contributions above threshold are plotted.
YAML specification:
definitions:
reports:
my_deeprc_report:
DeepRCMotifDiscovery:
threshold: 0.5
n_steps: 50
KernelSequenceLogo¶
A report that plots kernels of a CNN model as sequence logos. It works only with trained ReceptorCNN models which has kernels already normalized to represent information gain matrices. Additionally, it also plots the weights in the final fully-connected layer of the network associated with kernel outputs. For more information on how the model works, see ReceptorCNN.
The kernels are visualized using Logomaker. Original publication: Tareen A, Kinney JB. Logomaker: beautiful sequence logos in Python. Bioinformatics. 2020; 36(7):2272-2274. doi:10.1093/bioinformatics/btz921.
YAML specification:
definitions:
reports:
my_kernel_seq_logo: KernelSequenceLogo
MotifSeedRecovery¶
This report can be used to show how well implanted motifs (for example, through the Simulation instruction) can be recovered by various machine learning methods using the k-mer encoding. This report creates a boxplot, where the x axis (box grouping) represents the maximum possible overlap between an implanted motif seed and a kmer feature (measured in number of positions), and the y axis shows the coefficient size of the respective kmer feature. If the machine learning method has learned the implanted motif seeds, the coefficient size is expected to be largest for the kmer features with high overlap to the motif seeds.
Note that to use this report, the following criteria must be met:
KmerFrequencyEncoder must be used.
One of the following classifiers must be used: RandomForestClassifier, LogisticRegression, SVM, SVC
For each label, the implanted motif seeds relevant to that label must be specified
To find the overlap score between kmer features and implanted motif seeds, the two sequences are compared in a sliding window approach, and the maximum overlap is calculated.
Overlap scores between kmer features and implanted motifs are calculated differently based on the Hamming distance that was allowed during implanting.
Without hamming distance:
Seed: AAA -> score = 3
Feature: xAAAx
^^^
Seed: AAA -> score = 0
Feature: xAAxx
With hamming distance:
Seed: AAA -> score = 3
Feature: xAAAx
^^^
Seed: AAA -> score = 2
Feature: xAAxx
^^
Furthermore, gap positions in the motif seed are ignored:
Seed: A/AA -> score = 3
Feature: xAxAAx
^/^^
See Recovering simulated immune signals for more details.
Example output:
Specification arguments:
implanted_motifs_per_label (dict): a nested dictionary that specifies the motif seeds that were implanted in the given dataset. The first level of keys in this dictionary represents the different labels. In the inner dictionary there should be two keys: “seeds” and “hamming_distance”:
seeds: a list of motif seeds. The seeds may contain gaps, specified by a ‘/’ symbol.
hamming_distance: A boolean value that specifies whether hamming distance was allowed when implanting the motif seeds for a given label. Note that this applies to all seeds for this label.
gap_sizes: a list of all the possible gap sizes that were used when implanting a gapped motif seed. When no gapped seeds are used, this value has no effect.
YAML specification:
definitions:
reports:
my_motif_report:
MotifSeedRecovery:
implanted_motifs_per_label:
CD:
seeds:
- AA/A
- AAA
hamming_distance: False
gap_sizes:
- 0
- 1
- 2
T1D:
seeds:
- CC/C
- CCC
hamming_distance: True
gap_sizes:
- 2
ROCCurve¶
A report that plots the ROC curve for a binary classifier.
YAML specification:
definitions:
reports:
my_roc_report: ROCCurve
SequenceAssociationLikelihood¶
Plots the beta distribution used as a prior for class assignment in ProbabilisticBinaryClassifier. The distribution plotted shows the probability that a sequence is associated with a given class for a label.
YAML specification:
definitions:
reports:
my_sequence_assoc_report: SequenceAssociationLikelihood
TCRdistMotifDiscovery¶
The report for discovering motifs in paired immune receptor data of given specificity based on TCRdist3. The receptors are hierarchically clustered based on the tcrdist distance and then motifs are discovered for each cluster. The report outputs logo plots for the motifs along with the raw data used for plotting in csv format.
For the implementation, TCRdist3 library was used (source code available here). More details on the functionality used for this report are available here.
Original publications:
Dash P, Fiore-Gartland AJ, Hertz T, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. 2017; 547(7661):89-93. doi:10.1038/nature22383
Mayer-Blackwell K, Schattgen S, Cohen-Lavi L, et al. TCR meta-clonotypes for biomarker discovery with tcrdist3: quantification of public, HLA-restricted TCR biomarkers of SARS-CoV-2 infection. bioRxiv. Published online December 26, 2020:2020.12.24.424260. doi:10.1101/2020.12.24.424260
Example output:
Specification arguments:
positive_class_name (str): the class value (e.g., epitope) used to select only the receptors that are specific to the given epitope so that only those sequences are used to infer motifs; the reference receptors as required by TCRdist will be the ones from the dataset that have different or no epitope specified in their metadata; if the labels are available only on the epitope level (e.g., label is “AVFDRKSDAK” and classes are True and False), then here it should be specified that only the receptors with value “True” for label “AVFDRKSDAK” should be used; there is no default value for this argument
cores (int): number of processes to use for the computation of the distance and motifs
min_cluster_size (int): the minimum size of the cluster to discover the motifs for
use_reference_sequences (bool): when showing motifs, this parameter defines if reference sequences should be provided as well as a background
YAML specification:
definitions:
reports:
my_tcr_dist_report: # user-defined name
TCRdistMotifDiscovery:
positive_class_name: True # class name, could also be epitope name, depending on how it's defined in the dataset
cores: 4
min_cluster_size: 30
use_reference_sequences: False
TrainingPerformance¶
A report that plots the evaluation metrics for the performance given machine learning model and training dataset.
The available metrics are accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision,
recall, auc and log_loss (see immuneML.environment.Metric.Metric
).
Specification arguments:
metrics (list): A list of metrics used to evaluate training performance. See
immuneML.environment.Metric.Metric
for available options.
YAML specification:
definitions:
reports:
my_performance_report:
TrainingPerformance:
metrics:
- accuracy
- balanced_accuracy
- confusion_matrix
- f1_micro
- f1_macro
- f1_weighted
- precision
- recall
- auc
- log_loss
Train ML model reports¶
Train ML model reports plot general statistics or export data of multiple models simultaneously when running the TrainMLModel instruction.
In the TrainMLModel instruction, train ML model reports can be specified under ‘reports’. Example:
my_instruction:
type: TrainMLModel
reports:
- my_train_ml_model_report
# other parameters...
CVFeaturePerformance¶
This report plots the average training vs test performance w.r.t. given encoding parameter which is explicitly set in the feature attribute. It can be used only in combination with TrainMLModel instruction and can be only specified under ‘reports’
Specification arguments:
feature: name of the encoder parameter w.r.t. which the performance across training and test will be shown. Possible values depend on the encoder on which it is used.
is_feature_axis_categorical (bool): if the x-axis of the plot where features are shown should be categorical; alternatively it is automatically determined based on the feature values
YAML specification:
definitions:
reports:
report1:
CVFeaturePerformance:
feature: p_value_threshold # parameter value of SequenceAbundance encoder
is_feature_axis_categorical: True # show x-axis as categorical
DiseaseAssociatedSequenceCVOverlap¶
DiseaseAssociatedSequenceCVOverlap report makes one heatmap per label showing the overlap of disease-associated sequences (or k-mers)
produced by the SequenceAbundanceEncoder
,
CompAIRRSequenceAbundanceEncoder
or
KmerAbundanceEncoder
between folds of cross-validation (either inner or outer loop of the nested CV). The overlap is computed by the following equation:
For details, see Greiff V, Menzel U, Miho E, et al. Systems Analysis Reveals High Genetic and Antigen-Driven Predetermination of Antibody Repertoires throughout B Cell Development. Cell Reports. 2017;19(7):1467-1478. doi:10.1016/j.celrep.2017.04.054.
Specification arguments:
compare_in_selection (bool): whether to compute the overlap over the inner loop of the nested CV - the sequence overlap is shown across CV folds for the model chosen as optimal within that selection
compare_in_assessment (bool): whether to compute the overlap over the optimal models in the outer loop of the nested CV
YAML specification:
definitions:
reports:
my_overlap_report: DiseaseAssociatedSequenceCVOverlap # report has no parameters
MLSettingsPerformance¶
Report for TrainMLModel instruction: plots the performance for each of the setting combinations as defined under ‘settings’ in the assessment (outer validation) loop.
The performances are grouped by label (horizontal panels) encoding (vertical panels) and ML method (bar color). When multiple data splits are used, the average performance over the data splits is shown with an error bar representing the standard deviation.
This report can be used only with TrainMLModel instruction under ‘reports’.
Specification arguments:
single_axis_labels (bool): whether to use single axis labels. Note that using single axis labels makes the figure unsuited for rescaling, as the label position is given in a fixed distance from the axis. By default, single_axis_labels is False, resulting in standard plotly axis labels.
x_label_position (float): if single_axis_labels is True, this should be an integer specifying the x axis label position relative to the x axis. The default value for label_position is -0.1.
y_label_position (float): same as x_label_position, but for the y-axis.
YAML specification:
definitions:
reports:
my_hp_report: MLSettingsPerformance
ROCCurveSummary¶
This report plots ROC curves for all trained ML settings ([preprocessing], encoding, ML model) in the outer loop of cross-validation in the TrainMLModel instruction. If there are multiple splits in the outer loop, this report will make one plot per split. This report is defined only for binary classification. If there are multiple labels defined in the instruction, each label has to have two classes to be included in this report.
YAML specification:
definitions:
reports:
my_roc_summary_report: ROCCurveSummary
ReferenceSequenceOverlap¶
The ReferenceSequenceOverlap report compares a list of disease-associated sequences (or k-mers) produced by the
SequenceAbundanceEncoder
,
CompAIRRSequenceAbundanceEncoder
or
KmerAbundanceEncoder
to
a list of reference sequences. It outputs a Venn diagram and a list of sequences found both in the encoder and reference list.
The report compares the sequences by their sequence content and the additional comparison_attributes (such as V or J gene), as specified by the user.
Specification arguments:
reference_path (str): path to the reference file in csv format which contains one entry per row and has columns that correspond to the attributes listed under comparison_attributes argument
comparison_attributes (list): list of attributes to use for comparison; all of them have to be present in the reference file where they should be the names of the columns
label (str): name of the label for which the reference sequences/k-mers should be compared to the model; if none, it takes the one label from the instruction; if it is none and multiple labels were specified for the instruction, the report will not be generated
YAML specification:
definitions:
reports:
my_reference_overlap_report:
ReferenceSequenceOverlap:
reference_path: reference_sequences.csv # example usage with SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder
comparison_attributes:
- sequence_aa
- v_call
- j_call
my_reference_overlap_report_with_kmers:
ReferenceSequenceOverlap:
reference_path: reference_kmers.csv # example usage with KmerAbundanceEncoder
comparison_attributes:
- k-mer
Multi dataset reports¶
Multi dataset reports are special reports that can be specified when running immuneML with the MultiDatasetBenchmarkTool
.
See Manuscript use case 1: Robustness assessment for an example.
When running the MultiDatasetBenchmarkTool
, multi dataset reports can be specified under ‘benchmark_reports’.
Example:
my_instruction:
type: TrainMLModel
benchmark_reports:
- my_benchmark_report
# other parameters...
DiseaseAssociatedSequenceOverlap¶
DiseaseAssociatedSequenceOverlap report makes a heatmap showing the overlap of disease-associated sequences (or k-mers)
produced by the SequenceAbundanceEncoder
,
CompAIRRSequenceAbundanceEncoder
or
KmerAbundanceEncoder
between multiple datasets of different sizes (different number of repertoires per dataset).
This plot can be used only with MultiDatasetBenchmarkTool.
The overlap is computed by the following equation:
For details, see: Greiff V, Menzel U, Miho E, et al. Systems Analysis Reveals High Genetic and Antigen-Driven Predetermination of Antibody Repertoires throughout B Cell Development. Cell Reports. 2017;19(7):1467-1478. doi:10.1016/j.celrep.2017.04.054.
YAML specification:
definitions:
reports:
my_overlap_report: DiseaseAssociatedSequenceOverlap # report has no parameters
PerformanceOverview¶
PerformanceOverview report creates an ROC plot and precision-recall plot for optimal trained models on multiple datasets. The labels on the plots are the names of the datasets, so it might be good to have user-friendly names when defining datasets that are still a combination of letters, numbers and the underscore sign.
This report can be used only with MultiDatasetBenchmarkTool as it will plot ROC and PR curve for trained models across datasets. Also, it requires the task to be immune repertoire classification and cannot be used for receptor or sequence classification. Furthermore, it uses predictions on the test dataset to assess the performance and plot the curves. If the parameter refit_optimal_model is set to True, all data will be used to fit the optimal model, so there will not be a test dataset which can be used to assess performance and the report will not be generated.
If datasets have the same number of examples, the baseline PR curve will be plotted as described in this publication: Saito T, Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE. 2015;10(3):e0118432. doi:10.1371/journal.pone.0118432
If the datasets have different number of examples, the baseline PR curve will not be plotted.
YAML specification:
definitions:
reports:
my_performance_report: PerformanceOverview
Preprocessings¶
Under the definitions/preprocessing_sequences
component, the user can specify different preprocessing steps to
apply to a dataset before performing an analysis. This is optional.
ChainRepertoireFilter¶
Removes all repertoires from the RepertoireDataset object which contain at least one sequence with chain different than “keep_chain” parameter. Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.
Since the filter removes repertoires from the dataset (examples in machine learning setting), it cannot be used with TrainMLModel instruction. If you want to filter out repertoires including a given chain, see DatasetExport instruction with preprocessing.
Specification arguments:
keep_chain (str): Which chain should be kept, valid values are “TRA”, “TRB”, “IGH”, “IGL”, “IGK”
YAML specification:
preprocessing_sequences:
my_preprocessing:
- my_filter:
ChainRepertoireFilter:
keep_chain: TRB
ClonesPerRepertoireFilter¶
Removes all repertoires from the RepertoireDataset, which contain fewer clonotypes than specified by the lower_limit, or more clonotypes than specified by the upper_limit. Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets. When no lower or upper limit is specified, or the value -1 is specified, the limit is ignored.
Since the filter removes repertoires from the dataset (examples in machine learning setting), it cannot be used with TrainMLModel instruction. If you want to use this filter, see DatasetExport instruction with preprocessing.
Specification arguments:
lower_limit (int): The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire.
upper_limit (int): The maximal inclusive upper limit for the number of clonotypes allowed in a repertoire.
YAML specification:
preprocessing_sequences:
my_preprocessing:
- my_filter:
ClonesPerRepertoireFilter:
lower_limit: 100
upper_limit: 100000
CountPerSequenceFilter¶
Removes all sequences from a Repertoire when they have a count below low_count_limit, or sequences with no count value if remove_without_counts is True. This filter can be applied to Repertoires and RepertoireDatasets.
Specification arguments:
low_count_limit (int): The inclusive minimal count value in order to retain a given sequence.
remove_without_count (bool): Whether the sequences without a reported count value should be removed.
remove_empty_repertoires (bool): Whether repertoires without sequences should be removed. Only has an effect when remove_without_count is also set to True. If this is true, this preprocessing cannot be used with TrainMLModel instruction, but only with DatasetExport instruction instead.
batch_size (int): number of repertoires that can be loaded at the same time (only affects the speed when applying this filter on a RepertoireDataset)
YAML specification:
preprocessing_sequences:
my_preprocessing:
- my_filter:
CountPerSequenceFilter:
remove_without_count: True
remove_empty_repertoires: True
low_count_limit: 3
batch_size: 4
DuplicateSequenceFilter¶
Collapses duplicate nucleotide or amino acid sequences within each repertoire in the given RepertoireDataset. This filter can be applied to Repertoires and RepertoireDatasets.
Sequences are considered duplicates if the following fields are identical:
amino acid or nucleotide sequence (whichever is specified)
v and j genes (note that the full field including subgroup + gene is used for matching, i.e. V1 and V1-1 are not considered duplicates)
chain
region type
For all other fields (the non-specified sequence type, custom lists, sequence identifier) only the first occurring value is kept.
Note that this means the count value of a sequence with a given sequence identifier might not be the same as before removing duplicates, unless count_agg = FIRST is used.
Specification arguments:
filter_sequence_type (
SequenceType
): Whether the sequences should be collapsed on the nucleotide or amino acid level. Valid values are: [‘amino_acid’, ‘nucleotide’].region_type (str): which part of the sequence to examine, by default, this is IMGT_CDR3
count_agg (
CountAggregationFunction
): determines how the sequence counts of duplicate sequences are aggregated. Valid values are: [‘sum’, ‘max’, ‘min’, ‘mean’, ‘first’, ‘last’].
YAML specification:
preprocessing_sequences:
my_preprocessing:
- my_filter:
DuplicateSequenceFilter:
# required parameters:
filter_sequence_type: AMINO_ACID
# optional parameters (if not specified the values bellow will be used):
batch_size: 4
count_agg: SUM
region_type: IMGT_CDR3
MetadataRepertoireFilter¶
Removes repertoires from a RepertoireDataset based on information stored in the metadata_file. Note that this filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.
Since this filter changes the number of repertoires (examples for the machine learning task), it cannot be used with TrainMLModel instruction. To filter out repertoires, use preprocessing from the DatasetExport instruction that will create a new dataset ready to be used for training machine learning models.
Specification arguments:
criteria (dict): a nested dictionary that specifies the criteria for keeping certain columns. See
CriteriaMatcher
for a more detailed explanation.
YAML specification:
preprocessing_sequences:
my_preprocessing:
- my_filter:
# Example filter that keeps repertoires with values greater than 1 in the "my_column_name" column of the metadata_file
MetadataRepertoireFilter:
type: GREATER_THAN
value:
type: COLUMN
name: my_column_name
threshold: 1
ReferenceSequenceAnnotator¶
Annotates each sequence in each repertoire if it matches any of the reference sequences provided as input parameter. This report uses CompAIRR internally. To match CDR3 sequences (and not JUNCTION), CompAIRR v1.10 or later is needed.
Specification arguments:
reference_sequences (dict): A dictionary describing the reference dataset file. Import should be specified the same way as regular dataset import. It is only allowed to import a receptor dataset here (i.e., is_repertoire is False and paired is True by default, and these are not allowed to be changed).
max_edit_distance (int): The maximum edit distance between a target sequence (from the repertoire) and the reference sequence.
compairr_path (str): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.
threads (int): how many threads to be used by CompAIRR for sequence matching
ignore_genes (bool): Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains have to match. If True, gene information is ignored. By default, ignore_genes is False.
output_column_name (str): in case there are multiple annotations, it is possible here to define the name of the column in the output repertoire files for this specific annotation
repertoire_batch_size (int): how many repertoires to process simultaneously; depending on the repertoire size, this parameter might be use to limit the memory usage
region_type (str): which region type to check, default is IMGT_CDR3
YAML specification:
preprocessing_sequences:
my_preprocessing:
- step1:
ReferenceSequenceAnnotator:
reference_sequences:
format: VDJDB
params:
path: path/to/file.csv
compairr_path: optional/path/to/compairr
ignore_genes: False
max_edit_distance: 0
output_column_name: matched
threads: 4
repertoire_batch_size: 5
region_type: IMGT_CDR3
SequenceLengthFilter¶
Removes sequences with length out of the predefined range.
Specification arguments:
sequence_type (
SequenceType
): Whether the sequences should be filtered on the nucleotide or amino acid level. Valid options are defined by the SequenceType enum.min_len (int): minimum length of the sequence (sequences shorter than min_len will be removed); to not use min_len, set it to -1
max_len (int): maximum length of the sequence (sequences longer than max_len will be removed); to not use max_len, set it to -1
region_type (str): which part of the sequence to examine, by default, this is IMGT_CDR3
YAML specification:
preprocessing_sequences:
my_preprocessing:
- my_filter:
SequenceLengthFilter:
sequence_type: AMINO_ACID
min_len: 3 # -> remove all sequences shorter than 3
max_len: -1 # -> no upper bound on the sequence length
SubjectRepertoireCollector¶
Merges all the Repertoires in a RepertoireDataset that have the same ‘subject_id’ specified in the metadata. The result is a RepertoireDataset with one Repertoire per subject. This preprocessing cannot be used in combination with TrainMLModel instruction because it can change the number of examples. To combine the repertoires in this way, use this preprocessing with DatasetExport instruction.
YAML specification:
preprocessing_sequences:
my_preprocessing:
- my_filter: SubjectRepertoireCollector
Simulation¶
Under the definitions/simulation
component, the user can specify parameters necessary for simulating synthetic
immune signals into an AIRR dataset. See also Dataset simulation with LIgO.
Motifs¶
Motifs are the objects which are implanted into sequences during simulation.
They are defined under definitions/motifs
. There are several different motif types, each
having their own parameters.
SeedMotif¶
Describes motifs by seed, possible gaps, allowed hamming distances, positions that can be changed and what they can be changed to.
Specification arguments:
seed (str): An amino acid sequence that represents the basic motif seed. All implanted motifs correspond to the seed, or a modified version thereof, as specified in its instantiation strategy. If this argument is set, seed_chain1 and seed_chain2 arguments are not used.
min_gap (int): The minimum gap length, in case the original seed contains a gap.
max_gap (int): The maximum gap length, in case the original seed contains a gap.
hamming_distance_probabilities (dict): The probability of modifying the given seed with each number of modifications. The keys represent the number of modifications (hamming distance) between the original seed and the implanted motif, and the values represent the probabilities for the respective number of modifications. For example {0: 0.7, 1: 0.3} means that 30% of the time one position will be modified, and the remaining 70% of the time the motif will remain unmodified with respect to the seed. The values of hamming_distance_probabilities must sum to 1.
position_weights (dict): A dictionary containing the relative probabilities of choosing each position for hamming distance modification. The keys represent the position in the seed, where counting starts at 0. If the index of a gap is specified in position_weights, it will be removed. The values represent the relative probabilities for modifying each position when it gets selected for modification. For example {0: 0.6, 1: 0, 2: 0.4} means that when a sequence is selected for a modification (as specified in hamming_distance_probabilities), then 60% of the time the amino acid at index 0 is modified, and the remaining 40% of the time the amino acid at index 2. If the values of position_weights do not sum to 1, the remainder will be redistributed over all positions, including those not specified.
alphabet_weights (dict): A dictionary describing the relative probabilities of choosing each amino acid for hamming distance modification. The keys of the dictionary represent the amino acids and the values are the relative probabilities for choosing this amino acid. If the values of alphabet_weights do not sum to 1, the remainder will be redistributed over all possible amino acids, including those not specified.
YAML specification:
definitions:
motifs:
# examples for single chain receptor data
my_simple_motif: # this will be the identifier of the motif
seed: AAA # motif is always AAA
my_gapped_motif:
seed: AA/A # this motif can be AAA, AA_A, CAA, CA_A, DAA, DA_A, EAA, EA_A
min_gap: 0
max_gap: 1
hamming_distance_probabilities: # it can have a max of 1 substitution
0: 0.7
1: 0.3
position_weights: # note that index 2, the position of the gap, is excluded from position_weights
0: 1 # only first position can be changed
1: 0
3: 0
alphabet_weights: # the first A can be replaced by C, D or E
C: 0.4
D: 0.4
E: 0.2
PWM¶
Motifs defined by a positional weight matrix and using bionumpy’s PWM internally. For more details on bionumpy’s implementation of PWM, as well as for supported formats, see the documentation at https://bionumpy.github.io/bionumpy/tutorials/position_weight_matrix.html.
Specification arguments:
file_path: path to the file where the PWM is stored
threshold (float): when matching PWM to a sequence, this is the threshold to consider the sequence as containing the motif
YAML specification:
definitions:
motifs:
my_custom_pwm: # this will be the identifier of the motif
file_path: my_pwm_1.csv
threshold: 2
Signals¶
A signal represents a collection of motifs, and optionally, position weights showing where one
of the motifs of the signal can occur in a sequence.
The signals are defined under definitions/signals
.
A signal is associated with a metadata label, which is assigned to a receptor or repertoire. For example antigen-specific/disease-associated (receptor) or diseased (repertoire).
Note
IMGT positions
To use sequence position weights, IMGT positions should be explicitly specified as strings, under quotation marks, to allow for all positions to be properly distinguished.
Specification arguments:
motifs (list): A list of the motifs associated with this signal, either defined by seed or by position weight matrix. Alternatively, it can be a list of a list of motifs, in which case the motifs in the same sublist (max 2 motifs) have to co-occur in the same sequence
sequence_position_weights (dict): a dictionary specifying for each IMGT position in the sequence how likely it is for the signal to be there. If the position is not present in the sequence, the probability of the signal occurring at that position will be redistributed to other positions with probabilities that are not explicitly set to 0 by the user.
v_call (str): V gene with allele if available that has to co-occur with one of the motifs for the signal to exist; can be used in combination with rejection sampling, or full sequence implanting, otherwise ignored; to match in a sequence for rejection sampling, it is checked if this value is contained in the same field of generated sequence;
j_call (str): J gene with allele if available that has to co-occur with one of the motifs for the signal to exist; can be used in combination with rejection sampling, or full sequence implanting, otherwise ignored; to match in a sequence for rejection sampling, it is checked if this value is contained in the same field of generated sequence;
source_file (str): path to the file where the custom signal function is; cannot be combined with the arguments listed above (motifs, v_call, j_call, sequence_position_weights)
is_present_func (str): name of the function from the source_file file that will be used to specify the signal; the function’s signature must be:
def is_present(sequence_aa: str, sequence: str, v_call: str, j_call: str) -> bool:
# custom implementation where all or some of these arguments can be used
clonal_frequency (dict): clonal frequency in Ligo is simulated through scipy’s zeta distribution function for generating random numbers, with parameters provided under clonal_frequency parameter. If clonal frequency should not be used, this parameter can be None
clonal_frequency:
a: 2 # shape parameter of the distribution
loc: 0 # 0 by default but can be used to shift the distribution
YAML specification:
definitions:
signals:
my_signal:
motifs:
- my_simple_motif
- my_gapped_motif
sequence_position_weights:
'109': 0.5
'110': 0.5
v_call: TRBV1
j_call: TRBJ1
clonal_frequency:
a: 2
loc: 0
signal_with_custom_func:
source_file: signal_func.py
is_present_func: is_signal_present
clonal_frequency:
a: 2
loc: 0
Simulation config¶
The simulation config defines all parameters of the simulation. It can contain one or more simulation config items, which define groups of repertoires or receptors that have the same simulation parameters, such as signals, generative model, clonal frequencies, and noise parameters.
Specification arguments:
sim_items (dict): a list of SimConfigItems defining individual units of simulation
is_repertoire (bool): whether the simulation is on a repertoire (person) or sequence/receptor level
paired: if the simulation should output paired data, this parameter should contain a list of a list of sim_item pairs referenced by name that should be combined; if paired data is not needed, then it should be False
sequence_type (str): either amino_acid or nucleotide
simulation_strategy (str): either RejectionSampling or Implanting, see the tutorials for more information on choosing one of these
keep_p_gen_dist (bool): if possible, whether to keep the distribution of generation probabilities of the sequences the same as provided by the model without any signals
p_gen_bin_count (int): if keep_p_gen_dist is true, how many bins to use to approximate the generation probability distribution
remove_seqs_with_signals (bool): if true, it explicitly controls the proportions of signals in sequences and removes any accidental occurrences
species (str): species that the sequences come from; used to select correct genes to export full length sequences; default is ‘human’
implanting_scaling_factor (int): determines in how many receptors to implant the signal in reach iteration; this is computed as number_of_receptors_needed_for_signal * implanting_scaling_factor; useful when using Implanting simulation strategy in combination with importance sampling, since the generation probability of some receptors with implanted signals might be very rare and those receptors might end up not being kept often with importance sampling; this parameter is only used when keep_p_gen_dist is set to True
YAML specification:
definitions:
simulations:
sim1:
is_repertoire: false
paired: false
sequence_type: amino_acid
simulation_strategy: RejectionSampling
sim_items:
sim_item1: # group of sequences with same simulation params
generative_model:
chain: beta
default_model_name: humanTRB
model_path: null
type: OLGA
number_of_examples: 100
seed: 1002
signals:
signal1: 1
Simulation config item¶
When performing a simulation, one or more simulation config items can be specified. Config items define groups of repertoires or receptors that have the same simulation parameters, such as signals, generative model, clonal frequencies, noise parameters.
Specification arguments:
signals (dict): signals for the simulation item and the proportion of sequences in the repertoire that will have the given signal. For receptor-level simulation, the proportion will always be 1.
is_noise (bool): indicates whether the implanting should be regarded as noise; if it is True, the signals will be implanted as specified, but the repertoire/receptor in question will have negative class.
generative_model: parameters of the generative model, including its type, path to the model; currently supported models are OLGA and ExperimentalImport
seed (int): starting random seed for the generative model (it should differ across simulation items, or it can be set to null when not used)
false_positives_prob_in_receptors (float): when performing repertoire level simulation, what percentage of sequences should be false positives
false_negative_prob_in_receptors (float): when performing repertoire level simulation, what percentage of sequences should be false negatives
immune_events (dict): a set of key-value pairs that will be added to the metadata (same values for all data generated in one simulation sim_item) and can be later used as labels
default_clonal_frequency (dict): clonal frequency in Ligo is simulated through scipy’s zeta distribution function for generating random numbers, with parameters provided under default_clonal_frequency parameter. These parameters will be used to assign count values to sequences that do not contain any signals if they are required by the simulation. If clonal frequency shouldn’t be used, this parameter can be None
clonal_frequency:
a: 2 # shape parameter of the distribution
loc: 0 # 0 by default but can be used to shift the distribution
sequence_len_limits (dict): allows for filtering the generated sequences by length, needs to have parameters min and max specified; if not used, min/max should be -1
sequence_len_limits:
min: 4 # keep sequences of length 4 and longer
max: -1 # no limit on the max length of the sequences
YAML specification:
definitions:
simulations: # definitions of simulations should be under key simulations in the definitions part of the specification
# one simulation with multiple implanting objects, a part of definition section
my_simulation:
sim_item1:
number_of_examples: 10
seed: null # don't use seed
receptors_in_repertoire_count: 100
generative_model:
chain: beta
default_model_name: humanTRB
model_path: null
type: OLGA
signals:
my_signal: 0.25
my_signal2: 0.01
my_signal__my_signal2: 0.02 # my_signal and my_signal2 will co-occur in 2% of the receptors in all 10 repertoires
sim_item2:
number_of_examples: 5
receptors_in_repertoire_count: 150
seed: 10 #
generative_model:
chain: beta
default_model_name: humanTRB
model_path: null
type: OLGA
signals:
my_signal: 0.75
default_clonal_frequency:
a: 2
sequence_len_limits:
min: 3