immuneML.encodings.abundance_encoding package¶
Submodules¶
immuneML.encodings.abundance_encoding.AbundanceEncoderHelper module¶
- class immuneML.encodings.abundance_encoding.AbundanceEncoderHelper.AbundanceEncoderHelper[source]¶
Bases:
object
- INVALID_P_VALUE = 2.0¶
- static build_abundance_matrix(sequence_presence_matrix, matrix_repertoire_ids, dataset_repertoire_ids, sequence_p_values_indices)[source]¶
- static check_is_positive_class(dataset, matrix_repertoire_ids, label_config: LabelConfiguration)[source]¶
immuneML.encodings.abundance_encoding.CompAIRRBatchIterator module¶
immuneML.encodings.abundance_encoding.CompAIRRSequenceAbundanceEncoder module¶
- class immuneML.encodings.abundance_encoding.CompAIRRSequenceAbundanceEncoder.CompAIRRSequenceAbundanceEncoder(p_value_threshold: float, compairr_path: str, sequence_batch_size: int, ignore_genes: bool, keep_temporary_files: bool, threads: int, name: str = None)[source]¶
Bases:
DatasetEncoder
This encoder works similarly to the
SequenceAbundanceEncoder
, but internally uses CompAIRR to accelerate core computations.This encoder represents the repertoires as vectors where:
the first element corresponds to the number of label-associated clonotypes
the second element is the total number of unique clonotypes
To determine what clonotypes (amino acid sequences with or without matching V/J genes) are label-associated, Fisher’s exact test (one-sided) is used.
The encoder also writes out files containing the contingency table used for fisher’s exact test, the resulting p-values, and the significantly abundant sequences (use
RelevantSequenceExporter
to export these sequences in AIRR format).Reference: Emerson, Ryan O. et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.
Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. See Reproduction of the CMV status predictions study for an example using
SequenceAbundanceEncoder
.Dataset type:
RepertoireDatasets
Specification arguments:
p_value_threshold (float): The p value threshold to be used by the statistical test.
compairr_path (Path): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.
ignore_genes (bool): Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains have to match. If True, gene information is ignored. By default, ignore_genes is False.
sequence_batch_size (int): The number of sequences in a batch when comparing sequences across repertoires, typically 100s of thousands. This does not affect the results of the encoding, but may affect the speed and memory usage. The default value is 1.000.000
threads (int): The number of threads to use for parallelization. This does not affect the results of the encoding, only the speed. The default number of threads is 8.
keep_temporary_files (bool): whether to keep temporary files, including CompAIRR input, output and log files, and the sequence presence matrix. This may take a lot of storage space if the input dataset is large. By default, temporary files are not kept.
YAML specification:
definitions: encodings: my_sa_encoding: CompAIRRSequenceAbundance: compairr_path: optional/path/to/compairr p_value_threshold: 0.05 ignore_genes: False threads: 8
- LOG_FILENAME = 'compairr_log.txt'¶
- OUTPUT_FILENAME = 'compairr_out.tsv'¶
- RELEVANT_SEQUENCE_ABUNDANCE = 'relevant_sequence_abundance'¶
- TOTAL_SEQUENCE_ABUNDANCE = 'total_sequence_abundance'¶
- static build_object(dataset, **params)[source]¶
Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.
The build_object method should do the following:
Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.
Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.
Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.
- Parameters:
**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object
- Returns:
the object of the appropriate Encoder class
- encode(dataset, params: EncoderParams)[source]¶
This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.
- Parameters:
dataset – A dataset object (Sequence, Receptor or RepertoireDataset)
params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).
- Returns:
A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.
- get_additional_files() List[Path] [source]¶
Should return a list with all the files that need to be stored when storing the encoder. For example, SimilarToPositiveSequenceEncoder stores all ‘positive’ sequences in the training data, and predicts a sequence to be ‘positive’ if it is similar to any positive sequences in the training data. In that case, these positive sequences are stored in a file.
For many encoders, it may not be necessary to store additional files.
- static load_encoder(encoder_file: Path)[source]¶
The load_encoder method can load the encoder given the folder where the same class of the model was previously stored using the store function. Encoders are stored in pickle format. If the encoder uses additional files, they should be explicitly loaded here as well.
If there are no additional files, this method does not need to be overwritten. If there are additional files, its contents should be as follows:
encoder = DatasetEncoder.load_encoder(encoder_file) encoder.my_additional_file = DatasetEncoder.load_attribute(encoder, encoder_file, “my_additional_file”)
- Parameters:
encoder_file (Path) – path to the encoder file where the encoder was stored using store() function
- Returns:
the loaded Encoder object
- set_context(context: dict)[source]¶
This method can be used to attach the full dataset (as part of a dictionary), as opposed to the dataset which is passed to the .encode() method. When training ML models, that data split is usually a training/validation subset of the total dataset.
In most cases, an encoder should only use the ‘dataset’ argument passed to the .encode() method to compute the encoded data. Using information from the full dataset, which includes the test data, may result in data leakage. For example, some encoders normalise the computed feature values (e.g., KmerFrequencyEncoder). Such normalised feature values should be based only on the current data split, and test data should remain unseen.
To avoid confusion about which version of the dataset to use, the full dataset is by default not attached, and attaching the full dataset should be done explicitly when required. For instance, if the encoded data is some kind of distance matrix (e.g., DistanceEncoder), the distance between examples in the training and test dataset should be included. Note that this does not entail data leakage: the test examples are not used to improve the computation of distances. The distances to test examples are determined by an algorithm which does not ‘learn’ from test data.
To explicitly enable using the full dataset in the encoder, the contents of this method should be as follows:
self.context = context return self
- Parameters:
context – a dictionary containing the full dataset
immuneML.encodings.abundance_encoding.KmerAbundanceEncoder module¶
- class immuneML.encodings.abundance_encoding.KmerAbundanceEncoder.KmerAbundanceEncoder(p_value_threshold: float, sequence_encoding: SequenceEncodingType, k: int, k_left: int, k_right: int, min_gap: int, max_gap: int, name: str = None)[source]¶
Bases:
DatasetEncoder
This encoder is related to the
SequenceAbundanceEncoder
, but identifies label-associated subsequences (k-mers) instead of full label-associated sequences.This encoder represents the repertoires as vectors where:
the first element corresponds to the number of label-associated k-mers found in a repertoire
the second element is the total number of unique k-mers per repertoire
The label-associated k-mers are determined based on a one-sided Fisher’s exact test.
The encoder also writes out files containing the contingency table used for fisher’s exact test, the resulting p-values, and the significantly abundant k-mers.
Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. See Reproduction of the CMV status predictions study for an example using
SequenceAbundanceEncoder
.Dataset type:
RepertoireDatasets
Specification arguments:
p_value_threshold (float): The p value threshold to be used by the statistical test.
sequence_encoding (
SequenceEncodingType
): The type of k-mers that are used. The simplest (default) sequence_encoding isCONTINUOUS_KMER
, which uses contiguous subsequences of length k to represent the k-mers. When gapped k-mers are used (GAPPED_KMER
,GAPPED_KMER
), the k-mers may contain gaps with a size between min_gap and max_gap, and the k-mer length is defined as a combination of k_left and k_right. When IMGT k-mers are used (IMGT_CONTINUOUS_KMER
,IMGT_GAPPED_KMER
), IMGT positional information is taken into account (i.e. the same sequence in a different position is considered to be a different k-mer).k (int): Length of the k-mer (number of amino acids) when ungapped k-mers are used. The default value for k is 3.
k_left (int): When gapped k-mers are used, k_left indicates the length of the k-mer left of the gap. The default value for k_left is 1.
k_right (int): Same as k_left, but k_right determines the length of the k-mer right of the gap. The default value for k_right is 1.
min_gap (int): Minimum gap size when gapped k-mers are used. The default value for min_gap is 0.
max_gap: (int): Maximum gap size when gapped k-mers are used. The default value for max_gap is 0.
YAML specification:
definitions: encodings: my_ka_encoding: KmerAbundance: p_value_threshold: 0.05 threads: 8
- RELEVANT_SEQUENCE_ABUNDANCE = 'relevant_sequence_abundance'¶
- TOTAL_SEQUENCE_ABUNDANCE = 'total_sequence_abundance'¶
- static build_object(dataset, **params)[source]¶
Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.
The build_object method should do the following:
Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.
Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.
Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.
- Parameters:
**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object
- Returns:
the object of the appropriate Encoder class
- encode(dataset, params: EncoderParams)[source]¶
This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.
- Parameters:
dataset – A dataset object (Sequence, Receptor or RepertoireDataset)
params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).
- Returns:
A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.
- get_additional_files() List[Path] [source]¶
Should return a list with all the files that need to be stored when storing the encoder. For example, SimilarToPositiveSequenceEncoder stores all ‘positive’ sequences in the training data, and predicts a sequence to be ‘positive’ if it is similar to any positive sequences in the training data. In that case, these positive sequences are stored in a file.
For many encoders, it may not be necessary to store additional files.
- static load_encoder(encoder_file: Path)[source]¶
The load_encoder method can load the encoder given the folder where the same class of the model was previously stored using the store function. Encoders are stored in pickle format. If the encoder uses additional files, they should be explicitly loaded here as well.
If there are no additional files, this method does not need to be overwritten. If there are additional files, its contents should be as follows:
encoder = DatasetEncoder.load_encoder(encoder_file) encoder.my_additional_file = DatasetEncoder.load_attribute(encoder, encoder_file, “my_additional_file”)
- Parameters:
encoder_file (Path) – path to the encoder file where the encoder was stored using store() function
- Returns:
the loaded Encoder object
- set_context(context: dict)[source]¶
This method can be used to attach the full dataset (as part of a dictionary), as opposed to the dataset which is passed to the .encode() method. When training ML models, that data split is usually a training/validation subset of the total dataset.
In most cases, an encoder should only use the ‘dataset’ argument passed to the .encode() method to compute the encoded data. Using information from the full dataset, which includes the test data, may result in data leakage. For example, some encoders normalise the computed feature values (e.g., KmerFrequencyEncoder). Such normalised feature values should be based only on the current data split, and test data should remain unseen.
To avoid confusion about which version of the dataset to use, the full dataset is by default not attached, and attaching the full dataset should be done explicitly when required. For instance, if the encoded data is some kind of distance matrix (e.g., DistanceEncoder), the distance between examples in the training and test dataset should be included. Note that this does not entail data leakage: the test examples are not used to improve the computation of distances. The distances to test examples are determined by an algorithm which does not ‘learn’ from test data.
To explicitly enable using the full dataset in the encoder, the contents of this method should be as follows:
self.context = context return self
- Parameters:
context – a dictionary containing the full dataset
immuneML.encodings.abundance_encoding.SequenceAbundanceEncoder module¶
- class immuneML.encodings.abundance_encoding.SequenceAbundanceEncoder.SequenceAbundanceEncoder(comparison_attributes, p_value_threshold: float, sequence_batch_size: int, repertoire_batch_size: int, name: str = None)[source]¶
Bases:
DatasetEncoder
This encoder represents the repertoires as vectors where:
the first element corresponds to the number of label-associated clonotypes
the second element is the total number of unique clonotypes
To determine what clonotypes (with features defined by comparison_attributes) are label-associated, one-sided Fisher’s exact test is used.
The encoder also writes out files containing the contingency table used for Fisher’s exact test, the resulting p-values, and the significantly abundant sequences (use
RelevantSequenceExporter
to export these sequences in AIRR format).Reference: Emerson, Ryan O. et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.
Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. For full example of using this encoder, see Reproduction of the CMV status predictions study.
Dataset type:
RepertoireDatasets
Specification arguments:
comparison_attributes (list): The attributes to be considered to group receptors into clonotypes. Only the fields specified in comparison_attributes will be considered, all other fields are ignored. Valid comparison value can be any repertoire field name (e.g., as specified in the AIRR rearrangement schema).
p_value_threshold (float): The p value threshold to be used by the statistical test.
sequence_batch_size (int): The number of sequences in a batch when comparing sequences across repertoires, typically 100s of thousands. This does not affect the results of the encoding, only the speed. The default value is 1.000.000
repertoire_batch_size (int): How many repertoires will be loaded at once. This does not affect the result of the encoding, only the speed. This value is a trade-off between the number of repertoires that can fit the RAM at the time and loading time from disk.
YAML specification:
definitions: encodings: my_sa_encoding: SequenceAbundance: comparison_attributes: - cdr3_aa - v_call - j_call p_value_threshold: 0.05 sequence_batch_size: 100000 repertoire_batch_size: 32
- RELEVANT_SEQUENCE_ABUNDANCE = 'relevant_sequence_abundance'¶
- TOTAL_SEQUENCE_ABUNDANCE = 'total_sequence_abundance'¶
- static build_object(dataset, **params)[source]¶
Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.
The build_object method should do the following:
Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.
Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.
Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.
- Parameters:
**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object
- Returns:
the object of the appropriate Encoder class
- encode(dataset, params: EncoderParams)[source]¶
This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.
- Parameters:
dataset – A dataset object (Sequence, Receptor or RepertoireDataset)
params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).
- Returns:
A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.
- get_additional_files() List[Path] [source]¶
Should return a list with all the files that need to be stored when storing the encoder. For example, SimilarToPositiveSequenceEncoder stores all ‘positive’ sequences in the training data, and predicts a sequence to be ‘positive’ if it is similar to any positive sequences in the training data. In that case, these positive sequences are stored in a file.
For many encoders, it may not be necessary to store additional files.
- static load_encoder(encoder_file: Path)[source]¶
The load_encoder method can load the encoder given the folder where the same class of the model was previously stored using the store function. Encoders are stored in pickle format. If the encoder uses additional files, they should be explicitly loaded here as well.
If there are no additional files, this method does not need to be overwritten. If there are additional files, its contents should be as follows:
encoder = DatasetEncoder.load_encoder(encoder_file) encoder.my_additional_file = DatasetEncoder.load_attribute(encoder, encoder_file, “my_additional_file”)
- Parameters:
encoder_file (Path) – path to the encoder file where the encoder was stored using store() function
- Returns:
the loaded Encoder object
- set_context(context: dict)[source]¶
This method can be used to attach the full dataset (as part of a dictionary), as opposed to the dataset which is passed to the .encode() method. When training ML models, that data split is usually a training/validation subset of the total dataset.
In most cases, an encoder should only use the ‘dataset’ argument passed to the .encode() method to compute the encoded data. Using information from the full dataset, which includes the test data, may result in data leakage. For example, some encoders normalise the computed feature values (e.g., KmerFrequencyEncoder). Such normalised feature values should be based only on the current data split, and test data should remain unseen.
To avoid confusion about which version of the dataset to use, the full dataset is by default not attached, and attaching the full dataset should be done explicitly when required. For instance, if the encoded data is some kind of distance matrix (e.g., DistanceEncoder), the distance between examples in the training and test dataset should be included. Note that this does not entail data leakage: the test examples are not used to improve the computation of distances. The distances to test examples are determined by an algorithm which does not ‘learn’ from test data.
To explicitly enable using the full dataset in the encoder, the contents of this method should be as follows:
self.context = context return self
- Parameters:
context – a dictionary containing the full dataset