immuneML.encodings.distance_encoding package¶
Submodules¶
immuneML.encodings.distance_encoding.CompAIRRDistanceEncoder module¶
- class immuneML.encodings.distance_encoding.CompAIRRDistanceEncoder.CompAIRRDistanceEncoder(compairr_path: Path, keep_compairr_input: bool, differences: int, indels: bool, ignore_counts: bool, ignore_genes: bool, threads: int, context: dict = None, name: str = None)[source]¶
Bases:
DatasetEncoder
Encodes a given RepertoireDataset as a distance matrix, using the Morisita-Horn distance metric. Internally, CompAIRR is used for fast calculation of overlap between repertoires. This creates a pairwise distance matrix between each of the repertoires. The distance is calculated based on the number of matching receptor chain sequences between the repertoires. This matching may be defined to permit 1 or 2 mismatching amino acid/nucleotide positions and 1 indel in the sequence. Furthermore, matching may or may not include V and J gene information, and sequence frequencies may be included or ignored.
When mismatches (differences and indels) are allowed, the Morisita-Horn similarity may exceed 1. In this case, the Morisita-Horn distance (= similarity - 1) is set to 0 to avoid negative distance scores.
Dataset type:
RepertoireDatasets
Specification arguments:
compairr_path (Path): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.
keep_compairr_input (bool): whether to keep the input file that was passed to CompAIRR. This may take a lot of storage space if the input dataset is large. By default, the input file is not kept.
differences (int): Number of differences allowed between the sequences of two immune receptor chains, this may be between 0 and 2. By default, differences is 0.
indels (bool): Whether to allow an indel. This is only possible if differences is 1. By default, indels is False.
ignore_counts (bool): Whether to ignore the frequencies of the immune receptor chains. If False, frequencies will be included, meaning the ‘counts’ values for the receptors available in two repertoires are multiplied. If False, only the number of unique overlapping immune receptors (‘clones’) are considered. By default, ignore_counts is False.
ignore_genes (bool): Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains have to match. If True, gene information is ignored. By default, ignore_genes is False.
threads (int): The number of threads to use for parallelization. Default is 8.
YAML specification:
definitions: encodings: my_distance_encoder: CompAIRRDistance: compairr_path: optional/path/to/compairr differences: 0 indels: False ignore_counts: False ignore_genes: False
- INPUT_FILENAME = 'compairr_input.tsv'¶
- LOG_FILENAME = 'compairr_log.txt'¶
- OUTPUT_FILENAME = 'compairr_results.txt'¶
- build_distance_matrix(dataset: RepertoireDataset, params: EncoderParams, train_repertoire_ids: list)[source]¶
- build_labels(dataset: RepertoireDataset, params: EncoderParams) dict [source]¶
- static build_object(dataset, **params)[source]¶
Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.
The build_object method should do the following:
Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.
Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.
Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.
- Parameters:
**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object
- Returns:
the object of the appropriate Encoder class
- encode(dataset: RepertoireDataset, params: EncoderParams) RepertoireDataset [source]¶
This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.
- Parameters:
dataset – A dataset object (Sequence, Receptor or RepertoireDataset)
params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).
- Returns:
A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.
- set_context(context: dict)[source]¶
This method can be used to attach the full dataset (as part of a dictionary), as opposed to the dataset which is passed to the .encode() method. When training ML models, that data split is usually a training/validation subset of the total dataset.
In most cases, an encoder should only use the ‘dataset’ argument passed to the .encode() method to compute the encoded data. Using information from the full dataset, which includes the test data, may result in data leakage. For example, some encoders normalise the computed feature values (e.g., KmerFrequencyEncoder). Such normalised feature values should be based only on the current data split, and test data should remain unseen.
To avoid confusion about which version of the dataset to use, the full dataset is by default not attached, and attaching the full dataset should be done explicitly when required. For instance, if the encoded data is some kind of distance matrix (e.g., DistanceEncoder), the distance between examples in the training and test dataset should be included. Note that this does not entail data leakage: the test examples are not used to improve the computation of distances. The distances to test examples are determined by an algorithm which does not ‘learn’ from test data.
To explicitly enable using the full dataset in the encoder, the contents of this method should be as follows:
self.context = context return self
- Parameters:
context – a dictionary containing the full dataset
immuneML.encodings.distance_encoding.DistanceEncoder module¶
- class immuneML.encodings.distance_encoding.DistanceEncoder.DistanceEncoder(distance_metric: DistanceMetricType, attributes_to_match: list, sequence_batch_size: int, context: dict = None, name: str = None)[source]¶
Bases:
DatasetEncoder
Encodes a given RepertoireDataset as distance matrix, where the pairwise distance between each of the repertoires is calculated. The distance is calculated based on the presence/absence of elements defined under attributes_to_match. Thus, if attributes_to_match contains only ‘sequence_aas’, this means the distance between two repertoires is maximal if they contain the same set of sequence_aas, and the distance is minimal if none of the sequence_aas are shared between two repertoires.
Specification arguments:
distance_metric (
DistanceMetricType
): The metric used to calculate the distance between two repertoires. Names of different distance metric types are allowed values in the specification. The default distance metric is JACCARD (inverse Jaccard).sequence_batch_size (int): The number of sequences to be processed at once. Increasing this number increases the memory use. The default value is 1000.
attributes_to_match (list): The attributes to consider when determining whether a sequence is present in both repertoires. Only the fields defined under attributes_to_match will be considered, all other fields are ignored. Valid values include any repertoire attribute as defined in AIRR rearrangement schema (cdr3_aa, v_call, j_call, etc).
YAML specification:
definitions: encodings: my_distance_encoder: Distance: distance_metric: JACCARD sequence_batch_size: 1000 attributes_to_match: - cdr3_aa - v_call - j_call
- build_distance_matrix(dataset: RepertoireDataset, params: EncoderParams, train_repertoire_ids: list)[source]¶
- build_labels(dataset: RepertoireDataset, params: EncoderParams) dict [source]¶
- static build_object(dataset, **params)[source]¶
Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.
The build_object method should do the following:
Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.
Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.
Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.
- Parameters:
**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object
- Returns:
the object of the appropriate Encoder class
- encode(dataset, params: EncoderParams) RepertoireDataset [source]¶
This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.
- Parameters:
dataset – A dataset object (Sequence, Receptor or RepertoireDataset)
params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).
- Returns:
A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.
- static load_encoder(encoder_file: Path)[source]¶
The load_encoder method can load the encoder given the folder where the same class of the model was previously stored using the store function. Encoders are stored in pickle format. If the encoder uses additional files, they should be explicitly loaded here as well.
If there are no additional files, this method does not need to be overwritten. If there are additional files, its contents should be as follows:
encoder = DatasetEncoder.load_encoder(encoder_file) encoder.my_additional_file = DatasetEncoder.load_attribute(encoder, encoder_file, “my_additional_file”)
- Parameters:
encoder_file (Path) – path to the encoder file where the encoder was stored using store() function
- Returns:
the loaded Encoder object
- set_context(context: dict)[source]¶
This method can be used to attach the full dataset (as part of a dictionary), as opposed to the dataset which is passed to the .encode() method. When training ML models, that data split is usually a training/validation subset of the total dataset.
In most cases, an encoder should only use the ‘dataset’ argument passed to the .encode() method to compute the encoded data. Using information from the full dataset, which includes the test data, may result in data leakage. For example, some encoders normalise the computed feature values (e.g., KmerFrequencyEncoder). Such normalised feature values should be based only on the current data split, and test data should remain unseen.
To avoid confusion about which version of the dataset to use, the full dataset is by default not attached, and attaching the full dataset should be done explicitly when required. For instance, if the encoded data is some kind of distance matrix (e.g., DistanceEncoder), the distance between examples in the training and test dataset should be included. Note that this does not entail data leakage: the test examples are not used to improve the computation of distances. The distances to test examples are determined by an algorithm which does not ‘learn’ from test data.
To explicitly enable using the full dataset in the encoder, the contents of this method should be as follows:
self.context = context return self
- Parameters:
context – a dictionary containing the full dataset
immuneML.encodings.distance_encoding.DistanceMetricType module¶
immuneML.encodings.distance_encoding.TCRdistEncoder module¶
- class immuneML.encodings.distance_encoding.TCRdistEncoder.TCRdistEncoder(cores: int, name: str = None)[source]¶
Bases:
DatasetEncoder
Encodes the given ReceptorDataset as a distance matrix between all receptors, where the distance is computed using TCRdist from the paper: Dash P, Fiore-Gartland AJ, Hertz T, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. 2017; 547(7661):89-93. doi:10.1038/nature22383.
For the implementation, TCRdist3 library was used (source code available here).
Dataset type:
ReceptorDatasets
Specification arguments:
cores (int): number of processes to use for the computation
YAML specification:
definitions: encodings: my_tcr_dist_enc: TCRdist: cores: 4
- static build_object(dataset, **params)[source]¶
Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.
The build_object method should do the following:
Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.
Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.
Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.
- Parameters:
**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object
- Returns:
the object of the appropriate Encoder class
- encode(dataset, params: EncoderParams)[source]¶
This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.
- Parameters:
dataset – A dataset object (Sequence, Receptor or RepertoireDataset)
params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).
- Returns:
A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.
- set_context(context: dict)[source]¶
This method can be used to attach the full dataset (as part of a dictionary), as opposed to the dataset which is passed to the .encode() method. When training ML models, that data split is usually a training/validation subset of the total dataset.
In most cases, an encoder should only use the ‘dataset’ argument passed to the .encode() method to compute the encoded data. Using information from the full dataset, which includes the test data, may result in data leakage. For example, some encoders normalise the computed feature values (e.g., KmerFrequencyEncoder). Such normalised feature values should be based only on the current data split, and test data should remain unseen.
To avoid confusion about which version of the dataset to use, the full dataset is by default not attached, and attaching the full dataset should be done explicitly when required. For instance, if the encoded data is some kind of distance matrix (e.g., DistanceEncoder), the distance between examples in the training and test dataset should be included. Note that this does not entail data leakage: the test examples are not used to improve the computation of distances. The distances to test examples are determined by an algorithm which does not ‘learn’ from test data.
To explicitly enable using the full dataset in the encoder, the contents of this method should be as follows:
self.context = context return self
- Parameters:
context – a dictionary containing the full dataset