immuneML.encodings.motif_encoding package¶
Submodules¶
immuneML.encodings.motif_encoding.MotifEncoder module¶
- class immuneML.encodings.motif_encoding.MotifEncoder.MotifEncoder(max_positions: int = None, min_positions: int = None, min_precision: float = None, min_recall: dict = None, min_true_positives: int = None, no_gaps: bool = False, candidate_motif_filepath: str = None, label: str = None, name: str = None)[source]¶
Bases:
DatasetEncoder
This encoder enumerates every possible positional motif in a sequence dataset, and keeps only the motifs associated with the positive class. A ‘motif’ is defined as a combination of position-specific amino acids. These motifs may contain one or multiple gaps. Motifs are filtered out based on a minimal precision and recall threshold for predicting the positive class.
Note: the MotifEncoder can only be used for sequences of the same length.
The ideal recall threshold(s) given a user-defined precision threshold can be calibrated using the
MotifGeneralizationAnalysis
report. It is recommended to first run this report inExploratoryAnalysisInstruction
before using this encoder for ML.This encoder can be used in combination with the
BinaryFeatureClassifier
in order to learn a minimal set of compatible motifs for predicting the positive class. Alternatively, it may be combined with scikit-learn methods, such as for exampleLogisticRegression
, to learn a weight per motif.Dataset type:
SequenceDatasets
Specification arguments:
max_positions (int): The maximum motif size. This is number of positional amino acids the motif consists of (excluding gaps). The default value for max_positions is 4.
min_positions (int): The minimum motif size (see also: max_positions). The default value for max_positions is 1.
no_gaps (bool): Must be set to True if only contiguous motifs (position-specific k-mers) are allowed. By default, no_gaps is False, meaning both gapped and ungapped motifs are searched for.
min_precision (float): The minimum precision threshold for keeping a motif. The default value for min_precision is 0.8.
min_recall (float): The minimum recall threshold for keeping a motif. The default value for min_precision is 0. It is also possible to specify a recall threshold for each motif size. In this case, a dictionary must be specified where the motif sizes are keys and the recall values are values. Use the
MotifGeneralizationAnalysis
report to calibrate the optimal recall threshold given a user-defined precision threshold to ensure generalisability to unseen data.min_true_positives (int): The minimum number of true positive sequences that a motif needs to occur in. The default value for min_true_positives is 10.
candidate_motif_filepath (str): Optional filepath for pre-filterd candidate motifs. This may be used to save time. Only the given candidate motifs are considered. When this encoder has been run previously, a candidate motifs file named ‘all_candidate_motifs.tsv’ will have been exported. This file contains all possible motifs with high enough min_true_positives without applying precision and recall thresholds. The file must be a tab-separated file, structured as follows:
indices
amino_acids
1&2&3
A&G&C
5&7
E&D
The example above contains two motifs: AGC in positions 123, and E-D in positions 5-7 (with a gap at position 6).
label (str): The name of the binary label to train the encoder for. This is only necessary when the dataset contains multiple labels.
YAML specification:
definitions: encodings: my_motif_encoder: MotifEncoder: max_positions: 4 min_precision: 0.8 min_recall: # different recall thresholds for each motif size 1: 0.5 # For shorter motifs, a stricter recall threshold is used 2: 0.1 3: 0.01 4: 0.001 min_true_positives: 10
- static build_object(dataset=None, **params)[source]¶
Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.
The build_object method should do the following:
Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.
Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.
Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.
- Parameters:
**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object
- Returns:
the object of the appropriate Encoder class
- encode(dataset, params: EncoderParams)[source]¶
This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.
- Parameters:
dataset – A dataset object (Sequence, Receptor or RepertoireDataset)
params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).
- Returns:
A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.
- set_context(context: dict)[source]¶
This method can be used to attach the full dataset (as part of a dictionary), as opposed to the dataset which is passed to the .encode() method. When training ML models, that data split is usually a training/validation subset of the total dataset.
In most cases, an encoder should only use the ‘dataset’ argument passed to the .encode() method to compute the encoded data. Using information from the full dataset, which includes the test data, may result in data leakage. For example, some encoders normalise the computed feature values (e.g., KmerFrequencyEncoder). Such normalised feature values should be based only on the current data split, and test data should remain unseen.
To avoid confusion about which version of the dataset to use, the full dataset is by default not attached, and attaching the full dataset should be done explicitly when required. For instance, if the encoded data is some kind of distance matrix (e.g., DistanceEncoder), the distance between examples in the training and test dataset should be included. Note that this does not entail data leakage: the test examples are not used to improve the computation of distances. The distances to test examples are determined by an algorithm which does not ‘learn’ from test data.
To explicitly enable using the full dataset in the encoder, the contents of this method should be as follows:
self.context = context return self
- Parameters:
context – a dictionary containing the full dataset
immuneML.encodings.motif_encoding.PositionalMotifHelper module¶
- class immuneML.encodings.motif_encoding.PositionalMotifHelper.PositionalMotifHelper[source]¶
Bases:
object
- static check_motif(motif, np_sequences, y_true, weights, min_precision, min_true_positives, min_recall)[source]¶
- static check_motif_filepath(motif_filepath, location, parameter_name, expected_header='indices\tamino_acids\n')[source]¶
- static compute_all_candidate_motifs(np_sequences, params: PositionalMotifParams)[source]¶
- static compute_numpy_sequence_representation(dataset, location=None)[source]¶
Computes an efficient unicode representation for SequenceDatasets where all sequences have the same length
- static extend_motif(base_motif, np_sequences, legal_positional_aas, count_threshold=10, negative_aa=False, no_gaps=False)[source]¶
- static get_generalized_motifs(motifs)[source]¶
Generalized motifs option is temporarily not in use by MotifEncoder, as there does not seem to be a clear purpose as of now.
immuneML.encodings.motif_encoding.PositionalMotifParams module¶
- class immuneML.encodings.motif_encoding.PositionalMotifParams.PositionalMotifParams(max_positions: int, min_positions: int, count_threshold: int, pool_size: int = 4, allow_negative_aas: bool = False, no_gaps: bool = False)[source]¶
Bases:
object
- allow_negative_aas: bool = False¶
- count_threshold: int¶
- max_positions: int¶
- min_positions: int¶
- no_gaps: bool = False¶
- pool_size: int = 4¶
immuneML.encodings.motif_encoding.SimilarToPositiveSequenceEncoder module¶
- class immuneML.encodings.motif_encoding.SimilarToPositiveSequenceEncoder.SimilarToPositiveSequenceEncoder(hamming_distance: int = None, compairr_path: str = None, ignore_genes: bool = None, threads: int = None, keep_temporary_files: bool = None, name: str = None)[source]¶
Bases:
DatasetEncoder
A simple baseline encoding, to be used in combination with
BinaryFeatureClassifier
using keep_all = True. This encoder keeps track of all positive sequences in the training set, and ignores the negative sequences. Any sequence within a given hamming distance from a positive training sequence will be classified positive, all other sequences will be classified negative.Dataset type:
SequenceDatasets
Specification arguments:
hamming_distance (int): Maximum number of differences allowed between any positive sequence of the training set and a new observed sequence in order for the observed sequence to be classified as ‘positive’.
compairr_path (Path): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.
ignore_genes (bool): Only used when compairr is used. Whether to ignore V and J gene information. If False, the V and J genes between two sequences have to match for the sequence to be considered ‘similar’. If True, gene information is ignored. By default, ignore_genes is False.
threads (int): The number of threads to use for parallelization. This does not affect the results of the encoding, only the speed. The default number of threads is 8.
keep_temporary_files (bool): whether to keep temporary files, including CompAIRR input, output and log files, and the sequence presence matrix. This may take a lot of storage space if the input dataset is large. By default temporary files are not kept.
YAML specification:
definitions: encodings: my_sequence_encoder: SimilarToPositiveSequenceEncoder: hamming_distance: 2
- static build_object(dataset=None, **params)[source]¶
Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.
The build_object method should do the following:
Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.
Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.
Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.
- Parameters:
**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object
- Returns:
the object of the appropriate Encoder class
- encode(dataset, params: EncoderParams)[source]¶
This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.
- Parameters:
dataset – A dataset object (Sequence, Receptor or RepertoireDataset)
params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).
- Returns:
A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.
- get_sequence_matching_feature(dataset, params: EncoderParams)[source]¶
- get_sequence_matching_feature_with_compairr(dataset, params: EncoderParams)[source]¶
- set_context(context: dict)[source]¶
This method can be used to attach the full dataset (as part of a dictionary), as opposed to the dataset which is passed to the .encode() method. When training ML models, that data split is usually a training/validation subset of the total dataset.
In most cases, an encoder should only use the ‘dataset’ argument passed to the .encode() method to compute the encoded data. Using information from the full dataset, which includes the test data, may result in data leakage. For example, some encoders normalise the computed feature values (e.g., KmerFrequencyEncoder). Such normalised feature values should be based only on the current data split, and test data should remain unseen.
To avoid confusion about which version of the dataset to use, the full dataset is by default not attached, and attaching the full dataset should be done explicitly when required. For instance, if the encoded data is some kind of distance matrix (e.g., DistanceEncoder), the distance between examples in the training and test dataset should be included. Note that this does not entail data leakage: the test examples are not used to improve the computation of distances. The distances to test examples are determined by an algorithm which does not ‘learn’ from test data.
To explicitly enable using the full dataset in the encoder, the contents of this method should be as follows:
self.context = context return self
- Parameters:
context – a dictionary containing the full dataset