immuneML.encodings.motif_encoding package¶

Submodules¶

immuneML.encodings.motif_encoding.MotifEncoder module¶

class immuneML.encodings.motif_encoding.MotifEncoder.MotifEncoder(max_positions: int = None, min_positions: int = None, min_precision: float = None, min_recall: dict = None, min_true_positives: int = None, no_gaps: bool = False, candidate_motif_filepath: str = None, label: str = None, name: str = None)[source]¶

Bases: DatasetEncoder

This encoder enumerates every possible positional motif in a sequence dataset, and keeps only the motifs associated with the positive class. A ‘motif’ is defined as a combination of position-specific amino acids. These motifs may contain one or multiple gaps. Motifs are filtered out based on a minimal precision and recall threshold for predicting the positive class.

Note: the MotifEncoder can only be used for sequences of the same length.

The ideal recall threshold(s) given a user-defined precision threshold can be calibrated using the MotifGeneralizationAnalysis report. It is recommended to first run this report in ExploratoryAnalysisInstruction before using this encoder for ML.

This encoder can be used in combination with the BinaryFeatureClassifier in order to learn a minimal set of compatible motifs for predicting the positive class. Alternatively, it may be combined with scikit-learn methods, such as for example LogisticRegression, to learn a weight per motif.

Dataset type:

SequenceDatasets

Specification arguments:

max_positions (int): The maximum motif size. This is number of positional amino acids the motif consists of (excluding gaps). The default value for max_positions is 4.
min_positions (int): The minimum motif size (see also: max_positions). The default value for max_positions is 1.
no_gaps (bool): Must be set to True if only contiguous motifs (position-specific k-mers) are allowed. By default, no_gaps is False, meaning both gapped and ungapped motifs are searched for.
min_precision (float): The minimum precision threshold for keeping a motif. The default value for min_precision is 0.8.
min_recall (float): The minimum recall threshold for keeping a motif. The default value for min_precision is 0. It is also possible to specify a recall threshold for each motif size. In this case, a dictionary must be specified where the motif sizes are keys and the recall values are values. Use the MotifGeneralizationAnalysis report to calibrate the optimal recall threshold given a user-defined precision threshold to ensure generalisability to unseen data.
min_true_positives (int): The minimum number of true positive sequences that a motif needs to occur in. The default value for min_true_positives is 10.
candidate_motif_filepath (str): Optional filepath for pre-filterd candidate motifs. This may be used to save time. Only the given candidate motifs are considered. When this encoder has been run previously, a candidate motifs file named ‘all_candidate_motifs.tsv’ will have been exported. This file contains all possible motifs with high enough min_true_positives without applying precision and recall thresholds. The file must be a tab-separated file, structured as follows:

indices

amino_acids

1&2&3

A&G&C

5&7

E&D

The example above contains two motifs: AGC in positions 123, and E-D in positions 5-7 (with a gap at position 6).
label (str): The name of the binary label to train the encoder for. This is only necessary when the dataset contains multiple labels.

indices	amino_acids
1&2&3	A&G&C
5&7	E&D

YAML specification:

definitions:
    encodings:
        my_motif_encoder:
            MotifEncoder:
                max_positions: 4
                min_precision: 0.8
                min_recall:  # different recall thresholds for each motif size
                    1: 0.5   # For shorter motifs, a stricter recall threshold is used
                    2: 0.1
                    3: 0.01
                    4: 0.001
                min_true_positives: 10

static build_object(dataset=None, **params)[source]¶

Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.

The build_object method should do the following:

Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.

Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.

Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.

Parameters:: **params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object
Returns:: the object of the appropriate Encoder class

check_filtered_motifs(filtered_motifs)[source]¶

encode(dataset, params: EncoderParams)[source]¶

This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.

Parameters:

dataset – A dataset object (Sequence, Receptor or RepertoireDataset)
params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).

Returns:

A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.

get_encoded_dataset_from_motifs(dataset, motifs, params)[source]¶

set_context(context: dict)[source]¶

This method can be used to attach the full dataset (as part of a dictionary), as opposed to the dataset which is passed to the .encode() method. When training ML models, that data split is usually a training/validation subset of the total dataset.

In most cases, an encoder should only use the ‘dataset’ argument passed to the .encode() method to compute the encoded data. Using information from the full dataset, which includes the test data, may result in data leakage. For example, some encoders normalise the computed feature values (e.g., KmerFrequencyEncoder). Such normalised feature values should be based only on the current data split, and test data should remain unseen.

To avoid confusion about which version of the dataset to use, the full dataset is by default not attached, and attaching the full dataset should be done explicitly when required. For instance, if the encoded data is some kind of distance matrix (e.g., DistanceEncoder), the distance between examples in the training and test dataset should be included. Note that this does not entail data leakage: the test examples are not used to improve the computation of distances. The distances to test examples are determined by an algorithm which does not ‘learn’ from test data.

To explicitly enable using the full dataset in the encoder, the contents of this method should be as follows:

self.context = context return self

Parameters:: context – a dictionary containing the full dataset

immuneML.encodings.motif_encoding.PositionalMotifHelper module¶

class immuneML.encodings.motif_encoding.PositionalMotifHelper.PositionalMotifHelper[source]¶

Bases: object

static add_position_to_base_motif(base_motif, new_position, new_aa)[source]¶

static check_file_header(header, motif_filepath, expected_header='indices\tamino_acids\n')[source]¶

static check_motif(motif, np_sequences, y_true, weights, min_precision, min_true_positives, min_recall)[source]¶

static check_motif_filepath(motif_filepath, location, parameter_name, expected_header='indices\tamino_acids\n')[source]¶

static compute_all_candidate_motifs(np_sequences, params: PositionalMotifParams)[source]¶

static compute_numpy_sequence_representation(dataset, location=None)[source]¶: Computes an efficient unicode representation for SequenceDatasets where all sequences have the same length

static extend_motif(base_motif, np_sequences, legal_positional_aas, count_threshold=10, negative_aa=False, no_gaps=False)[source]¶

static get_flex_aa_sets(amino_acids)[source]¶

static get_generalized_motifs(motifs)[source]¶: Generalized motifs option is temporarily not in use by MotifEncoder, as there does not seem to be a clear purpose as of now.

static get_generalized_motifs_for_index(indices, all_motif_amino_acids)[source]¶

static get_motif_size(string_repr, value_sep='&', motif_sep='-')[source]¶

static get_numpy_sequence_representation(dataset)[source]¶

static identify_legal_positional_aas(np_sequences, count_threshold=10)[source]¶

static motif_to_string(indices, amino_acids, value_sep='&', motif_sep='\t', newline=True)[source]¶

static read_motifs_from_file(filepath)[source]¶

static sort_motifs_by_index(motifs)[source]¶

static string_to_motif(string, value_sep, motif_sep)[source]¶

static test_aa(sequences, index, aa)[source]¶

static test_motif(np_sequences, indices, amino_acids)[source]¶: Tests for all sequences whether it contains the given motif (defined by indices and amino acids)

static test_position(np_sequences, index, aas)[source]¶

static write_motifs_to_file(motifs, filepath)[source]¶

immuneML.encodings.motif_encoding.PositionalMotifParams module¶

class immuneML.encodings.motif_encoding.PositionalMotifParams.PositionalMotifParams(max_positions: int, min_positions: int, count_threshold: int, pool_size: int = 4, allow_negative_aas: bool = False, no_gaps: bool = False)[source]¶

Bases: object

allow_negative_aas: bool = False¶

count_threshold: int¶

max_positions: int¶

min_positions: int¶

no_gaps: bool = False¶

pool_size: int = 4¶

immuneML.encodings.motif_encoding.SimilarToPositiveSequenceEncoder module¶

class immuneML.encodings.motif_encoding.SimilarToPositiveSequenceEncoder.SimilarToPositiveSequenceEncoder(hamming_distance: int = None, compairr_path: str = None, ignore_genes: bool = None, threads: int = None, keep_temporary_files: bool = None, name: str = None)[source]¶

Bases: DatasetEncoder

A simple baseline encoding, to be used in combination with BinaryFeatureClassifier using keep_all = True. This encoder keeps track of all positive sequences in the training set, and ignores the negative sequences. Any sequence within a given hamming distance from a positive training sequence will be classified positive, all other sequences will be classified negative.

Dataset type:

SequenceDatasets

Specification arguments:

hamming_distance (int): Maximum number of differences allowed between any positive sequence of the training set and a new observed sequence in order for the observed sequence to be classified as ‘positive’.
compairr_path (Path): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.
ignore_genes (bool): Only used when compairr is used. Whether to ignore V and J gene information. If False, the V and J genes between two sequences have to match for the sequence to be considered ‘similar’. If True, gene information is ignored. By default, ignore_genes is False.
threads (int): The number of threads to use for parallelization. This does not affect the results of the encoding, only the speed. The default number of threads is 8.
keep_temporary_files (bool): whether to keep temporary files, including CompAIRR input, output and log files, and the sequence presence matrix. This may take a lot of storage space if the input dataset is large. By default temporary files are not kept.

YAML specification:

definitions:
    encodings:
        my_sequence_encoder:
            SimilarToPositiveSequenceEncoder:
                hamming_distance: 2

static build_object(dataset=None, **params)[source]¶

The build_object method should do the following:

Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.

Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.

Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.

Parameters:: **params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object
Returns:: the object of the appropriate Encoder class

encode(dataset, params: EncoderParams)[source]¶

This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.

Parameters:

dataset – A dataset object (Sequence, Receptor or RepertoireDataset)
params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).

Returns:

A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.

get_sequence_matching_feature(dataset, params: EncoderParams)[source]¶

get_sequence_matching_feature_with_compairr(dataset, params: EncoderParams)[source]¶

get_sequence_matching_feature_without_compairr(dataset)[source]¶

set_context(context: dict)[source]¶

To explicitly enable using the full dataset in the encoder, the contents of this method should be as follows:

self.context = context return self

Parameters:: context – a dictionary containing the full dataset

immuneML.encodings.motif_encoding package¶

Submodules¶

immuneML.encodings.motif_encoding.MotifEncoder module¶

immuneML.encodings.motif_encoding.PositionalMotifHelper module¶

immuneML.encodings.motif_encoding.PositionalMotifParams module¶

immuneML.encodings.motif_encoding.SimilarToPositiveSequenceEncoder module¶

Module contents¶