immuneML.encodings.kmer_frequency package

Subpackages

Submodules

immuneML.encodings.kmer_frequency.KmerFreqReceptorEncoder module

class immuneML.encodings.kmer_frequency.KmerFreqReceptorEncoder.KmerFreqReceptorEncoder(normalization_type: NormalizationType, reads: ReadsType, sequence_encoding: SequenceEncodingType, k: int = 0, k_left: int = 0, k_right: int = 0, min_gap: int = 0, max_gap: int = 0, metadata_fields_to_include: list = None, name: str = None, scale_to_unit_variance: bool = False, scale_to_zero_mean: bool = False, sequence_type: SequenceType = None)[source]

Bases: KmerFrequencyEncoder

immuneML.encodings.kmer_frequency.KmerFreqRepertoireEncoder module

class immuneML.encodings.kmer_frequency.KmerFreqRepertoireEncoder.KmerFreqRepertoireEncoder(normalization_type: NormalizationType, reads: ReadsType, sequence_encoding: SequenceEncodingType, k: int = 0, k_left: int = 0, k_right: int = 0, min_gap: int = 0, max_gap: int = 0, metadata_fields_to_include: list = None, name: str = None, scale_to_unit_variance: bool = False, scale_to_zero_mean: bool = False, sequence_type: SequenceType = None)[source]

Bases: KmerFrequencyEncoder

encode_repertoire(repertoire, params: EncoderParams)[source]
get_encoded_repertoire(repertoire, params: EncoderParams)[source]

immuneML.encodings.kmer_frequency.KmerFreqSequenceEncoder module

class immuneML.encodings.kmer_frequency.KmerFreqSequenceEncoder.KmerFreqSequenceEncoder(normalization_type: NormalizationType, reads: ReadsType, sequence_encoding: SequenceEncodingType, k: int = 0, k_left: int = 0, k_right: int = 0, min_gap: int = 0, max_gap: int = 0, metadata_fields_to_include: list = None, name: str = None, scale_to_unit_variance: bool = False, scale_to_zero_mean: bool = False, sequence_type: SequenceType = None)[source]

Bases: KmerFrequencyEncoder

immuneML.encodings.kmer_frequency.KmerFrequencyEncoder module

class immuneML.encodings.kmer_frequency.KmerFrequencyEncoder.KmerFrequencyEncoder(normalization_type: NormalizationType, reads: ReadsType, sequence_encoding: SequenceEncodingType, k: int = 0, k_left: int = 0, k_right: int = 0, min_gap: int = 0, max_gap: int = 0, metadata_fields_to_include: list = None, name: str = None, scale_to_unit_variance: bool = False, scale_to_zero_mean: bool = False, sequence_type: SequenceType = None)[source]

Bases: DatasetEncoder

The KmerFrequencyEncoder class encodes a repertoire, sequence or receptor by frequencies of k-mers it contains. A k-mer is a sequence of letters of length k into which an immune receptor sequence can be decomposed. K-mers can be defined in different ways, as determined by the sequence_encoding.

Dataset type:

  • SequenceDatasets

  • ReceptorDatasets

  • RepertoireDatasets

Specification arguments:

  • sequence_encoding (SequenceEncodingType): The type of k-mers that are used. The simplest sequence_encoding is CONTINUOUS_KMER, which uses contiguous subsequences of length k to represent the k-mers. When gapped k-mers are used (GAPPED_KMER, GAPPED_KMER), the k-mers may contain gaps with a size between min_gap and max_gap, and the k-mer length is defined as a combination of k_left and k_right. When IMGT k-mers are used (IMGT_CONTINUOUS_KMER, IMGT_GAPPED_KMER), IMGT positional information is taken into account (i.e. the same sequence in a different position is considered to be a different k-mer). When the identity representation is used (IDENTITY), the k-mers just correspond to the original sequences.

  • normalization_type (NormalizationType): The way in which the k-mer frequencies should be normalized. The default value for normalization_type is l2.

  • reads (ReadsType): Reads type signify whether the counts of the sequences in the repertoire will be taken into account. If UNIQUE, only unique sequences (clonotypes) are encoded, and if ALL, the sequence ‘count’ value is taken into account when determining the k-mer frequency. The default value for reads is unique.

  • k (int): Length of the k-mer (number of amino acids) when ungapped k-mers are used. The default value for k is 3.

  • k_left (int): When gapped k-mers are used, k_left indicates the length of the k-mer left of the gap. The default value for k_left is 1.

  • k_right (int): Same as k_left, but k_right determines the length of the k-mer right of the gap. The default value for k_right is 1.

  • min_gap (int): Minimum gap size when gapped k-mers are used. The default value for min_gap is 0.

  • max_gap: (int): Maximum gap size when gapped k-mers are used. The default value for max_gap is 0.

  • sequence_type (str): Whether to work with nucleotide or amino acid sequences. Amino acid sequences are the default. To work with either sequence type, the sequences of the desired type should be included in the datasets, e.g., listed under ‘columns_to_load’ parameter. By default, both types will be included if available. Valid values are: AMINO_ACID and NUCLEOTIDE.

  • scale_to_unit_variance (bool): whether to scale the design matrix after normalization to have unit variance per feature. Setting this argument to True might improve the subsequent classifier’s performance depending on the type of the classifier. The default value for scale_to_unit_variance is true.

  • scale_to_zero_mean (bool): whether to scale the design matrix after normalization to have zero mean per feature. Setting this argument to True might improve the subsequent classifier’s performance depending on the type of the classifier. However, if the original design matrix was sparse, setting this argument to True will destroy the sparsity and will increase the memory consumption. The default value for scale_to_zero_mean is false.

YAML specification:

definitions:
    encodings:
        my_continuous_kmer:
            KmerFrequency:
                normalization_type: RELATIVE_FREQUENCY
                reads: UNIQUE
                sequence_encoding: CONTINUOUS_KMER
                sequence_type: NUCLEOTIDE
                k: 3
                scale_to_unit_variance: True
                scale_to_zero_mean: True
        my_gapped_kmer:
            KmerFrequency:
                normalization_type: RELATIVE_FREQUENCY
                reads: UNIQUE
                sequence_encoding: GAPPED_KMER
                sequence_type: AMINO_ACID
                k_left: 2
                k_right: 2
                min_gap: 1
                max_gap: 3
                scale_to_unit_variance: True
                scale_to_zero_mean: False
STEP_ENCODED = 'encoded'
STEP_NORMALIZED = 'normalized'
STEP_SCALED = 'scaled'
STEP_VECTORIZED = 'vectorized'
static build_object(dataset=None, **params)[source]

Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.

The build_object method should do the following:

  1. Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.

  2. Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.

  3. Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.

Parameters:

**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object

Returns:

the object of the appropriate Encoder class

dataset_mapping = {'ReceptorDataset': 'KmerFreqReceptorEncoder', 'RepertoireDataset': 'KmerFreqRepertoireEncoder', 'SequenceDataset': 'KmerFreqSequenceEncoder'}
encode(dataset, params: EncoderParams)[source]

This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.

Parameters:
  • dataset – A dataset object (Sequence, Receptor or RepertoireDataset)

  • params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).

Returns:

A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.

get_additional_files() List[str][source]

Should return a list with all the files that need to be stored when storing the encoder. For example, SimilarToPositiveSequenceEncoder stores all ‘positive’ sequences in the training data, and predicts a sequence to be ‘positive’ if it is similar to any positive sequences in the training data. In that case, these positive sequences are stored in a file.

For many encoders, it may not be necessary to store additional files.

scale_normalized(params, dataset, normalized_examples)[source]

Module contents