immuneML.encodings.onehot package

Submodules

immuneML.encodings.onehot.OneHotEncoder module

class immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: str = None, sequence_type: SequenceType = None)[source]

Bases: DatasetEncoder

One-hot encoding for repertoires, sequences or receptors. In one-hot encoding, each alphabet character (amino acid or nucleotide) is replaced by a sparse vector with one 1 and the rest zeroes. The position of the 1 represents the alphabet character.

Dataset type:

  • SequenceDatasets

  • ReceptorDatasets

  • RepertoireDatasets

Specification arguments:

  • use_positional_info (bool): whether to include features representing the positional information. If True, three additional feature vectors will be added, representing the sequence start, sequence middle and sequence end. The values in these features are scaled between 0 and 1. A graphical representation of the values of these vectors is given below.

  Value of sequence start:         Value of sequence middle:        Value of sequence end:

1 \                              1    /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾\         1                          /
   \                                 /                   \                                  /
    \                               /                     \                                /
0    \_____________________      0 /                       \      0  _____________________/
  <----sequence length---->        <----sequence length---->         <----sequence length---->
  • distance_to_seq_middle (int): only applies when use_positional_info is True. This is the distance from the edge of the CDR3 sequence (IMGT positions 105 and 117) to the portion of the sequence that is considered ‘middle’. For example: if distance_to_seq_middle is 6 (default), all IMGT positions in the interval [111, 112) receive positional value 1. When using nucleotide sequences: note that the distance is measured in (amino acid) IMGT positions. If the complete sequence length is smaller than 2 * distance_to_seq_middle, the maximum value of the ‘start’ and ‘end’ vectors will not reach 0, and the maximum value of the ‘middle’ vector will not reach 1. A graphical representation of the positional vectors with a too short sequence is given below:

Value of sequence start         Value of sequence middle        Value of sequence end:
with very short sequence:       with very short sequence:       with very short sequence:

     1 \                               1                                 1    /
        \                                                                    /
         \                                /\                                /
     0                                 0 /  \                            0
       <->                               <-->                               <->
  • flatten (bool): whether to flatten the final onehot matrix to a 2-dimensional matrix [examples, other_dims_combined] This must be set to True when using onehot encoding in combination with scikit-learn ML methods (inheriting SklearnMethod), such as LogisticRegression, SVM, SVC, RandomForestClassifier and KNN.

  • sequence_type: whether to use nucleotide or amino acid sequence for encoding. Valid values are ‘nucleotide’ and ‘amino_acid’.

YAML specification:

definitions:
    encodings:
        one_hot_vanilla:
            OneHot:
                use_positional_info: False
                flatten: False
                sequence_type: amino_acid

        one_hot_positional:
            OneHot:
                use_positional_info: True
                distance_to_seq_middle: 3
                flatten: False
                sequence_type: nucleotide
static build_object(dataset=None, **params)[source]

Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.

The build_object method should do the following:

  1. Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.

  2. Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.

  3. Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.

Parameters:

**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object

Returns:

the object of the appropriate Encoder class

dataset_mapping = {'ReceptorDataset': 'OneHotReceptorEncoder', 'RepertoireDataset': 'OneHotRepertoireEncoder', 'SequenceDataset': 'OneHotSequenceEncoder'}
encode(dataset, params: EncoderParams)[source]

This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.

Parameters:
  • dataset – A dataset object (Sequence, Receptor or RepertoireDataset)

  • params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).

Returns:

A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.

immuneML.encodings.onehot.OneHotReceptorEncoder module

class immuneML.encodings.onehot.OneHotReceptorEncoder.OneHotReceptorEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: str = None, sequence_type: SequenceType = None)[source]

Bases: OneHotEncoder

One-hot encoded repertoire data is represented in a matrix with dimensions:

[receptors, chains, sequence_lengths, one_hot_characters]

when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:
  • start position (high when close to start)

  • middle position (high in the middle of the sequence)

  • end position (high when close to end)

immuneML.encodings.onehot.OneHotRepertoireEncoder module

class immuneML.encodings.onehot.OneHotRepertoireEncoder.OneHotRepertoireEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: str = None, sequence_type: SequenceType = None)[source]

Bases: OneHotEncoder

One-hot encoded repertoire data is represented in a matrix with dimensions:

[repertoires, sequences, sequence_lengths, one_hot_characters]

when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:
  • start position (high when close to start)

  • middle position (high in the middle of the sequence)

  • end position (high when close to end)

immuneML.encodings.onehot.OneHotSequenceEncoder module

class immuneML.encodings.onehot.OneHotSequenceEncoder.OneHotSequenceEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: str = None, sequence_type: SequenceType = None)[source]

Bases: OneHotEncoder

One-hot encoded repertoire data is represented in a matrix with dimensions:

[sequences, sequence_lengths, one_hot_characters]

when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:
  • start position (high when close to start)

  • middle position (high in the middle of the sequence)

  • end position (high when close to end)

Module contents