immuneML.encodings.onehot package

Submodules

immuneML.encodings.onehot.OneHotEncoder module

class immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]

Bases: immuneML.encodings.DatasetEncoder.DatasetEncoder

One-hot encoding for repertoires, sequences or receptors. In one-hot encoding, each alphabet character (amino acid or nucleotide) is replaced by a sparse vector with one 1 and the rest zeroes. The position of the 1 represents the alphabet character.

Parameters
  • use_positional_info (bool) – whether to include features representing the positional information.

  • True (If) –

  • additional feature vectors will be added (three) –

  • the sequence start (representing) –

  • middle (sequence) –

  • sequence end. The values in these features are scaled between 0 and 1. A graphical representation of (and) –

  • values of these vectors is given below. (the) –

  • code-block: (.) –

    console: Value of sequence start: Value of sequence middle: Value of sequence end:

    1 1 /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾ 1 /
    / /

    / /

    0 _____________________ 0 / 0 _____________________/

    <—-sequence length—-> <—-sequence length—-> <—-sequence length—->

  • distance_to_seq_middle (int) – only applies when use_positional_info is True. This is the distance from the edge

  • the CDR3 sequence (of) –

  • example (For) – if distance_to_seq_middle is 6 (default), all IMGT positions in the interval [111, 112)

  • positional value 1. (receive) –

  • using nucleotide sequences (When) – note that the distance is measured in (amino acid) IMGT positions.

  • the complete sequence length is smaller than 2 * distance_to_seq_middle (If) –

  • maximum value of the (the) –

  • and 'end' vectors will not reach 0 ('start') –

  • the maximum value of the 'middle' vector will not reach 1. (and) –

  • graphical representation of the positional vectors with a too short sequence is given below (A) –

  • code-block:

    console: Value of sequence start Value of sequence middle Value of sequence end: with very short sequence: with very short sequence: with very short sequence:

    1 1 1 /
    /

    / /

    0 0 / 0

    <-> <–> <->

  • flatten (bool) – whether to flatten the final onehot matrix to a 2-dimensional matrix [examples, other_dims_combined]

  • must be set to True when using onehot encoding in combination with scikit-learn ML methods (This) –

:param such as LogisticRegression: :param SVM: :param RandomForestClassifier and KNN.: :param sequence_type: whether to use nucleotide or amino acid sequence for encoding. Valid values are ‘nucleotide’ and ‘amino_acid’.

YAML specification:

one_hot_vanilla:
    OneHot:
        use_positional_info: False
        flatten: False
        sequence_type: amino_acid

one_hot_positional:
    OneHot:
        use_positional_info: True
        distance_to_seq_middle: 3
        flatten: False
        sequence_type: nucleotide
static build_object(dataset=None, **params)[source]
dataset_mapping = {'ReceptorDataset': 'OneHotReceptorEncoder', 'RepertoireDataset': 'OneHotRepertoireEncoder', 'SequenceDataset': 'OneHotSequenceEncoder'}
encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]
store(encoded_dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]

immuneML.encodings.onehot.OneHotReceptorEncoder module

class immuneML.encodings.onehot.OneHotReceptorEncoder.OneHotReceptorEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]

Bases: immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder

One-hot encoded repertoire data is represented in a matrix with dimensions:

[receptors, chains, sequence_lengths, one_hot_characters]

when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:
  • start position (high when close to start)

  • middle position (high in the middle of the sequence)

  • end position (high when close to end)

immuneML.encodings.onehot.OneHotRepertoireEncoder module

class immuneML.encodings.onehot.OneHotRepertoireEncoder.OneHotRepertoireEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]

Bases: immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder

One-hot encoded repertoire data is represented in a matrix with dimensions:

[repertoires, sequences, sequence_lengths, one_hot_characters]

when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:
  • start position (high when close to start)

  • middle position (high in the middle of the sequence)

  • end position (high when close to end)

immuneML.encodings.onehot.OneHotSequenceEncoder module

class immuneML.encodings.onehot.OneHotSequenceEncoder.OneHotSequenceEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]

Bases: immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder

One-hot encoded repertoire data is represented in a matrix with dimensions:

[sequences, sequence_lengths, one_hot_characters]

when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:
  • start position (high when close to start)

  • middle position (high in the middle of the sequence)

  • end position (high when close to end)

Module contents