immuneML.encodings.onehot package

Submodules

immuneML.encodings.onehot.OneHotEncoder module

class immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: str = None, sequence_type: SequenceType = None)[source]

Bases: DatasetEncoder

One-hot encoding for repertoires, sequences or receptors. In one-hot encoding, each alphabet character (amino acid or nucleotide) is replaced by a sparse vector with one 1 and the rest zeroes. The position of the 1 represents the alphabet character.

Parameters:
  • use_positional_info (bool) – whether to include features representing the positional information.

  • True (If) –

  • added (three additional feature vectors will be) –

  • start (representing the sequence) –

  • middle (sequence) –

  • of (and sequence end. The values in these features are scaled between 0 and 1. A graphical representation) –

  • below. (the values of these vectors is given) –

  • code-block: (..) –

    console: Value of sequence start: Value of sequence middle: Value of sequence end:

    1 1 /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾ 1 /
    / /

    / /

    0 _____________________ 0 / 0 _____________________/

    <—-sequence length—-> <—-sequence length—-> <—-sequence length—->

  • distance_to_seq_middle (If the complete sequence length is smaller than 2 *) – only applies when use_positional_info is True. This is the distance from the edge

  • sequence (of the CDR3) –

  • example (For) – if distance_to_seq_middle is 6 (default), all IMGT positions in the interval [111, 112)

  • 1. (and the maximum value of the 'middle' vector will not reach) –

  • sequences (When using nucleotide) – note that the distance is measured in (amino acid) IMGT positions.

  • distance_to_seq_middle

  • the (the maximum value of) –

  • 0 ('start' and 'end' vectors will not reach) –

  • 1.

  • below (A graphical representation of the positional vectors with a too short sequence is given) –

  • code-block:

    console: Value of sequence start Value of sequence middle Value of sequence end: with very short sequence: with very short sequence: with very short sequence:

    1 1 1 /
    /

    / /

    0 0 / 0

    <-> <–> <->

  • flatten (bool) – whether to flatten the final onehot matrix to a 2-dimensional matrix [examples, other_dims_combined]

  • methods (This must be set to True when using onehot encoding in combination with scikit-learn ML) –

:param such as LogisticRegression: :param SVM: :param SVC: :param RandomForestClassifier and KNN.: :param sequence_type: whether to use nucleotide or amino acid sequence for encoding. Valid values are ‘nucleotide’ and ‘amino_acid’.

YAML specification:

one_hot_vanilla:
    OneHot:
        use_positional_info: False
        flatten: False
        sequence_type: amino_acid

one_hot_positional:
    OneHot:
        use_positional_info: True
        distance_to_seq_middle: 3
        flatten: False
        sequence_type: nucleotide
static build_object(dataset=None, **params)[source]
dataset_mapping = {'ReceptorDataset': 'OneHotReceptorEncoder', 'RepertoireDataset': 'OneHotRepertoireEncoder', 'SequenceDataset': 'OneHotSequenceEncoder'}
encode(dataset, params: EncoderParams)[source]
store(encoded_dataset, params: EncoderParams)[source]

immuneML.encodings.onehot.OneHotReceptorEncoder module

class immuneML.encodings.onehot.OneHotReceptorEncoder.OneHotReceptorEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: str = None, sequence_type: SequenceType = None)[source]

Bases: OneHotEncoder

One-hot encoded repertoire data is represented in a matrix with dimensions:

[receptors, chains, sequence_lengths, one_hot_characters]

when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:
  • start position (high when close to start)

  • middle position (high in the middle of the sequence)

  • end position (high when close to end)

immuneML.encodings.onehot.OneHotRepertoireEncoder module

class immuneML.encodings.onehot.OneHotRepertoireEncoder.OneHotRepertoireEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: str = None, sequence_type: SequenceType = None)[source]

Bases: OneHotEncoder

One-hot encoded repertoire data is represented in a matrix with dimensions:

[repertoires, sequences, sequence_lengths, one_hot_characters]

when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:
  • start position (high when close to start)

  • middle position (high in the middle of the sequence)

  • end position (high when close to end)

immuneML.encodings.onehot.OneHotSequenceEncoder module

class immuneML.encodings.onehot.OneHotSequenceEncoder.OneHotSequenceEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: str = None, sequence_type: SequenceType = None)[source]

Bases: OneHotEncoder

One-hot encoded repertoire data is represented in a matrix with dimensions:

[sequences, sequence_lengths, one_hot_characters]

when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:
  • start position (high when close to start)

  • middle position (high in the middle of the sequence)

  • end position (high when close to end)

Module contents