immuneML.encodings.onehot package¶

Submodules¶

immuneML.encodings.onehot.OneHotEncoder module¶

class immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]¶

Bases: immuneML.encodings.DatasetEncoder.DatasetEncoder

One-hot encoding for repertoires, sequences or receptors. In one-hot encoding, each alphabet character (amino acid or nucleotide) is replaced by a sparse vector with one 1 and the rest zeroes. The position of the 1 represents the alphabet character.

Parameters

use_positional_info (bool) – whether to include features representing the positional information.
True (If) –
additional feature vectors will be added (three) –
the sequence start (representing) –
middle (sequence) –
sequence end. The values in these features are scaled between 0 and 1. A graphical representation of (and) –
values of these vectors is given below. (the) –
code-block: (.) –
console: Value of sequence start: Value of sequence middle: Value of sequence end:

1 1 /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾ 1 /

/ /
/ /

0 _____________________ 0 / 0 _____________________/
<—-sequence length—-> <—-sequence length—-> <—-sequence length—->
distance_to_seq_middle (int) – only applies when use_positional_info is True. This is the distance from the edge
the CDR3 sequence (of) –
example (For) – if distance_to_seq_middle is 6 (default), all IMGT positions in the interval [111, 112)
positional value 1. (receive) –
using nucleotide sequences (When) – note that the distance is measured in (amino acid) IMGT positions.
the complete sequence length is smaller than 2 * distance_to_seq_middle (If) –
maximum value of the (the) –
and 'end' vectors will not reach 0 ('start') –
the maximum value of the 'middle' vector will not reach 1. (and) –
graphical representation of the positional vectors with a too short sequence is given below (A) –
code-block: –
console: Value of sequence start Value of sequence middle Value of sequence end: with very short sequence: with very short sequence: with very short sequence:

1 1 1 /

/
/ /

0 0 / 0
<-> <–> <->
flatten (bool) – whether to flatten the final onehot matrix to a 2-dimensional matrix [examples, other_dims_combined]
must be set to True when using onehot encoding in combination with scikit-learn ML methods (This) –

:param such as LogisticRegression: :param SVM: :param RandomForestClassifier and KNN.: :param sequence_type: whether to use nucleotide or amino acid sequence for encoding. Valid values are ‘nucleotide’ and ‘amino_acid’.

YAML specification:

one_hot_vanilla:
    OneHot:
        use_positional_info: False
        flatten: False
        sequence_type: amino_acid

one_hot_positional:
    OneHot:
        use_positional_info: True
        distance_to_seq_middle: 3
        flatten: False
        sequence_type: nucleotide

static build_object(dataset=None, **params)[source]¶

dataset_mapping = {'ReceptorDataset': 'OneHotReceptorEncoder', 'RepertoireDataset': 'OneHotRepertoireEncoder', 'SequenceDataset': 'OneHotSequenceEncoder'}¶

encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]¶

store(encoded_dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]¶

immuneML.encodings.onehot.OneHotReceptorEncoder module¶

class immuneML.encodings.onehot.OneHotReceptorEncoder.OneHotReceptorEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]¶

Bases: immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder

One-hot encoded repertoire data is represented in a matrix with dimensions:

[receptors, chains, sequence_lengths, one_hot_characters]

when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:

start position (high when close to start)
middle position (high in the middle of the sequence)
end position (high when close to end)

immuneML.encodings.onehot.OneHotRepertoireEncoder module¶

class immuneML.encodings.onehot.OneHotRepertoireEncoder.OneHotRepertoireEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]¶

Bases: immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder

One-hot encoded repertoire data is represented in a matrix with dimensions:

[repertoires, sequences, sequence_lengths, one_hot_characters]

when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:

start position (high when close to start)
middle position (high in the middle of the sequence)
end position (high when close to end)

immuneML.encodings.onehot.OneHotSequenceEncoder module¶

class immuneML.encodings.onehot.OneHotSequenceEncoder.OneHotSequenceEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]¶

Bases: immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder

One-hot encoded repertoire data is represented in a matrix with dimensions:

[sequences, sequence_lengths, one_hot_characters]

when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:

start position (high when close to start)
middle position (high in the middle of the sequence)
end position (high when close to end)

immuneML.encodings.onehot package¶

Submodules¶

immuneML.encodings.onehot.OneHotEncoder module¶

immuneML.encodings.onehot.OneHotReceptorEncoder module¶

immuneML.encodings.onehot.OneHotRepertoireEncoder module¶

immuneML.encodings.onehot.OneHotSequenceEncoder module¶

Module contents¶