immuneML.encodings.onehot package¶
Submodules¶
immuneML.encodings.onehot.OneHotEncoder module¶
-
class
immuneML.encodings.onehot.OneHotEncoder.
OneHotEncoder
(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]¶ Bases:
immuneML.encodings.DatasetEncoder.DatasetEncoder
One-hot encoding for repertoires, sequences or receptors. In one-hot encoding, each alphabet character (amino acid or nucleotide) is replaced by a sparse vector with one 1 and the rest zeroes. The position of the 1 represents the alphabet character.
- Parameters
use_positional_info (bool) – whether to include features representing the positional information.
True (If) –
additional feature vectors will be added (three) –
the sequence start (representing) –
middle (sequence) –
sequence end. The values in these features are scaled between 0 and 1. A graphical representation of (and) –
values of these vectors is given below. (the) –
code-block: (.) –
console: Value of sequence start: Value of sequence middle: Value of sequence end:
- 1 1 /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾ 1 /
- / /
/ /
- 0 _____________________ 0 / 0 _____________________/
<—-sequence length—-> <—-sequence length—-> <—-sequence length—->
distance_to_seq_middle (int) – only applies when use_positional_info is True. This is the distance from the edge
the CDR3 sequence (of) –
example (For) – if distance_to_seq_middle is 6 (default), all IMGT positions in the interval [111, 112)
positional value 1. (receive) –
using nucleotide sequences (When) – note that the distance is measured in (amino acid) IMGT positions.
the complete sequence length is smaller than 2 * distance_to_seq_middle (If) –
maximum value of the (the) –
and 'end' vectors will not reach 0 ('start') –
the maximum value of the 'middle' vector will not reach 1. (and) –
graphical representation of the positional vectors with a too short sequence is given below (A) –
code-block: –
console: Value of sequence start Value of sequence middle Value of sequence end: with very short sequence: with very short sequence: with very short sequence:
- 1 1 1 /
- /
/ /
- 0 0 / 0
<-> <–> <->
flatten (bool) – whether to flatten the final onehot matrix to a 2-dimensional matrix [examples, other_dims_combined]
must be set to True when using onehot encoding in combination with scikit-learn ML methods (This) –
:param such as LogisticRegression: :param SVM: :param RandomForestClassifier and KNN.: :param sequence_type: whether to use nucleotide or amino acid sequence for encoding. Valid values are ‘nucleotide’ and ‘amino_acid’.
YAML specification:
one_hot_vanilla: OneHot: use_positional_info: False flatten: False sequence_type: amino_acid one_hot_positional: OneHot: use_positional_info: True distance_to_seq_middle: 3 flatten: False sequence_type: nucleotide
-
dataset_mapping
= {'ReceptorDataset': 'OneHotReceptorEncoder', 'RepertoireDataset': 'OneHotRepertoireEncoder', 'SequenceDataset': 'OneHotSequenceEncoder'}¶
-
encode
(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]¶
-
store
(encoded_dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]¶
immuneML.encodings.onehot.OneHotReceptorEncoder module¶
-
class
immuneML.encodings.onehot.OneHotReceptorEncoder.
OneHotReceptorEncoder
(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]¶ Bases:
immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder
- One-hot encoded repertoire data is represented in a matrix with dimensions:
[receptors, chains, sequence_lengths, one_hot_characters]
- when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:
start position (high when close to start)
middle position (high in the middle of the sequence)
end position (high when close to end)
immuneML.encodings.onehot.OneHotRepertoireEncoder module¶
-
class
immuneML.encodings.onehot.OneHotRepertoireEncoder.
OneHotRepertoireEncoder
(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]¶ Bases:
immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder
- One-hot encoded repertoire data is represented in a matrix with dimensions:
[repertoires, sequences, sequence_lengths, one_hot_characters]
- when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:
start position (high when close to start)
middle position (high in the middle of the sequence)
end position (high when close to end)
immuneML.encodings.onehot.OneHotSequenceEncoder module¶
-
class
immuneML.encodings.onehot.OneHotSequenceEncoder.
OneHotSequenceEncoder
(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]¶ Bases:
immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder
- One-hot encoded repertoire data is represented in a matrix with dimensions:
[sequences, sequence_lengths, one_hot_characters]
- when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:
start position (high when close to start)
middle position (high in the middle of the sequence)
end position (high when close to end)