immuneML.encodings.onehot package
Submodules
immuneML.encodings.onehot.OneHotEncoder module
- class immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: str = None, sequence_type: SequenceType = None)[source]
Bases:
DatasetEncoder
One-hot encoding for repertoires, sequences or receptors. In one-hot encoding, each alphabet character (amino acid or nucleotide) is replaced by a sparse vector with one 1 and the rest zeroes. The position of the 1 represents the alphabet character.
- Parameters:
use_positional_info (bool) – whether to include features representing the positional information.
True (If) –
added (three additional feature vectors will be) –
start (representing the sequence) –
middle (sequence) –
of (and sequence end. The values in these features are scaled between 0 and 1. A graphical representation) –
below. (the values of these vectors is given) –
code-block: (..) –
console: Value of sequence start: Value of sequence middle: Value of sequence end:
- 1 1 /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾ 1 /
- / /
/ /
- 0 _____________________ 0 / 0 _____________________/
<—-sequence length—-> <—-sequence length—-> <—-sequence length—->
distance_to_seq_middle (If the complete sequence length is smaller than 2 *) – only applies when use_positional_info is True. This is the distance from the edge
sequence (of the CDR3) –
example (For) – if distance_to_seq_middle is 6 (default), all IMGT positions in the interval [111, 112)
1. (and the maximum value of the 'middle' vector will not reach) –
sequences (When using nucleotide) – note that the distance is measured in (amino acid) IMGT positions.
distance_to_seq_middle –
the (the maximum value of) –
0 ('start' and 'end' vectors will not reach) –
1. –
below (A graphical representation of the positional vectors with a too short sequence is given) –
code-block: –
console: Value of sequence start Value of sequence middle Value of sequence end: with very short sequence: with very short sequence: with very short sequence:
- 1 1 1 /
- /
/ /
- 0 0 / 0
<-> <–> <->
flatten (bool) – whether to flatten the final onehot matrix to a 2-dimensional matrix [examples, other_dims_combined]
methods (This must be set to True when using onehot encoding in combination with scikit-learn ML) –
:param such as LogisticRegression: :param SVM: :param SVC: :param RandomForestClassifier and KNN.: :param sequence_type: whether to use nucleotide or amino acid sequence for encoding. Valid values are ‘nucleotide’ and ‘amino_acid’.
YAML specification:
one_hot_vanilla: OneHot: use_positional_info: False flatten: False sequence_type: amino_acid one_hot_positional: OneHot: use_positional_info: True distance_to_seq_middle: 3 flatten: False sequence_type: nucleotide
- dataset_mapping = {'ReceptorDataset': 'OneHotReceptorEncoder', 'RepertoireDataset': 'OneHotRepertoireEncoder', 'SequenceDataset': 'OneHotSequenceEncoder'}
- encode(dataset, params: EncoderParams)[source]
- store(encoded_dataset, params: EncoderParams)[source]
immuneML.encodings.onehot.OneHotReceptorEncoder module
- class immuneML.encodings.onehot.OneHotReceptorEncoder.OneHotReceptorEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: str = None, sequence_type: SequenceType = None)[source]
Bases:
OneHotEncoder
- One-hot encoded repertoire data is represented in a matrix with dimensions:
[receptors, chains, sequence_lengths, one_hot_characters]
- when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:
start position (high when close to start)
middle position (high in the middle of the sequence)
end position (high when close to end)
immuneML.encodings.onehot.OneHotRepertoireEncoder module
- class immuneML.encodings.onehot.OneHotRepertoireEncoder.OneHotRepertoireEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: str = None, sequence_type: SequenceType = None)[source]
Bases:
OneHotEncoder
- One-hot encoded repertoire data is represented in a matrix with dimensions:
[repertoires, sequences, sequence_lengths, one_hot_characters]
- when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:
start position (high when close to start)
middle position (high in the middle of the sequence)
end position (high when close to end)
immuneML.encodings.onehot.OneHotSequenceEncoder module
- class immuneML.encodings.onehot.OneHotSequenceEncoder.OneHotSequenceEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: str = None, sequence_type: SequenceType = None)[source]
Bases:
OneHotEncoder
- One-hot encoded repertoire data is represented in a matrix with dimensions:
[sequences, sequence_lengths, one_hot_characters]
- when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:
start position (high when close to start)
middle position (high in the middle of the sequence)
end position (high when close to end)