immuneML.encodings.onehot package

Submodules

immuneML.encodings.onehot.OneHotEncoder module

class immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]

Bases: immuneML.encodings.DatasetEncoder.DatasetEncoder

One-hot encoding for repertoires, sequences or receptors. In one-hot encoding, each alphabet character (amino acid or nucleotide) is replaced by a sparse vector with one 1 and the rest zeroes. The position of the 1 represents the alphabet character.

Parameters

use_positional_info (bool) – whether to include features representing the positional information.
True (If) –
added (three additional feature vectors will be) –
start (representing the sequence) –
middle (sequence) –
of (and sequence end. The values in these features are scaled between 0 and 1. A graphical representation) –
below. (the values of these vectors is given) –
code-block: (.) –
console: Value of sequence start: Value of sequence middle: Value of sequence end:

1 1 /‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾ 1 /

/ /
/ /

0 _____________________ 0 / 0 _____________________/
<—-sequence length—-> <—-sequence length—-> <—-sequence length—->
distance_to_seq_middle (If the complete sequence length is smaller than 2 *) – only applies when use_positional_info is True. This is the distance from the edge
sequence (of the CDR3) –
example (For) – if distance_to_seq_middle is 6 (default), all IMGT positions in the interval [111, 112)
1. (and the maximum value of the 'middle' vector will not reach) –
sequences (When using nucleotide) – note that the distance is measured in (amino acid) IMGT positions.
distance_to_seq_middle –
the (the maximum value of) –
0 ('start' and 'end' vectors will not reach) –
1. –
below (A graphical representation of the positional vectors with a too short sequence is given) –
code-block: –
console: Value of sequence start Value of sequence middle Value of sequence end: with very short sequence: with very short sequence: with very short sequence:

1 1 1 /

/
/ /

0 0 / 0
<-> <–> <->
flatten (bool) – whether to flatten the final onehot matrix to a 2-dimensional matrix [examples, other_dims_combined]
methods (This must be set to True when using onehot encoding in combination with scikit-learn ML) –

:param such as LogisticRegression: :param SVM: :param SVC: :param RandomForestClassifier and KNN.: :param sequence_type: whether to use nucleotide or amino acid sequence for encoding. Valid values are ‘nucleotide’ and ‘amino_acid’.

YAML specification:

one_hot_vanilla:
    OneHot:
        use_positional_info: False
        flatten: False
        sequence_type: amino_acid

one_hot_positional:
    OneHot:
        use_positional_info: True
        distance_to_seq_middle: 3
        flatten: False
        sequence_type: nucleotide

static build_object(dataset=None, **params)[source]

dataset_mapping = {'ReceptorDataset': 'OneHotReceptorEncoder', 'RepertoireDataset': 'OneHotRepertoireEncoder', 'SequenceDataset': 'OneHotSequenceEncoder'}

encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]

store(encoded_dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]

immuneML.encodings.onehot.OneHotReceptorEncoder module

class immuneML.encodings.onehot.OneHotReceptorEncoder.OneHotReceptorEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]

Bases: immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder

One-hot encoded repertoire data is represented in a matrix with dimensions:

[receptors, chains, sequence_lengths, one_hot_characters]

when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:

start position (high when close to start)
middle position (high in the middle of the sequence)
end position (high when close to end)

immuneML.encodings.onehot.OneHotRepertoireEncoder module

class immuneML.encodings.onehot.OneHotRepertoireEncoder.OneHotRepertoireEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]

Bases: immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder

One-hot encoded repertoire data is represented in a matrix with dimensions:

[repertoires, sequences, sequence_lengths, one_hot_characters]

when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:

start position (high when close to start)
middle position (high in the middle of the sequence)
end position (high when close to end)

immuneML.encodings.onehot.OneHotSequenceEncoder module

class immuneML.encodings.onehot.OneHotSequenceEncoder.OneHotSequenceEncoder(use_positional_info: bool, distance_to_seq_middle: int, flatten: bool, name: Optional[str] = None, sequence_type: Optional[immuneML.environment.SequenceType.SequenceType] = None)[source]

Bases: immuneML.encodings.onehot.OneHotEncoder.OneHotEncoder

One-hot encoded repertoire data is represented in a matrix with dimensions:

[sequences, sequence_lengths, one_hot_characters]

when use_positional_info is true, the last 3 indices in one_hot_characters represents the positional information:

start position (high when close to start)
middle position (high in the middle of the sequence)
end position (high when close to end)

immuneML.encodings.onehot package

Submodules

immuneML.encodings.onehot.OneHotEncoder module

immuneML.encodings.onehot.OneHotReceptorEncoder module

immuneML.encodings.onehot.OneHotRepertoireEncoder module

immuneML.encodings.onehot.OneHotSequenceEncoder module

Module contents