immuneML.encodings.word2vec package
Submodules
immuneML.encodings.word2vec.W2VRepertoireEncoder module
- class immuneML.encodings.word2vec.W2VRepertoireEncoder.W2VRepertoireEncoder(vector_size: int, k: int, model_type: immuneML.encodings.word2vec.model_creator.ModelType.ModelType, epochs: int, window: int, name: Optional[str] = None)[source]
Bases:
immuneML.encodings.word2vec.Word2VecEncoder.Word2VecEncoder
immuneML.encodings.word2vec.Word2VecEncoder module
- class immuneML.encodings.word2vec.Word2VecEncoder.Word2VecEncoder(vector_size: int, k: int, model_type: immuneML.encodings.word2vec.model_creator.ModelType.ModelType, epochs: int, window: int, name: Optional[str] = None)[source]
Bases:
immuneML.encodings.DatasetEncoder.DatasetEncoder
Word2VecEncoder learns the vector representations of k-mers based on the context (receptor sequence). It works for sequence and repertoire datasets. Similar idea was discussed in: Ostrovsky-Berman, M., Frankel, B., Polak, P. & Yaari, G. Immune2vec: Embedding B/T Cell Receptor Sequences in ℝN Using Natural Language Processing. Frontiers in Immunology 12, (2021).
This encoder relies on gensim’s implementation of Word2Vec and KmerHelper for k-mer extraction. Currently it works on amino acid level.
- Parameters
vector_size (int) – The size of the vector to be learnt.
model_type (
ModelType
) – The context which will besequence. (used to infer the representation of the) –
:param If
SEQUENCE
is used: :param the context of: :param a k-mer is defined by the sequence it occurs in (e.g. if the sequence is CASTTY and k-mer is AST: :param : :param then its context consists of k-mers CAS: :param STT: :param TTY): :param IfKMER_PAIR
is used: :param the context for: :param the k-mer is defined as all the k-mers that within one edit distance (e.g. for k-mer CAS: :param the context: :param includes CAA: :param CAC: :param CAD etc.).: :param Valid values for this parameter are names of the ModelType enum.: :param k: The length of the k-mers used for the encoding. :type k: int :param epochs: for how many epochs to train the word2vec model for a given set of sentences (corresponding to epochs parameter in gensim package) :type epochs: int :param window: max distance between two k-mers in a sequence (same as window parameter in gensim’s word2vec) :type window: intYAML specification:
encodings: my_w2v: Word2Vec: vector_size: 16 k: 3 model_type: SEQUENCE epochs: 100 window: 8
- DESCRIPTION_LABELS = 'labels'
- DESCRIPTION_REPERTOIRES = 'repertoires'
- dataset_mapping = {'RepertoireDataset': 'W2VRepertoireEncoder', 'SequenceDataset': 'W2VSequenceEncoder'}
- encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]