immuneML.encodings.word2vec package

Subpackages

immuneML.encodings.word2vec.model_creator package

Submodules

immuneML.encodings.word2vec.W2VRepertoireEncoder module

class immuneML.encodings.word2vec.W2VRepertoireEncoder.W2VRepertoireEncoder(vector_size: int, k: int, model_type: ModelType, epochs: int, window: int, name: str = None)[source]: Bases: Word2VecEncoder

immuneML.encodings.word2vec.W2VSequenceEncoder module

class immuneML.encodings.word2vec.W2VSequenceEncoder.W2VSequenceEncoder(vector_size: int, k: int, model_type: ModelType, epochs: int, window: int, name: str = None)[source]: Bases: Word2VecEncoder

immuneML.encodings.word2vec.Word2VecEncoder module

class immuneML.encodings.word2vec.Word2VecEncoder.Word2VecEncoder(vector_size: int, k: int, model_type: ModelType, epochs: int, window: int, name: str = None)[source]

Bases: DatasetEncoder

Word2VecEncoder learns the vector representations of k-mers based on the context (receptor sequence). It works for sequence and repertoire datasets. Similar idea was discussed in: Ostrovsky-Berman, M., Frankel, B., Polak, P. & Yaari, G. Immune2vec: Embedding B/T Cell Receptor Sequences in ℝN Using Natural Language Processing. Frontiers in Immunology 12, (2021).

This encoder relies on gensim’s implementation of Word2Vec and KmerHelper for k-mer extraction. Currently it works on amino acid level.

Parameters:

vector_size (int) – The size of the vector to be learnt.
model_type (ModelType) – The context which will be
sequence. (used to infer the representation of the) –

:param If SEQUENCE is used: :param the context of: :param a k-mer is defined by the sequence it occurs in (e.g. if the sequence is CASTTY and k-mer is AST: :param : :param then its context consists of k-mers CAS: :param STT: :param TTY): :param If KMER_PAIR is used: :param the context for: :param the k-mer is defined as all the k-mers that within one edit distance (e.g. for k-mer CAS: :param the context: :param includes CAA: :param CAC: :param CAD etc.).: :param Valid values for this parameter are names of the ModelType enum.: :param k: The length of the k-mers used for the encoding. :type k: int :param epochs: for how many epochs to train the word2vec model for a given set of sentences (corresponding to epochs parameter in gensim package) :type epochs: int :param window: max distance between two k-mers in a sequence (same as window parameter in gensim’s word2vec) :type window: int

YAML specification:

encodings:
    my_w2v:
        Word2Vec:
            vector_size: 16
            k: 3
            model_type: SEQUENCE
            epochs: 100
            window: 8

DESCRIPTION_LABELS = 'labels'

DESCRIPTION_REPERTOIRES = 'repertoires'

static build_object(dataset=None, **params)[source]

dataset_mapping = {'RepertoireDataset': 'W2VRepertoireEncoder', 'SequenceDataset': 'W2VSequenceEncoder'}

encode(dataset, params: EncoderParams)[source]

static export_encoder(path: Path, encoder) → str[source]

get_additional_files() → List[str][source]: Should return a list with all the files that need to be stored when storing the encoder.

static get_documentation()[source]

static load_encoder(encoder_file: Path)[source]