immuneML.encodings.word2vec package

Subpackages

Submodules

immuneML.encodings.word2vec.W2VRepertoireEncoder module

class immuneML.encodings.word2vec.W2VRepertoireEncoder.W2VRepertoireEncoder(vector_size: int, k: int, model_type: ModelType, epochs: int, window: int, name: str = None)[source]

Bases: Word2VecEncoder

immuneML.encodings.word2vec.W2VSequenceEncoder module

class immuneML.encodings.word2vec.W2VSequenceEncoder.W2VSequenceEncoder(vector_size: int, k: int, model_type: ModelType, epochs: int, window: int, name: str = None)[source]

Bases: Word2VecEncoder

immuneML.encodings.word2vec.Word2VecEncoder module

class immuneML.encodings.word2vec.Word2VecEncoder.Word2VecEncoder(vector_size: int, k: int, model_type: ModelType, epochs: int, window: int, name: str = None)[source]

Bases: DatasetEncoder

Word2VecEncoder learns the vector representations of k-mers based on the context (receptor sequence). Similar idea was discussed in: Ostrovsky-Berman, M., Frankel, B., Polak, P. & Yaari, G. Immune2vec: Embedding B/T Cell Receptor Sequences in ℝN Using Natural Language Processing. Frontiers in Immunology 12, (2021).

This encoder relies on gensim’s implementation of Word2Vec and KmerHelper for k-mer extraction. Currently it works on amino acid level.

Dataset type:

  • SequenceDatasets

  • RepertoireDatasets

Specification arguments:

  • vector_size (int): The size of the vector to be learnt.

  • model_type (ModelType): The context which will be used to infer the representation of the sequence. If SEQUENCE is used, the context of a k-mer is defined by the sequence it occurs in (e.g. if the sequence is CASTTY and k-mer is AST, then its context consists of k-mers CAS, STT, TTY) If KMER_PAIR is used, the context for the k-mer is defined as all the k-mers that within one edit distance (e.g. for k-mer CAS, the context includes CAA, CAC, CAD etc.). Valid values for this parameter are names of the ModelType enum.

  • k (int): The length of the k-mers used for the encoding.

  • epochs (int): for how many epochs to train the word2vec model for a given set of sentences (corresponding to epochs parameter in gensim package)

  • window (int): max distance between two k-mers in a sequence (same as window parameter in gensim’s word2vec)

YAML pecification:

definitions:
    encodings:
        encodings:
            my_w2v:
                Word2Vec:
                    vector_size: 16
                    k: 3
                    model_type: SEQUENCE
                    epochs: 100
                    window: 8
DESCRIPTION_LABELS = 'labels'
DESCRIPTION_REPERTOIRES = 'repertoires'
static build_object(dataset=None, **params)[source]

Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.

The build_object method should do the following:

  1. Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.

  2. Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.

  3. Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.

Parameters:

**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object

Returns:

the object of the appropriate Encoder class

dataset_mapping = {'RepertoireDataset': 'W2VRepertoireEncoder', 'SequenceDataset': 'W2VSequenceEncoder'}
encode(dataset, params: EncoderParams)[source]

This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.

Parameters:
  • dataset – A dataset object (Sequence, Receptor or RepertoireDataset)

  • params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).

Returns:

A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.

get_additional_files() List[str][source]

Should return a list with all the files that need to be stored when storing the encoder. For example, SimilarToPositiveSequenceEncoder stores all ‘positive’ sequences in the training data, and predicts a sequence to be ‘positive’ if it is similar to any positive sequences in the training data. In that case, these positive sequences are stored in a file.

For many encoders, it may not be necessary to store additional files.

static get_documentation()[source]
static load_encoder(encoder_file: Path)[source]

The load_encoder method can load the encoder given the folder where the same class of the model was previously stored using the store function. Encoders are stored in pickle format. If the encoder uses additional files, they should be explicitly loaded here as well.

If there are no additional files, this method does not need to be overwritten. If there are additional files, its contents should be as follows:

encoder = DatasetEncoder.load_encoder(encoder_file) encoder.my_additional_file = DatasetEncoder.load_attribute(encoder, encoder_file, “my_additional_file”)

Parameters:

encoder_file (Path) – path to the encoder file where the encoder was stored using store() function

Returns:

the loaded Encoder object

Module contents