immuneML.encodings.protein_embedding package

Submodules

immuneML.encodings.protein_embedding.ProtT5Encoder module

class immuneML.encodings.protein_embedding.ProtT5Encoder.ProtT5Encoder(name: str = None, region_type: RegionType = RegionType.IMGT_CDR3, device: str = 'cpu', num_processes: int = 1)[source]

Bases: ProteinEmbeddingEncoder

Encoder based on a pretrained protein language model by Elnaggar et al. 2021. The used transformer model is “Rostlab/prot_t5_xl_half_uniref50-enc”.

Original publication: Elnaggar, A., Heinzinger, M., Dallago, C., Rihawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., Bhowmik, D., & Rost, B. (2021). ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing (No. arXiv:2007.06225). arXiv. https://doi.org/10.48550/arXiv.2007.06225

Original GitHub repository with license information: https://github.com/agemagician/ProtTrans

Dataset type:

  • SequenceDatasets

  • ReceptorDatasets

  • RepertoireDatasets

Specification arguments:

  • region_type (RegionType): Which part of the receptor sequence to encode. Defaults to IMGT_CDR3.

  • device (str): Which device to use for model inference - ‘cpu’, ‘cuda’, ‘mps’ - as defined by pytorch. Defaults to ‘cpu’.

  • num_processes (int): Number of processes to use for parallel processing. Defaults to 1.

YAML specification:

definitions:
    encodings:
        my_prot_t5_encoder:
            ProtT5::
                region_type: IMGT_CDR3
                device: cpu
                num_processes: 4
static build_object(dataset: Dataset, **params)[source]

Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.

The build_object method should do the following:

  1. Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.

  2. Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.

  3. Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.

Parameters:

**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object

Returns:

the object of the appropriate Encoder class

immuneML.encodings.protein_embedding.ProteinEmbeddingEncoder module

class immuneML.encodings.protein_embedding.ProteinEmbeddingEncoder.ProteinEmbeddingEncoder(region_type: RegionType, name: str = None, num_processes: int = 1, device: str = 'cpu')[source]

Bases: DatasetEncoder, ABC

Abstract base class for protein embedding encoders that handles dataset-type specific logic. Subclasses must implement the _embed_sequence_set method.

abstract static build_object(dataset: Dataset, **params)[source]

Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.

The build_object method should do the following:

  1. Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.

  2. Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.

  3. Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.

Parameters:

**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object

Returns:

the object of the appropriate Encoder class

encode(dataset: Dataset, params: EncoderParams) Dataset[source]

This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.

Parameters:
  • dataset – A dataset object (Sequence, Receptor or RepertoireDataset)

  • params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).

Returns:

A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.

immuneML.encodings.protein_embedding.TCRBertEncoder module

class immuneML.encodings.protein_embedding.TCRBertEncoder.TCRBertEncoder(name: str = None, region_type: RegionType = RegionType.IMGT_CDR3, model: str = None, layers: list = None, method: str = None, batch_size: int = None, device: str = 'cpu')[source]

Bases: ProteinEmbeddingEncoder

TCRBertEncoder is based on TCR-BERT, a large language model trained on TCR sequences. TCRBertEncoder embeds TCR sequences using either of the pre-trained models provided on HuggingFace repository.

Original publication: Wu, K. E., Yost, K., Daniel, B., Belk, J., Xia, Y., Egawa, T., Satpathy, A., Chang, H., & Zou, J. (2024). TCR-BERT: Learning the grammar of T-cell receptors for flexible antigen-binding analyses. Proceedings of the 18th Machine Learning in Computational Biology Meeting, 194–229. https://proceedings.mlr.press/v240/wu24b.html

Dataset type:

  • SequenceDataset

  • ReceptorDataset

  • RepertoireDataset

Specification arguments:

  • model (str): The pre-trained model to use (huggingface model hub identifier). Available options are ‘tcr-bert’ and ‘tcr-bert-mlm-only’.

  • layers (list): The hidden layers to use for encoding. Layers should be given as negative integers, where -1 indicates the last representation, -2 second to last, etc. Default is [-1].

  • method (str): The method to use for pooling the hidden states. Available options are ‘mean’, ‘max’’, ‘cls’, and ‘pool’. Default is ‘mean’. For explanation of the methods, see GitHub repository of TCR-BERT.

  • batch_size (int): The batch size to use for encoding. Default is 256.

YAML specification:

definitions:
    encodings:
        my_tcr_bert_encoder: TCRBert
static build_object(dataset: Dataset, **params)[source]

Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.

The build_object method should do the following:

  1. Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.

  2. Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.

  3. Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.

Parameters:

**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object

Returns:

the object of the appropriate Encoder class

Module contents