immuneML.encodings.protein_embedding package¶
Submodules¶
immuneML.encodings.protein_embedding.ProtT5Encoder module¶
- class immuneML.encodings.protein_embedding.ProtT5Encoder.ProtT5Encoder(name: str = None, region_type: RegionType = RegionType.IMGT_CDR3, device: str = 'cpu', num_processes: int = 1)[source]¶
Bases:
ProteinEmbeddingEncoder
Encoder based on a pretrained protein language model by Elnaggar et al. 2021. The used transformer model is “Rostlab/prot_t5_xl_half_uniref50-enc”.
Original publication: Elnaggar, A., Heinzinger, M., Dallago, C., Rihawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., Bhowmik, D., & Rost, B. (2021). ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing (No. arXiv:2007.06225). arXiv. https://doi.org/10.48550/arXiv.2007.06225
Original GitHub repository with license information: https://github.com/agemagician/ProtTrans
Dataset type:
SequenceDatasets
ReceptorDatasets
RepertoireDatasets
Specification arguments:
region_type (RegionType): Which part of the receptor sequence to encode. Defaults to IMGT_CDR3.
device (str): Which device to use for model inference - ‘cpu’, ‘cuda’, ‘mps’ - as defined by pytorch. Defaults to ‘cpu’.
num_processes (int): Number of processes to use for parallel processing. Defaults to 1.
YAML specification:
definitions: encodings: my_prot_t5_encoder: ProtT5:: region_type: IMGT_CDR3 device: cpu num_processes: 4
- static build_object(dataset: Dataset, **params)[source]¶
Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.
The build_object method should do the following:
Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.
Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.
Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.
- Parameters:
**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object
- Returns:
the object of the appropriate Encoder class
immuneML.encodings.protein_embedding.ProteinEmbeddingEncoder module¶
- class immuneML.encodings.protein_embedding.ProteinEmbeddingEncoder.ProteinEmbeddingEncoder(region_type: RegionType, name: str = None, num_processes: int = 1, device: str = 'cpu')[source]¶
Bases:
DatasetEncoder
,ABC
Abstract base class for protein embedding encoders that handles dataset-type specific logic. Subclasses must implement the _embed_sequence_set method.
- abstract static build_object(dataset: Dataset, **params)[source]¶
Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.
The build_object method should do the following:
Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.
Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.
Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.
- Parameters:
**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object
- Returns:
the object of the appropriate Encoder class
- encode(dataset: Dataset, params: EncoderParams) Dataset [source]¶
This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.
- Parameters:
dataset – A dataset object (Sequence, Receptor or RepertoireDataset)
params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).
- Returns:
A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.
immuneML.encodings.protein_embedding.TCRBertEncoder module¶
- class immuneML.encodings.protein_embedding.TCRBertEncoder.TCRBertEncoder(name: str = None, region_type: RegionType = RegionType.IMGT_CDR3, model: str = None, layers: list = None, method: str = None, batch_size: int = None, device: str = 'cpu')[source]¶
Bases:
ProteinEmbeddingEncoder
TCRBertEncoder is based on TCR-BERT, a large language model trained on TCR sequences. TCRBertEncoder embeds TCR sequences using either of the pre-trained models provided on HuggingFace repository.
Original publication: Wu, K. E., Yost, K., Daniel, B., Belk, J., Xia, Y., Egawa, T., Satpathy, A., Chang, H., & Zou, J. (2024). TCR-BERT: Learning the grammar of T-cell receptors for flexible antigen-binding analyses. Proceedings of the 18th Machine Learning in Computational Biology Meeting, 194–229. https://proceedings.mlr.press/v240/wu24b.html
Dataset type:
SequenceDataset
ReceptorDataset
RepertoireDataset
Specification arguments:
model (str): The pre-trained model to use (huggingface model hub identifier). Available options are ‘tcr-bert’ and ‘tcr-bert-mlm-only’.
layers (list): The hidden layers to use for encoding. Layers should be given as negative integers, where -1 indicates the last representation, -2 second to last, etc. Default is [-1].
method (str): The method to use for pooling the hidden states. Available options are ‘mean’, ‘max’’, ‘cls’, and ‘pool’. Default is ‘mean’. For explanation of the methods, see GitHub repository of TCR-BERT.
batch_size (int): The batch size to use for encoding. Default is 256.
YAML specification:
definitions: encodings: my_tcr_bert_encoder: TCRBert
- static build_object(dataset: Dataset, **params)[source]¶
Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.
The build_object method should do the following:
Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.
Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.
Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.
- Parameters:
**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object
- Returns:
the object of the appropriate Encoder class