immuneML.encodings.amino_acid_property_encoding package¶
Submodules¶
immuneML.encodings.amino_acid_property_encoding.AminoAcidPropertyEncoder module¶
- class immuneML.encodings.amino_acid_property_encoding.AminoAcidPropertyEncoder.AminoAcidPropertyEncoder(factors: str, region_type: RegionType, scale_to_zero_mean: bool = False, scale_to_unit_variance: bool = False, name: str = None)[source]¶
Bases:
DatasetEncoderEncodes a dataset by replacing each amino acid in a sequence with its biophysicochemical factor vector and averaging those vectors across all positions in the sequence. Three factor sets are supported, each stored as a TSV file under
immuneML/config/physicochemical_factors/:atchley— 5 factors per amino acid (Atchley et al., 2005).kidera— 10 factors per amino acid (Kidera et al., 1985).amino_acid_property— 14 mixed physicochemical descriptors per amino acid compiled from several published sources and originally aggregated in VDJtools (Shugay et al., 2015).
Characters outside the standard 20-amino-acid alphabet (gaps, X, etc.) are silently skipped; a sequence with no known amino acids is encoded as an all-zero vector.
For SequenceDatasets the output shape is
[n_sequences, n_factors]. For ReceptorDatasets each chain is encoded independently and the resulting vectors are concatenated (chains ordered alphabetically by locus name), giving shape[n_receptors, 2 × n_factors].Dataset type:
SequenceDatasets
ReceptorDatasets
Specification arguments:
factors (str): Which set of biophysicochemical factors to use. Valid values:
atchley(5 factors),kidera(10 factors), oramino_acid_property(14 factors).region_type (str): Which part of the receptor sequence to encode (e.g.
imgt_cdr3).scale_to_zero_mean (bool): Whether to scale each feature to zero mean across examples after encoding. Defaults to
true.scale_to_unit_variance (bool): Whether to scale each feature to unit variance across examples after encoding. Defaults to
true.
References:
Factor values downloaded from vadimnazarov/kidera-atchley.
W.R. Atchley, J. Zhao, A.D. Fernandes, & T. Drüke, Solving the protein sequence metric problem, Proc. Natl. Acad. Sci. U.S.A. 102 (18) 6395-6400, https://doi.org/10.1073/pnas.0408677102 (2005).
Kidera, A., Konishi, Y., Oka, M. et al. Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J Protein Chem 4, 23–55 (1985). https://doi.org/10.1007/BF01025492
Shugay M et al. VDJtools: Unifying Post-analysis of T Cell Receptor Repertoires. PLoS Comp Biol 2015; 11(11):e1004503-e1004503.
YAML specification:
definitions: encodings: my_atchley_encoder: AminoAcidProperty: factors: atchley region_type: imgt_cdr3 scale_to_zero_mean: true scale_to_unit_variance: true my_kidera_encoder: AminoAcidProperty: factors: kidera region_type: imgt_cdr3 my_aa_property_encoder: AminoAcidProperty: factors: amino_acid_property region_type: imgt_cdr3
- static build_object(dataset: Dataset, **params)[source]¶
Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.
The build_object method should do the following:
Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.
Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.
Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.
- Parameters:
dataset – Dataset object of the same class as the dataset to be encoded later; in case there are multiple dataset types supported by the encoder, the dataset should be of one of these types and the correct subclass of the encoder should be returned
**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object
- Returns:
the object of the appropriate Encoder class
- encode(dataset: Dataset, params: EncoderParams) Dataset[source]¶
This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.
- Parameters:
dataset – A dataset object (Sequence, Receptor or RepertoireDataset)
params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).
- Returns:
A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.