immuneML.encodings.atchley_kmer_encoding package

Submodules

immuneML.encodings.atchley_kmer_encoding.AtchleyKmerEncoder module

class immuneML.encodings.atchley_kmer_encoding.AtchleyKmerEncoder.AtchleyKmerEncoder(k: int, skip_first_n_aa: int, skip_last_n_aa: int, abundance: str, normalize_all_features: bool, name: str = None)[source]

Bases: DatasetEncoder

Represents a repertoire through Atchley factors and relative abundance of k-mers. Should be used in combination with the AtchleyKmerMILClassifier.

For more details, see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292 .

Note that sequences in the repertoire with length shorter than skip_first_n_aa + skip_last_n_aa + k will not be encoded.

Dataset type:

  • RepertoireDatasets

Specification arguments:

  • k (int): k-mer length

  • skip_first_n_aa (int): number of amino acids to remove from the beginning of the receptor sequence

  • skip_last_n_aa (int): number of amino acids to remove from the end of the receptor sequence

  • abundance: how to compute abundance term for k-mers

  • normalize_all_features (bool): when normalizing features to have 0 mean and unit variance, this parameter indicates if the abundance feature should be included in the normalization

YAML specification:

definitions:
    encodings:
        my_encoder:
            AtchleyKmer:
                k: 4
                skip_first_n_aa: 3
                skip_last_n_aa: 3
                abundance: RELATIVE_ABUNDANCE
                normalize_all_features: False
static build_object(dataset, **params)[source]

Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.

The build_object method should do the following:

  1. Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.

  2. Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.

  3. Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.

Parameters:

**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object

Returns:

the object of the appropriate Encoder class

encode(dataset, params: EncoderParams)[source]

This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.

Parameters:
  • dataset – A dataset object (Sequence, Receptor or RepertoireDataset)

  • params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).

Returns:

A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.

get_additional_files() List[str][source]

Should return a list with all the files that need to be stored when storing the encoder. For example, SimilarToPositiveSequenceEncoder stores all ‘positive’ sequences in the training data, and predicts a sequence to be ‘positive’ if it is similar to any positive sequences in the training data. In that case, these positive sequences are stored in a file.

For many encoders, it may not be necessary to store additional files.

static get_documentation()[source]
static load_encoder(encoder_file: Path)[source]

The load_encoder method can load the encoder given the folder where the same class of the model was previously stored using the store function. Encoders are stored in pickle format. If the encoder uses additional files, they should be explicitly loaded here as well.

If there are no additional files, this method does not need to be overwritten. If there are additional files, its contents should be as follows:

encoder = DatasetEncoder.load_encoder(encoder_file) encoder.my_additional_file = DatasetEncoder.load_attribute(encoder, encoder_file, “my_additional_file”)

Parameters:

encoder_file (Path) – path to the encoder file where the encoder was stored using store() function

Returns:

the loaded Encoder object

immuneML.encodings.atchley_kmer_encoding.RelativeAbundanceType module

class immuneML.encodings.atchley_kmer_encoding.RelativeAbundanceType.RelativeAbundanceType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

RELATIVE_ABUNDANCE = 'relative_abundance'
TCRB_RELATIVE_ABUNDANCE = 'tcrb_relative_abundance'

immuneML.encodings.atchley_kmer_encoding.Util module

class immuneML.encodings.atchley_kmer_encoding.Util.Util[source]

Bases: object

ATCHLEY_FACTORS = None
ATCHLEY_FACTOR_COUNT = 5
static compute_abundance(sequences: ndarray, counts: ndarray, k: int, abundance: RelativeAbundanceType)[source]
static compute_relative_abundance(sequences: ndarray, counts: ndarray, k: int) dict[source]

Computes the relative abundance of k-mers in the repertoire per following equations where C is the template count, T is the total count and RA is relative abundance (the output of the function for each k-mer separately):

\[ \begin{align}\begin{aligned}C^{kmer}=\sum_{\underset{with kmer}{TCR \beta}} C^{TCR \beta}\\T^{kmer} = \sum_{kmer} C^{kmer}\\RA = \frac{C^{kmer}}{T^{kmer}}\end{aligned}\end{align} \]

For more details, please see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292

Parameters:
  • sequences – an array of (amino acid) sequences (corresponding to a repertoire)

  • counts – an array of counts for each of the sequences

  • k – the length of the k-mer (in the publication referenced above, k is 4)

Returns:

a dictionary where keys are k-mers and values are their relative abundances in the given list of sequences

static compute_tcrb_relative_abundance(sequences: ndarray, counts: ndarray, k: int) dict[source]

Computes the relative abundance of k-mers in the repertoire per following equations where C is the template count for the given receptor sequence, T is the total count across all receptor sequences. The relative abundance per receptor sequence is then computed and only the maximum sequence abudance was used for the k-mer so that the k-mer’s relative abundance is equal to the abundance of the most frequent receptor sequence in which the receptor appears:

\[ \begin{align}\begin{aligned}T^{TCR \beta} = \sum_{TCR\beta} C^{TCR\beta}\\RA^{TCR\beta} = \frac{C^{TCR\beta}}{T^{TCR\beta}}\\RA = \max_{\underset{with \, kmer}{TCR\beta}} {RA^{TCR \beta}}\end{aligned}\end{align} \]

For more details, please see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292

Parameters:
  • sequences – an array of (amino acid) sequences (corresponding to a repertoire)

  • counts – an array of counts for each of the sequences

  • k – the length of the k-mer (in the publication referenced above, k is 4)

Returns:

a dictionary where keys are k-mers and values are their relative abundances in the given list of sequences

static get_atchely_factors(kmers: list, k: int) pandas.DataFrame[source]

Returns values of Atchley factors for each amino acid in the sequence. The data was downloaded from the publication: Atchley WR, Zhao J, Fernandes AD, Drüke T. Solving the protein sequence metric problem. PNAS. 2005;102(18):6395-6400. doi:10.1073/pnas.0408677102

Parameters:
  • kmers – a list of amino acid sequences

  • k – length of k-mers

Returns:

values of Atchley factors for each amino acid in the sequence

Module contents