immuneML.encodings.atchley_kmer_encoding package

Submodules

immuneML.encodings.atchley_kmer_encoding.AtchleyKmerEncoder module

class immuneML.encodings.atchley_kmer_encoding.AtchleyKmerEncoder.AtchleyKmerEncoder(k: int, skip_first_n_aa: int, skip_last_n_aa: int, abundance: str, normalize_all_features: bool, name: Optional[str] = None)[source]

Bases: immuneML.encodings.DatasetEncoder.DatasetEncoder

Represents a repertoire through Atchley factors and relative abundance of k-mers. Should be used in combination with the AtchleyKmerMILClassifier.

For more details, see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292 .

Note that sequences in the repertoire with length shorter than skip_first_n_aa + skip_last_n_aa + k will not be encoded.

Parameters
  • k (int) – k-mer length

  • skip_first_n_aa (int) – number of amino acids to remove from the beginning of the receptor sequence

  • skip_last_n_aa (int) – number of amino acids to remove from the end of the receptor sequence

  • abundance – how to compute abundance term for k-mers

  • normalize_all_features (bool) – when normalizing features to have 0 mean and unit variance, this parameter indicates if the abundance

  • normalization (feature should be included in the) –

YAML specification:

my_encoder:
    AtchleyKmer:
        k: 4
        skip_first_n_aa: 3
        skip_last_n_aa: 3
        abundance: RELATIVE_ABUNDANCE
        normalize_all_features: False
static build_object(dataset, **params)[source]
encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]
static export_encoder(path: pathlib.Path, encoder) pathlib.Path[source]
get_additional_files() List[str][source]
static get_documentation()[source]
static load_encoder(encoder_file: pathlib.Path)[source]

immuneML.encodings.atchley_kmer_encoding.RelativeAbundanceType module

class immuneML.encodings.atchley_kmer_encoding.RelativeAbundanceType.RelativeAbundanceType(value)[source]

Bases: enum.Enum

An enumeration.

RELATIVE_ABUNDANCE = 'relative_abundance'
TCRB_RELATIVE_ABUNDANCE = 'tcrb_relative_abundance'

immuneML.encodings.atchley_kmer_encoding.Util module

class immuneML.encodings.atchley_kmer_encoding.Util.Util[source]

Bases: object

ATCHLEY_FACTORS = None
ATCHLEY_FACTOR_COUNT = 5
static compute_abundance(sequences: numpy.ndarray, counts: numpy.ndarray, k: int, abundance: immuneML.encodings.atchley_kmer_encoding.RelativeAbundanceType.RelativeAbundanceType)[source]
static compute_relative_abundance(sequences: numpy.ndarray, counts: numpy.ndarray, k: int) dict[source]

Computes the relative abundance of k-mers in the repertoire per following equations where C is the template count, T is the total count and RA is relative abundance (the output of the function for each k-mer separately):

\[ \begin{align}\begin{aligned}C^{kmer}=\sum_{\underset{with kmer}{TCR \beta}} C^{TCR \beta}\\T^{kmer} = \sum_{kmer} C^{kmer}\\RA = \frac{C^{kmer}}{T^{kmer}}\end{aligned}\end{align} \]

For more details, please see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292

Parameters
  • sequences – an array of (amino acid) sequences (corresponding to a repertoire)

  • counts – an array of counts for each of the sequences

  • k – the length of the k-mer (in the publication referenced above, k is 4)

Returns

a dictionary where keys are k-mers and values are their relative abundances in the given list of sequences

static compute_tcrb_relative_abundance(sequences: numpy.ndarray, counts: numpy.ndarray, k: int) dict[source]

Computes the relative abundance of k-mers in the repertoire per following equations where C is the template count for the given receptor sequence, T is the total count across all receptor sequences. The relative abundance per receptor sequence is then computed and only the maximum sequence abudance was used for the k-mer so that the k-mer’s relative abundance is equal to the abundance of the most frequent receptor sequence in which the receptor appears:

\[ \begin{align}\begin{aligned}T^{TCR \beta} = \sum_{TCR\beta} C^{TCR\beta}\\RA^{TCR\beta} = \frac{C^{TCR\beta}}{T^{TCR\beta}}\\RA = \max_{\underset{with \, kmer}{TCR\beta}} {RA^{TCR \beta}}\end{aligned}\end{align} \]

For more details, please see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292

Parameters
  • sequences – an array of (amino acid) sequences (corresponding to a repertoire)

  • counts – an array of counts for each of the sequences

  • k – the length of the k-mer (in the publication referenced above, k is 4)

Returns

a dictionary where keys are k-mers and values are their relative abundances in the given list of sequences

static get_atchely_factors(kmers: list, k: int) pandas.DataFrame[source]

Returns values of Atchley factors for each amino acid in the sequence. The data was downloaded from the publication: Atchley WR, Zhao J, Fernandes AD, Drüke T. Solving the protein sequence metric problem. PNAS. 2005;102(18):6395-6400. doi:10.1073/pnas.0408677102

Parameters
  • kmers – a list of amino acid sequences

  • k – length of k-mers

Returns

values of Atchley factors for each amino acid in the sequence

Module contents