immuneML.encodings.atchley_kmer_encoding package¶
Submodules¶
immuneML.encodings.atchley_kmer_encoding.AtchleyKmerEncoder module¶

class
immuneML.encodings.atchley_kmer_encoding.AtchleyKmerEncoder.
AtchleyKmerEncoder
(k: int, skip_first_n_aa: int, skip_last_n_aa: int, abundance: str, normalize_all_features: bool, name: Optional[str] = None)[source]¶ Bases:
immuneML.encodings.DatasetEncoder.DatasetEncoder
Represents a repertoire through Atchley factors and relative abundance of kmers. Should be used in combination with the AtchleyKmerMILClassifier.
For more details, see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumorinfiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/00085472.CAN182292 .
Note that sequences in the repertoire with length shorter than skip_first_n_aa + skip_last_n_aa + k will not be encoded.
 Parameters
k (int) – kmer length
skip_first_n_aa (int) – number of amino acids to remove from the beginning of the receptor sequence
skip_last_n_aa (int) – number of amino acids to remove from the end of the receptor sequence
abundance – how to compute abundance term for kmers
normalize_all_features (bool) – when normalizing features to have 0 mean and unit variance, this parameter indicates if the abundance
should be included in the normalization (feature) –
YAML specification:
my_encoder: AtchleyKmer: k: 4 skip_first_n_aa: 3 skip_last_n_aa: 3 abundance: RELATIVE_ABUNDANCE normalize_all_features: False

encode
(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]¶
immuneML.encodings.atchley_kmer_encoding.RelativeAbundanceType module¶
immuneML.encodings.atchley_kmer_encoding.Util module¶

class
immuneML.encodings.atchley_kmer_encoding.Util.
Util
[source]¶ Bases:
object

ATCHLEY_FACTORS
= None¶

ATCHLEY_FACTOR_COUNT
= 5¶

static
compute_abundance
(sequences: numpy.ndarray, counts: numpy.ndarray, k: int, abundance: immuneML.encodings.atchley_kmer_encoding.RelativeAbundanceType.RelativeAbundanceType)[source]¶

static
compute_relative_abundance
(sequences: numpy.ndarray, counts: numpy.ndarray, k: int) → dict[source]¶ Computes the relative abundance of kmers in the repertoire per following equations where C is the template count, T is the total count and RA is relative abundance (the output of the function for each kmer separately):
\[ \begin{align}\begin{aligned}C^{kmer}=\sum_{\underset{with kmer}{TCR \beta}} C^{TCR \beta}\\T^{kmer} = \sum_{kmer} C^{kmer}\\RA = \frac{C^{kmer}}{T^{kmer}}\end{aligned}\end{align} \]For more details, please see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumorinfiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/00085472.CAN182292
 Parameters
sequences – an array of (amino acid) sequences (corresponding to a repertoire)
counts – an array of counts for each of the sequences
k – the length of the kmer (in the publication referenced above, k is 4)
 Returns
a dictionary where keys are kmers and values are their relative abundances in the given list of sequences

static
compute_tcrb_relative_abundance
(sequences: numpy.ndarray, counts: numpy.ndarray, k: int) → dict[source]¶ Computes the relative abundance of kmers in the repertoire per following equations where C is the template count for the given receptor sequence, T is the total count across all receptor sequences. The relative abundance per receptor sequence is then computed and only the maximum sequence abudance was used for the kmer so that the kmer’s relative abundance is equal to the abundance of the most frequent receptor sequence in which the receptor appears:
\[ \begin{align}\begin{aligned}T^{TCR \beta} = \sum_{TCR\beta} C^{TCR\beta}\\RA^{TCR\beta} = \frac{C^{TCR\beta}}{T^{TCR\beta}}\\RA = \max_{\underset{with \, kmer}{TCR\beta}} {RA^{TCR \beta}}\end{aligned}\end{align} \]For more details, please see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumorinfiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/00085472.CAN182292
 Parameters
sequences – an array of (amino acid) sequences (corresponding to a repertoire)
counts – an array of counts for each of the sequences
k – the length of the kmer (in the publication referenced above, k is 4)
 Returns
a dictionary where keys are kmers and values are their relative abundances in the given list of sequences

static
get_atchely_factors
(kmers: list, k: int) → pandas.DataFrame[source]¶ Returns values of Atchley factors for each amino acid in the sequence. The data was downloaded from the publication: Atchley WR, Zhao J, Fernandes AD, Drüke T. Solving the protein sequence metric problem. PNAS. 2005;102(18):63956400. doi:10.1073/pnas.0408677102
 Parameters
kmers – a list of amino acid sequences
k – length of kmers
 Returns
values of Atchley factors for each amino acid in the sequence
