immuneML.encodings.atchley_kmer_encoding package¶
Submodules¶
immuneML.encodings.atchley_kmer_encoding.AtchleyKmerEncoder module¶
-
class
immuneML.encodings.atchley_kmer_encoding.AtchleyKmerEncoder.
AtchleyKmerEncoder
(k: int, skip_first_n_aa: int, skip_last_n_aa: int, abundance: str, normalize_all_features: bool, name: Optional[str] = None)[source]¶ Bases:
immuneML.encodings.DatasetEncoder.DatasetEncoder
Represents a repertoire through Atchley factors and relative abundance of k-mers. Should be used in combination with the AtchleyKmerMILClassifier.
For more details, see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292 .
Note that sequences in the repertoire with length shorter than skip_first_n_aa + skip_last_n_aa + k will not be encoded.
- Parameters
k (int) – k-mer length
skip_first_n_aa (int) – number of amino acids to remove from the beginning of the receptor sequence
skip_last_n_aa (int) – number of amino acids to remove from the end of the receptor sequence
abundance – how to compute abundance term for k-mers
normalize_all_features (bool) – when normalizing features to have 0 mean and unit variance, this parameter indicates if the abundance
should be included in the normalization (feature) –
YAML specification:
my_encoder: AtchleyKmer: k: 4 skip_first_n_aa: 3 skip_last_n_aa: 3 abundance: RELATIVE_ABUNDANCE normalize_all_features: False
-
encode
(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]¶
immuneML.encodings.atchley_kmer_encoding.RelativeAbundanceType module¶
immuneML.encodings.atchley_kmer_encoding.Util module¶
-
class
immuneML.encodings.atchley_kmer_encoding.Util.
Util
[source]¶ Bases:
object
-
ATCHLEY_FACTORS
= None¶
-
ATCHLEY_FACTOR_COUNT
= 5¶
-
static
compute_abundance
(sequences: numpy.ndarray, counts: numpy.ndarray, k: int, abundance: immuneML.encodings.atchley_kmer_encoding.RelativeAbundanceType.RelativeAbundanceType)[source]¶
-
static
compute_relative_abundance
(sequences: numpy.ndarray, counts: numpy.ndarray, k: int) → dict[source]¶ Computes the relative abundance of k-mers in the repertoire per following equations where C is the template count, T is the total count and RA is relative abundance (the output of the function for each k-mer separately):
\[ \begin{align}\begin{aligned}C^{kmer}=\sum_{\underset{with kmer}{TCR \beta}} C^{TCR \beta}\\T^{kmer} = \sum_{kmer} C^{kmer}\\RA = \frac{C^{kmer}}{T^{kmer}}\end{aligned}\end{align} \]For more details, please see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292
- Parameters
sequences – an array of (amino acid) sequences (corresponding to a repertoire)
counts – an array of counts for each of the sequences
k – the length of the k-mer (in the publication referenced above, k is 4)
- Returns
a dictionary where keys are k-mers and values are their relative abundances in the given list of sequences
-
static
compute_tcrb_relative_abundance
(sequences: numpy.ndarray, counts: numpy.ndarray, k: int) → dict[source]¶ Computes the relative abundance of k-mers in the repertoire per following equations where C is the template count for the given receptor sequence, T is the total count across all receptor sequences. The relative abundance per receptor sequence is then computed and only the maximum sequence abudance was used for the k-mer so that the k-mer’s relative abundance is equal to the abundance of the most frequent receptor sequence in which the receptor appears:
\[ \begin{align}\begin{aligned}T^{TCR \beta} = \sum_{TCR\beta} C^{TCR\beta}\\RA^{TCR\beta} = \frac{C^{TCR\beta}}{T^{TCR\beta}}\\RA = \max_{\underset{with \, kmer}{TCR\beta}} {RA^{TCR \beta}}\end{aligned}\end{align} \]For more details, please see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292
- Parameters
sequences – an array of (amino acid) sequences (corresponding to a repertoire)
counts – an array of counts for each of the sequences
k – the length of the k-mer (in the publication referenced above, k is 4)
- Returns
a dictionary where keys are k-mers and values are their relative abundances in the given list of sequences
-
static
get_atchely_factors
(kmers: list, k: int) → pandas.DataFrame[source]¶ Returns values of Atchley factors for each amino acid in the sequence. The data was downloaded from the publication: Atchley WR, Zhao J, Fernandes AD, Drüke T. Solving the protein sequence metric problem. PNAS. 2005;102(18):6395-6400. doi:10.1073/pnas.0408677102
- Parameters
kmers – a list of amino acid sequences
k – length of k-mers
- Returns
values of Atchley factors for each amino acid in the sequence
-