immuneML.encodings.atchley_kmer_encoding package
Submodules
immuneML.encodings.atchley_kmer_encoding.AtchleyKmerEncoder module
- class immuneML.encodings.atchley_kmer_encoding.AtchleyKmerEncoder.AtchleyKmerEncoder(k: int, skip_first_n_aa: int, skip_last_n_aa: int, abundance: str, normalize_all_features: bool, name: Optional[str] = None)[source]
Bases:
immuneML.encodings.DatasetEncoder.DatasetEncoder
Represents a repertoire through Atchley factors and relative abundance of k-mers. Should be used in combination with the AtchleyKmerMILClassifier.
For more details, see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292 .
Note that sequences in the repertoire with length shorter than skip_first_n_aa + skip_last_n_aa + k will not be encoded.
- Parameters
k (int) – k-mer length
skip_first_n_aa (int) – number of amino acids to remove from the beginning of the receptor sequence
skip_last_n_aa (int) – number of amino acids to remove from the end of the receptor sequence
abundance – how to compute abundance term for k-mers
normalize_all_features (bool) – when normalizing features to have 0 mean and unit variance, this parameter indicates if the abundance
normalization (feature should be included in the) –
YAML specification:
my_encoder: AtchleyKmer: k: 4 skip_first_n_aa: 3 skip_last_n_aa: 3 abundance: RELATIVE_ABUNDANCE normalize_all_features: False
- encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]
immuneML.encodings.atchley_kmer_encoding.RelativeAbundanceType module
immuneML.encodings.atchley_kmer_encoding.Util module
- class immuneML.encodings.atchley_kmer_encoding.Util.Util[source]
Bases:
object
- ATCHLEY_FACTORS = None
- ATCHLEY_FACTOR_COUNT = 5
- static compute_abundance(sequences: numpy.ndarray, counts: numpy.ndarray, k: int, abundance: immuneML.encodings.atchley_kmer_encoding.RelativeAbundanceType.RelativeAbundanceType)[source]
- static compute_relative_abundance(sequences: numpy.ndarray, counts: numpy.ndarray, k: int) dict [source]
Computes the relative abundance of k-mers in the repertoire per following equations where C is the template count, T is the total count and RA is relative abundance (the output of the function for each k-mer separately):
\[ \begin{align}\begin{aligned}C^{kmer}=\sum_{\underset{with kmer}{TCR \beta}} C^{TCR \beta}\\T^{kmer} = \sum_{kmer} C^{kmer}\\RA = \frac{C^{kmer}}{T^{kmer}}\end{aligned}\end{align} \]For more details, please see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292
- Parameters
sequences – an array of (amino acid) sequences (corresponding to a repertoire)
counts – an array of counts for each of the sequences
k – the length of the k-mer (in the publication referenced above, k is 4)
- Returns
a dictionary where keys are k-mers and values are their relative abundances in the given list of sequences
- static compute_tcrb_relative_abundance(sequences: numpy.ndarray, counts: numpy.ndarray, k: int) dict [source]
Computes the relative abundance of k-mers in the repertoire per following equations where C is the template count for the given receptor sequence, T is the total count across all receptor sequences. The relative abundance per receptor sequence is then computed and only the maximum sequence abudance was used for the k-mer so that the k-mer’s relative abundance is equal to the abundance of the most frequent receptor sequence in which the receptor appears:
\[ \begin{align}\begin{aligned}T^{TCR \beta} = \sum_{TCR\beta} C^{TCR\beta}\\RA^{TCR\beta} = \frac{C^{TCR\beta}}{T^{TCR\beta}}\\RA = \max_{\underset{with \, kmer}{TCR\beta}} {RA^{TCR \beta}}\end{aligned}\end{align} \]For more details, please see the original publication: Ostmeyer J, Christley S, Toby IT, Cowell LG. Biophysicochemical motifs in T cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocytes and adjacent healthy tissue. Cancer Res. Published online January 1, 2019:canres.2292.2018. doi:10.1158/0008-5472.CAN-18-2292
- Parameters
sequences – an array of (amino acid) sequences (corresponding to a repertoire)
counts – an array of counts for each of the sequences
k – the length of the k-mer (in the publication referenced above, k is 4)
- Returns
a dictionary where keys are k-mers and values are their relative abundances in the given list of sequences
- static get_atchely_factors(kmers: list, k: int) pandas.DataFrame [source]
Returns values of Atchley factors for each amino acid in the sequence. The data was downloaded from the publication: Atchley WR, Zhao J, Fernandes AD, Drüke T. Solving the protein sequence metric problem. PNAS. 2005;102(18):6395-6400. doi:10.1073/pnas.0408677102
- Parameters
kmers – a list of amino acid sequences
k – length of k-mers
- Returns
values of Atchley factors for each amino acid in the sequence