immuneML.encodings.reference_encoding package¶
Submodules¶
immuneML.encodings.reference_encoding.MatchedReceptorsEncoder module¶
-
class
immuneML.encodings.reference_encoding.MatchedReceptorsEncoder.
MatchedReceptorsEncoder
(reference_receptors: List[immuneML.data_model.receptor.Receptor.Receptor], max_edit_distances: dict, name: Optional[str] = None)[source]¶ Bases:
immuneML.encodings.DatasetEncoder.DatasetEncoder
Encodes the dataset based on the matches between a dataset containing unpaired (single chain) data, and a paired reference receptor dataset. For each paired reference receptor, the frequency of either chain in the dataset is counted.
This encoding should be used in combination with the Matches report.
- Parameters
reference (dict) – A dictionary describing the reference dataset file, specified the same as regular data import.
:param See the
sequence_import
for specification details.: :param Must contain paired receptor sequences.: :param max_edit_distances: A dictionary specifying the maximum edit distance between a target sequence :type max_edit_distances: dict :param (from the repertoire) and the reference sequence. A maximum distance can be specified per chain: :param for example: :param to allow for less strict matching of TCR alpha and BCR light chains. When only an integer is specified: :param : :param this distance is applied to all possible chains.:YAML Specification:
my_mr_encoding: MatchedReceptors: reference: format: IRIS params: path: path/to/file.txt paired: True all_dual_chains: True all_genes: True max_edit_distances: alpha: 1 beta: 0
-
dataset_mapping
= {'RepertoireDataset': 'MatchedReceptorsRepertoireEncoder'}¶
-
encode
(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]¶
immuneML.encodings.reference_encoding.MatchedReceptorsRepertoireEncoder module¶
-
class
immuneML.encodings.reference_encoding.MatchedReceptorsRepertoireEncoder.
MatchedReceptorsRepertoireEncoder
(reference_receptors: List[immuneML.data_model.receptor.Receptor.Receptor], max_edit_distances: dict, name: Optional[str] = None)[source]¶ Bases:
immuneML.encodings.reference_encoding.MatchedReceptorsEncoder.MatchedReceptorsEncoder
immuneML.encodings.reference_encoding.MatchedReferenceUtil module¶
immuneML.encodings.reference_encoding.MatchedRegexEncoder module¶
-
class
immuneML.encodings.reference_encoding.MatchedRegexEncoder.
MatchedRegexEncoder
(motif_filepath: pathlib.Path, match_v_genes: bool, sum_counts: bool, chains: list, name: Optional[str] = None)[source]¶ Bases:
immuneML.encodings.DatasetEncoder.DatasetEncoder
Encodes the dataset based on the matches between a RepertoireDataset and a collection of regular expressions. For each regular expression, the number of sequences in the RepertoireDataset containing the expression is counted. This can also be used to count how often a subsequence occurs in a RepertoireDataset.
The regular expressions are defined per chain, and it is possible to require a V gene match in addition to the CDR3 sequence containing the regular expression.
This encoding should be used in combination with the Matches report.
- Parameters
match_v_genes (bool) – Whether V gene matches are required. If this is True, a match is only counted if the
gene matches the gene specified in the motif input file. By default match_v_genes is False. (V) –
sum_counts (bool) – When counting the number of matches, one can choose to count the number of matching sequences
sum the frequencies of those sequences. If sum_counts is True (or) –
sequence frequencies are summed. Otherwise (the) –
:param : :param if sum_counts is False: :param the number of matching unique sequences is counted. By default sum_counts is False.: :param motif_filepath: The path to the motif input file. This should be a tab separated file containing a :type motif_filepath: str :param column named ‘id’ and for every chain that should be matched a column containing the regex: :type column named ‘id’ and for every chain that should be matched a column containing the regex: <chain>_regex :param the V gene: :type the V gene: <chain>V :param The chains are specified by their three letter code: :param see
Chain
.: :param In the simplest case: ==== ==========id TRB_regex ==== ========== 1 ACG 2 EDNA 3 DFWG ==== ==========
- Parameters
counting the number of occurrences of a given list of k-mers in TRB sequences (when) –
id
TRB_regex
1
ACG
2
EDNA
3
DFWG
contents of the motif file could look like this (the) –
id
TRB_regex
1
ACG
2
EDNA
3
DFWG
is also possible to test whether paired regular expressions occur in the dataset (for example (It) – regular expressions
both a TRA chain and a TRB chain) by specifying them on the same line. (matching) –
a more complex case where both paired and unpaired regular expressions are specified (In) –
addition to matching the V (in) –
genes –
id
TRA_regex
TRAV
TRB_regex
TRBV
1
AGQ.GSS
TRAV35
S[APL]GQY
TRBV29-1
2
ASS.R.*
TRBV7-3
contents of the motif file could look like this –
id
TRA_regex
TRAV
TRB_regex
TRBV
1
AGQ.GSS
TRAV35
S[APL]GQY
TRBV29-1
2
ASS.R.*
TRBV7-3
YAML Specification:
my_mr_encoding: MatchedRegex: motif_filepath: path/to/file.txt match_v_genes: True sum_counts: False
-
dataset_mapping
= {'RepertoireDataset': 'MatchedRegexRepertoireEncoder'}¶
-
encode
(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]¶
immuneML.encodings.reference_encoding.MatchedRegexRepertoireEncoder module¶
-
class
immuneML.encodings.reference_encoding.MatchedRegexRepertoireEncoder.
MatchedRegexRepertoireEncoder
(motif_filepath: pathlib.Path, match_v_genes: bool, sum_counts: bool, chains: list, name: Optional[str] = None)[source]¶ Bases:
immuneML.encodings.reference_encoding.MatchedRegexEncoder.MatchedRegexEncoder
immuneML.encodings.reference_encoding.MatchedSequencesEncoder module¶
-
class
immuneML.encodings.reference_encoding.MatchedSequencesEncoder.
MatchedSequencesEncoder
(max_edit_distance: int, reference_sequences: immuneML.data_model.receptor.receptor_sequence.ReceptorSequenceList.ReceptorSequenceList, name: Optional[str] = None)[source]¶ Bases:
immuneML.encodings.DatasetEncoder.DatasetEncoder
Encodes the dataset based on the matches between a RepertoireDataset and a reference sequence dataset.
This encoding should be used in combination with the Matches report.
- Parameters
reference (dict) – A dictionary describing the reference dataset file. See the
sequence_import
for specification details.max_edit_distance (dict) – The maximum edit distance between a target sequence (from the repertoire) and the reference sequence. A maximum distance can be specified per chain.
YAML Specification:
my_ms_encoding: MatchedSequences: reference: path: path/to/file.txt format: VDJDB max_edit_distance: 1
-
dataset_mapping
= {'RepertoireDataset': 'MatchedSequencesRepertoireEncoder'}¶
-
encode
(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]¶
immuneML.encodings.reference_encoding.MatchedSequencesRepertoireEncoder module¶
-
class
immuneML.encodings.reference_encoding.MatchedSequencesRepertoireEncoder.
MatchedSequencesRepertoireEncoder
(max_edit_distance: int, reference_sequences: immuneML.data_model.receptor.receptor_sequence.ReceptorSequenceList.ReceptorSequenceList, name: Optional[str] = None)[source]¶ Bases:
immuneML.encodings.reference_encoding.MatchedSequencesEncoder.MatchedSequencesEncoder