immuneML.encodings.reference_encoding package
Submodules
immuneML.encodings.reference_encoding.MatchedReceptorsEncoder module
- class immuneML.encodings.reference_encoding.MatchedReceptorsEncoder.MatchedReceptorsEncoder(reference_receptors: List[immuneML.data_model.receptor.Receptor.Receptor], max_edit_distances: dict, name: Optional[str] = None)[source]
Bases:
immuneML.encodings.DatasetEncoder.DatasetEncoder
Encodes the dataset based on the matches between a dataset containing unpaired (single chain) data, and a paired reference receptor dataset. For each paired reference receptor, the frequency of either chain in the dataset is counted.
This encoding should be used in combination with the Matches report.
- Parameters
reference (dict) – A dictionary describing the reference dataset file, specified the same as regular data import.
:param See the
sequence_import
for specification details.: :param Must contain paired receptor sequences.: :param max_edit_distances: A dictionary specifying the maximum edit distance between a target sequence :type max_edit_distances: dict :param (from the repertoire) and the reference sequence. A maximum distance can be specified per chain: :param for example: :param to allow for less strict matching of TCR alpha and BCR light chains. When only an integer is specified: :param : :param this distance is applied to all possible chains.:YAML Specification:
my_mr_encoding: MatchedReceptors: reference: format: IRIS params: path: path/to/file.txt paired: True all_dual_chains: True all_genes: True max_edit_distances: alpha: 1 beta: 0
- dataset_mapping = {'RepertoireDataset': 'MatchedReceptorsRepertoireEncoder'}
- encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]
immuneML.encodings.reference_encoding.MatchedReceptorsRepertoireEncoder module
- class immuneML.encodings.reference_encoding.MatchedReceptorsRepertoireEncoder.MatchedReceptorsRepertoireEncoder(reference_receptors: List[immuneML.data_model.receptor.Receptor.Receptor], max_edit_distances: dict, name: Optional[str] = None)[source]
Bases:
immuneML.encodings.reference_encoding.MatchedReceptorsEncoder.MatchedReceptorsEncoder
immuneML.encodings.reference_encoding.MatchedReferenceUtil module
immuneML.encodings.reference_encoding.MatchedRegexEncoder module
- class immuneML.encodings.reference_encoding.MatchedRegexEncoder.MatchedRegexEncoder(motif_filepath: pathlib.Path, match_v_genes: bool, sum_counts: bool, chains: list, name: Optional[str] = None)[source]
Bases:
immuneML.encodings.DatasetEncoder.DatasetEncoder
Encodes the dataset based on the matches between a RepertoireDataset and a collection of regular expressions. For each regular expression, the number of sequences in the RepertoireDataset containing the expression is counted. This can also be used to count how often a subsequence occurs in a RepertoireDataset.
The regular expressions are defined per chain, and it is possible to require a V gene match in addition to the CDR3 sequence containing the regular expression.
This encoding should be used in combination with the Matches report.
- Parameters
match_v_genes (bool) – Whether V gene matches are required. If this is True, a match is only counted if the
False. (V gene matches the gene specified in the motif input file. By default match_v_genes is) –
sum_counts (bool) – When counting the number of matches, one can choose to count the number of matching sequences
True (or sum the frequencies of those sequences. If sum_counts is) –
Otherwise (the sequence frequencies are summed.) –
:param : :param if sum_counts is False: :param the number of matching unique sequences is counted. By default sum_counts is False.: :param motif_filepath: The path to the motif input file. This should be a tab separated file containing a :type motif_filepath: str :param column named ‘id’ and for every chain that should be matched a column containing the regex: :type column named ‘id’ and for every chain that should be matched a column containing the regex: <chain>_regex :param the V gene: :type the V gene: <chain>V :param The chains are specified by their three letter code: :param see
Chain
.: :param In the simplest case: ==== ==========id TRB_regex ==== ========== 1 ACG 2 EDNA 3 DFWG ==== ==========
- Parameters
sequences (when counting the number of occurrences of a given list of k-mers in TRB) –
id
TRB_regex
1
ACG
2
EDNA
3
DFWG
this (the contents of the motif file could look like) –
id
TRB_regex
1
ACG
2
EDNA
3
DFWG
example (It is also possible to test whether paired regular expressions occur in the dataset (for) – regular expressions
line. (matching both a TRA chain and a TRB chain) by specifying them on the same) –
specified (In a more complex case where both paired and unpaired regular expressions are) –
V (in addition to matching the) –
genes –
id
TRA_regex
TRAV
TRB_regex
TRBV
1
AGQ.GSS
TRAV35
S[APL]GQY
TRBV29-1
2
ASS.R.*
TRBV7-3
this –
id
TRA_regex
TRAV
TRB_regex
TRBV
1
AGQ.GSS
TRAV35
S[APL]GQY
TRBV29-1
2
ASS.R.*
TRBV7-3
YAML Specification:
my_mr_encoding: MatchedRegex: motif_filepath: path/to/file.txt match_v_genes: True sum_counts: False
- dataset_mapping = {'RepertoireDataset': 'MatchedRegexRepertoireEncoder'}
- encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]
immuneML.encodings.reference_encoding.MatchedRegexRepertoireEncoder module
- class immuneML.encodings.reference_encoding.MatchedRegexRepertoireEncoder.MatchedRegexRepertoireEncoder(motif_filepath: pathlib.Path, match_v_genes: bool, sum_counts: bool, chains: list, name: Optional[str] = None)[source]
Bases:
immuneML.encodings.reference_encoding.MatchedRegexEncoder.MatchedRegexEncoder
immuneML.encodings.reference_encoding.MatchedSequencesEncoder module
- class immuneML.encodings.reference_encoding.MatchedSequencesEncoder.MatchedSequencesEncoder(max_edit_distance: int, reference_sequences: immuneML.data_model.receptor.receptor_sequence.ReceptorSequenceList.ReceptorSequenceList, name: Optional[str] = None)[source]
Bases:
immuneML.encodings.DatasetEncoder.DatasetEncoder
Encodes the dataset based on the matches between a RepertoireDataset and a reference sequence dataset.
This encoding should be used in combination with the Matches report.
- Parameters
reference (dict) – A dictionary describing the reference dataset file. See the
sequence_import
for specification details.max_edit_distance (dict) – The maximum edit distance between a target sequence (from the repertoire) and the reference sequence. A maximum distance can be specified per chain.
YAML Specification:
my_ms_encoding: MatchedSequences: reference: path: path/to/file.txt format: VDJDB max_edit_distance: 1
- dataset_mapping = {'RepertoireDataset': 'MatchedSequencesRepertoireEncoder'}
- encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]
immuneML.encodings.reference_encoding.MatchedSequencesRepertoireEncoder module
- class immuneML.encodings.reference_encoding.MatchedSequencesRepertoireEncoder.MatchedSequencesRepertoireEncoder(max_edit_distance: int, reference_sequences: immuneML.data_model.receptor.receptor_sequence.ReceptorSequenceList.ReceptorSequenceList, name: Optional[str] = None)[source]
Bases:
immuneML.encodings.reference_encoding.MatchedSequencesEncoder.MatchedSequencesEncoder