immuneML.encodings.reference_encoding package

Submodules

immuneML.encodings.reference_encoding.MatchedReceptorsEncoder module

class immuneML.encodings.reference_encoding.MatchedReceptorsEncoder.MatchedReceptorsEncoder(reference_receptors: List[immuneML.data_model.receptor.Receptor.Receptor], max_edit_distances: dict, name: Optional[str] = None)[source]

Bases: immuneML.encodings.DatasetEncoder.DatasetEncoder

Encodes the dataset based on the matches between a dataset containing unpaired (single chain) data, and a paired reference receptor dataset. For each paired reference receptor, the frequency of either chain in the dataset is counted.

This encoding should be used in combination with the Matches report.

Parameters

reference (dict) – A dictionary describing the reference dataset file, specified the same as regular data import.

:param See the sequence_import for specification details.: :param Must contain paired receptor sequences.: :param max_edit_distances: A dictionary specifying the maximum edit distance between a target sequence :type max_edit_distances: dict :param (from the repertoire) and the reference sequence. A maximum distance can be specified per chain: :param for example: :param to allow for less strict matching of TCR alpha and BCR light chains. When only an integer is specified: :param : :param this distance is applied to all possible chains.:

YAML Specification:

my_mr_encoding:
    MatchedReceptors:
        reference:
            format: IRIS
            params:
                path: path/to/file.txt
                paired: True
                all_dual_chains: True
                all_genes: True
        max_edit_distances:
            alpha: 1
            beta: 0
static build_object(dataset=None, **params)[source]
dataset_mapping = {'RepertoireDataset': 'MatchedReceptorsRepertoireEncoder'}
encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]

immuneML.encodings.reference_encoding.MatchedReceptorsRepertoireEncoder module

class immuneML.encodings.reference_encoding.MatchedReceptorsRepertoireEncoder.MatchedReceptorsRepertoireEncoder(reference_receptors: List[immuneML.data_model.receptor.Receptor.Receptor], max_edit_distances: dict, name: Optional[str] = None)[source]

Bases: immuneML.encodings.reference_encoding.MatchedReceptorsEncoder.MatchedReceptorsEncoder

immuneML.encodings.reference_encoding.MatchedReferenceUtil module

class immuneML.encodings.reference_encoding.MatchedReferenceUtil.MatchedReferenceUtil[source]

Bases: object

Utility class for MatchedSequencesEncoder and MatchedReceptorsEncoder

static prepare_reference(reference_params: dict, location: str, paired: bool)[source]

immuneML.encodings.reference_encoding.MatchedRegexEncoder module

class immuneML.encodings.reference_encoding.MatchedRegexEncoder.MatchedRegexEncoder(motif_filepath: pathlib.Path, match_v_genes: bool, sum_counts: bool, chains: list, name: Optional[str] = None)[source]

Bases: immuneML.encodings.DatasetEncoder.DatasetEncoder

Encodes the dataset based on the matches between a RepertoireDataset and a collection of regular expressions. For each regular expression, the number of sequences in the RepertoireDataset containing the expression is counted. This can also be used to count how often a subsequence occurs in a RepertoireDataset.

The regular expressions are defined per chain, and it is possible to require a V gene match in addition to the CDR3 sequence containing the regular expression.

This encoding should be used in combination with the Matches report.

Parameters
  • match_v_genes (bool) – Whether V gene matches are required. If this is True, a match is only counted if the

  • gene matches the gene specified in the motif input file. By default match_v_genes is False. (V) –

  • sum_counts (bool) – When counting the number of matches, one can choose to count the number of matching sequences

  • sum the frequencies of those sequences. If sum_counts is True (or) –

  • sequence frequencies are summed. Otherwise (the) –

:param : :param if sum_counts is False: :param the number of matching unique sequences is counted. By default sum_counts is False.: :param motif_filepath: The path to the motif input file. This should be a tab separated file containing a :type motif_filepath: str :param column named ‘id’ and for every chain that should be matched a column containing the regex: :type column named ‘id’ and for every chain that should be matched a column containing the regex: <chain>_regex :param the V gene: :type the V gene: <chain>V :param The chains are specified by their three letter code: :param see Chain.: :param In the simplest case: ==== ==========

id TRB_regex ==== ========== 1 ACG 2 EDNA 3 DFWG ==== ==========

Parameters
  • counting the number of occurrences of a given list of k-mers in TRB sequences (when) –

    id

    TRB_regex

    1

    ACG

    2

    EDNA

    3

    DFWG

  • contents of the motif file could look like this (the) –

    id

    TRB_regex

    1

    ACG

    2

    EDNA

    3

    DFWG

  • is also possible to test whether paired regular expressions occur in the dataset (for example (It) – regular expressions

  • both a TRA chain and a TRB chain) by specifying them on the same line. (matching) –

  • a more complex case where both paired and unpaired regular expressions are specified (In) –

  • addition to matching the V (in) –

  • genes

    id

    TRA_regex

    TRAV

    TRB_regex

    TRBV

    1

    AGQ.GSS

    TRAV35

    S[APL]GQY

    TRBV29-1

    2

    ASS.R.*

    TRBV7-3

  • contents of the motif file could look like this

    id

    TRA_regex

    TRAV

    TRB_regex

    TRBV

    1

    AGQ.GSS

    TRAV35

    S[APL]GQY

    TRBV29-1

    2

    ASS.R.*

    TRBV7-3

YAML Specification:

my_mr_encoding:
    MatchedRegex:
        motif_filepath: path/to/file.txt
        match_v_genes: True
        sum_counts: False
static build_object(dataset=None, **params)[source]
dataset_mapping = {'RepertoireDataset': 'MatchedRegexRepertoireEncoder'}
encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]
static get_documentation()[source]

immuneML.encodings.reference_encoding.MatchedRegexRepertoireEncoder module

class immuneML.encodings.reference_encoding.MatchedRegexRepertoireEncoder.MatchedRegexRepertoireEncoder(motif_filepath: pathlib.Path, match_v_genes: bool, sum_counts: bool, chains: list, name: Optional[str] = None)[source]

Bases: immuneML.encodings.reference_encoding.MatchedRegexEncoder.MatchedRegexEncoder

immuneML.encodings.reference_encoding.MatchedSequencesEncoder module

class immuneML.encodings.reference_encoding.MatchedSequencesEncoder.MatchedSequencesEncoder(max_edit_distance: int, reference_sequences: immuneML.data_model.receptor.receptor_sequence.ReceptorSequenceList.ReceptorSequenceList, name: Optional[str] = None)[source]

Bases: immuneML.encodings.DatasetEncoder.DatasetEncoder

Encodes the dataset based on the matches between a RepertoireDataset and a reference sequence dataset.

This encoding should be used in combination with the Matches report.

Parameters
  • reference (dict) – A dictionary describing the reference dataset file. See the sequence_import for specification details.

  • max_edit_distance (dict) – The maximum edit distance between a target sequence (from the repertoire) and the reference sequence. A maximum distance can be specified per chain.

YAML Specification:

my_ms_encoding:
    MatchedSequences:
        reference:
            path: path/to/file.txt
            format: VDJDB
        max_edit_distance: 1
static build_object(dataset=None, **params)[source]
dataset_mapping = {'RepertoireDataset': 'MatchedSequencesRepertoireEncoder'}
encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]

immuneML.encodings.reference_encoding.MatchedSequencesRepertoireEncoder module

class immuneML.encodings.reference_encoding.MatchedSequencesRepertoireEncoder.MatchedSequencesRepertoireEncoder(max_edit_distance: int, reference_sequences: immuneML.data_model.receptor.receptor_sequence.ReceptorSequenceList.ReceptorSequenceList, name: Optional[str] = None)[source]

Bases: immuneML.encodings.reference_encoding.MatchedSequencesEncoder.MatchedSequencesEncoder

immuneML.encodings.reference_encoding.SequenceMatchingSummaryType module

class immuneML.encodings.reference_encoding.SequenceMatchingSummaryType.SequenceMatchingSummaryType(value)[source]

Bases: enum.Enum

An enumeration.

CLONAL_PERCENTAGE = 1
COUNT = 0
PERCENTAGE = 2

Module contents