immuneML.encodings.distance_encoding package

Submodules

immuneML.encodings.distance_encoding.CompAIRRDistanceEncoder module

class immuneML.encodings.distance_encoding.CompAIRRDistanceEncoder.CompAIRRDistanceEncoder(compairr_path: Path, keep_compairr_input: bool, differences: int, indels: bool, ignore_counts: bool, ignore_genes: bool, threads: int, context: dict = None, name: str = None)[source]

Bases: DatasetEncoder

Encodes a given RepertoireDataset as a distance matrix, using the Morisita-Horn distance metric. Internally, CompAIRR is used for fast calculation of overlap between repertoires. This creates a pairwise distance matrix between each of the repertoires. The distance is calculated based on the number of matching receptor chain sequences between the repertoires. This matching may be defined to permit 1 or 2 mismatching amino acid/nucleotide positions and 1 indel in the sequence. Furthermore, matching may or may not include V and J gene information, and sequence frequencies may be included or ignored.

When mismatches (differences and indels) are allowed, the Morisita-Horn similarity may exceed 1. In this case, the Morisita-Horn distance (= similarity - 1) is set to 0 to avoid negative distance scores.

Parameters:
  • compairr_path (Path) – optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR

  • 'compairr' (has been installed such that it can be called directly on the command line with the command) –

:param : :param or that it is located at /usr/local/bin/compairr.: :param keep_compairr_input: whether to keep the input file that was passed to CompAIRR. This may take a lot of :type keep_compairr_input: bool :param storage space if the input dataset is large. By default the input file is not kept.: :param differences: Number of differences allowed between the sequences of two immune receptor chains, this :type differences: int :param may be between 0 and 2. By default: :param differences is 0.: :param indels: Whether to allow an indel. This is only possible if differences is 1. By default, indels is False. :type indels: bool :param ignore_counts: Whether to ignore the frequencies of the immune receptor chains. If False, frequencies :type ignore_counts: bool :param will be included: :param meaning the ‘counts’ values for the receptors available in two repertoires are multiplied.: :param If False: :type If False: ‘clones’ :param only the number of unique overlapping immune receptors: :type only the number of unique overlapping immune receptors: ‘clones’ :param By default: :param ignore_counts is False.: :param ignore_genes: Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains :type ignore_genes: bool :param have to match. If True: :param gene information is ignored. By default: :param ignore_genes is False.: :param threads: The number of threads to use for parallelization. Default is 8. :type threads: int

YAML specification:

my_distance_encoder:
    CompAIRRDistance:
        compairr_path: optional/path/to/compairr
        differences: 0
        indels: False
        ignore_counts: False
        ignore_genes: False
INPUT_FILENAME = 'compairr_input.tsv'
LOG_FILENAME = 'compairr_log.txt'
OUTPUT_FILENAME = 'compairr_results.txt'
build_distance_matrix(dataset: RepertoireDataset, params: EncoderParams, train_repertoire_ids: list)[source]
build_labels(dataset: RepertoireDataset, params: EncoderParams) dict[source]
static build_object(dataset, **params)[source]
encode(dataset: RepertoireDataset, params: EncoderParams) RepertoireDataset[source]
static export_encoder(path: Path, encoder) Path[source]
static load_encoder(encoder_file: Path)[source]
set_context(context: dict)[source]

immuneML.encodings.distance_encoding.DistanceEncoder module

class immuneML.encodings.distance_encoding.DistanceEncoder.DistanceEncoder(distance_metric: DistanceMetricType, attributes_to_match: list, sequence_batch_size: int, context: dict = None, name: str = None)[source]

Bases: DatasetEncoder

Encodes a given RepertoireDataset as distance matrix, where the pairwise distance between each of the repertoires is calculated. The distance is calculated based on the presence/absence of elements defined under attributes_to_match. Thus, if attributes_to_match contains only ‘sequence_aas’, this means the distance between two repertoires is maximal if they contain the same set of sequence_aas, and the distance is minimal if none of the sequence_aas are shared between two repertoires.

Parameters:
  • distance_metric (DistanceMetricType) – The metric used to calculate the

  • specification. (distance between two repertoires. Names of different distance metric types are allowed values in the) –

  • JACCARD (The default distance metric is) –

  • sequence_batch_size (int) – The number of sequences to be processed at once. Increasing this number increases the memory use.

  • 1000. (The default value is) –

  • attributes_to_match (list) – The attributes to consider when determining whether a sequence is present in both repertoires.

  • considered (Only the fields defined under attributes_to_match will be) –

  • ignored. (all other fields are) –

  • attribute (Valid values include any repertoire) –

YAML specification:

my_distance_encoder:
    Distance:
        distance_metric: JACCARD
        sequence_batch_size: 1000
        attributes_to_match:
            - sequence_aas
            - v_genes
            - j_genes
            - chains
            - region_types
build_distance_matrix(dataset: RepertoireDataset, params: EncoderParams, train_repertoire_ids: list)[source]
build_labels(dataset: RepertoireDataset, params: EncoderParams) dict[source]
static build_object(dataset, **params)[source]
encode(dataset, params: EncoderParams) RepertoireDataset[source]
static export_encoder(path: Path, encoder) Path[source]
static get_documentation()[source]
static load_encoder(encoder_file: Path)[source]
set_context(context: dict)[source]

immuneML.encodings.distance_encoding.DistanceMetricType module

class immuneML.encodings.distance_encoding.DistanceMetricType.DistanceMetricType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

JACCARD = 'jaccard'
MORISITA_HORN = 'morisita_horn'

immuneML.encodings.distance_encoding.TCRdistEncoder module

class immuneML.encodings.distance_encoding.TCRdistEncoder.TCRdistEncoder(cores: int, name: str = None)[source]

Bases: DatasetEncoder

Encodes the given ReceptorDataset as a distance matrix between all receptors, where the distance is computed using TCRdist from the paper: Dash P, Fiore-Gartland AJ, Hertz T, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. 2017; 547(7661):89-93. doi:10.1038/nature22383.

For the implementation, TCRdist3 library was used (source code available here).

Parameters:

cores (int) – number of processes to use for the computation

YAML specification:

my_tcr_dist_enc: # user-defined name
    TCRdist:
        cores: 4
static build_object(dataset, **params)[source]
encode(dataset, params: EncoderParams)[source]
static export_encoder(path: Path, encoder) str[source]
set_context(context: dict)[source]

Module contents