immuneML.encodings.distance_encoding package


immuneML.encodings.distance_encoding.DistanceEncoder module

class immuneML.encodings.distance_encoding.DistanceEncoder.DistanceEncoder(distance_metric: immuneML.encodings.distance_encoding.DistanceMetricType.DistanceMetricType, attributes_to_match: list, sequence_batch_size: int, context: Optional[dict] = None, name: Optional[str] = None)[source]

Bases: immuneML.encodings.DatasetEncoder.DatasetEncoder

Encodes a given RepertoireDataset as distance matrix, where the pairwise distance between each of the repertoires is calculated. The distance is calculated based on the presence/absence of elements defined under attributes_to_match. Thus, if attributes_to_match contains only ‘sequence_aas’, this means the distance between two repertoires is maximal if they contain the same set of sequence_aas, and the distance is minimal if none of the sequence_aas are shared between two repertoires.

  • distance_metric (DistanceMetricType) – The metric used to calculate the

  • between two repertoires. Names of different distance metric types are allowed values in the specification. (distance) –

  • default distance metric is JACCARD (The) –

  • sequence_batch_size (int) – The number of sequences to be processed at once. Increasing this number increases the memory use.

  • default value is 1000. (The) –

  • attributes_to_match (list) – The attributes to consider when determining whether a sequence is present in both repertoires.

  • the fields defined under attributes_to_match will be considered (Only) –

  • other fields are ignored. (all) –

  • values include any repertoire attribute (Valid) –

YAML specification:

        distance_metric: JACCARD
        sequence_batch_size: 1000
            - sequence_aas
            - v_genes
            - j_genes
            - chains
            - region_types
build_distance_matrix(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, params: immuneML.encodings.EncoderParams.EncoderParams, train_repertoire_ids: list)[source]
build_labels(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, params: immuneML.encodings.EncoderParams.EncoderParams) → dict[source]
static build_object(dataset, **params)[source]
encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset[source]
static export_encoder(path: pathlib.Path, encoder) → pathlib.Path[source]
static get_documentation()[source]
static load_encoder(encoder_file: pathlib.Path)[source]
set_context(context: dict)[source]

immuneML.encodings.distance_encoding.DistanceMetricType module

class immuneML.encodings.distance_encoding.DistanceMetricType.DistanceMetricType(value)[source]

Bases: enum.Enum

An enumeration.

JACCARD = 'jaccard'

immuneML.encodings.distance_encoding.TCRdistEncoder module

class immuneML.encodings.distance_encoding.TCRdistEncoder.TCRdistEncoder(cores: int, name: Optional[str] = None)[source]

Bases: immuneML.encodings.DatasetEncoder.DatasetEncoder

Encodes the given ReceptorDataset as a distance matrix between all receptors, where the distance is computed using TCRdist from the paper: Dash P, Fiore-Gartland AJ, Hertz T, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. 2017; 547(7661):89-93. doi:10.1038/nature22383.

For the implementation, TCRdist3 library was used (source code available here).


cores (int) – number of processes to use for the computation

YAML specification:

my_tcr_dist_enc: # user-defined name
        cores: 4
static build_object(dataset, **params)[source]
encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]
static export_encoder(path: pathlib.Path, encoder) → str[source]
set_context(context: dict)[source]

Module contents