immuneML.encodings.abundance_encoding package

Submodules

immuneML.encodings.abundance_encoding.AbundanceEncoderHelper module

class immuneML.encodings.abundance_encoding.AbundanceEncoderHelper.AbundanceEncoderHelper[source]

Bases: object

INVALID_P_VALUE = 2.0
static build_abundance_matrix(sequence_presence_matrix, matrix_repertoire_ids, dataset_repertoire_ids, sequence_p_values_indices)[source]
static check_is_positive_class(dataset, matrix_repertoire_ids, label_config: LabelConfiguration)[source]
static check_labels(label_config: LabelConfiguration, location: str)[source]
static get_relevant_sequence_indices(sequence_presence_iterator, is_positive_class, p_value_threshold, relevant_indices_path, params, cache_params=None)[source]

immuneML.encodings.abundance_encoding.CompAIRRBatchIterator module

class immuneML.encodings.abundance_encoding.CompAIRRBatchIterator.CompAIRRBatchIterator(paths, sequence_batch_size)[source]

Bases: object

compute_sequence_count()[source]
get_batch_dict(paths)[source]
get_batch_from_path(path)[source]
get_batches(repertoire_ids=None)[source]
get_sequence_vectors(repertoire_ids=None)[source]
set_repertoire_ids(repertoire_ids)[source]

immuneML.encodings.abundance_encoding.CompAIRRSequenceAbundanceEncoder module

class immuneML.encodings.abundance_encoding.CompAIRRSequenceAbundanceEncoder.CompAIRRSequenceAbundanceEncoder(p_value_threshold: float, compairr_path: str, sequence_batch_size: int, ignore_genes: bool, keep_temporary_files: bool, threads: int, name: str = None)[source]

Bases: DatasetEncoder

This encoder works similarly to the SequenceAbundanceEncoder, but internally uses CompAIRR to accelerate core computations.

This encoder represents the repertoires as vectors where:

  • the first element corresponds to the number of label-associated clonotypes

  • the second element is the total number of unique clonotypes

To determine what clonotypes (amino acid sequences with or without matching V/J genes) are label-associated, Fisher’s exact test (one-sided) is used.

The encoder also writes out files containing the contingency table used for fisher’s exact test, the resulting p-values, and the significantly abundant sequences (use RelevantSequenceExporter to export these sequences in AIRR format).

Reference: Emerson, Ryan O. et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.

Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. See Reproduction of the CMV status predictions study for an example using SequenceAbundanceEncoder.

Parameters:
  • p_value_threshold (float) – The p value threshold to be used by the statistical test.

  • compairr_path (Path) – optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR

  • 'compairr' (has been installed such that it can be called directly on the command line with the command) –

:param : :param or that it is located at /usr/local/bin/compairr.: :param ignore_genes: Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains :type ignore_genes: bool :param have to match. If True: :param gene information is ignored. By default: :param ignore_genes is False.: :param sequence_batch_size: The number of sequences in a batch when comparing sequences across repertoires, typically 100s of thousands. :type sequence_batch_size: int :param This does not affect the results of the encoding: :param but may affect the speed and memory usage. The default value is 1.000.000: :param threads: The number of threads to use for parallelization. This does not affect the results of the encoding, only the speed. :type threads: int :param The default number of threads is 8.: :param keep_temporary_files: whether to keep temporary files, including CompAIRR input, output and log files, and the sequence :type keep_temporary_files: bool :param presence matrix. This may take a lot of storage space if the input dataset is large. By default: :param temporary files are not kept.:

YAML specification:

my_sa_encoding:
    CompAIRRSequenceAbundance:
        compairr_path: optional/path/to/compairr
        p_value_threshold: 0.05
        ignore_genes: False
        threads: 8
LOG_FILENAME = 'compairr_log.txt'
OUTPUT_FILENAME = 'compairr_out.tsv'
RELEVANT_SEQUENCE_ABUNDANCE = 'relevant_sequence_abundance'
TOTAL_SEQUENCE_ABUNDANCE = 'total_sequence_abundance'
static build_object(dataset, **params)[source]
encode(dataset, params: EncoderParams)[source]
static export_encoder(path: Path, encoder) Path[source]
get_additional_files() List[Path][source]

Should return a list with all the files that need to be stored when storing the encoder.

get_relevant_sequence_attributes()[source]
get_sequence_set(repertoire_dataset)[source]
get_sequence_set_for_repertoire(repertoire, sequence_attributes)[source]
static load_encoder(encoder_file: Path)[source]
set_context(context: dict)[source]
store(encoded_dataset, params: EncoderParams)[source]
write_sequence_set_file(sequence_set, filename, offset=0, region_type=RegionType.IMGT_JUNCTION)[source]

immuneML.encodings.abundance_encoding.KmerAbundanceEncoder module

class immuneML.encodings.abundance_encoding.KmerAbundanceEncoder.KmerAbundanceEncoder(p_value_threshold: float, sequence_encoding: SequenceEncodingType, k: int, k_left: int, k_right: int, min_gap: int, max_gap: int, name: str = None)[source]

Bases: DatasetEncoder

This encoder is related to the SequenceAbundanceEncoder, but identifies label-associated subsequences (k-mers) instead of full label-associated sequences.

This encoder represents the repertoires as vectors where:

  • the first element corresponds to the number of label-associated k-mers found in a repertoire

  • the second element is the total number of unique k-mers per repertoire

The label-associated k-mers are determined based on a one-sided Fisher’s exact test.

The encoder also writes out files containing the contingency table used for fisher’s exact test, the resulting p-values, and the significantly abundant k-mers.

Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. See Reproduction of the CMV status predictions study for an example using SequenceAbundanceEncoder.

Parameters:
  • p_value_threshold (float) – The p value threshold to be used by the statistical test.

  • sequence_encoding (SequenceEncodingType) – The type of k-mers that are used. The simplest (default) sequence_encoding is CONTINUOUS_KMER, which uses contiguous subsequences of length k to represent the k-mers. When gapped k-mers are used (GAPPED_KMER, GAPPED_KMER), the k-mers may contain gaps with a size between min_gap and max_gap, and the k-mer length is defined as a combination of k_left and k_right. When IMGT k-mers are used (IMGT_CONTINUOUS_KMER, IMGT_GAPPED_KMER), IMGT positional information is taken into account (i.e. the same sequence in a different position is considered to be a different k-mer).

  • k (int) – Length of the k-mer (number of amino acids) when ungapped k-mers are used. The default value for k is 3.

  • k_left (int) – When gapped k-mers are used, k_left indicates the length of the k-mer left of the gap. The default value for k_left is 1.

  • k_right (int) – Same as k_left, but k_right determines the length of the k-mer right of the gap. The default value for k_right is 1.

  • min_gap (int) – Minimum gap size when gapped k-mers are used. The default value for min_gap is 0.

  • max_gap – (int): Maximum gap size when gapped k-mers are used. The default value for max_gap is 0.

YAML specification:

my_sa_encoding:
    KmerAbundance:
        p_value_threshold: 0.05
        threads: 8
RELEVANT_SEQUENCE_ABUNDANCE = 'relevant_sequence_abundance'
TOTAL_SEQUENCE_ABUNDANCE = 'total_sequence_abundance'
static build_object(dataset, **params)[source]
encode(dataset, params: EncoderParams)[source]
static export_encoder(path: Path, encoder) Path[source]
get_additional_files() List[Path][source]

Should return a list with all the files that need to be stored when storing the encoder.

static load_encoder(encoder_file: Path)[source]
set_context(context: dict)[source]
store(encoded_dataset, params: EncoderParams)[source]

immuneML.encodings.abundance_encoding.SequenceAbundanceEncoder module

class immuneML.encodings.abundance_encoding.SequenceAbundanceEncoder.SequenceAbundanceEncoder(comparison_attributes, p_value_threshold: float, sequence_batch_size: int, repertoire_batch_size: int, name: str = None)[source]

Bases: DatasetEncoder

This encoder represents the repertoires as vectors where:

  • the first element corresponds to the number of label-associated clonotypes

  • the second element is the total number of unique clonotypes

To determine what clonotypes (with features defined by comparison_attributes) are label-associated, one-sided Fisher’s exact test is used.

The encoder also writes out files containing the contingency table used for Fisher’s exact test, the resulting p-values, and the significantly abundant sequences (use RelevantSequenceExporter to export these sequences in AIRR format).

Reference: Emerson, Ryan O. et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.

Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. For full example of using this encoder, see Reproduction of the CMV status predictions study.

Parameters:
  • comparison_attributes (list) – The attributes to be considered to group receptors into clonotypes. Only the fields specified in

  • considered (comparison_attributes will be) –

  • name. (all other fields are ignored. Valid comparison value can be any repertoire field) –

  • p_value_threshold (float) – The p value threshold to be used by the statistical test.

  • sequence_batch_size (int) – The number of sequences in a batch when comparing sequences across repertoires, typically 100s of thousands.

  • encoding (This does not affect the results of the) –

  • 1.000.000 (only the speed. The default value is) –

  • repertoire_batch_size (int) – How many repertoires will be loaded at once. This does not affect the result of the encoding, only the speed.

  • disk. (This value is a trade-off between the number of repertoires that can fit the RAM at the time and loading time from) –

YAML specification:

my_sa_encoding:
    SequenceAbundance:
        comparison_attributes:
            - sequence_aas
            - v_genes
            - j_genes
            - chains
            - region_types
        p_value_threshold: 0.05
        sequence_batch_size: 100000
        repertoire_batch_size: 32
RELEVANT_SEQUENCE_ABUNDANCE = 'relevant_sequence_abundance'
TOTAL_SEQUENCE_ABUNDANCE = 'total_sequence_abundance'
static build_object(dataset, **params)[source]
encode(dataset, params: EncoderParams)[source]
static export_encoder(path: Path, encoder) Path[source]
get_additional_files() List[Path][source]

Should return a list with all the files that need to be stored when storing the encoder.

static get_documentation()[source]
static load_encoder(encoder_file: Path)[source]
set_context(context: dict)[source]
store(encoded_dataset, params: EncoderParams)[source]

Module contents