immuneML.encodings.filtered_sequence_encoding package

Submodules

immuneML.encodings.filtered_sequence_encoding.SequenceAbundanceEncoder module

class immuneML.encodings.filtered_sequence_encoding.SequenceAbundanceEncoder.SequenceAbundanceEncoder(comparison_attributes, p_value_threshold: float, sequence_batch_size: int, repertoire_batch_size: int, name: Optional[str] = None)[source]

Bases: immuneML.encodings.DatasetEncoder.DatasetEncoder

This encoder represents the repertoires as vectors where:

  • the first element corresponds to the number of label-associated clonotypes

  • the second element is the total number of unique clonotypes

To determine what clonotypes (with features defined by comparison_attributes) are label-associated based on a statistical test. The statistical test used is Fisher’s exact test (one-sided).

Reference: Emerson, Ryan O. et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.

Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. For full example of using this encoder, see Reproduction of the CMV status predictions study.

Parameters
  • comparison_attributes (list) – The attributes to be considered to group receptors into clonotypes. Only the fields specified in

  • will be considered (comparison_attributes) –

  • other fields are ignored. Valid comparison value can be any repertoire field name. (all) –

  • p_value_threshold (float) – The p value threshold to be used by the statistical test.

  • sequence_batch_size (int) – The number of sequences in a batch when comparing sequences across repertoires, typically 100s of thousands.

  • does not affect the results of the encoding (This) –

  • the speed. (only) –

  • repertoire_batch_size (int) – How many repertoires will be loaded at once. This does not affect the result of the encoding, only the speed.

  • value is a trade-off between the number of repertoires that can fit the RAM at the time and loading time from disk. (This) –

YAML specification:

my_sa_encoding:
    SequenceAbundance:
        comparison_attributes:
            - sequence_aas
            - v_genes
            - j_genes
            - chains
            - region_types
        p_value_threshold: 0.05
        sequence_batch_size: 100000
        repertoire_batch_size: 32
RELEVANT_SEQUENCE_ABUNDANCE = 'relevant_sequence_abundance'
TOTAL_SEQUENCE_ABUNDANCE = 'total_sequence_abundance'
static build_object(dataset, **params)[source]
encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]
static export_encoder(path: pathlib.Path, encoder) → pathlib.Path[source]
get_additional_files() → List[pathlib.Path][source]
static get_documentation()[source]
static load_encoder(encoder_file: pathlib.Path)[source]
set_context(context: dict)[source]
store(encoded_dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]

immuneML.encodings.filtered_sequence_encoding.SequenceFilterHelper module

class immuneML.encodings.filtered_sequence_encoding.SequenceFilterHelper.SequenceFilterHelper[source]

Bases: object

INVALID_P_VALUE = 2
static build_comparison_data(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, context: dict, comparison_attributes: list, params: immuneML.encodings.EncoderParams.EncoderParams, sequence_batch_size: int)[source]
static filter_sequences(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, comparison_data: immuneML.pairwise_repertoire_comparison.ComparisonData.ComparisonData, label: immuneML.environment.Label.Label, p_value_threshold: float)[source]
static find_label_associated_sequence_p_values(comparison_data: immuneML.pairwise_repertoire_comparison.ComparisonData.ComparisonData, repertoires: List[immuneML.data_model.repertoire.Repertoire.Repertoire], label: immuneML.environment.Label.Label)[source]
static get_relevant_sequences(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, params: immuneML.encodings.EncoderParams.EncoderParams, comparison_data: immuneML.pairwise_repertoire_comparison.ComparisonData.ComparisonData, label: str, p_value_threshold, comparison_attributes: list, sequence_indices_path: pathlib.Path)[source]

Module contents