immuneML.encodings.filtered_sequence_encoding package
Submodules
immuneML.encodings.filtered_sequence_encoding.CompAIRRSequenceAbundanceEncoder module
- class immuneML.encodings.filtered_sequence_encoding.CompAIRRSequenceAbundanceEncoder.CompAIRRSequenceAbundanceEncoder(p_value_threshold: float, compairr_path: str, sequence_batch_size: int, ignore_genes: bool, threads: int, name: Optional[str] = None)[source]
Bases:
immuneML.encodings.DatasetEncoder.DatasetEncoder
This encoder works similarly to the
SequenceAbundanceEncoder
, but internally uses CompAIRR to accelerate core computations.This encoder represents the repertoires as vectors where:
the first element corresponds to the number of label-associated clonotypes
the second element is the total number of unique clonotypes
To determine what clonotypes (with or without matching V/J genes) are label-associated based on a statistical test. The statistical test used is Fisher’s exact test (one-sided).
Reference: Emerson, Ryan O. et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.
Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. See Reproduction of the CMV status predictions study for an example using
SequenceAbundanceEncoder
.- Parameters
p_value_threshold (float) – The p value threshold to be used by the statistical test.
compairr_path (Path) – optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR
'compairr' (has been installed such that it can be called directly on the command line with the command) –
:param : :param or that it is located at /usr/local/bin/compairr.: :param ignore_genes: Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains :type ignore_genes: bool :param have to match. If True: :param gene information is ignored. By default: :param ignore_genes is False.: :param sequence_batch_size: The number of sequences in a batch when comparing sequences across repertoires, typically 100s of thousands. :type sequence_batch_size: int :param This does not affect the results of the encoding: :param only the speed and memory usage.: :param threads: The number of threads to use for parallelization. This does not affect the results of the encoding, only the speed. :type threads: int :param The default number of threads is 8.:
YAML specification:
my_sa_encoding: CompAIRRSequenceAbundance: compairr_path: optional/path/to/compairr p_value_threshold: 0.05 ignore_genes: False threads: 8
- LOG_FILENAME = 'compairr_log.txt'
- OUTPUT_FILENAME = 'compairr_out.tsv'
- RELEVANT_SEQUENCE_ABUNDANCE = 'relevant_sequence_abundance'
- TOTAL_SEQUENCE_ABUNDANCE = 'total_sequence_abundance'
- encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]
- store(encoded_dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]
immuneML.encodings.filtered_sequence_encoding.SequenceAbundanceEncoder module
- class immuneML.encodings.filtered_sequence_encoding.SequenceAbundanceEncoder.SequenceAbundanceEncoder(comparison_attributes, p_value_threshold: float, sequence_batch_size: int, repertoire_batch_size: int, name: Optional[str] = None)[source]
Bases:
immuneML.encodings.DatasetEncoder.DatasetEncoder
This encoder represents the repertoires as vectors where:
the first element corresponds to the number of label-associated clonotypes
the second element is the total number of unique clonotypes
To determine what clonotypes (with features defined by comparison_attributes) are label-associated based on a statistical test. The statistical test used is Fisher’s exact test (one-sided).
Reference: Emerson, Ryan O. et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.
Note: to use this encoder, it is necessary to explicitly define the positive class for the label when defining the label in the instruction. With positive class defined, it can then be determined which sequences are indicative of the positive class. For full example of using this encoder, see Reproduction of the CMV status predictions study.
- Parameters
comparison_attributes (list) – The attributes to be considered to group receptors into clonotypes. Only the fields specified in
considered (comparison_attributes will be) –
name. (all other fields are ignored. Valid comparison value can be any repertoire field) –
p_value_threshold (float) – The p value threshold to be used by the statistical test.
sequence_batch_size (int) – The number of sequences in a batch when comparing sequences across repertoires, typically 100s of thousands.
encoding (This does not affect the results of the) –
speed. (only the) –
repertoire_batch_size (int) – How many repertoires will be loaded at once. This does not affect the result of the encoding, only the speed.
disk. (This value is a trade-off between the number of repertoires that can fit the RAM at the time and loading time from) –
YAML specification:
my_sa_encoding: SequenceAbundance: comparison_attributes: - sequence_aas - v_genes - j_genes - chains - region_types p_value_threshold: 0.05 sequence_batch_size: 100000 repertoire_batch_size: 32
- RELEVANT_SEQUENCE_ABUNDANCE = 'relevant_sequence_abundance'
- TOTAL_SEQUENCE_ABUNDANCE = 'total_sequence_abundance'
- encode(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]
- store(encoded_dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]
immuneML.encodings.filtered_sequence_encoding.SequenceFilterHelper module
- class immuneML.encodings.filtered_sequence_encoding.SequenceFilterHelper.SequenceFilterHelper[source]
Bases:
object
- INVALID_P_VALUE = 2
- static build_comparison_data(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, context: dict, comparison_attributes: list, params: immuneML.encodings.EncoderParams.EncoderParams, sequence_batch_size: int)[source]
- static filter_sequences(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, comparison_data: immuneML.pairwise_repertoire_comparison.ComparisonData.ComparisonData, label: immuneML.environment.Label.Label, p_value_threshold: float)[source]
- static find_label_associated_sequence_p_values(comparison_data: immuneML.pairwise_repertoire_comparison.ComparisonData.ComparisonData, repertoires: List[immuneML.data_model.repertoire.Repertoire.Repertoire], label: immuneML.environment.Label.Label)[source]
- static get_relevant_sequences(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, params: immuneML.encodings.EncoderParams.EncoderParams, comparison_data: immuneML.pairwise_repertoire_comparison.ComparisonData.ComparisonData, label_name: str, p_value_threshold, comparison_attributes: list, sequence_indices_path: pathlib.Path)[source]