immuneML.encodings.filtered_sequence_encoding package¶
Submodules¶
immuneML.encodings.filtered_sequence_encoding.SequenceAbundanceEncoder module¶
-
class
immuneML.encodings.filtered_sequence_encoding.SequenceAbundanceEncoder.
SequenceAbundanceEncoder
(comparison_attributes, p_value_threshold: float, sequence_batch_size: int, repertoire_batch_size: int, name: Optional[str] = None)[source]¶ Bases:
immuneML.encodings.DatasetEncoder.DatasetEncoder
This encoder represents the repertoires as vectors where:
the first element corresponds to the number of label-associated clonotypes
the second element is the total number of unique clonotypes
To determine what clonotypes (with features defined by comparison_attributes) are label-associated based on a statistical test. The statistical test used is Fisher’s exact test (one-sided).
Reference: Emerson, Ryan O. et al. ‘Immunosequencing Identifies Signatures of Cytomegalovirus Exposure History and HLA-Mediated Effects on the T Cell Repertoire’. Nature Genetics 49, no. 5 (May 2017): 659–65. doi.org/10.1038/ng.3822.
- Parameters
comparison_attributes (list) – The attributes to be considered to group receptors into clonotypes. Only the fields specified in
will be considered (comparison_attributes) –
other fields are ignored. Valid comparison value can be any repertoire field name. (all) –
p_value_threshold (float) – The p value threshold to be used by the statistical test.
sequence_batch_size (int) – The number of sequences in a batch when comparing sequences across repertoires, typically 100s of thousands.
does not affect the results of the encoding (This) –
the speed. (only) –
repertoire_batch_size (int) – How many repertoires will be loaded at once. This does not affect the result of the encoding, only the speed.
value is a trade-off between the number of repertoires that can fit the RAM at the time and loading time from disk. (This) –
YAML specification:
my_sa_encoding: SequenceAbundance: comparison_attributes: - sequence_aas - v_genes - j_genes - chains - region_types p_value_threshold: 0.05 sequence_batch_size: 100000 repertoire_batch_size: 32
-
RELEVANT_SEQUENCE_ABUNDANCE
= 'relevant_sequence_abundance'¶
-
TOTAL_SEQUENCE_ABUNDANCE
= 'total_sequence_abundance'¶
-
encode
(dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]¶
-
store
(encoded_dataset, params: immuneML.encodings.EncoderParams.EncoderParams)[source]¶
immuneML.encodings.filtered_sequence_encoding.SequenceFilterHelper module¶
-
class
immuneML.encodings.filtered_sequence_encoding.SequenceFilterHelper.
SequenceFilterHelper
[source]¶ Bases:
object
-
INVALID_P_VALUE
= 2¶
-
static
build_comparison_data
(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, context: dict, comparison_attributes: list, params: immuneML.encodings.EncoderParams.EncoderParams, sequence_batch_size: int)[source]¶
-
static
filter_sequences
(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, comparison_data: immuneML.pairwise_repertoire_comparison.ComparisonData.ComparisonData, label: immuneML.environment.Label.Label, p_value_threshold: float)[source]¶
-
static
find_label_associated_sequence_p_values
(comparison_data: immuneML.pairwise_repertoire_comparison.ComparisonData.ComparisonData, repertoires: List[immuneML.data_model.repertoire.Repertoire.Repertoire], label: immuneML.environment.Label.Label)[source]¶
-
static
get_relevant_sequences
(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, params: immuneML.encodings.EncoderParams.EncoderParams, comparison_data: immuneML.pairwise_repertoire_comparison.ComparisonData.ComparisonData, label: str, p_value_threshold, comparison_attributes: list, sequence_indices_path: pathlib.Path)[source]¶
-