immuneML.preprocessing.filters package¶
Submodules¶
immuneML.preprocessing.filters.ChainRepertoireFilter module¶
-
class
immuneML.preprocessing.filters.ChainRepertoireFilter.
ChainRepertoireFilter
(keep_chain: immuneML.data_model.receptor.receptor_sequence.Chain.Chain)[source]¶ Bases:
immuneML.preprocessing.filters.Filter.Filter
Removes all repertoires from the RepertoireDataset object which contain at least one sequence with chain different than “keep_chain” parameter. Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.
- Parameters
keep_chain (
SequenceType
) – Which chain should be kept.
YAML specification:
preprocessing_sequences: my_preprocessing: - my_filter: ChainRepertoireFilter: keep_chain: TRB
-
static
process
(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, params: dict) → immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset[source]¶
-
process_dataset
(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, result_path: Optional[pathlib.Path] = None)[source]¶
immuneML.preprocessing.filters.ClonesPerRepertoireFilter module¶
-
class
immuneML.preprocessing.filters.ClonesPerRepertoireFilter.
ClonesPerRepertoireFilter
(lower_limit: int = - 1, upper_limit: int = - 1)[source]¶ Bases:
immuneML.preprocessing.filters.Filter.Filter
Removes all repertoires from the RepertoireDataset, which contain fewer clonotypes than specified by the lower_limit, or more clonotypes than specified by the upper_limit. Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.
- Parameters
lower_limit (int) – The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire.
upper_limit (int) – The maximal inclusive upper limit for the number of clonotypes allowed in a repertoire.
When no lower or upper limit is specified, or the value -1 is specified, the limit is ignored.
YAML specification:
preprocessing_sequences: my_preprocessing: - my_filter: ClonesPerRepertoireFilter: lower_limit: 100 upper_limit: 100000
-
static
process
(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, params: dict) → immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset[source]¶
-
process_dataset
(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, result_path: Optional[pathlib.Path] = None)[source]¶
immuneML.preprocessing.filters.CountAggregationFunction module¶
immuneML.preprocessing.filters.CountPerSequenceFilter module¶
-
class
immuneML.preprocessing.filters.CountPerSequenceFilter.
CountPerSequenceFilter
(low_count_limit: int, remove_without_count: bool, remove_empty_repertoires: bool, batch_size: int)[source]¶ Bases:
immuneML.preprocessing.filters.Filter.Filter
Removes all sequences from a Repertoire when they have a count below low_count_limit, or sequences with no count value if remove_without_counts is True. This filter can be applied to Repertoires and RepertoireDatasets.
- Parameters
low_count_limit (int) – The inclusive minimal count value in order to retain a given sequence.
remove_without_count (bool) – Whether the sequences without a reported count value should be removed.
remove_empty_repertoires (bool) – Whether repertoires without sequences should be removed.
has an effect when remove_without_count is also set to True. (Only) –
batch_size (int) – number of repertoires that can be loaded at the same time (only affects the speed when applying this filter on a RepertoireDataset)
YAML specification:
preprocessing_sequences: my_preprocessing: - my_filter: CountPerSequenceFilter: remove_without_count: True remove_empty_repertoires: True low_count_limit: 3 batch_size: 4
-
static
process
(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, params: dict) → immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset[source]¶
-
process_dataset
(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, result_path: pathlib.Path) → immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset[source]¶
-
static
process_repertoire
(repertoire: immuneML.data_model.repertoire.Repertoire.Repertoire, params: dict) → immuneML.data_model.repertoire.Repertoire.Repertoire[source]¶
immuneML.preprocessing.filters.DuplicateSequenceFilter module¶
-
class
immuneML.preprocessing.filters.DuplicateSequenceFilter.
DuplicateSequenceFilter
(filter_sequence_type: immuneML.environment.SequenceType.SequenceType, batch_size: int, count_agg: immuneML.preprocessing.filters.CountAggregationFunction.CountAggregationFunction)[source]¶ Bases:
immuneML.preprocessing.filters.Filter.Filter
Collapses duplicate nucleotide or amino acid sequences within each repertoire in the given RepertoireDataset. This filter can be applied to Repertoires and RepertoireDatasets.
Sequences are considered duplicates if the following fields are identical:
amino acid or nucleotide sequence (whichever is specified)
v and j genes (note that the full field including subgroup + gene is used for matching, i.e. V1 and V1-1 are not considered duplicates)
chain
region type
For all other fields (the non-specified sequence type, custom lists, sequence identifier) only the first occurring value is kept.
Note that this means the count value of a sequence with a given sequence identifier might not be the same as before removing duplicates, unless count_agg = FIRST is used.
- Parameters
filter_sequence_type (
SequenceType
) – Whether the sequences should be collapsed on the nucleotide or amino acid level. Valid options are defined by the SequenceType enum.batch_size (int) – number of repertoires that can be loaded at the same time (only affects the speed)
count_agg (
CountAggregationFunction
) – determines how the sequence counts of duplicate sequences are aggregated. Valid options are defined by the CountAggregationFunction enum.
YAML specification:
preprocessing_sequences: my_preprocessing: - my_filter: DuplicateSequenceFilter: # required parameters: filter_sequence_type: AMINO_ACID # optional parameters (if not specified the values bellow will be used): batch_size: 4 count_agg: SUM
-
static
process
(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, params: dict) → immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset[source]¶
-
process_dataset
(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, result_path: pathlib.Path) → immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset[source]¶
-
static
process_repertoire
(repertoire: immuneML.data_model.repertoire.Repertoire.Repertoire, params: dict) → immuneML.data_model.repertoire.Repertoire.Repertoire[source]¶
immuneML.preprocessing.filters.Filter module¶
-
class
immuneML.preprocessing.filters.Filter.
Filter
[source]¶ Bases:
immuneML.preprocessing.Preprocessor.Preprocessor
,abc.ABC
-
static
build_new_metadata
(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, indices_to_keep: list, result_path: pathlib.Path)[source]¶
-
static
check_dataset_not_empty
(processed_dataset: immuneML.data_model.dataset.Dataset.Dataset, location='Filter')[source]¶
-
static
immuneML.preprocessing.filters.MetadataRepertoireFilter module¶
-
class
immuneML.preprocessing.filters.MetadataRepertoireFilter.
MetadataRepertoireFilter
(criteria: dict)[source]¶ Bases:
immuneML.preprocessing.filters.Filter.Filter
Removes repertoires from a RepertoireDataset based on information stored in the metadata_file. Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.
- Parameters
criteria (dict) – a nested dictionary that specifies the criteria for keeping certain columns. See
CriteriaMatcher
for a more detailed explanation.
YAML specification:
preprocessing_sequences: my_preprocessing: - my_filter: # Example filter that keeps repertoires with values greater than 1 in the "my_column_name" column of the metadata_file MetadataRepertoireFilter: type: GREATER_THAN value: type: COLUMN name: my_column_name threshold: 1
-
static
get_matching_indices
(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, criteria)[source]¶
-
static
process
(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, params: dict) → immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset[source]¶
-
process_dataset
(dataset: immuneML.data_model.dataset.RepertoireDataset.RepertoireDataset, result_path: pathlib.Path)[source]¶