immuneML.preprocessing.filters package¶

Submodules¶

immuneML.preprocessing.filters.ChainRepertoireFilter module¶

class immuneML.preprocessing.filters.ChainRepertoireFilter.ChainRepertoireFilter(keep_chain, result_path: Path = None)[source]¶

Bases: Filter

Removes all repertoires from the RepertoireDataset object which contain at least one sequence with chain different than “keep_chain” parameter. Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.

Since the filter removes repertoires from the dataset (examples in machine learning setting), it cannot be used with TrainMLModel instruction. If you want to filter out repertoires including a given chain, see DatasetExport instruction with preprocessing.

Specification arguments:

keep_chain (str): Which chain should be kept, valid values are “TRA”, “TRB”, “IGH”, “IGL”, “IGK”

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            ChainRepertoireFilter:
                keep_chain: TRB

keeps_example_count() → bool[source]¶: Defines if the preprocessing can be run with TrainMLModel instruction; to be able to run with it, the preprocessing cannot change the number of examples in the dataset

process_dataset(dataset: RepertoireDataset, result_path: Path, number_of_processes=1)[source]¶

immuneML.preprocessing.filters.ClonesPerRepertoireFilter module¶

class immuneML.preprocessing.filters.ClonesPerRepertoireFilter.ClonesPerRepertoireFilter(result_path: Path = None, lower_limit: int = -1, upper_limit: int = -1)[source]¶

Bases: Filter

Removes all repertoires from the RepertoireDataset, which contain fewer clonotypes than specified by the lower_limit, or more clonotypes than specified by the upper_limit. Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets. When no lower or upper limit is specified, or the value -1 is specified, the limit is ignored.

Since the filter removes repertoires from the dataset (examples in machine learning setting), it cannot be used with TrainMLModel instruction. If you want to use this filter, see DatasetExport instruction with preprocessing.

Specification arguments:

lower_limit (int): The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire.
upper_limit (int): The maximal inclusive upper limit for the number of clonotypes allowed in a repertoire.

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            ClonesPerRepertoireFilter:
                lower_limit: 100
                upper_limit: 100000

keeps_example_count() → bool[source]¶: Defines if the preprocessing can be run with TrainMLModel instruction; to be able to run with it, the preprocessing cannot change the number of examples in the dataset

process_dataset(dataset: RepertoireDataset, result_path: Path, number_of_processes=1)[source]¶

immuneML.preprocessing.filters.CountAggregationFunction module¶

class immuneML.preprocessing.filters.CountAggregationFunction.CountAggregationFunction(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶

Bases: Enum

FIRST = 'first'¶

LAST = 'last'¶

MAX = 'max'¶

MEAN = 'mean'¶

MIN = 'min'¶

SUM = 'sum'¶

immuneML.preprocessing.filters.CountPerSequenceFilter module¶

class immuneML.preprocessing.filters.CountPerSequenceFilter.CountPerSequenceFilter(low_count_limit: int, remove_without_count: bool, remove_empty_repertoires: bool, batch_size: int, result_path: Path = None)[source]¶

Bases: Filter

Removes all sequences from a Repertoire when they have a count below low_count_limit, or sequences with no count value if remove_without_counts is True. This filter can be applied to Repertoires and RepertoireDatasets.

Specification arguments:

low_count_limit (int): The inclusive minimal count value in order to retain a given sequence.
remove_without_count (bool): Whether the sequences without a reported count value should be removed.
remove_empty_repertoires (bool): Whether repertoires without sequences should be removed. Only has an effect when remove_without_count is also set to True. If this is true, this preprocessing cannot be used with TrainMLModel instruction, but only with DatasetExport instruction instead.
batch_size (int): number of repertoires that can be loaded at the same time (only affects the speed when applying this filter on a RepertoireDataset)

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            CountPerSequenceFilter:
                remove_without_count: True
                remove_empty_repertoires: True
                low_count_limit: 3
                batch_size: 4

keeps_example_count() → bool[source]¶: Defines if the preprocessing can be run with TrainMLModel instruction; to be able to run with it, the preprocessing cannot change the number of examples in the dataset

process_dataset(dataset: RepertoireDataset, result_path: Path, number_of_processes=1) → RepertoireDataset[source]¶

immuneML.preprocessing.filters.DuplicateSequenceFilter module¶

class immuneML.preprocessing.filters.DuplicateSequenceFilter.DuplicateSequenceFilter(filter_sequence_type: SequenceType, batch_size: int, count_agg: CountAggregationFunction, result_path: Path = None, region_type: RegionType = RegionType.IMGT_CDR3)[source]¶

Bases: Filter

Collapses duplicate nucleotide or amino acid sequences within each repertoire in the given RepertoireDataset or within a SequenceDataset. This filter can be applied to Repertoires, RepertoireDatasets, and SequenceDatasets.

Sequences are considered duplicates if the following fields are identical:

amino acid or nucleotide sequence (whichever is specified)
v and j genes (note that the full field including subgroup + gene is used for matching, i.e. V1 and V1-1 are not considered duplicates)
chain
region type

For all other fields (the non-specified sequence type, custom lists, sequence identifier) only the first occurring value is kept.

Note that this means the count value of a sequence with a given sequence identifier might not be the same as before removing duplicates, unless count_agg = FIRST is used.

Specification arguments:

filter_sequence_type (SequenceType): Whether the sequences should be collapsed on the nucleotide or amino acid level. Valid options are defined by the SequenceType enum.
region_type (str): which part of the sequence to examine, by default, this is IMGT_CDR3
count_agg (CountAggregationFunction): determines how the sequence counts of duplicate sequences are aggregated. Valid options are defined by the CountAggregationFunction enum.

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            DuplicateSequenceFilter:
                # required parameters:
                filter_sequence_type: AMINO_ACID
                # optional parameters (if not specified the values bellow will be used):
                batch_size: 4
                count_agg: SUM
                region_type: IMGT_CDR3

classmethod build_object(**kwargs)[source]¶

static get_documentation()[source]¶

process_dataset(dataset, result_path: Path, number_of_processes=1)[source]¶

immuneML.preprocessing.filters.Filter module¶

class immuneML.preprocessing.filters.Filter.Filter(result_path: Path = None)[source]¶

Bases: Preprocessor, ABC

check_dataset_not_empty(processed_dataset: Dataset, location='Filter')[source]¶

immuneML.preprocessing.filters.MetadataFilter module¶

class immuneML.preprocessing.filters.MetadataFilter.MetadataFilter(criteria: dict, result_path: Path = None)[source]¶

Bases: Filter

Removes examples from a dataset based on the examples’ metadata. It works for any dataset type. Note that for repertoire datasets, this means that repertoires will be filtered out, and for sequences datasets - sequences.

Since this filter changes the number of examples, it cannot be used with TrainMLModel instruction. Use with DatasetExport instruction instead.

Specification arguments:

criteria (dict): a nested dictionary that specifies the criteria for keeping the dataset examples based on the column values; it contains the type of evaluation, name of the column, and additional parameters depending on evaluation; alternatively, it can contain a combination of multiple (evaluation, column, parameters) groups; evaluation_types: IN, NOT_IN, NOT_NA, GREATER_THAN, LESS_THAN, TOP_N, RANDOM_N; for IN, NOT_IN the parameter name is ‘values’, for GREATER_THAN, LESS_THAN the parameter name is ‘threshold’ and for TOP_N, RANDOM_N the parameter name is ‘number’; supported boolean combinations of groups are AND and OR with (evaluation, column, parameter) groups specified under ‘operands’ key; see the YAML below for example.

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            # Example filter that keeps e.g., repertoires with values greater than 1 in the "my_column_name"
            # column of the metadata_file
            MetadataFilter:
                type: GREATER_THAN
                column: my_column_name
                threshold: 1
    my_second_preprocessing:
        - my_filter2: # only examples which in column "label" have values 'label_val1' or 'label_val2' are kept
            MetadataFilter:
                type: IN
                values: [label_val1, label_val2]
                column: label
    my_third_preprocessing_example:
        - my_combined_filter:
            MetadataFilter:
            # keeps examples with that have label_val1 or label_val2 in the column label and
            # that at the same time have a value larger than 1.3 in another_metadata_column
                type: AND
                operands:
                - type: IN
                  values: [label_val1, label_val2]
                  column: label
                - type: GREATER_THAN
                  column: another_metadata_column
                  threshold: 1.3

keeps_example_count() → bool[source]¶: Defines if the preprocessing can be run with TrainMLModel instruction; to be able to run with it, the preprocessing cannot change the number of examples in the dataset

process_dataset(dataset: Dataset, result_path: Path, number_of_processes=1)[source]¶

immuneML.preprocessing.filters.SequenceLengthFilter module¶

class immuneML.preprocessing.filters.SequenceLengthFilter.SequenceLengthFilter(min_len: int, max_len: int, sequence_type: SequenceType, region_type: RegionType, name: str = None)[source]¶

Bases: Filter

Removes sequences with length out of the predefined range.

Supported dataset types:

SequenceDataset
ReceptorDataset
RepertoireDataset

Specification arguments:

sequence_type (SequenceType): Whether the sequences should be filtered on the nucleotide or amino acid level. Valid options are defined by the SequenceType enum.
min_len (int): minimum length of the sequence (sequences shorter than min_len will be removed); to not use min_len, set it to -1
max_len (int): maximum length of the sequence (sequences longer than max_len will be removed); to not use max_len, set it to -1
region_type (str): which part of the sequence to examine, by default, this is IMGT_CDR3

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            SequenceLengthFilter:
                sequence_type: AMINO_ACID
                min_len: 3 # -> remove all sequences shorter than 3
                max_len: -1 # -> no upper bound on the sequence length

classmethod build_object(**kwargs)[source]¶

process_dataset(dataset, result_path: Path, number_of_processes: int = 1)[source]¶

immuneML.preprocessing.filters package¶

Submodules¶

immuneML.preprocessing.filters.ChainRepertoireFilter module¶

immuneML.preprocessing.filters.ClonesPerRepertoireFilter module¶

immuneML.preprocessing.filters.CountAggregationFunction module¶

immuneML.preprocessing.filters.CountPerSequenceFilter module¶

immuneML.preprocessing.filters.DuplicateSequenceFilter module¶

immuneML.preprocessing.filters.Filter module¶

immuneML.preprocessing.filters.MetadataFilter module¶

immuneML.preprocessing.filters.SequenceLengthFilter module¶

Module contents¶