Preprocessing parameters

Under the definitions/preprocessing_sequences component, the user can specify different preprocessing steps to apply to a dataset before performing an analysis. This is optional.

ChainRepertoireFilter

Removes all repertoires from the RepertoireDataset object which contain at least one sequence with chain different than “keep_chain” parameter. Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.

Since the filter removes repertoires from the dataset (examples in machine learning setting), it cannot be used with TrainMLModel instruction. If you want to filter out repertoires including a given chain, see DatasetExport instruction with preprocessing.

Specification arguments:

  • keep_chain (str): Which chain should be kept, valid values are “TRA”, “TRB”, “IGH”, “IGL”, “IGK”

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            ChainRepertoireFilter:
                keep_chain: TRB

ClonesPerRepertoireFilter

Removes all repertoires from the RepertoireDataset, which contain fewer clonotypes than specified by the lower_limit, or more clonotypes than specified by the upper_limit. Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets. When no lower or upper limit is specified, or the value -1 is specified, the limit is ignored.

Since the filter removes repertoires from the dataset (examples in machine learning setting), it cannot be used with TrainMLModel instruction. If you want to use this filter, see DatasetExport instruction with preprocessing.

Specification arguments:

  • lower_limit (int): The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire.

  • upper_limit (int): The maximal inclusive upper limit for the number of clonotypes allowed in a repertoire.

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            ClonesPerRepertoireFilter:
                lower_limit: 100
                upper_limit: 100000

CountPerSequenceFilter

Removes all sequences from a Repertoire when they have a count below low_count_limit, or sequences with no count value if remove_without_counts is True. This filter can be applied to Repertoires and RepertoireDatasets.

Specification arguments:

  • low_count_limit (int): The inclusive minimal count value in order to retain a given sequence.

  • remove_without_count (bool): Whether the sequences without a reported count value should be removed.

  • remove_empty_repertoires (bool): Whether repertoires without sequences should be removed. Only has an effect when remove_without_count is also set to True. If this is true, this preprocessing cannot be used with TrainMLModel instruction, but only with DatasetExport instruction instead.

  • batch_size (int): number of repertoires that can be loaded at the same time (only affects the speed when applying this filter on a RepertoireDataset)

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            CountPerSequenceFilter:
                remove_without_count: True
                remove_empty_repertoires: True
                low_count_limit: 3
                batch_size: 4

DuplicateSequenceFilter

Collapses duplicate nucleotide or amino acid sequences within each repertoire in the given RepertoireDataset. This filter can be applied to Repertoires and RepertoireDatasets.

Sequences are considered duplicates if the following fields are identical:

  • amino acid or nucleotide sequence (whichever is specified)

  • v and j genes (note that the full field including subgroup + gene is used for matching, i.e. V1 and V1-1 are not considered duplicates)

  • chain

  • region type

For all other fields (the non-specified sequence type, custom lists, sequence identifier) only the first occurring value is kept.

Note that this means the count value of a sequence with a given sequence identifier might not be the same as before removing duplicates, unless count_agg = FIRST is used.

Specification arguments:

  • filter_sequence_type (SequenceType): Whether the sequences should be collapsed on the nucleotide or amino acid level. Valid values are: [‘amino_acid’, ‘nucleotide’].

  • region_type (str): which part of the sequence to examine, by default, this is IMGT_CDR3

  • count_agg (CountAggregationFunction): determines how the sequence counts of duplicate sequences are aggregated. Valid values are: [‘sum’, ‘max’, ‘min’, ‘mean’, ‘first’, ‘last’].

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            DuplicateSequenceFilter:
                # required parameters:
                filter_sequence_type: AMINO_ACID
                # optional parameters (if not specified the values bellow will be used):
                batch_size: 4
                count_agg: SUM
                region_type: IMGT_CDR3

MetadataRepertoireFilter

Removes repertoires from a RepertoireDataset based on information stored in the metadata_file. Note that this filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.

Since this filter changes the number of repertoires (examples for the machine learning task), it cannot be used with TrainMLModel instruction. To filter out repertoires, use preprocessing from the DatasetExport instruction that will create a new dataset ready to be used for training machine learning models.

Specification arguments:

  • criteria (dict): a nested dictionary that specifies the criteria for keeping certain columns. See CriteriaMatcher for a more detailed explanation.

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            # Example filter that keeps repertoires with values greater than 1 in the "my_column_name" column of the metadata_file
            MetadataRepertoireFilter:
                type: GREATER_THAN
                value:
                    type: COLUMN
                    name: my_column_name
                threshold: 1

ReferenceSequenceAnnotator

Annotates each sequence in each repertoire if it matches any of the reference sequences provided as input parameter. This report uses CompAIRR internally. To match CDR3 sequences (and not JUNCTION), CompAIRR v1.10 or later is needed.

Specification arguments:

  • reference_sequences (dict): A dictionary describing the reference dataset file. Import should be specified the same way as regular dataset import. It is only allowed to import a receptor dataset here (i.e., is_repertoire is False and paired is True by default, and these are not allowed to be changed).

  • max_edit_distance (int): The maximum edit distance between a target sequence (from the repertoire) and the reference sequence.

  • compairr_path (str): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.

  • threads (int): how many threads to be used by CompAIRR for sequence matching

  • ignore_genes (bool): Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains have to match. If True, gene information is ignored. By default, ignore_genes is False.

  • output_column_name (str): in case there are multiple annotations, it is possible here to define the name of the column in the output repertoire files for this specific annotation

  • repertoire_batch_size (int): how many repertoires to process simultaneously; depending on the repertoire size, this parameter might be use to limit the memory usage

  • region_type (str): which region type to check, default is IMGT_CDR3

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - step1:
            ReferenceSequenceAnnotator:
                reference_sequences:
                    format: VDJDB
                    params:
                        path: path/to/file.csv
                compairr_path: optional/path/to/compairr
                ignore_genes: False
                max_edit_distance: 0
                output_column_name: matched
                threads: 4
                repertoire_batch_size: 5
                region_type: IMGT_CDR3

SequenceLengthFilter

Removes sequences with length out of the predefined range.

Specification arguments:

  • sequence_type (SequenceType): Whether the sequences should be filtered on the nucleotide or amino acid level. Valid options are defined by the SequenceType enum.

  • min_len (int): minimum length of the sequence (sequences shorter than min_len will be removed); to not use min_len, set it to -1

  • max_len (int): maximum length of the sequence (sequences longer than max_len will be removed); to not use max_len, set it to -1

  • region_type (str): which part of the sequence to examine, by default, this is IMGT_CDR3

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter:
            SequenceLengthFilter:
                sequence_type: AMINO_ACID
                min_len: 3 # -> remove all sequences shorter than 3
                max_len: -1 # -> no upper bound on the sequence length

SubjectRepertoireCollector

Merges all the Repertoires in a RepertoireDataset that have the same ‘subject_id’ specified in the metadata. The result is a RepertoireDataset with one Repertoire per subject. This preprocessing cannot be used in combination with TrainMLModel instruction because it can change the number of examples. To combine the repertoires in this way, use this preprocessing with DatasetExport instruction.

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter: SubjectRepertoireCollector