immuneML.preprocessing package¶

Subpackages¶

immuneML.preprocessing.filters package

Submodules¶

immuneML.preprocessing.Preprocessor module¶

class immuneML.preprocessing.Preprocessor.Preprocessor(result_path: Path = None)[source]¶

Bases: object

check_dataset_type(dataset, valid_dataset_types: list, location: str)[source]¶

keeps_example_count() → bool[source]¶: Defines if the preprocessing can be run with TrainMLModel instruction; to be able to run with it, the preprocessing cannot change the number of examples in the dataset

abstract process_dataset(dataset: RepertoireDataset, result_path: Path, number_of_processes: int = 1) → RepertoireDataset[source]¶

immuneML.preprocessing.ReferenceSequenceAnnotator module¶

class immuneML.preprocessing.ReferenceSequenceAnnotator.ReferenceSequenceAnnotator(reference_sequences: List[ReceptorSequence], max_edit_distance: int, compairr_path: str, ignore_genes: bool, threads: int, output_column_name: str, repertoire_batch_size: int, region_type: RegionType)[source]¶

Bases: Preprocessor

Annotates each sequence in each repertoire if it matches any of the reference sequences provided as input parameter. This report uses CompAIRR internally. To match CDR3 sequences (and not JUNCTION), CompAIRR v1.10 or later is needed.

Specification arguments:

reference_sequences (dict): A dictionary describing the reference dataset file. Import should be specified the same way as regular dataset import. It is only allowed to import a receptor dataset here (i.e., is_repertoire is False and paired is True by default, and these are not allowed to be changed).
max_edit_distance (int): The maximum edit distance between a target sequence (from the repertoire) and the reference sequence.
compairr_path (str): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.
threads (int): how many threads to be used by CompAIRR for sequence matching
ignore_genes (bool): Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains have to match. If True, gene information is ignored. By default, ignore_genes is False.
output_column_name (str): in case there are multiple annotations, it is possible here to define the name of the column in the output repertoire files for this specific annotation
repertoire_batch_size (int): how many repertoires to process simultaneously; depending on the repertoire size, this parameter might be use to limit the memory usage
region_type (str): which region type to check, default is IMGT_CDR3

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - step1:
            ReferenceSequenceAnnotator:
                reference_sequences:
                    format: VDJDB
                    params:
                        path: path/to/file.csv
                compairr_path: optional/path/to/compairr
                ignore_genes: False
                max_edit_distance: 0
                output_column_name: matched
                threads: 4
                repertoire_batch_size: 5
                region_type: IMGT_CDR3

classmethod build_object(**kwargs)[source]¶

process_dataset(dataset: RepertoireDataset, result_path: Path, number_of_processes=1) → RepertoireDataset[source]¶

immuneML.preprocessing.SubjectRepertoireCollector module¶

class immuneML.preprocessing.SubjectRepertoireCollector.SubjectRepertoireCollector(result_path: Path = None)[source]¶