immuneML.preprocessing package

Subpackages

Submodules

immuneML.preprocessing.Preprocessor module

class immuneML.preprocessing.Preprocessor.Preprocessor(result_path: Path = None)[source]

Bases: object

check_dataset_type(dataset, valid_dataset_types: list, location: str)[source]
keeps_example_count() bool[source]

Defines if the preprocessing can be run with TrainMLModel instruction; to be able to run with it, the preprocessing cannot change the number of examples in the dataset

abstract process_dataset(dataset: RepertoireDataset, result_path: Path, number_of_processes: int = 1) RepertoireDataset[source]

immuneML.preprocessing.ReferenceSequenceAnnotator module

class immuneML.preprocessing.ReferenceSequenceAnnotator.ReferenceSequenceAnnotator(reference_sequences: List[ReceptorSequence], max_edit_distance: int, compairr_path: str, ignore_genes: bool, threads: int, output_column_name: str, repertoire_batch_size: int)[source]

Bases: Preprocessor

Annotates each sequence in each repertoire if it matches any of the reference sequences provided as input parameter. This report uses CompAIRR internally. To match CDR3 sequences (and not JUNCTION), CompAIRR v1.10 or later is needed.

Parameters:
  • reference_sequences (dict) – A dictionary describing the reference dataset file. Import should be specified the same way as regular dataset import. It is only allowed to import a receptor dataset here (i.e., is_repertoire is False and paired is True by default, and these are not allowed to be changed).

  • max_edit_distance (int) – The maximum edit distance between a target sequence (from the repertoire) and the reference sequence.

  • compairr_path (str) – optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.

  • threads (int) – how many threads to be used by CompAIRR for sequence matching

  • ignore_genes (bool) – Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains have to match. If True, gene information is ignored. By default, ignore_genes is False.

  • output_column_name (str) – in case there are multiple annotations, it is possible here to define the name of the column in the output repertoire files for this specific annotation

  • repertoire_batch_size (int) – how many repertoires to process simultaneously; depending on the repertoire size, this parameter might be use to limit the memory usage

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - step1:
            ReferenceSequenceAnnotator:
                reference_sequences:
                    format: VDJDB
                    params:
                        path: path/to/file.csv
                compairr_path: optional/path/to/compairr
                ignore_genes: False
                max_edit_distance: 0
                output_column_name: matched
                threads: 4
                repertoire_batch_size: 5
classmethod build_object(**kwargs)[source]
process_dataset(dataset: RepertoireDataset, result_path: Path, number_of_processes=1) RepertoireDataset[source]

immuneML.preprocessing.SubjectRepertoireCollector module

class immuneML.preprocessing.SubjectRepertoireCollector.SubjectRepertoireCollector(result_path: Path = None)[source]

Bases: Preprocessor

Merges all the Repertoires in a RepertoireDataset that have the same ‘subject_id’ specified in the metadata. The result is a RepertoireDataset with one Repertoire per subject. This preprocessing cannot be used in combination with TrainMLModel instruction because it can change the number of examples. To combine the repertoires in this way, use this preprocessing with DatasetExport instruction.

YAML specification:

preprocessing_sequences:
    my_preprocessing:
        - my_filter: SubjectRepertoireCollector
keeps_example_count() bool[source]

Defines if the preprocessing can be run with TrainMLModel instruction; to be able to run with it, the preprocessing cannot change the number of examples in the dataset

process_dataset(dataset: RepertoireDataset, result_path: Path, number_of_processes=1)[source]

Module contents