immuneML.preprocessing package¶
Subpackages¶
- immuneML.preprocessing.filters package
- Submodules
- immuneML.preprocessing.filters.ChainRepertoireFilter module
- immuneML.preprocessing.filters.ClonesPerRepertoireFilter module
- immuneML.preprocessing.filters.CountAggregationFunction module
- immuneML.preprocessing.filters.CountPerSequenceFilter module
- immuneML.preprocessing.filters.DuplicateSequenceFilter module
- immuneML.preprocessing.filters.Filter module
- immuneML.preprocessing.filters.MetadataRepertoireFilter module
- immuneML.preprocessing.filters.SequenceLengthFilter module
- Module contents
Submodules¶
immuneML.preprocessing.Preprocessor module¶
- class immuneML.preprocessing.Preprocessor.Preprocessor(result_path: Path = None)[source]¶
Bases:
object
immuneML.preprocessing.ReferenceSequenceAnnotator module¶
- class immuneML.preprocessing.ReferenceSequenceAnnotator.ReferenceSequenceAnnotator(reference_sequences: List[ReceptorSequence], max_edit_distance: int, compairr_path: str, ignore_genes: bool, threads: int, output_column_name: str, repertoire_batch_size: int, region_type: RegionType)[source]¶
Bases:
Preprocessor
Annotates each sequence in each repertoire if it matches any of the reference sequences provided as input parameter. This report uses CompAIRR internally. To match CDR3 sequences (and not JUNCTION), CompAIRR v1.10 or later is needed.
Specification arguments:
reference_sequences (dict): A dictionary describing the reference dataset file. Import should be specified the same way as regular dataset import. It is only allowed to import a receptor dataset here (i.e., is_repertoire is False and paired is True by default, and these are not allowed to be changed).
max_edit_distance (int): The maximum edit distance between a target sequence (from the repertoire) and the reference sequence.
compairr_path (str): optional path to the CompAIRR executable. If not given, it is assumed that CompAIRR has been installed such that it can be called directly on the command line with the command ‘compairr’, or that it is located at /usr/local/bin/compairr.
threads (int): how many threads to be used by CompAIRR for sequence matching
ignore_genes (bool): Whether to ignore V and J gene information. If False, the V and J genes between two receptor chains have to match. If True, gene information is ignored. By default, ignore_genes is False.
output_column_name (str): in case there are multiple annotations, it is possible here to define the name of the column in the output repertoire files for this specific annotation
repertoire_batch_size (int): how many repertoires to process simultaneously; depending on the repertoire size, this parameter might be use to limit the memory usage
region_type (str): which region type to check, default is IMGT_CDR3
YAML specification:
preprocessing_sequences: my_preprocessing: - step1: ReferenceSequenceAnnotator: reference_sequences: format: VDJDB params: path: path/to/file.csv compairr_path: optional/path/to/compairr ignore_genes: False max_edit_distance: 0 output_column_name: matched threads: 4 repertoire_batch_size: 5 region_type: IMGT_CDR3
immuneML.preprocessing.SubjectRepertoireCollector module¶
- class immuneML.preprocessing.SubjectRepertoireCollector.SubjectRepertoireCollector(result_path: Path = None)[source]¶
Bases:
Preprocessor
Merges all the Repertoires in a RepertoireDataset that have the same ‘subject_id’ specified in the metadata. The result is a RepertoireDataset with one Repertoire per subject. This preprocessing cannot be used in combination with TrainMLModel instruction because it can change the number of examples. To combine the repertoires in this way, use this preprocessing with DatasetExport instruction.
YAML specification:
preprocessing_sequences: my_preprocessing: - my_filter: SubjectRepertoireCollector