How to add a new preprocessing¶
Preprocessings are applied to modify a dataset before encoding the data, for example, removing certain sequences from a repertoire. In immuneML, the sequence of preprocessing steps applied to a given dataset before training an ML model is considered a hyperparameter that can be optimized using nested cross validation.
Adding an example preprocessor to the immuneML codebase¶
This tutorial describes how to add a new Preprocessor
class to immuneML,
using a simple example preprocessor. We highly recommend completing this tutorial to get a better understanding of the immuneML
interfaces before continuing to implement your own preprocessor.
Step-by-step tutorial¶
For this tutorial, we provide a SillyFilter
(download here
or view below),
in order to test adding a new Preprocessor file to immuneML. This preprocessor acts like a filter which randomly selects
a subset of repertoires to keep.
SillyFilter.py
import random from pathlib import Path from immuneML.data_model.dataset.Dataset import Dataset from immuneML.data_model.dataset.RepertoireDataset import RepertoireDataset from immuneML.preprocessing.filters.Filter import Filter from immuneML.util.ParameterValidator import ParameterValidator from immuneML.util.PathBuilder import PathBuilder class SillyFilter(Filter): """ This SillyFilter class is a placeholder for a real Preprocessor. It randomly selects a fraction of the repertoires to be removed from the dataset. **Specification arguments:** - fraction_to_keep (float): The fraction of repertoires to keep **YAML specification:** .. indent with spaces .. code-block:: yaml definitions: preprocessing_sequences: my_preprocessing: - step1: SillyFilter: fraction_to_remove: 0.8 """ def __init__(self, fraction_to_keep: float = None): super().__init__() self.fraction_to_keep = fraction_to_keep @classmethod def build_object(cls, **kwargs): # build_object is called early in the immuneML run, before the analysis takes place. # Its purpose is to fail early when a class is called incorrectly (checking parameters and dataset), # and provide user-friendly error messages. # ParameterValidator contains many utility functions for checking user parameters ParameterValidator.assert_type_and_value(kwargs['fraction_to_keep'], float, SillyFilter.__name__, 'fraction_to_keep', min_inclusive=0) return SillyFilter(**kwargs) def process_dataset(self, dataset: RepertoireDataset, result_path: Path, number_of_processes=1) -> RepertoireDataset: self.result_path = PathBuilder.build(result_path if result_path is not None else self.result_path) # utility function to ensure that the dataset type is RepertoireDataset self.check_dataset_type(dataset, [RepertoireDataset], SillyFilter.__name__) processed_dataset = self._create_random_dataset_subset(dataset) # utility function to ensure the remaining dataset is not empty self.check_dataset_not_empty(processed_dataset, SillyFilter.__name__) return processed_dataset def _create_random_dataset_subset(self, dataset): # Select some random fraction of identifiers, and use it to create a subset of the original dataset n_new_examples = round(dataset.get_example_count() * self.fraction_to_keep) new_example_indices = random.sample(range(dataset.get_example_count()), n_new_examples) preprocessed_dataset = dataset.make_subset(example_indices=new_example_indices, path=self.result_path, dataset_type=Dataset.SUBSAMPLED) return preprocessed_dataset def keeps_example_count(self): # Overwrite keeps_example_count to return False since some examples (repertoires) are removed return False
Add a new class to the
filters
package inside thepreprocessing
package. The new class should inherit from the base classFilter
. A filter is a special category of preprocessors which removes examples (repertoires) from the dataset. Other preprocessors, which for example just annotate the dataset, should be placed directly inside thepreprocessing
package and inherit thePreprocessor
class instead.If the preprocessor has any default parameters, they should be added in a default parameters YAML file. This file should be added to the folder
config/default_params/preprocessing
. The default parameters file is automatically discovered based on the name of the class using the base name converted to snake case, and with an added ‘_params.yaml’ suffix. For theSillyFilter
, this issilly_filter_params.yaml
, which could for example contain the following:fraction_to_keep: 0.8
In rare cases where classes have unconventional names that do not translate well to CamelCase (e.g., MiXCR, VDJdb), this needs to be accounted for in
convert_to_snake_case()
.
Test running the new preprocessing with a YAML specification¶
If you want to use immuneML directly to test run your preprocessor, the YAML example below may be used.
This example analysis creates a randomly generated dataset, runs the SillyFilter
, and
runs the SimpleDatasetOverview report on the preprocessed dataset to inspect the results.
test_run_silly_filter.yaml
definitions: datasets: my_dataset: format: RandomSequenceDataset params: sequence_count: 100 preprocessing_sequences: my_preprocessing: - step1: SillyFilter: fraction_to_remove: 0.8 reports: simple_overview: SimpleDatasetOverview instructions: exploratory_instr: type: ExploratoryAnalysis analyses: analysis_1: dataset: d1 preprocessing_sequence: my_preprocessing_seq report: simple_overview
Adding a Unit test for a Preprocessing¶
Add a unit test for the new SillyFilter
(download
the example testfile or view below)
test_sillyFilter.py
import os import shutil from unittest import TestCase from immuneML.caching.CacheType import CacheType from immuneML.environment.Constants import Constants from immuneML.environment.EnvironmentSettings import EnvironmentSettings from immuneML.preprocessing.filters.SillyFilter import SillyFilter from immuneML.simulation.dataset_generation.RandomDatasetGenerator import RandomDatasetGenerator class TestSillyFilter(TestCase): def setUp(self) -> None: os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name def _get_mock_repertoire_dataset(self, path): # Create a mock RepertoireDataset with 10 repertoires, each containing 50 sequences of length 15, dataset = RandomDatasetGenerator.generate_repertoire_dataset(repertoire_count=10, sequence_count_probabilities={50: 1}, sequence_length_probabilities={15: 1}, labels={}, path=path) return dataset def test_process_dataset(self): tmp_path = EnvironmentSettings.tmp_test_path / "silly_filter/" dataset = self._get_mock_repertoire_dataset(tmp_path / "original_dataset") params = {"fraction_to_keep": 0.8} filter = SillyFilter.build_object(**params) processed_dataset = filter.process_dataset(dataset, tmp_path / "filtered_dataset") # 10 original repertoires, keep 80% assert len(processed_dataset.repertoires) == 8 shutil.rmtree(tmp_path)
Add a new file to the
test.preprocessing.filters
package named test_sillyFilter.py.Add a class
TestSillyFilter
that inheritsunittest.TestCase
to the new file.Add a function
setUp()
to set up cache used for testing. This should ensure that the cache location will be set toEnvironmentSettings.tmp_test_path / "cache/"
Define one or more tests for the class and functions you implemented.
It is recommended to at least test building the Preprocessor and running the preprocessing
Mock data is typically used to test new classes. Tip: the
RandomDatasetGenerator
class can be used to generate Repertoire, Sequence or Receptor datasets with random sequences.If you need to write data to a path (for example test datasets or results), use the following location:
EnvironmentSettings.tmp_test_path / "some_unique_foldername"
Implementing a new Preprocessor¶
This section describes tips and tricks for implementing your own new Preprocessor
from scratch.
Detailed instructions of how to implement each method, as well as some special cases, can be found in the
Preprocessor
base class.
Note
Coding conventions and tips
Class names are written in CamelCase
Class methods are writte in snake_case
Abstract base classes
MLMethod
,DatasetEncoder
, andReport
, define an interface for their inheriting subclasses. These classes contain abstract methods which should be overwritten.Class methods starting with _underscore are generally considered “private” methods, only to be called by the class itself. If a method is expected to be called from another class, the method name should not start with an underscore.
When familiarising yourself with existing code, we recommend focusing on public methods. Private methods are typically very unique to a class (internal class-specific calculations), whereas the public methods contain more general functionalities (e.g., returning a main result).
If your class should have any default parameters, they should be defined in a default parameters file under
config/default_params/
.Some utility classes are available in the
util
package to provide useful functionalities. For example,ParameterValidator
can be used to check user input and generate error messages, orPathBuilder
can be used to add and remove folders.
Implementing the process() method in a new encoder class¶
The main functionality of the preprocessor class is implemented in its process(dataset, params)
method.
This method takes in a dataset, modifies the dataset according to the given instructions, and returns
the new modified dataset.
When implementing the process(dataset, params)
method, take the following points into account:
The method takes in the argument
params
, which is a dictionary containing any relevant parameters. One of these parameters is the result pathparams["result_path"]
which should be used as the location to store the metadata file of a new repertoire dataset.Check if the given dataset is the correct dataset type, for example by using the static method
check_dataset_type(dataset, valid_dataset_types, location)
. Some preprocessings are only sensible for a given type of dataset. Datasets can be of the type RepertoireDataset, SequenceDataset and ReceptorDataset (see: immuneML data model).Do not modify the given dataset object, but create a clone instead.
When your preprocessor is a filter (i.e., when it removes sequences or repertoires from the dataset), extra precautions need to be taken to ensure that the dataset does not contain empty repertoires and that the entries in the metadata file match the new dataset. The utility functions provided by the
Filter
class can be useful for this:remove_empty_repertoires(repertoires)
checks whether any of the provided repertoires are empty (this might happen when filtering out sequences based on strict criteria), and returns a list containing only non-empty repertoires.check_dataset_not_empty(processed_dataset, location)
checks whether there is still any data left in the dataset. If all sequences or repertoires were removed by filtering, an error will be thrown.build_new_metadata(dataset, indices_to_keep, result_path)
creates a new metadata file based on a subset of the existing metadata file. When removing repertoires from a repertoire dataset, this method should always be applied.
Class documentation standards for preprocessors¶
Class documentation should be added as a docstring to all new Encoder, MLMethod, Report or Preprocessing classes. The class docstrings are used to automatically generate the documentation web pages, using Sphinx reStructuredText, and should adhere to a standard format:
A short, general description of the functionality
Optional extended description, including any references or specific cases that should bee considered. For instance: if a class can only be used for a particular dataset type. Compatibility between Encoders, MLMethods and Reports should also be described.
For encoders, the appropriate dataset type(s). For example:
**Dataset type:** - SequenceDatasets - RepertoireDatasets
A list of arguments, when applicable. This should follow the format below:
**Specification arguments:** - parameter_name (type): a short description - other_paramer_name (type): a short description
A YAML snippet, to show an example of how the new component should be called. Make sure to test your YAML snippet in an immuneML run to ensure it is specified correctly. The following formatting should be used to ensure the YAML snippet is rendered correctly:
**YAML specification:** .. indent with spaces .. code-block:: yaml definitions: yaml_keyword: # could be encodings/ml_methods/reports/etc... my_new_class: MyNewClass: parameter_name: 0 other_paramer_name: 1
Click to view a full example of Preprocessor class documentation.
This SillyFilter class is a placeholder for a real Preprocessor.
It randomly selects a fraction of the repertoires to be removed from the dataset.
**Specification arguments:**
- fraction_to_keep (float): The fraction of repertoires to keep
**YAML specification:**
.. indent with spaces
.. code-block:: yaml
definitions:
preprocessing_sequences:
my_preprocessing:
- step1:
SillyFilter:
fraction_to_remove: 0.8