How to add a new preprocessing¶

Preprocessings are applied to modify a dataset before encoding the data, for example, removing certain sequences from a repertoire. In immuneML, the sequence of preprocessing steps applied to a given dataset before training an ML model is considered a hyperparameter that can be optimized using nested cross validation.

Adding an example preprocessor to the immuneML codebase¶

This tutorial describes how to add a new Preprocessor class to immuneML, using a simple example preprocessor. We highly recommend completing this tutorial to get a better understanding of the immuneML interfaces before continuing to implement your own preprocessor.

Step-by-step tutorial¶

For this tutorial, we provide a SillyFilter (download here or view below), in order to test adding a new Preprocessor file to immuneML. This preprocessor acts like a filter which randomly selects a subset of repertoires to keep.

SillyFilter.py

import random
from pathlib import Path

from immuneML.data_model.dataset.Dataset import Dataset
from immuneML.data_model.dataset.RepertoireDataset import RepertoireDataset
from immuneML.preprocessing.filters.Filter import Filter
from immuneML.util.ParameterValidator import ParameterValidator
from immuneML.util.PathBuilder import PathBuilder


class SillyFilter(Filter):
    """
    This SillyFilter class is a placeholder for a real Preprocessor.
    It randomly selects a fraction of the repertoires to be removed from the dataset.


    **Specification arguments:**

    - fraction_to_keep (float): The fraction of repertoires to keep


    **YAML specification:**

    .. indent with spaces
    .. code-block:: yaml

        definitions:
            preprocessing_sequences:
                my_preprocessing:
                    - step1:
                        SillyFilter:
                            fraction_to_remove: 0.8

    """

    def __init__(self, fraction_to_keep: float = None):
        super().__init__()
        self.fraction_to_keep = fraction_to_keep

    @classmethod
    def build_object(cls, **kwargs):
        # build_object is called early in the immuneML run, before the analysis takes place.
        # Its purpose is to fail early when a class is called incorrectly (checking parameters and dataset),
        # and provide user-friendly error messages.

        # ParameterValidator contains many utility functions for checking user parameters
        ParameterValidator.assert_type_and_value(kwargs['fraction_to_keep'], float, SillyFilter.__name__, 'fraction_to_keep', min_inclusive=0)

        return SillyFilter(**kwargs)

    def process_dataset(self, dataset: RepertoireDataset, result_path: Path, number_of_processes=1) -> RepertoireDataset:
        self.result_path = PathBuilder.build(result_path if result_path is not None else self.result_path)

        # utility function to ensure that the dataset type is RepertoireDataset
        self.check_dataset_type(dataset, [RepertoireDataset], SillyFilter.__name__)

        processed_dataset = self._create_random_dataset_subset(dataset)

        # utility function to ensure the remaining dataset is not empty
        self.check_dataset_not_empty(processed_dataset, SillyFilter.__name__)

        return processed_dataset

    def _create_random_dataset_subset(self, dataset):
        # Select some random fraction of identifiers, and use it to create a subset of the original dataset
        n_new_examples = round(dataset.get_example_count() * self.fraction_to_keep)
        new_example_indices = random.sample(range(dataset.get_example_count()), n_new_examples)

        preprocessed_dataset = dataset.make_subset(example_indices=new_example_indices,
                                                   path=self.result_path,
                                                   dataset_type=Dataset.SUBSAMPLED)

        return preprocessed_dataset

    def keeps_example_count(self):
        # Overwrite keeps_example_count to return False since some examples (repertoires) are removed
        return False

Add a new class to the filters package inside the preprocessing package. The new class should inherit from the base class Filter. A filter is a special category of preprocessors which removes examples (repertoires) from the dataset. Other preprocessors, which for example just annotate the dataset, should be placed directly inside the preprocessing package and inherit the Preprocessor class instead.
If the preprocessor has any default parameters, they should be added in a default parameters YAML file. This file should be added to the folder config/default_params/preprocessing. The default parameters file is automatically discovered based on the name of the class using the base name converted to snake case, and with an added ‘_params.yaml’ suffix. For the SillyFilter, this is silly_filter_params.yaml, which could for example contain the following:
```
fraction_to_keep: 0.8
```
In rare cases where classes have unconventional names that do not translate well to CamelCase (e.g., MiXCR, VDJdb), this needs to be accounted for in convert_to_snake_case().

Test running the new preprocessing with a YAML specification¶

If you want to use immuneML directly to test run your preprocessor, the YAML example below may be used. This example analysis creates a randomly generated dataset, runs the SillyFilter, and runs the SimpleDatasetOverview report on the preprocessed dataset to inspect the results.

test_run_silly_filter.yaml

definitions:
  datasets:
    my_dataset:
      format: RandomSequenceDataset
      params:
        sequence_count: 100

  preprocessing_sequences:
    my_preprocessing:
    - step1:
        SillyFilter:
          fraction_to_remove: 0.8

  reports:
    simple_overview: SimpleDatasetOverview


instructions:
  exploratory_instr:
    type: ExploratoryAnalysis
    analyses:
      analysis_1:
        dataset: d1
        preprocessing_sequence: my_preprocessing_seq
        report: simple_overview

Adding a Unit test for a Preprocessing¶

Add a unit test for the new SillyFilter (download the example testfile or view below)

test_sillyFilter.py

import os
import shutil
from unittest import TestCase

from immuneML.caching.CacheType import CacheType
from immuneML.environment.Constants import Constants
from immuneML.environment.EnvironmentSettings import EnvironmentSettings
from immuneML.preprocessing.filters.SillyFilter import SillyFilter
from immuneML.simulation.dataset_generation.RandomDatasetGenerator import RandomDatasetGenerator


class TestSillyFilter(TestCase):

    def setUp(self) -> None:
        os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name

    def _get_mock_repertoire_dataset(self, path):
        # Create a mock RepertoireDataset with 10 repertoires, each containing 50 sequences of length 15,
        dataset = RandomDatasetGenerator.generate_repertoire_dataset(repertoire_count=10,
                                                                     sequence_count_probabilities={50: 1},
                                                                     sequence_length_probabilities={15: 1},
                                                                     labels={},
                                                                     path=path)

        return dataset

    def test_process_dataset(self):
        tmp_path = EnvironmentSettings.tmp_test_path / "silly_filter/"

        dataset = self._get_mock_repertoire_dataset(tmp_path / "original_dataset")

        params = {"fraction_to_keep": 0.8}
        filter = SillyFilter.build_object(**params)

        processed_dataset = filter.process_dataset(dataset, tmp_path / "filtered_dataset")

        # 10 original repertoires, keep 80%
        assert len(processed_dataset.repertoires) == 8

        shutil.rmtree(tmp_path)

Add a new file to the test.preprocessing.filters package named test_sillyFilter.py.
Add a class TestSillyFilter that inherits unittest.TestCase to the new file.
Add a function setUp() to set up cache used for testing. This should ensure that the cache location will be set to EnvironmentSettings.tmp_test_path / "cache/"
Define one or more tests for the class and functions you implemented.
- It is recommended to at least test building the Preprocessor and running the preprocessing
- Mock data is typically used to test new classes. Tip: the RandomDatasetGenerator class can be used to generate Repertoire, Sequence or Receptor datasets with random sequences.
- If you need to write data to a path (for example test datasets or results), use the following location: EnvironmentSettings.tmp_test_path / "some_unique_foldername"

Implementing a new Preprocessor¶

This section describes tips and tricks for implementing your own new Preprocessor from scratch. Detailed instructions of how to implement each method, as well as some special cases, can be found in the Preprocessor base class.

Note

Coding conventions and tips

Class names are written in CamelCase
Class methods are writte in snake_case
Abstract base classes MLMethod, DatasetEncoder, and Report, define an interface for their inheriting subclasses. These classes contain abstract methods which should be implemented.
Class methods starting with _underscore are generally considered “private” methods, only to be called by the class itself. If a method is expected to be called from another class, the method name should not start with an underscore.
When familiarising yourself with existing code, we recommend focusing on public methods. Private methods are typically very unique to a class (internal class-specific calculations), whereas the public methods contain more general functionalities (e.g., returning a main result).
If your class should have any default parameters, they should be defined in a default parameters file under config/default_params/.
Some utility classes are available in the util package to provide useful functionalities. For example, ParameterValidator can be used to check user input and generate error messages, or PathBuilder can be used to add and remove folders.

Implementing the process() method in a new encoder class¶

The main functionality of the preprocessor class is implemented in its process(dataset, params) method. This method takes in a dataset, modifies the dataset according to the given instructions, and returns the new modified dataset.

When implementing the process(dataset, params) method, take the following points into account:

The method takes in the argument params, which is a dictionary containing any relevant parameters. One of these parameters is the result path params["result_path"] which should be used as the location to store the metadata file of a new repertoire dataset.
Check if the given dataset is the correct dataset type, for example by using the static method check_dataset_type(dataset, valid_dataset_types, location). Some preprocessings are only sensible for a given type of dataset. Datasets can be of the type RepertoireDataset, SequenceDataset and ReceptorDataset (see: immuneML data model).
Do not modify the given dataset object, but create a clone instead.
When your preprocessor is a filter (i.e., when it removes sequences or repertoires from the dataset), extra precautions need to be taken to ensure that the dataset does not contain empty repertoires and that the entries in the metadata file match the new dataset. The utility functions provided by the Filter class can be useful for this:
- remove_empty_repertoires(repertoires) checks whether any of the provided repertoires are empty (this might happen when filtering out sequences based on strict criteria), and returns a list containing only non-empty repertoires.
- check_dataset_not_empty(processed_dataset, location) checks whether there is still any data left in the dataset. If all sequences or repertoires were removed by filtering, an error will be thrown.
- build_new_metadata(dataset, indices_to_keep, result_path) creates a new metadata file based on a subset of the existing metadata file. When removing repertoires from a repertoire dataset, this method should always be applied.

Class documentation standards for preprocessors¶

Class documentation should be added as a docstring to all new Encoder, MLMethod, Report or Preprocessing classes. The class docstrings are used to automatically generate the documentation web pages, using Sphinx reStructuredText, and should adhere to a standard format:

A short, general description of the functionality
Optional extended description, including any references or specific cases that should be considered. For instance: if a class can only be used for a particular dataset type. Compatibility between Encoders, MLMethods and Reports should also be described.
For encoders, the appropriate dataset type(s). For example:

**Dataset type:**

- SequenceDatasets

- RepertoireDatasets

A list of arguments, when applicable. This should follow the format below:

**Specification arguments:**

- parameter_name (type): a short description

- other_paramer_name (type): a short description

A YAML snippet, to show an example of how the new component should be called. Make sure to test your YAML snippet in an immuneML run to ensure it is specified correctly. The following formatting should be used to ensure the YAML snippet is rendered correctly:

**YAML specification:**

.. code-block:: yaml

    definitions:
        yaml_keyword: # could be encodings/ml_methods/reports/etc...
            my_new_class:
                MyNewClass:
                    parameter_name: 0
                    other_paramer_name: 1

Click to view a full example of Preprocessor class documentation.

This SillyFilter class is a placeholder for a real Preprocessor.
It randomly selects a fraction of the repertoires to be removed from the dataset.


**Specification arguments:**

- fraction_to_keep (float): The fraction of repertoires to keep


**YAML specification:**

.. indent with spaces
.. code-block:: yaml

    definitions:
        preprocessing_sequences:
            my_preprocessing:
                - step1:
                    SillyFilter:
                        fraction_to_remove: 0.8