How to add a new encoding

Adding an example encoder to the immuneML codebase

This tutorial describes how to add a new DatasetEncoder class to immuneML, using a simple example encoder. We highly recommend completing this tutorial to get a better understanding of the immuneML interfaces before continuing to implement your own encoder.

Step-by-step tutorial

For this tutorial, we provide a SillyEncoder (download here or view below), in order to test adding a new Encoder file to immuneML. This encoder ignores the data of the input examples, and generates a few random features per example.

SillyEncoder.py
import numpy as np

from immuneML.data_model.dataset.ReceptorDataset import ReceptorDataset
from immuneML.data_model.dataset.RepertoireDataset import RepertoireDataset
from immuneML.data_model.dataset.SequenceDataset import SequenceDataset
from immuneML.data_model.encoded_data.EncodedData import EncodedData
from immuneML.encodings.DatasetEncoder import DatasetEncoder
from immuneML.encodings.EncoderParams import EncoderParams
from immuneML.data_model.dataset.Dataset import Dataset
from immuneML.util.EncoderHelper import EncoderHelper
from immuneML.util.ParameterValidator import ParameterValidator


class SillyEncoder(DatasetEncoder):
    """
    This SillyEncoder class is a placeholder for a real encoder.
    It computes a set of random numbers as features for a given dataset.

    **Dataset type:**

    - SequenceDatasets

    - ReceptorDatasets

    - RepertoireDatasets


    **Specification arguments:**

    - random_seed (int): The random seed for generating random features.

    - embedding_len (int): The number of random features to generate per example.


    **YAML specification:**

    .. indent with spaces
    .. code-block:: yaml

        definitions:
            encodings:
                my_silly_encoder:
                    Silly: # name of the class (without 'Encoder' suffix)
                        random_seed: 1
                        embedding_len: 5
    """

    def __init__(self, random_seed: int, embedding_len: int, name: str = None):
        # The encoder name contains the user-defined name for the encoder. It may be used by some reports.
        super().__init__(name=name)

        # All user parameters are set here.
        # Default parameters must not be defined in the Encoder class, but in a default parameters file.
        self.random_seed = random_seed
        self.embedding_len = embedding_len

    @staticmethod
    def build_object(dataset=None, **params):
        # build_object is called early in the immuneML run, before the analysis takes place.
        # Its purpose is to fail early when a class is called incorrectly (checking parameters and dataset),
        # and provide user-friendly error messages.

        # ParameterValidator contains many utility functions for checking user parameters
        ParameterValidator.assert_type_and_value(params['random_seed'], int, SillyEncoder.__name__, 'random_seed', min_inclusive=1)
        ParameterValidator.assert_type_and_value(params['embedding_len'], int, SillyEncoder.__name__, 'embedding_len', min_inclusive=1, max_inclusive=100)

        # An error should be thrown if the dataset type is incompatible with the Encoder.
        # If different sub-classes are defined for each dataset type (e.g., OneHotRepertoireEncoder),
        # an instance of the dataset-specific class must be returned here.
        if isinstance(dataset, SequenceDataset) or isinstance(dataset, ReceptorDataset) or isinstance(dataset, RepertoireDataset):
            return SillyEncoder(**params)
        else:
            raise ValueError("SillyEncoder is only defined for dataset types SequenceDataset, ReceptorDataset or RepertoireDataset")

    def encode(self, dataset, params: EncoderParams) -> Dataset:
        np.random.seed(self.random_seed)

        # Generate the design matrix from the sequence dataset
        encoded_examples = self._get_encoded_examples(dataset)

        # EncoderHelper contains some utility functions, including this function for encoding the labels
        labels = EncoderHelper.encode_dataset_labels(dataset, params.label_config, params.encode_labels)

        # Each feature is represented by some meaningful name
        feature_names = [f"random_number_{i}" for i in range(self.embedding_len)]

        encoded_data = EncodedData(examples=encoded_examples,
                                   labels=labels,
                                   example_ids=dataset.get_example_ids(),
                                   feature_names=feature_names,
                                   encoding=SillyEncoder.__name__) # When using dataset-specific encoders,
                                                                   # make sure to use the general encoder name here
                                                                   # (e.g., OneHotEncoder.__name__, not OneHotSequenceEncoder.__name__)

        encoded_dataset = dataset.clone()
        encoded_dataset.encoded_data = encoded_data

        return encoded_dataset

    def _get_encoded_examples(self, dataset: Dataset) -> np.array:
        if isinstance(dataset, SequenceDataset):
            return self._get_encoded_sequences(dataset)
        elif isinstance(dataset, ReceptorDataset):
            return self._get_encoded_receptors(dataset)
        elif isinstance(dataset, RepertoireDataset):
            return self._get_encoded_repertoires(dataset)

    def _get_encoded_sequences(self, dataset: SequenceDataset) -> np.array:
        encoded_sequences = []

        for sequence in dataset.get_data():
            # Each sequence is a ReceptorSequence object.
            # Different properties of the sequence can be retrieved here, examples:
            identifier = sequence.get_id()
            aa_seq = sequence.get_sequence() # gets the amino acid sequence by default (alternative: nucleotide)
            v_gene = sequence.get_attribute("v_gene") # gets the v and j genes (without *allele)
            j_gene = sequence.get_attribute("j_gene")

            # In this encoding, sequence information is ignored, random features are generated
            random_encoding = np.random.rand(self.embedding_len)
            encoded_sequences.append(random_encoding)

        return np.array(encoded_sequences)

    def _get_encoded_receptors(self, dataset: ReceptorDataset) -> np.array:
        encoded_receptors = []

        for receptor in dataset.get_data():
            # Each receptor is a Receptor subclass object (e.g., TCABReceptor, BCReceptor)
            # A Receptor contains two paired ReceptorSequence objects
            identifier = receptor.get_id()
            chain1, chain2 = receptor.get_chains()
            sequence1 = receptor.get_chain(chain1)
            sequence2 = receptor.get_chain(chain2)

            # Properties of the specific ReceptorSequences can be retrieved, examples:
            aa_seq1 = sequence1.get_sequence() # gets the amino acid sequence by default (alternative: nucleotide)
            v_gene_seq1 = sequence1.get_attribute("v_gene") # gets the v and j genes (without *allele)
            j_gene_seq1 = sequence1.get_attribute("j_gene")

            # It's also possible to retrieve this information for both chains at the Receptor level:
            aa_seq1, aa_seq2 = receptor.get_attribute("sequence_aa")
            v_gene_seq1, v_gene_seq2 = receptor.get_attribute("v_gene")

            # In this encoding, sequence information is ignored, random features are generated
            random_encoding = np.random.rand(self.embedding_len)
            encoded_receptors.append(random_encoding)

        return np.array(encoded_receptors)

    def _get_encoded_repertoires(self, dataset: RepertoireDataset) -> np.array:
        encoded_repertoires = []

        for repertoire in dataset.get_data():
            # Each repertoire is a Repertoire object.
            # Different properties of the repertoire can be retrieved here, examples:
            identifiers = repertoire.get_sequence_identifiers(as_list=True)
            aa_sequences = repertoire.get_sequence_aas(as_list=True)
            v_genes = repertoire.get_v_genes() # gets the v and j genes (without *allele)
            j_genes = repertoire.get_j_genes()
            sequence_counts = repertoire.get_counts()
            chains = repertoire.get_chains()

            # In this encoding, repertoire information is ignored, random features are generated
            random_encoding = np.random.rand(self.embedding_len)
            encoded_repertoires.append(random_encoding)

        return np.array(encoded_repertoires)
  1. Add a new Python package to the encodings package. This means: a new folder (with meaningful name) containing an empty __init__.py file.

  2. Add a new encoder class to the package. The new class should inherit from the base class DatasetEncoder. The name of the class should end with ‘Encoder’, and when calling this class in the YAML specification, the ‘Encoder’ suffix is omitted. In the test example, the class is called SillyEncoder, which would be referred to as Silly in the YAML specification.

  3. If the encoder has any default parameters, they should be added in a default parameters YAML file. This file should be added to the folder config/default_params/encodings. The default parameters file is automatically discovered based on the name of the class using the base name (without ‘Encoder’ suffix) converted to snake case, and with an added ‘_params.yaml’ suffix. For the SillyEncoder, this is silly_params.yaml, which could for example contain the following:

    random_seed: 1
    embedding_len: 5
    

    In rare cases where classes have unconventional names that do not translate well to CamelCase (e.g., MiXCR, VDJdb), this needs to be accounted for in convert_to_snake_case().

  4. Use the automated script check_new_encoder.py to test the newly added encoder. This script will throw errors or warnings if the DatasetEncoder class implementation is incorrect or if files are put in the wrong place. Example command to test the SillyEncoder for sequence datasets:

    python3 ./scripts/check_new_encoder.py -e ./immuneML/encodings/silly/SillyEncoder.py -d sequence
    
  5. If a compatible ML method is already available, add the new encoder class to the list of compatible encoders returned by the get_compatible_encoders() method of the MLMethod of interest. See also Adding encoder compatibility to an ML method.

Test running the new encoding with a YAML specification

If you want to use immuneML directly to test run your encoder, the YAML example below may be used. This example analysis creates a randomly generated dataset, encodes the data using the SillyEncoder and exports the encoded data as a csv file.

test_run_silly_encoder.yaml
definitions:
  datasets:
    my_dataset:
      format: RandomSequenceDataset
      params:
        sequence_count: 100
        labels:
          binds_epitope:
            True: 0.6
            False: 0.4

  encodings:
    my_silly_encoder:
      Silly:
        random_seed: 3

  reports:
    my_design_matrix: DesignMatrixExporter

instructions:
  my_instruction:
    type: ExploratoryAnalysis
    analyses:
      my_analysis_1:
        dataset: my_dataset
        encoding: my_silly_encoder
        report: my_design_matrix
        labels:
        - binds_epitope

Adding a Unit test for a DatasetEncoder

Add a unit test for the new SillyEncoder (download the example testfile or view below):

test_sillyEncoder.py
import os
import shutil
import unittest

from immuneML.caching.CacheType import CacheType
from immuneML.encodings.EncoderParams import EncoderParams
from immuneML.environment.Constants import Constants
from immuneML.environment.EnvironmentSettings import EnvironmentSettings
from immuneML.environment.LabelConfiguration import LabelConfiguration
from immuneML.data_model.encoded_data.EncodedData import EncodedData
from immuneML.encodings.silly.SillyEncoder import SillyEncoder
from immuneML.environment.Label import Label
from immuneML.simulation.dataset_generation.RandomDatasetGenerator import RandomDatasetGenerator


class TestSillyEncoder(unittest.TestCase):

    def setUp(self) -> None:
        os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name

    def _get_mock_sequence_dataset(self, path):
        # Create a mock SequenceDataset with 10 sequences of length 15,
        # and a label called 'binding' with 50% chance of having status 'yes' or 'no'
        dataset = RandomDatasetGenerator.generate_sequence_dataset(sequence_count=10,
                                                                   length_probabilities={15: 1},
                                                                   labels={"binding": {"yes": 0.5, "no": 0.5}},
                                                                   path=path)

        label_config = LabelConfiguration(labels=[Label(name="binding", values=["yes", "no"])])

        return dataset, label_config

    def _get_mock_receptor_dataset(self, path):
        # Create a mock ReceptorDataset with 10 receptors with sequences of length 15,
        # and a label called 'binding' with 50% chance of having status 'yes' or 'no'
        dataset = RandomDatasetGenerator.generate_receptor_dataset(receptor_count=10,
                                                                   chain_1_length_probabilities={15: 1},
                                                                   chain_2_length_probabilities={15: 1},
                                                                   labels={"binding": {"yes": 0.5, "no": 0.5}},
                                                                   path=path)

        label_config = LabelConfiguration(labels=[Label(name="binding", values=["yes", "no"])])

        return dataset, label_config

    def _get_mock_repertoire_dataset(self, path):
        # Create a mock RepertoireDataset with 10 repertoires, each containing 50 sequences of length 15,
        # and a label called 'disease' with 50% chance of having status 'yes' or 'no'
        dataset = RandomDatasetGenerator.generate_repertoire_dataset(repertoire_count=10,
                                                           sequence_count_probabilities={50: 1},
                                                           sequence_length_probabilities={15: 1},
                                                           labels={"disease": {"yes": 0.5, "no": 0.5}},
                                                           path=path)

        label_config = LabelConfiguration(labels=[Label(name="disease", values=["yes", "no"])])

        return dataset, label_config

    def test_silly_sequence_encoder(self):
        tmp_path = EnvironmentSettings.tmp_test_path / "silly_sequence/"
        sequence_dataset, label_config = self._get_mock_sequence_dataset(tmp_path)
        self._test_silly_encoder(tmp_path, sequence_dataset, label_config)

    def test_silly_receptor_encoder(self):
        tmp_path = EnvironmentSettings.tmp_test_path / "silly_receptor/"
        receptor_dataset, label_config = self._get_mock_receptor_dataset(tmp_path)
        self._test_silly_encoder(tmp_path, receptor_dataset, label_config)

    def test_silly_repertoire_encoder(self):
        tmp_path = EnvironmentSettings.tmp_test_path / "silly_repertoire/"
        receptor_dataset, label_config = self._get_mock_repertoire_dataset(tmp_path)
        self._test_silly_encoder(tmp_path, receptor_dataset, label_config)

    def _test_silly_encoder(self, tmp_path, dataset, label_config):
        # test getting a SillyEncoder from the build_object method
        params = {"random_seed": 1, "embedding_len": 3}
        encoder = SillyEncoder.build_object(dataset, **params)
        self.assertIsInstance(encoder, SillyEncoder)

        # test encoding data
        encoded_dataset = encoder.encode(dataset,
                                         params=EncoderParams(result_path=tmp_path,
                                                              label_config=label_config))

        # the result must be a Dataset (of the same subtype as the original dataset) with EncodedData attached
        self.assertIsInstance(encoded_dataset, dataset.__class__)
        self.assertIsInstance(encoded_dataset.encoded_data, EncodedData)

        # testing the validity of the encoded data
        self.assertEqual(dataset.get_example_ids(), encoded_dataset.encoded_data.example_ids)
        self.assertTrue((encoded_dataset.encoded_data.examples >= 0).all())
        self.assertTrue((encoded_dataset.encoded_data.examples <= 1).all())

        # don't forget to remove the temporary data
        shutil.rmtree(tmp_path)
  1. Add a new package to the test.encodings package which matches the package name of your encoder code. In this case, the new package would be test.encodings.silly.

  2. To the new test.encodings.silly package, add a new file named test_sillyEncoder.py.

  3. Add a class TestSillyEncoder that inherits unittest.TestCase to the new file.

  4. Add a function setUp() to set up cache used for testing. This should ensure that the cache location will be set to EnvironmentSettings.tmp_test_path / "cache/"

  5. Define one or more tests for the class and functions you implemented. For the SillyEncoder example, these have already been added. Note:

    • It is recommended to at least test the output of the ‘encode’ method (ensure a valid EncodedData object with correct examples matrix is returned).

    • Make sure to add tests for every relevant dataset type. Tests for different dataset types may be split into several different classes/files if desired (e.g., test_oneHotReceptorEncoder.py, test_oneHotSequenceEncoder.py, …). For the SillyEncoder, all tests are in the same file.

    • Mock data is typically used to test new classes. Tip: the RandomDatasetGenerator class can be used to generate Repertoire, Sequence or Receptor datasets with random sequences.

    • If you need to write data to a path (for example test datasets or results), use the following location: EnvironmentSettings.tmp_test_path / "some_unique_foldername"

Implementing a new encoder

This section describes tips and tricks for implementing your own new DatasetEncoder from scratch. Detailed instructions of how to implement each method, as well as some special cases, can be found in the DatasetEncoder base class.

Note

Coding conventions and tips

  1. Class names are written in CamelCase

  2. Class methods are writte in snake_case

  3. Abstract base classes MLMethod, DatasetEncoder, and Report, define an interface for their inheriting subclasses. These classes contain abstract methods which should be overwritten.

  4. Class methods starting with _underscore are generally considered “private” methods, only to be called by the class itself. If a method is expected to be called from another class, the method name should not start with an underscore.

  5. When familiarising yourself with existing code, we recommend focusing on public methods. Private methods are typically very unique to a class (internal class-specific calculations), whereas the public methods contain more general functionalities (e.g., returning a main result).

  6. If your class should have any default parameters, they should be defined in a default parameters file under config/default_params/.

  7. Some utility classes are available in the util package to provide useful functionalities. For example, ParameterValidator can be used to check user input and generate error messages, or PathBuilder can be used to add and remove folders.

Encoders for different dataset types

Inside immuneML, three different types of datasets are considered: RepertoireDataset for immune repertoires, SequenceDataset for single-chain immune receptor sequences and ReceptorDataset for paired sequences. Encoding should be implemented separately for each dataset type. This can be solved in two different ways:

  • Have a single Encoder class containing separate methods for encoding different dataset types. During encoding, the dataset type is checked, and the corresponding methods are called. An example of this is given in the SillyEncoder Example Encoder and automatic testing.

  • Have an abstract base Encoder class for the general encoding type, with subclasses for each dataset type. The base Encoder contains all shared functionalities, and the subclasses contain dataset-specific functionalities, such as code for ‘encoding an example’. Note that in this case, the base Encoder implements the method build_object(dataset: Dataset, params), that returns the correct dataset type-specific encoder subclass. An example of this is OneHotEncoder, which has subclasses OneHotSequenceEncoder, OneHotReceptorEncoder and OneHotRepertoireEncoder

When an encoding only makes sense for one possible dataset type, only one class needs to be created. The build_object(dataset: Dataset, params) method should raise a user-friendly error when an illegal dataset type is supplied. An example of this can be found in SimilarToPositiveSequenceEncoder.

Input and output of the encode() method

The encode() method is called by immuneML to encode a new dataset. This method is called with two arguments: a dataset and params (an EncoderParams object), which contains:

EncoderParams:

  • label_config: a LabelConfiguration object containing the labels that were specified for the analysis. Should be used as an input parameter for EncoderHelper.encode_dataset_labels().

  • encode_labels: boolean value which specifies whether labels must be used when encoding. Should be used as an input parameter for EncoderHelper.encode_dataset_labels().

  • pool_size: the number of parallel processes that the Encoder is allowed to use, for example when using parallelisation using the package pool. This only needs to be used when implementing parallelisation.

  • result_path: this path can optionally be used to store intermediate files, if necessary. For most encoders, this is not necessary.

  • learn_model: a boolean value indicating whether the encoder is called during ‘training’ (learn_model=True) or ‘application’ (learn_model=False). Thus, this parameter can be used to prevent ‘leakage’ of information from the test to training set. This must be taken into account when performing operations over the whole dataset, such as normalising/scaling features (example: Word2VecEncoder). For encoders where the encoding of a single example is not dependent on other examples, (e.g., OneHotEncoder), this parameter can be ignored.

  • model: this parameter is used by e.g., KmerFrequencyEncoder to pass its parameters to other classes. This parameter can usually be ignored.

The encode() method should return a new dataset object, which is a copy of the original input dataset, but with an added encoded_data attribute. The encoded_data attribute should contain an EncodedData object, which is created with the following arguments:

EncodedData:

  • examples: a design matrix where the rows represent Repertoires, Receptors or Sequences (‘examples’), and the columns the encoding-specific features. This is typically a numpy matrix, but may also be another matrix type (e.g., scipy sparse matrix, pytorch tensor, pandas dataframe).

  • encoding: a string denoting the encoder base class that was used.

  • labels: a dictionary of labels, where each label is a key, and the values are the label values across the examples (for example: {disease1: [positive, positive, negative]} if there are 3 repertoires). This parameter should be set only if EncoderParams.encode_labels is True, otherwise it should be set to None. This can be created by calling utility function EncoderHelper.encode_dataset_labels().

  • example_ids: a list of identifiers for the examples (Repertoires, Receptors or Sequences). This can be retrieved using Dataset.get_example_ids().

  • feature_names: a list of feature names, i.e., the names given to the encoding-specific features. When included, list must be as long as the number of features.

  • feature_annotations: an optional pandas dataframe with additional information about the features. When included, number of rows in this dataframe must correspond to the number of features. This parameter is not typically used.

  • info: an optional dictionary that may be used to store any additional information that is relevant (for example paths to additional output files). This parameter is not typically used.

The examples attribute of the EncodedData objects will be directly passed to the ML models for training. Other attributes are used for reports and interpretability.

Caching intermediate results

To prevent recomputing the same result a second time, immuneML uses caching. Caching can be applied to methods which compute an (intermediate) result. The result is stored to a file, and when the same method call is made, the previously stored result is retrieved from the file and returned.

We recommend applying caching to methods which are computationally expensive and may be called multiple times in the same way. For example, encoders are a good target for caching as they may take long to compute and can be called multiple times on the same data when combined with different ML methods. But ML methods typically do not require caching, as you would want to apply ML methods with different parameters or to differently encoded data.

Any method call in immuneML can be cached as follows:

result = CacheHandler.memo_by_params(params = cache_params, fn = lambda: my_method_for_caching(my_method_param1, my_method_param2, ...))

The CacheHandler.memo_by_params method does the following:

  • Using the caching parameters, a unique cache key (random string) is created.

  • CacheHandler checks if there already exists a previously computed result that is associated with this key.

  • If the result exists, the result is returned without (re)computing the method.

  • If the result does not exist, the method is computed, its result is stored using the cache key, and the result is returned.

The lambda function call simply calls the method to be cached, using any required parameters. The cache_params represent the unique, immutable parameters used to compute the cache key. It should have the following properties:

  • It must be a nested tuple containing only immutable items such as strings, booleans and integers. It cannot contain mutable items like lists, dictionaries, sets and objects (they all need to be converted nested tuples of immutable items).

  • It should include every factor that can contribute to a difference in the results of the computed method. For example, when caching the encode_data step, the following should be included:

    • dataset descriptors (dataset id, example ids, dataset type),

    • encoding name,

    • labels,

    • EncoderParams.learn_model if used,

    • all relevant input parameters to the encoder. Preferentially retrieved automatically (such as by vars(self)), as this ensures that if new parameters are added to the encoder, they are always added to the caching params.

For example, OneHotEncoder computes its caching parameters as follows:

def _prepare_caching_params(self, dataset, params: EncoderParams):
    return (("dataset_identifier", dataset.identifier),
            ("example_identifiers", tuple(dataset.get_example_ids())),
            ("dataset_type", dataset.__class__.__name__),
            ("encoding", OneHotEncoder.__name__),
            ("labels", tuple(params.label_config.get_labels_by_name())),
            ("encoding_params", tuple(vars(self).items())))

The construction of caching parameters must be done carefully, as caching bugs are extremely difficult to discover. Rather add ‘too much’ information than too little. A missing parameter will not lead to an error, but can result in silently copying over results from previous method calls.

Class documentation standards for encodings

Class documentation should be added as a docstring to all new Encoder, MLMethod, Report or Preprocessing classes. The class docstrings are used to automatically generate the documentation web pages, using Sphinx reStructuredText, and should adhere to a standard format:

  1. A short, general description of the functionality

  2. Optional extended description, including any references or specific cases that should bee considered. For instance: if a class can only be used for a particular dataset type. Compatibility between Encoders, MLMethods and Reports should also be described.

  3. For encoders, the appropriate dataset type(s). For example:

    **Dataset type:**
    
    - SequenceDatasets
    
    - RepertoireDatasets
    
  4. A list of arguments, when applicable. This should follow the format below:

    **Specification arguments:**
    
    - parameter_name (type): a short description
    
    - other_paramer_name (type): a short description
    
  5. A YAML snippet, to show an example of how the new component should be called. Make sure to test your YAML snippet in an immuneML run to ensure it is specified correctly. The following formatting should be used to ensure the YAML snippet is rendered correctly:

    **YAML specification:**
    
    .. indent with spaces
    .. code-block:: yaml
    
        definitions:
            yaml_keyword: # could be encodings/ml_methods/reports/etc...
                my_new_class:
                    MyNewClass:
                        parameter_name: 0
                        other_paramer_name: 1
    
Click to view a full example of DatasetEncoder class documentation.
This SillyEncoder class is a placeholder for a real encoder.
It computes a set of random numbers as features for a given dataset.

**Specification arguments:**

- random_seed (int): The random seed for generating random features.

- embedding_len (int): The number of random features to generate per example.


 **YAML specification:**

.. indent with spaces
.. code-block:: yaml

    definitions:
        encodings:
            my_silly_encoder:
                Silly:
                    random_seed: 1
                    embedding_len: 5