How to add a new encoding#
Adding an example encoder to the immuneML codebase#
This tutorial describes how to add a new DatasetEncoder
class to immuneML,
using a simple example encoder. We highly recommend completing this tutorial to get a better understanding of the immuneML
interfaces before continuing to implement your own encoder.
Step-by-step tutorial#
For this tutorial, we provide a SillyEncoder
(download here
or view below), in order to test adding a new Encoder file to immuneML.
This encoder ignores the data of the input examples, and generates a few random features per example.
SillyEncoder.py
import numpy as np from immuneML.data_model.dataset.ReceptorDataset import ReceptorDataset from immuneML.data_model.dataset.RepertoireDataset import RepertoireDataset from immuneML.data_model.dataset.SequenceDataset import SequenceDataset from immuneML.data_model.encoded_data.EncodedData import EncodedData from immuneML.encodings.DatasetEncoder import DatasetEncoder from immuneML.encodings.EncoderParams import EncoderParams from immuneML.data_model.dataset.Dataset import Dataset from immuneML.util.EncoderHelper import EncoderHelper from immuneML.util.ParameterValidator import ParameterValidator class SillyEncoder(DatasetEncoder): """ This SillyEncoder class is a placeholder for a real encoder. It computes a set of random numbers as features for a given dataset. **Specification arguments:** - random_seed (int): The random seed for generating random features. - embedding_len (int): The number of random features to generate per example. **YAML specification:** .. indent with spaces .. code-block:: yaml definitions: encodings: my_silly_encoder: Silly: # name of the class (without 'Encoder' suffix) random_seed: 1 embedding_len: 5 """ def __init__(self, random_seed: int, embedding_len: int, name: str = None): # The encoder name contains the user-defined name for the encoder. It may be used by some reports. super().__init__(name=name) # All user parameters are set here. # Default parameters must not be defined in the Encoder class, but in a default parameters file. self.random_seed = random_seed self.embedding_len = embedding_len @staticmethod def build_object(dataset=None, **params): # build_object is called early in the immuneML run, before the analysis takes place. # Its purpose is to fail early when a class is called incorrectly (checking parameters and dataset), # and provide user-friendly error messages. # ParameterValidator contains many utility functions for checking user parameters ParameterValidator.assert_type_and_value(params['random_seed'], int, SillyEncoder.__name__, 'random_seed', min_inclusive=1) ParameterValidator.assert_type_and_value(params['embedding_len'], int, SillyEncoder.__name__, 'embedding_len', min_inclusive=1, max_inclusive=100) # An error should be thrown if the dataset type is incompatible with the Encoder. # If different sub-classes are defined for each dataset type (e.g., OneHotRepertoireEncoder), # an instance of the dataset-specific class must be returned here. if isinstance(dataset, SequenceDataset) or isinstance(dataset, ReceptorDataset) or isinstance(dataset, RepertoireDataset): return SillyEncoder(**params) else: raise ValueError("SillyEncoder is only defined for dataset types SequenceDataset, ReceptorDataset or RepertoireDataset") def encode(self, dataset, params: EncoderParams) -> Dataset: np.random.seed(self.random_seed) # Generate the design matrix from the sequence dataset encoded_examples = self._get_encoded_examples(dataset) # EncoderHelper contains some utility functions, including this function for encoding the labels labels = EncoderHelper.encode_dataset_labels(dataset, params.label_config, params.encode_labels) # Each feature is represented by some meaningful name feature_names = [f"random_number_{i}" for i in range(self.embedding_len)] encoded_data = EncodedData(examples=encoded_examples, labels=labels, example_ids=dataset.get_example_ids(), feature_names=feature_names, encoding=SillyEncoder.__name__) # When using dataset-specific encoders, # make sure to use the general encoder name here # (e.g., OneHotEncoder.__name__, not OneHotSequenceEncoder.__name__) encoded_dataset = dataset.clone() encoded_dataset.encoded_data = encoded_data return encoded_dataset def _get_encoded_examples(self, dataset: Dataset) -> np.array: if isinstance(dataset, SequenceDataset): return self._get_encoded_sequences(dataset) elif isinstance(dataset, ReceptorDataset): return self._get_encoded_receptors(dataset) elif isinstance(dataset, RepertoireDataset): return self._get_encoded_repertoires(dataset) def _get_encoded_sequences(self, dataset: SequenceDataset) -> np.array: encoded_sequences = [] for sequence in dataset.get_data(): # Each sequence is a ReceptorSequence object. # Different properties of the sequence can be retrieved here, examples: identifier = sequence.get_id() aa_seq = sequence.get_sequence() # gets the amino acid sequence by default (alternative: nucleotide) v_gene = sequence.get_attribute("v_gene") # gets the v and j genes (without *allele) j_gene = sequence.get_attribute("j_gene") # In this encoding, sequence information is ignored, random features are generated random_encoding = np.random.rand(self.embedding_len) encoded_sequences.append(random_encoding) return np.array(encoded_sequences) def _get_encoded_receptors(self, dataset: ReceptorDataset) -> np.array: encoded_receptors = [] for receptor in dataset.get_data(): # Each receptor is a Receptor subclass object (e.g., TCABReceptor, BCReceptor) # A Receptor contains two paired ReceptorSequence objects identifier = receptor.get_id() chain1, chain2 = receptor.get_chains() sequence1 = receptor.get_chain(chain1) sequence2 = receptor.get_chain(chain2) # Properties of the specific ReceptorSequences can be retrieved, examples: aa_seq1 = sequence1.get_sequence() # gets the amino acid sequence by default (alternative: nucleotide) v_gene_seq1 = sequence1.get_attribute("v_gene") # gets the v and j genes (without *allele) j_gene_seq1 = sequence1.get_attribute("j_gene") # It's also possible to retrieve this information for both chains at the Receptor level: aa_seq1, aa_seq2 = receptor.get_attribute("sequence_aa") v_gene_seq1, v_gene_seq2 = receptor.get_attribute("v_gene") # In this encoding, sequence information is ignored, random features are generated random_encoding = np.random.rand(self.embedding_len) encoded_receptors.append(random_encoding) return np.array(encoded_receptors) def _get_encoded_repertoires(self, dataset: RepertoireDataset) -> np.array: encoded_repertoires = [] for repertoire in dataset.get_data(): # Each repertoire is a Repertoire object. # Different properties of the repertoire can be retrieved here, examples: identifiers = repertoire.get_sequence_identifiers(as_list=True) aa_sequences = repertoire.get_sequence_aas(as_list=True) v_genes = repertoire.get_v_genes() # gets the v and j genes (without *allele) j_genes = repertoire.get_j_genes() sequence_counts = repertoire.get_counts() chains = repertoire.get_chains() # In this encoding, repertoire information is ignored, random features are generated random_encoding = np.random.rand(self.embedding_len) encoded_repertoires.append(random_encoding) return np.array(encoded_repertoires)
Add a new Python package to the
encodings
package. This means: a new folder (with meaningful name) containing an empty__init__.py
file.Add a new encoder class to the package. The new class should inherit from the base class
DatasetEncoder
. The name of the class should end with ‘Encoder’, and when calling this class in the YAML specification, the ‘Encoder’ suffix is omitted. In the test example, the class is calledSillyEncoder
, which would be referred to asSilly
in the YAML specification.If the encoder has any default parameters, they should be added in a default parameters YAML file. This file should be added to the folder
config/default_params/encodings
. The default parameters file is automatically discovered based on the name of the class using the base name (without ‘Encoder’ suffix) converted to snake case, and with an added ‘_params.yaml’ suffix. For theSillyEncoder
, this issilly_params.yaml
, which could for example contain the following:random_seed: 1 embedding_len: 5
In rare cases where classes have unconventional names that do not translate well to CamelCase (e.g., MiXCR, VDJdb), this needs to be accounted for in
convert_to_snake_case()
.Use the automated script check_new_encoder.py to test the newly added encoder. This script will throw errors or warnings if the DatasetEncoder class implementation is incorrect or if files are put in the wrong place. Example command to test the
SillyEncoder
for sequence datasets:python3 ./scripts/check_new_encoder.py -e ./immuneML/encodings/silly/SillyEncoder.py -d sequence
If a compatible ML method is already available, add the new encoder class to the list of compatible encoders returned by the
get_compatible_encoders()
method of theMLMethod
of interest. See also Adding encoder compatibility to an ML method.
Test running the new encoding with a YAML specification#
If you want to use immuneML directly to test run your encoder, the YAML example below may be used.
This example analysis creates a randomly generated dataset, encodes the data using the SillyEncoder
and exports the encoded data as a csv file.
test_run_silly_encoder.yaml
definitions: datasets: my_dataset: format: RandomSequenceDataset params: sequence_count: 100 labels: binds_epitope: True: 0.6 False: 0.4 encodings: my_silly_encoder: Silly: random_seed: 3 reports: my_design_matrix: DesignMatrixExporter instructions: my_instruction: type: ExploratoryAnalysis analyses: my_analysis_1: dataset: my_dataset encoding: my_silly_encoder report: my_design_matrix labels: - binds_epitope
Adding a Unit test for a DatasetEncoder#
Add a unit test for the new SillyEncoder (download
the example testfile or view below):
test_sillyEncoder.py
import os import shutil import unittest from immuneML.caching.CacheType import CacheType from immuneML.encodings.EncoderParams import EncoderParams from immuneML.environment.Constants import Constants from immuneML.environment.EnvironmentSettings import EnvironmentSettings from immuneML.environment.LabelConfiguration import LabelConfiguration from immuneML.data_model.encoded_data.EncodedData import EncodedData from immuneML.encodings.silly.SillyEncoder import SillyEncoder from immuneML.environment.Label import Label from immuneML.simulation.dataset_generation.RandomDatasetGenerator import RandomDatasetGenerator class TestSillyEncoder(unittest.TestCase): def setUp(self) -> None: os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name def _get_mock_sequence_dataset(self, path): # Create a mock SequenceDataset with 10 sequences of length 15, # and a label called 'binding' with 50% chance of having status 'yes' or 'no' dataset = RandomDatasetGenerator.generate_sequence_dataset(sequence_count=10, length_probabilities={15: 1}, labels={"binding": {"yes": 0.5, "no": 0.5}}, path=path) label_config = LabelConfiguration(labels=[Label(name="binding", values=["yes", "no"])]) return dataset, label_config def _get_mock_receptor_dataset(self, path): # Create a mock ReceptorDataset with 10 receptors with sequences of length 15, # and a label called 'binding' with 50% chance of having status 'yes' or 'no' dataset = RandomDatasetGenerator.generate_receptor_dataset(receptor_count=10, chain_1_length_probabilities={15: 1}, chain_2_length_probabilities={15: 1}, labels={"binding": {"yes": 0.5, "no": 0.5}}, path=path) label_config = LabelConfiguration(labels=[Label(name="binding", values=["yes", "no"])]) return dataset, label_config def _get_mock_repertoire_dataset(self, path): # Create a mock RepertoireDataset with 10 repertoires, each containing 50 sequences of length 15, # and a label called 'disease' with 50% chance of having status 'yes' or 'no' dataset = RandomDatasetGenerator.generate_repertoire_dataset(repertoire_count=10, sequence_count_probabilities={50: 1}, sequence_length_probabilities={15: 1}, labels={"disease": {"yes": 0.5, "no": 0.5}}, path=path) label_config = LabelConfiguration(labels=[Label(name="disease", values=["yes", "no"])]) return dataset, label_config def test_silly_sequence_encoder(self): tmp_path = EnvironmentSettings.tmp_test_path / "silly_sequence/" sequence_dataset, label_config = self._get_mock_sequence_dataset(tmp_path) self._test_silly_encoder(tmp_path, sequence_dataset, label_config) def test_silly_receptor_encoder(self): tmp_path = EnvironmentSettings.tmp_test_path / "silly_receptor/" receptor_dataset, label_config = self._get_mock_receptor_dataset(tmp_path) self._test_silly_encoder(tmp_path, receptor_dataset, label_config) def test_silly_repertoire_encoder(self): tmp_path = EnvironmentSettings.tmp_test_path / "silly_repertoire/" receptor_dataset, label_config = self._get_mock_repertoire_dataset(tmp_path) self._test_silly_encoder(tmp_path, receptor_dataset, label_config) def _test_silly_encoder(self, tmp_path, dataset, label_config): # test getting a SillyEncoder from the build_object method params = {"random_seed": 1, "embedding_len": 3} encoder = SillyEncoder.build_object(dataset, **params) self.assertIsInstance(encoder, SillyEncoder) # test encoding data encoded_dataset = encoder.encode(dataset, params=EncoderParams(result_path=tmp_path, label_config=label_config)) # the result must be a Dataset (of the same subtype as the original dataset) with EncodedData attached self.assertIsInstance(encoded_dataset, dataset.__class__) self.assertIsInstance(encoded_dataset.encoded_data, EncodedData) # testing the validity of the encoded data self.assertEqual(dataset.get_example_ids(), encoded_dataset.encoded_data.example_ids) self.assertTrue((encoded_dataset.encoded_data.examples >= 0).all()) self.assertTrue((encoded_dataset.encoded_data.examples <= 1).all()) # don't forget to remove the temporary data shutil.rmtree(tmp_path)
Add a new package to the
test.encodings
package which matches the package name of your encoder code. In this case, the new package would betest.encodings.silly
.To the new
test.encodings.silly
package, add a new file named test_sillyEncoder.py.Add a class
TestSillyEncoder
that inheritsunittest.TestCase
to the new file.Add a function
setUp()
to set up cache used for testing. This should ensure that the cache location will be set toEnvironmentSettings.tmp_test_path / "cache/"
Define one or more tests for the class and functions you implemented. For the SillyEncoder example, these have already been added. Note:
It is recommended to at least test the output of the ‘encode’ method (ensure a valid EncodedData object with correct examples matrix is returned).
Make sure to add tests for every relevant dataset type. Tests for different dataset types may be split into several different classes/files if desired (e.g., test_oneHotReceptorEncoder.py, test_oneHotSequenceEncoder.py, …). For the SillyEncoder, all tests are in the same file.
Mock data is typically used to test new classes. Tip: the
RandomDatasetGenerator
class can be used to generate Repertoire, Sequence or Receptor datasets with random sequences.If you need to write data to a path (for example test datasets or results), use the following location:
EnvironmentSettings.tmp_test_path / "some_unique_foldername"
Implementing a new encoder#
This section describes tips and tricks for implementing your own new DatasetEncoder
from scratch.
Detailed instructions of how to implement each method, as well as some special cases, can be found in the
DatasetEncoder
base class.
Note
Coding conventions and tips
Class names are written in CamelCase
Class methods are writte in snake_case
Abstract base classes
MLMethod
,DatasetEncoder
, andReport
, define an interface for their inheriting subclasses. These classes contain abstract methods which should be overwritten.Class methods starting with _underscore are generally considered “private” methods, only to be called by the class itself. If a method is expected to be called from another class, the method name should not start with an underscore.
When familiarising yourself with existing code, we recommend focusing on public methods. Private methods are typically very unique to a class (internal class-specific calculations), whereas the public methods contain more general functionalities (e.g., returning a main result).
If your class should have any default parameters, they should be defined in a default parameters file under
config/default_params/
.Some utility classes are available in the
util
package to provide useful functionalities. For example,ParameterValidator
can be used to check user input and generate error messages, orPathBuilder
can be used to add and remove folders.
Encoders for different dataset types#
Inside immuneML, three different types of datasets are considered: RepertoireDataset
for immune
repertoires, SequenceDataset
for single-chain immune
receptor sequences and ReceptorDataset
for paired sequences.
Encoding should be implemented separately for each dataset type. This can be solved in two different ways:
Have a single Encoder class containing separate methods for encoding different dataset types. During encoding, the dataset type is checked, and the corresponding methods are called. An example of this is given in the SillyEncoder Example Encoder and automatic testing.
Have an abstract base Encoder class for the general encoding type, with subclasses for each dataset type. The base Encoder contains all shared functionalities, and the subclasses contain dataset-specific functionalities, such as code for ‘encoding an example’. Note that in this case, the base Encoder implements the method
build_object(dataset: Dataset, params)
, that returns the correct dataset type-specific encoder subclass. An example of this isOneHotEncoder
, which has subclassesOneHotSequenceEncoder
,OneHotReceptorEncoder
andOneHotRepertoireEncoder
When an encoding only makes sense for one possible dataset type, only one class needs to be created.
The build_object(dataset: Dataset, params)
method should raise a user-friendly error when an illegal dataset type is supplied.
An example of this can be found in SimilarToPositiveSequenceEncoder
.
Input and output of the encode() method#
The encode() method is called by immuneML to encode a new dataset.
This method is called with two arguments: a dataset and params (an EncoderParams
object), which contains:
EncoderParams:
label_config
: aLabelConfiguration
object containing the labels that were specified for the analysis. Should be used as an input parameter forEncoderHelper.encode_dataset_labels()
.
encode_labels
: boolean value which specifies whether labels must be used when encoding. Should be used as an input parameter forEncoderHelper.encode_dataset_labels()
.
pool_size
: the number of parallel processes that the Encoder is allowed to use, for example when using parallelisation using the packagepool
. This only needs to be used when implementing parallelisation.
result_path
: this path can optionally be used to store intermediate files, if necessary. For most encoders, this is not necessary.
learn_model
: a boolean value indicating whether the encoder is called during ‘training’ (learn_model=True) or ‘application’ (learn_model=False). Thus, this parameter can be used to prevent ‘leakage’ of information from the test to training set. This must be taken into account when performing operations over the whole dataset, such as normalising/scaling features (example:Word2VecEncoder
). For encoders where the encoding of a single example is not dependent on other examples, (e.g.,OneHotEncoder
), this parameter can be ignored.
model
: this parameter is used by e.g.,KmerFrequencyEncoder
to pass its parameters to other classes. This parameter can usually be ignored.
The encode()
method should return a new dataset object, which is a copy of the original input dataset, but with an added encoded_data
attribute.
The encoded_data
attribute should contain an EncodedData
object, which is created with the
following arguments:
EncodedData:
examples
: a design matrix where the rows represent Repertoires, Receptors or Sequences (‘examples’), and the columns the encoding-specific features. This is typically a numpy matrix, but may also be another matrix type (e.g., scipy sparse matrix, pytorch tensor, pandas dataframe).
encoding
: a string denoting the encoder base class that was used.
labels
: a dictionary of labels, where each label is a key, and the values are the label values across the examples (for example: {disease1: [positive, positive, negative]} if there are 3 repertoires). This parameter should be set only ifEncoderParams.encode_labels
is True, otherwise it should be set to None. This can be created by calling utility functionEncoderHelper.encode_dataset_labels()
.
example_ids
: a list of identifiers for the examples (Repertoires, Receptors or Sequences). This can be retrieved usingDataset.get_example_ids()
.
feature_names
: a list of feature names, i.e., the names given to the encoding-specific features. When included, list must be as long as the number of features.
feature_annotations
: an optional pandas dataframe with additional information about the features. When included, number of rows in this dataframe must correspond to the number of features. This parameter is not typically used.
info
: an optional dictionary that may be used to store any additional information that is relevant (for example paths to additional output files). This parameter is not typically used.
The examples
attribute of the EncodedData
objects will be directly passed to the ML models for training.
Other attributes are used for reports and interpretability.
Caching intermediate results#
To prevent recomputing the same result a second time, immuneML uses caching. Caching can be applied to methods which compute an (intermediate) result. The result is stored to a file, and when the same method call is made, the previously stored result is retrieved from the file and returned.
We recommend applying caching to methods which are computationally expensive and may be called multiple times in the same way. For example, encoders are a good target for caching as they may take long to compute and can be called multiple times on the same data when combined with different ML methods. But ML methods typically do not require caching, as you would want to apply ML methods with different parameters or to differently encoded data.
Any method call in immuneML can be cached as follows:
result = CacheHandler.memo_by_params(params = cache_params, fn = lambda: my_method_for_caching(my_method_param1, my_method_param2, ...))
The CacheHandler.memo_by_params
method does the following:
Using the caching parameters, a unique cache key (random string) is created.
CacheHandler checks if there already exists a previously computed result that is associated with this key.
If the result exists, the result is returned without (re)computing the method.
If the result does not exist, the method is computed, its result is stored using the cache key, and the result is returned.
The lambda
function call simply calls the method to be cached, using any required parameters.
The cache_params
represent the unique, immutable parameters used to compute the cache key.
It should have the following properties:
It must be a nested tuple containing only immutable items such as strings, booleans and integers. It cannot contain mutable items like lists, dictionaries, sets and objects (they all need to be converted nested tuples of immutable items).
It should include every factor that can contribute to a difference in the results of the computed method. For example, when caching the encode_data step, the following should be included:
dataset descriptors (dataset id, example ids, dataset type),
encoding name,
labels,
EncoderParams.learn_model
if used,all relevant input parameters to the encoder. Preferentially retrieved automatically (such as by
vars(self)
), as this ensures that if new parameters are added to the encoder, they are always added to the caching params.
For example, OneHotEncoder
computes its
caching parameters as follows:
def _prepare_caching_params(self, dataset, params: EncoderParams): return (("dataset_identifier", dataset.identifier), ("example_identifiers", tuple(dataset.get_example_ids())), ("dataset_type", dataset.__class__.__name__), ("encoding", OneHotEncoder.__name__), ("labels", tuple(params.label_config.get_labels_by_name())), ("encoding_params", tuple(vars(self).items())))
The construction of caching parameters must be done carefully, as caching bugs are extremely difficult to discover. Rather add ‘too much’ information than too little. A missing parameter will not lead to an error, but can result in silently copying over results from previous method calls.
Class documentation standards for encodings#
Class documentation should be added as a docstring to all new Encoder, MLMethod, Report or Preprocessing classes. The class docstrings are used to automatically generate the documentation web pages, using Sphinx reStructuredText, and should adhere to a standard format:
A short, general description of the functionality
Optional extended description, including any references or specific cases that should bee considered. For instance: if a class can only be used for a particular dataset type. Compatibility between Encoders, MLMethods and Reports should also be described.
A list of arguments, when applicable. This should follow the format below:
**Specification arguments:** - parameter_name (type): a short description - other_paramer_name (type): a short description
A YAML snippet, to show an example of how the new component should be called. Make sure to test your YAML snippet in an immuneML run to ensure it is specified correctly. The following formatting should be used to ensure the YAML snippet is rendered correctly:
**YAML specification:** .. indent with spaces .. code-block:: yaml definitions: yaml_keyword: # could be encodings/ml_methods/reports/etc... my_new_class: MyNewClass: parameter_name: 0 other_paramer_name: 1
Click to view a full example of DatasetEncoder class documentation.
This SillyEncoder class is a placeholder for a real encoder.
It computes a set of random numbers as features for a given dataset.
**Specification arguments:**
- random_seed (int): The random seed for generating random features.
- embedding_len (int): The number of random features to generate per example.
**YAML specification:**
.. indent with spaces
.. code-block:: yaml
definitions:
encodings:
my_silly_encoder:
Silly:
random_seed: 1
embedding_len: 5