How to add a new machine learning method¶

Adding an example classifier to the immuneML codebase¶

This tutorial describes how to add a new MLMethod class to immuneML, using a simple example classifier. We highly recommend completing this tutorial to get a better understanding of the immuneML interfaces before continuing to implement your own classifier.

Step-by-step tutorial¶

For this tutorial, we provide a SillyClassifier (download here or view below), in order to test adding a new MLMethod file to immuneML. This method ignores the input dataset, and makes a random prediction per example.

SillyClassifier.py

import copy
import yaml
import numpy as np
from pathlib import Path

from immuneML.environment.Label import Label
from immuneML.util.PathBuilder import PathBuilder
from immuneML.ml_methods.classifiers.MLMethod import MLMethod
from immuneML.data_model.encoded_data.EncodedData import EncodedData


class SillyClassifier(MLMethod):
    """
    This SillyClassifier is a placeholder for a real ML method.
    It generates random predictions ignoring the input features.

    **Specification arguments:**

    - random_seed (int): The random seed for generating random predictions.


    **YAML specification:**

    .. indent with spaces
    .. code-block:: yaml

        definitions:
            ml_methods:
                my_silly_method:
                    SillyClassifier:
                        random_seed: 100

    """
    def __init__(self, random_seed: int = None):
        super().__init__()
        self.random_seed = random_seed
        self.silly_model_fitted = False

    def _fit(self, encoded_data: EncodedData, cores_for_training: int = 2):
        # Since the silly classifier makes random predictions and ignores training data, no model is fitted during training.
        # For any other method, model fitting should be implemented here.
        self.silly_model_fitted = True

    def _predict_proba(self, encoded_data: EncodedData):
        np.random.seed(self.random_seed)

        # Generate an array containing a random prediction probability for each example
        pred_probabilities = np.random.rand(len(encoded_data.examples))

        return {self.label.name: {self.label.positive_class: pred_probabilities,
                                  self.label.get_binary_negative_class(): 1 - pred_probabilities}}

    def _predict(self, encoded_data: EncodedData):
        predictions_proba = self.predict_proba(encoded_data, self.label)
        proba_positive_class = predictions_proba[self.label.name][self.label.positive_class]

        predictions = []

        for proba in proba_positive_class:
            if proba > 0.5:
                predictions.append(self.label.positive_class)
            else:
                predictions.append(self.label.get_binary_negative_class())

        # Shorter alternative using class mapping:
        # return {self.label.name: np.array([self.class_mapping[val] for val in (proba_positive_class > 0.5).tolist()])}

        return {self.label.name: np.array(predictions)}

    def can_predict_proba(self) -> bool:
        return True

    def can_fit_with_example_weights(self) -> bool:
        return False

    def get_compatible_encoders(self):
        # Every encoder that is compatible with the ML method should be listed here.
        # The SillyClassifier can in principle be used with any encoder, few examples are listed
        from immuneML.encodings.abundance_encoding.SequenceAbundanceEncoder import SequenceAbundanceEncoder
        from immuneML.encodings.abundance_encoding.KmerAbundanceEncoder import KmerAbundanceEncoder
        from immuneML.encodings.atchley_kmer_encoding.AtchleyKmerEncoder import AtchleyKmerEncoder
        from immuneML.encodings.distance_encoding.DistanceEncoder import DistanceEncoder
        from immuneML.encodings.evenness_profile.EvennessProfileEncoder import EvennessProfileEncoder
        from immuneML.encodings.kmer_frequency.KmerFrequencyEncoder import KmerFrequencyEncoder
        from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder
        from immuneML.encodings.onehot.OneHotEncoder import OneHotEncoder

        return [SequenceAbundanceEncoder, KmerAbundanceEncoder, DistanceEncoder, EvennessProfileEncoder,
                AtchleyKmerEncoder, KmerFrequencyEncoder, MotifEncoder, OneHotEncoder]

    def store(self, path: Path):
        # The most basic way of storing a model is to get the parameters in a yaml-friendly format (get_params)
        # and store this in a file.
        # Depending on the method, more files (e.g., internal pickle, pytorch or keras files) may need to be stored.
        # The 'store' method should be compatible with 'load'
        PathBuilder.build(path)
        class_parameters = self.get_params()
        params_path = path / "custom_params.yaml"

        with params_path.open('w') as file:
            yaml.dump(class_parameters, file)

    def get_params(self) -> dict:
        # Returns a yaml-friendly dictionary (only simple types, no objects) with all parameters of this ML method
        params = copy.deepcopy(vars(self))

        if self.label:
            # the 'Label' object must be converted to a yaml-friendly version
            params["label"] = self.label.get_desc_for_storage()

        return params

    def load(self, path: Path):
        # The 'load' method is called on a new (untrained) object of the specific MLMethod class.
        # This method is used to load parameters from a previously trained model from storage,
        # thus creating a copy of the original trained model

        # Load the dictionary of parameters from the YAML file
        params_path = path / "custom_params.yaml"

        with params_path.open("r") as file:
            custom_params = yaml.load(file, Loader=yaml.SafeLoader)

        # Loop through the dictionary and set each parameter
        for param, value in custom_params.items():
            if hasattr(self, param):
                if param == "label":
                    # Special case: if the parameter is 'label', convert to a Label object
                    setattr(self, "label", Label(**value))
                else:
                    # Other cases: directly set the parameter to the given value
                    setattr(self, param, value)

Add a new class to the immuneML.ml_methods.classifiers package. The new class should inherit from the base class MLMethod.
If the ML method has any default parameters, they should be added in a default parameters YAML file. This file should be added to the folder config/default_params/ml_methods. The default parameters file is automatically discovered based on the name of the class using the class name converted to snake case, and with an added ‘_params.yaml’ suffix. For the SillyClassifier, this is silly_classifier_params.yaml, which could for example contain the following:
```
random_seed: 1
```
In rare cases where classes have unconventional names that do not translate well to CamelCase (e.g., MiXCR, VDJdb), this needs to be accounted for in convert_to_snake_case().
Use the automated script check_new_ml_method.py to test the newly added ML method. This script will throw errors or warnings if the MLMethod class implementation is incorrect.
- Note: this script will try running the new classifier with a random EncodedData object (a matrix of random numbers), which may not be compatible with your particular MLMethod. You may overwrite the function get_example_encoded_data() to supply a custom EncodedData object which meets the requirements of your MLMethod.
Example command to test the SillyClassifier:
```
python3 ./scripts/check_new_ml_method.py -m ./immuneML/ml_methods/classifiers/SillyClassifier.py
```

Test running the new ML method with a YAML specification¶

If you want to use immuneML directly to test run your ML method, the YAML example below may be used. This example analysis encodes a random dataset using k-mer encoding, trains and compares the performance of two silly classifiers which were initialised with different random seeds, and shows the results in a report. Note that when you test your own classifier, a compatible encoding must be used.

test_run_silly_classifier.yaml

definitions:
  datasets:
    my_dataset:
      format: RandomSequenceDataset
      params:
        sequence_count: 100
        labels:
          binds_epitope:
            True: 0.6
            False: 0.4

  encodings:
    my_encoding:
      KmerFrequency:
        k: 3

  ml_methods:
    my_first_silly_classifier:
      SillyClassifier:
        random_seed: 1
    my_second_silly_classifier:
      SillyClassifier:
        random_seed: 2

  reports:
    my_training_performance: TrainingPerformance
    my_settings_performance: MLSettingsPerformance

instructions:
  my_instruction:
    type: TrainMLModel

    dataset: my_dataset
    labels:
    - binds_epitope

    settings:
    - encoding: my_encoding
      ml_method: my_first_silly_classifier
    - encoding: my_encoding
      ml_method: my_second_silly_classifier

    assessment:
      split_strategy: random
      split_count: 1
      training_percentage: 0.7
      reports:
        models: [my_training_performance]
    selection:
      split_strategy: random
      split_count: 1
      training_percentage: 0.7

    optimization_metric: balanced_accuracy
    reports: [my_settings_performance]

Adding a Unit test for an MLMethod¶

Add a unit test for the new SillyClassifier (download the example testfile or view below)

test_sillyClassifier.py

import os
import shutil
import numpy as np
from unittest import TestCase

from immuneML.environment.Label import Label
from immuneML.caching.CacheType import CacheType
from immuneML.util.PathBuilder import PathBuilder
from immuneML.environment.Constants import Constants
from immuneML.data_model.encoded_data.EncodedData import EncodedData
from immuneML.environment.EnvironmentSettings import EnvironmentSettings
from immuneML.ml_methods.classifiers.SillyClassifier import SillyClassifier


class TestSillyClassifier(TestCase):

    def setUp(self) -> None:
        os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name

    def get_enc_data(self):
        # Creates a mock encoded data object with 8 random examples
        enc_data = EncodedData(examples=np.array([[1, 0, 0],
                                                  [0, 1, 1],
                                                  [1, 1, 1],
                                                  [0, 1, 1],
                                                  [1, 0, 0],
                                                  [0, 1, 1],
                                                  [1, 1, 1],
                                                  [0, 1, 1]]),
                               example_ids=list(range(8)),
                               feature_names=["a", "b", "c"],
                               labels={"my_label": ["yes", "no", "yes", "no", "yes", "no", "yes", "no"]},
                               encoding="random")

        label = Label(name="my_label", values=["yes", "no"], positive_class="yes")

        return enc_data, label

    def test_predictions(self):
        enc_data, label = self.get_enc_data()
        classifier = SillyClassifier(random_seed=50)

        # test fitting
        classifier.fit(enc_data, label)
        self.assertTrue(classifier.silly_model_fitted)

        # test 'predict'
        predictions = classifier.predict(enc_data, label)
        self.assertEqual(len(predictions[label.name]), len(enc_data.examples))

        # test 'predict_proba'
        prediction_probabilities = classifier.predict_proba(enc_data, label)
        self.assertEqual(len(prediction_probabilities[label.name][label.positive_class]), len(enc_data.examples))
        self.assertEqual(len(prediction_probabilities[label.name][label.get_binary_negative_class()]), len(enc_data.examples))
        self.assertTrue(all(0 <= pred <= 1 for pred in prediction_probabilities[label.name][label.positive_class]))
        self.assertTrue(all(0 <= pred <= 1 for pred in prediction_probabilities[label.name][label.get_binary_negative_class()]))

    def test_store_and_load(self):
        path = PathBuilder.build(EnvironmentSettings.tmp_test_path / "silly")
        enc_data, label = self.get_enc_data()
        classifier = SillyClassifier(random_seed=50)
        classifier.fit(enc_data, label)
        classifier.store(path)

        # Loading should be done in an 'empty' model (no parameters)
        classifier2 = SillyClassifier()
        classifier2.load(path)

        self.assertEqual(classifier.get_params(), classifier2.get_params())
        shutil.rmtree(path)

Add a new file to the test.ml_methods package named test_sillyClassifier.py.
Add a class TestSillyClassifier that inherits unittest.TestCase to the new file.
Add a function setUp() to set up cache used for testing. This should ensure that the cache location will be set to EnvironmentSettings.tmp_test_path / "cache/"
Define one or more tests for the class and functions you implemented.
- It is recommended to at least test fitting, prediction and storing/loading of the model.
- Mock data is typically used to test new classes.
- If you need to write data to a path (for example test datasets or results), use the following location: EnvironmentSettings.tmp_test_path / "some_unique_foldername"

Implementing a new classifier¶

This section describes tips and tricks for implementing your own new MLMethod from scratch. Detailed instructions of how to implement each method, as well as some special cases, can be found in the MLMethod base class.

Note

Coding conventions and tips

Class names are written in CamelCase
Class methods are writte in snake_case
Abstract base classes MLMethod, DatasetEncoder, and Report, define an interface for their inheriting subclasses. These classes contain abstract methods which should be overwritten.
Class methods starting with _underscore are generally considered “private” methods, only to be called by the class itself. If a method is expected to be called from another class, the method name should not start with an underscore.
When familiarising yourself with existing code, we recommend focusing on public methods. Private methods are typically very unique to a class (internal class-specific calculations), whereas the public methods contain more general functionalities (e.g., returning a main result).
If your class should have any default parameters, they should be defined in a default parameters file under config/default_params/.
Some utility classes are available in the util package to provide useful functionalities. For example, ParameterValidator can be used to check user input and generate error messages, or PathBuilder can be used to add and remove folders.

Developing a method outside immuneML with a sample design matrix¶

The initial development of the new ML method need not take place within immuneML. immuneML can be used to encode and export an example design matrix using the DesignMatrixExporter report with an appropriate encoding in the ExploratoryAnalysis instruction. The method can then be developed and debugged separately, and afterwards be integrated into the platform.

The following YAML example shows how to generate some random example data (detailed description here), encode it using a k-mer encoding and export the design matrix to .csv format. Note that for design matrices beyond 2 dimensions (such as OneHotEncoder with flatten = False), the matrix is exported as a .npy file instead of a .csv file.

export_design_matrix.yaml

definitions:
  datasets:
    my_simulated_data:
      format: RandomRepertoireDataset
      params:
        repertoire_count: 5 # a dataset with 5 repertoires
        sequence_count_probabilities: # each repertoire has 10 sequences
          10: 1
        sequence_length_probabilities: # each sequence has length 15
          15: 1
        labels:
          my_label: # half of the repertoires has my_label = true, the rest has false
            false: 0.5
            true: 0.5
  encodings:
    my_3mer_encoding:
      KmerFrequency:
        k: 3
  reports:
    my_design_matrix:
      DesignMatrixExporter:
        name: my_design_matrix
instructions:
  my_instruction:
    type: ExploratoryAnalysis
    analyses:
      my_analysis:
        dataset: my_simulated_data
        encoding: my_3mer_encoding
        labels:
        - my_label
        report: my_design_matrix

The resulting design matrix can be found the sub-folder my_instruction/analysis_my_analysis/report/design_matrix.csv, and the true classes for each repertoire can be found in labels.csv. To load files into an EncodedData object, the function immuneML.dev_util.util.load_encoded_data can be used.

Input and output for the fit() and predict() methods¶

Inside immuneML, the design matrix is passed to an MLMethod wrapped in an EncodedData object. This is the main input to the fitting and prediction methods. Additional inputs to the MLMethod during fitting are set in MLMethod._initialize_fit().

The EncodedData object contains the following fields:

EncodedData:

examples: a design matrix where the rows represent Repertoires, Receptors or Sequences (‘examples’), and the columns the encoding-specific features. This is typically a numpy matrix, but may also be another matrix type (e.g., scipy sparse matrix, pytorch tensor, pandas dataframe).

encoding: a string denoting the encoder base class that was used.

labels: a dictionary of labels, where each label is a key, and the values are the label values across the examples (for example: {disease1: [positive, positive, negative]} if there are 3 repertoires). This parameter should be set only if EncoderParams.encode_labels is True, otherwise it should be set to None. This can be created by calling utility function EncoderHelper.encode_dataset_labels().

example_ids: a list of identifiers for the examples (Repertoires, Receptors or Sequences). This can be retrieved using Dataset.get_example_ids().

feature_names: a list of feature names, i.e., the names given to the encoding-specific features. When included, list must be as long as the number of features.

feature_annotations: an optional pandas dataframe with additional information about the features. When included, number of rows in this dataframe must correspond to the number of features. This parameter is not typically used.

info: an optional dictionary that may be used to store any additional information that is relevant (for example paths to additional output files). This parameter is not typically used.

The output predictions should be formatted the same way as the EncodedData.labels:

{'label_name': ['class1', 'class1', 'class2']}

When predicting probabilities, a nested dictionary should be used to give the probabilities per class:

{'label_name': {'class1': [0.9, 0.8, 0.3]},
               {'class2': [0.1, 0.2, 0.7]}}

Adding encoder compatibility to an ML method¶

Each ML method is only compatible with a limited set of encoders. immuneML automatically checks if the given encoder and ML method are compatible when running the TrainMLModel instruction, and raises an error if they are not compatible. To ensure immuneML recognizes the encoder-ML method compatibility, make sure that the encoder is added to the list of encoder classes returned by the get_compatible_encoders() method of the ML method(s) of interest.

Implementing fitting through cross-validation¶

By default, models in immuneML are fitted through nested-cross validation. This allows for both hyperparameter selection and model comparison. immuneML also allows for the implementation of a third level of k-fold cross-validation for hyperparameter selection within the ML model (model_selection_cv in the YAML specification). This can be useful when a large number or range of hyperparameters is typically considered (e.g., regularisation parameters in logistic regression). Such additional cross-validation should be implemented inside the method _fit_by_cross_validation. The result should be that a single model (with optimal hyperparameters) is saved in the MLMethod object. See SklearnMethod for a detailed example. Note: this is advanced model implementation, which is usually not necessary to implement.

Class documentation standards for ML methods¶

Class documentation should be added as a docstring to all new Encoder, MLMethod, Report or Preprocessing classes. The class docstrings are used to automatically generate the documentation web pages, using Sphinx reStructuredText, and should adhere to a standard format:

A short, general description of the functionality
Optional extended description, including any references or specific cases that should bee considered. For instance: if a class can only be used for a particular dataset type. Compatibility between Encoders, MLMethods and Reports should also be described.

For encoders, the appropriate dataset type(s). For example:

**Dataset type:**

- SequenceDatasets

- RepertoireDatasets

A list of arguments, when applicable. This should follow the format below:

**Specification arguments:**

- parameter_name (type): a short description

- other_paramer_name (type): a short description

A YAML snippet, to show an example of how the new component should be called. Make sure to test your YAML snippet in an immuneML run to ensure it is specified correctly. The following formatting should be used to ensure the YAML snippet is rendered correctly:

**YAML specification:**

.. indent with spaces
.. code-block:: yaml

    definitions:
        yaml_keyword: # could be encodings/ml_methods/reports/etc...
            my_new_class:
                MyNewClass:
                    parameter_name: 0
                    other_paramer_name: 1

Click to view a full example of MLMethod class documentation.

This SillyClassifier is a placeholder for a real ML method.
It generates random predictions ignoring the input features.


**Specification arguments:**

- random_seed (int): The random seed for generating random predictions.


**YAML specification:**

.. indent with spaces
.. code-block:: yaml

    definitions:
        ml_methods:
            my_silly_method:
                SillyClassifier:
                    random_seed: 100