How to add a new machine learning method¶
Adding an example classifier to the immuneML codebase¶
This tutorial describes how to add a new MLMethod
class to immuneML,
using a simple example classifier. We highly recommend completing this tutorial to get a better understanding of the immuneML
interfaces before continuing to implement your own classifier.
Step-by-step tutorial¶
For this tutorial, we provide a SillyClassifier
(download here
or view below), in order to test adding a new MLMethod
file to immuneML.
This method ignores the input dataset, and makes a random prediction per example.
SillyClassifier.py
import copy import yaml import numpy as np from pathlib import Path from immuneML.environment.Label import Label from immuneML.util.PathBuilder import PathBuilder from immuneML.ml_methods.classifiers.MLMethod import MLMethod from immuneML.data_model.encoded_data.EncodedData import EncodedData class SillyClassifier(MLMethod): """ This SillyClassifier is a placeholder for a real ML method. It generates random predictions ignoring the input features. **Specification arguments:** - random_seed (int): The random seed for generating random predictions. **YAML specification:** .. indent with spaces .. code-block:: yaml definitions: ml_methods: my_silly_method: SillyClassifier: random_seed: 100 """ def __init__(self, random_seed: int = None): super().__init__() self.random_seed = random_seed self.silly_model_fitted = False def _fit(self, encoded_data: EncodedData, cores_for_training: int = 2): # Since the silly classifier makes random predictions and ignores training data, no model is fitted during training. # For any other method, model fitting should be implemented here. self.silly_model_fitted = True def _predict_proba(self, encoded_data: EncodedData): np.random.seed(self.random_seed) # Generate an array containing a random prediction probability for each example pred_probabilities = np.random.rand(len(encoded_data.examples)) return {self.label.name: {self.label.positive_class: pred_probabilities, self.label.get_binary_negative_class(): 1 - pred_probabilities}} def _predict(self, encoded_data: EncodedData): predictions_proba = self.predict_proba(encoded_data, self.label) proba_positive_class = predictions_proba[self.label.name][self.label.positive_class] predictions = [] for proba in proba_positive_class: if proba > 0.5: predictions.append(self.label.positive_class) else: predictions.append(self.label.get_binary_negative_class()) # Shorter alternative using class mapping: # return {self.label.name: np.array([self.class_mapping[val] for val in (proba_positive_class > 0.5).tolist()])} return {self.label.name: np.array(predictions)} def can_predict_proba(self) -> bool: return True def can_fit_with_example_weights(self) -> bool: return False def get_compatible_encoders(self): # Every encoder that is compatible with the ML method should be listed here. # The SillyClassifier can in principle be used with any encoder, few examples are listed from immuneML.encodings.abundance_encoding.SequenceAbundanceEncoder import SequenceAbundanceEncoder from immuneML.encodings.abundance_encoding.KmerAbundanceEncoder import KmerAbundanceEncoder from immuneML.encodings.atchley_kmer_encoding.AtchleyKmerEncoder import AtchleyKmerEncoder from immuneML.encodings.distance_encoding.DistanceEncoder import DistanceEncoder from immuneML.encodings.evenness_profile.EvennessProfileEncoder import EvennessProfileEncoder from immuneML.encodings.kmer_frequency.KmerFrequencyEncoder import KmerFrequencyEncoder from immuneML.encodings.motif_encoding.MotifEncoder import MotifEncoder from immuneML.encodings.onehot.OneHotEncoder import OneHotEncoder return [SequenceAbundanceEncoder, KmerAbundanceEncoder, DistanceEncoder, EvennessProfileEncoder, AtchleyKmerEncoder, KmerFrequencyEncoder, MotifEncoder, OneHotEncoder] def store(self, path: Path): # The most basic way of storing a model is to get the parameters in a yaml-friendly format (get_params) # and store this in a file. # Depending on the method, more files (e.g., internal pickle, pytorch or keras files) may need to be stored. # The 'store' method should be compatible with 'load' PathBuilder.build(path) class_parameters = self.get_params() params_path = path / "custom_params.yaml" with params_path.open('w') as file: yaml.dump(class_parameters, file) def get_params(self) -> dict: # Returns a yaml-friendly dictionary (only simple types, no objects) with all parameters of this ML method params = copy.deepcopy(vars(self)) if self.label: # the 'Label' object must be converted to a yaml-friendly version params["label"] = self.label.get_desc_for_storage() return params def load(self, path: Path): # The 'load' method is called on a new (untrained) object of the specific MLMethod class. # This method is used to load parameters from a previously trained model from storage, # thus creating a copy of the original trained model # Load the dictionary of parameters from the YAML file params_path = path / "custom_params.yaml" with params_path.open("r") as file: custom_params = yaml.load(file, Loader=yaml.SafeLoader) # Loop through the dictionary and set each parameter for param, value in custom_params.items(): if hasattr(self, param): if param == "label": # Special case: if the parameter is 'label', convert to a Label object setattr(self, "label", Label(**value)) else: # Other cases: directly set the parameter to the given value setattr(self, param, value)
Add a new class to the
immuneML.ml_methods.classifiers
package. The new class should inherit from the base classMLMethod
.If the ML method has any default parameters, they should be added in a default parameters YAML file. This file should be added to the folder
config/default_params/ml_methods
. The default parameters file is automatically discovered based on the name of the class using the class name converted to snake case, and with an added ‘_params.yaml’ suffix. For theSillyClassifier
, this issilly_classifier_params.yaml
, which could for example contain the following:random_seed: 1
In rare cases where classes have unconventional names that do not translate well to CamelCase (e.g., MiXCR, VDJdb), this needs to be accounted for in
convert_to_snake_case()
.Use the automated script check_new_ml_method.py to test the newly added ML method. This script will throw errors or warnings if the MLMethod class implementation is incorrect.
Note: this script will try running the new classifier with a random
EncodedData
object (a matrix of random numbers), which may not be compatible with your particular MLMethod. You may overwrite the functionget_example_encoded_data()
to supply a custom EncodedData object which meets the requirements of your MLMethod.
Example command to test the
SillyClassifier
:python3 ./scripts/check_new_ml_method.py -m ./immuneML/ml_methods/classifiers/SillyClassifier.py
Test running the new ML method with a YAML specification¶
If you want to use immuneML directly to test run your ML method, the YAML example below may be used. This example analysis encodes a random dataset using k-mer encoding, trains and compares the performance of two silly classifiers which were initialised with different random seeds, and shows the results in a report. Note that when you test your own classifier, a compatible encoding must be used.
test_run_silly_classifier.yaml
definitions: datasets: my_dataset: format: RandomSequenceDataset params: sequence_count: 100 labels: binds_epitope: True: 0.6 False: 0.4 encodings: my_encoding: KmerFrequency: k: 3 ml_methods: my_first_silly_classifier: SillyClassifier: random_seed: 1 my_second_silly_classifier: SillyClassifier: random_seed: 2 reports: my_training_performance: TrainingPerformance my_settings_performance: MLSettingsPerformance instructions: my_instruction: type: TrainMLModel dataset: my_dataset labels: - binds_epitope settings: - encoding: my_encoding ml_method: my_first_silly_classifier - encoding: my_encoding ml_method: my_second_silly_classifier assessment: split_strategy: random split_count: 1 training_percentage: 0.7 reports: models: [my_training_performance] selection: split_strategy: random split_count: 1 training_percentage: 0.7 optimization_metric: balanced_accuracy reports: [my_settings_performance]
Adding a Unit test for an MLMethod¶
Add a unit test for the new SillyClassifier
(download
the example testfile or view below)
test_sillyClassifier.py
import os import shutil import numpy as np from unittest import TestCase from immuneML.environment.Label import Label from immuneML.caching.CacheType import CacheType from immuneML.util.PathBuilder import PathBuilder from immuneML.environment.Constants import Constants from immuneML.data_model.encoded_data.EncodedData import EncodedData from immuneML.environment.EnvironmentSettings import EnvironmentSettings from immuneML.ml_methods.classifiers.SillyClassifier import SillyClassifier class TestSillyClassifier(TestCase): def setUp(self) -> None: os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name def get_enc_data(self): # Creates a mock encoded data object with 8 random examples enc_data = EncodedData(examples=np.array([[1, 0, 0], [0, 1, 1], [1, 1, 1], [0, 1, 1], [1, 0, 0], [0, 1, 1], [1, 1, 1], [0, 1, 1]]), example_ids=list(range(8)), feature_names=["a", "b", "c"], labels={"my_label": ["yes", "no", "yes", "no", "yes", "no", "yes", "no"]}, encoding="random") label = Label(name="my_label", values=["yes", "no"], positive_class="yes") return enc_data, label def test_predictions(self): enc_data, label = self.get_enc_data() classifier = SillyClassifier(random_seed=50) # test fitting classifier.fit(enc_data, label) self.assertTrue(classifier.silly_model_fitted) # test 'predict' predictions = classifier.predict(enc_data, label) self.assertEqual(len(predictions[label.name]), len(enc_data.examples)) # test 'predict_proba' prediction_probabilities = classifier.predict_proba(enc_data, label) self.assertEqual(len(prediction_probabilities[label.name][label.positive_class]), len(enc_data.examples)) self.assertEqual(len(prediction_probabilities[label.name][label.get_binary_negative_class()]), len(enc_data.examples)) self.assertTrue(all(0 <= pred <= 1 for pred in prediction_probabilities[label.name][label.positive_class])) self.assertTrue(all(0 <= pred <= 1 for pred in prediction_probabilities[label.name][label.get_binary_negative_class()])) def test_store_and_load(self): path = PathBuilder.build(EnvironmentSettings.tmp_test_path / "silly") enc_data, label = self.get_enc_data() classifier = SillyClassifier(random_seed=50) classifier.fit(enc_data, label) classifier.store(path) # Loading should be done in an 'empty' model (no parameters) classifier2 = SillyClassifier() classifier2.load(path) self.assertEqual(classifier.get_params(), classifier2.get_params()) shutil.rmtree(path)
Add a new file to the
test.ml_methods
package named test_sillyClassifier.py.Add a class
TestSillyClassifier
that inheritsunittest.TestCase
to the new file.Add a function
setUp()
to set up cache used for testing. This should ensure that the cache location will be set toEnvironmentSettings.tmp_test_path / "cache/"
Define one or more tests for the class and functions you implemented.
It is recommended to at least test fitting, prediction and storing/loading of the model.
Mock data is typically used to test new classes.
If you need to write data to a path (for example test datasets or results), use the following location:
EnvironmentSettings.tmp_test_path / "some_unique_foldername"
Implementing a new classifier¶
This section describes tips and tricks for implementing your own new MLMethod
from scratch.
Detailed instructions of how to implement each method, as well as some special cases, can be found in the
MLMethod
base class.
Note
Coding conventions and tips
Class names are written in CamelCase
Class methods are writte in snake_case
Abstract base classes
MLMethod
,DatasetEncoder
, andReport
, define an interface for their inheriting subclasses. These classes contain abstract methods which should be overwritten.Class methods starting with _underscore are generally considered “private” methods, only to be called by the class itself. If a method is expected to be called from another class, the method name should not start with an underscore.
When familiarising yourself with existing code, we recommend focusing on public methods. Private methods are typically very unique to a class (internal class-specific calculations), whereas the public methods contain more general functionalities (e.g., returning a main result).
If your class should have any default parameters, they should be defined in a default parameters file under
config/default_params/
.Some utility classes are available in the
util
package to provide useful functionalities. For example,ParameterValidator
can be used to check user input and generate error messages, orPathBuilder
can be used to add and remove folders.
Developing a method outside immuneML with a sample design matrix¶
The initial development of the new ML method need not take place within immuneML. immuneML can be used to encode and export an example design matrix using the DesignMatrixExporter report with an appropriate encoding in the ExploratoryAnalysis instruction. The method can then be developed and debugged separately, and afterwards be integrated into the platform.
The following YAML example shows how to generate some random example data (detailed description here),
encode it using a k-mer encoding and export the design matrix to .csv format.
Note that for design matrices beyond 2 dimensions (such as OneHotEncoder
with flatten = False), the matrix is exported as a .npy file instead of a .csv file.
export_design_matrix.yaml
definitions: datasets: my_simulated_data: format: RandomRepertoireDataset params: repertoire_count: 5 # a dataset with 5 repertoires sequence_count_probabilities: # each repertoire has 10 sequences 10: 1 sequence_length_probabilities: # each sequence has length 15 15: 1 labels: my_label: # half of the repertoires has my_label = true, the rest has false false: 0.5 true: 0.5 encodings: my_3mer_encoding: KmerFrequency: k: 3 reports: my_design_matrix: DesignMatrixExporter: name: my_design_matrix instructions: my_instruction: type: ExploratoryAnalysis analyses: my_analysis: dataset: my_simulated_data encoding: my_3mer_encoding labels: - my_label report: my_design_matrix
The resulting design matrix can be found the sub-folder my_instruction/analysis_my_analysis/report/design_matrix.csv
,
and the true classes for each repertoire can be found in labels.csv
.
To load files into an EncodedData
object, the function immuneML.dev_util.util.load_encoded_data
can be used.
Input and output for the fit() and predict() methods¶
Inside immuneML, the design matrix is passed to an MLMethod wrapped in an EncodedData
object.
This is the main input to the fitting and prediction methods.
Additional inputs to the MLMethod during fitting are set in MLMethod._initialize_fit()
.
The EncodedData
object contains the following fields:
EncodedData:
examples
: a design matrix where the rows represent Repertoires, Receptors or Sequences (‘examples’), and the columns the encoding-specific features. This is typically a numpy matrix, but may also be another matrix type (e.g., scipy sparse matrix, pytorch tensor, pandas dataframe).
encoding
: a string denoting the encoder base class that was used.
labels
: a dictionary of labels, where each label is a key, and the values are the label values across the examples (for example: {disease1: [positive, positive, negative]} if there are 3 repertoires). This parameter should be set only ifEncoderParams.encode_labels
is True, otherwise it should be set to None. This can be created by calling utility functionEncoderHelper.encode_dataset_labels()
.
example_ids
: a list of identifiers for the examples (Repertoires, Receptors or Sequences). This can be retrieved usingDataset.get_example_ids()
.
feature_names
: a list of feature names, i.e., the names given to the encoding-specific features. When included, list must be as long as the number of features.
feature_annotations
: an optional pandas dataframe with additional information about the features. When included, number of rows in this dataframe must correspond to the number of features. This parameter is not typically used.
info
: an optional dictionary that may be used to store any additional information that is relevant (for example paths to additional output files). This parameter is not typically used.
The output predictions should be formatted the same way as the EncodedData.labels
:
{'label_name': ['class1', 'class1', 'class2']}
When predicting probabilities, a nested dictionary should be used to give the probabilities per class:
{'label_name': {'class1': [0.9, 0.8, 0.3]},
{'class2': [0.1, 0.2, 0.7]}}
Adding encoder compatibility to an ML method¶
Each ML method is only compatible with a limited set of encoders. immuneML automatically checks if the given encoder and ML method are
compatible when running the TrainMLModel instruction, and raises an error if they are not compatible.
To ensure immuneML recognizes the encoder-ML method compatibility, make sure that the encoder is added to the list of encoder classes
returned by the get_compatible_encoders()
method of the ML method(s) of interest.
Implementing fitting through cross-validation¶
By default, models in immuneML are fitted through nested-cross validation.
This allows for both hyperparameter selection and model comparison.
immuneML also allows for the implementation of a third level of k-fold cross-validation for hyperparameter selection within
the ML model (model_selection_cv
in the YAML specification).
This can be useful when a large number or range of hyperparameters is typically considered
(e.g., regularisation parameters in logistic regression).
Such additional cross-validation should be implemented inside the method _fit_by_cross_validation
.
The result should be that a single model (with optimal hyperparameters) is saved in the MLMethod object.
See SklearnMethod
for a detailed example.
Note: this is advanced model implementation, which is usually not necessary to implement.
Class documentation standards for ML methods¶
Class documentation should be added as a docstring to all new Encoder, MLMethod, Report or Preprocessing classes. The class docstrings are used to automatically generate the documentation web pages, using Sphinx reStructuredText, and should adhere to a standard format:
A short, general description of the functionality
Optional extended description, including any references or specific cases that should bee considered. For instance: if a class can only be used for a particular dataset type. Compatibility between Encoders, MLMethods and Reports should also be described.
For encoders, the appropriate dataset type(s). For example:
**Dataset type:** - SequenceDatasets - RepertoireDatasets
A list of arguments, when applicable. This should follow the format below:
**Specification arguments:** - parameter_name (type): a short description - other_paramer_name (type): a short description
A YAML snippet, to show an example of how the new component should be called. Make sure to test your YAML snippet in an immuneML run to ensure it is specified correctly. The following formatting should be used to ensure the YAML snippet is rendered correctly:
**YAML specification:** .. indent with spaces .. code-block:: yaml definitions: yaml_keyword: # could be encodings/ml_methods/reports/etc... my_new_class: MyNewClass: parameter_name: 0 other_paramer_name: 1
Click to view a full example of MLMethod class documentation.
This SillyClassifier is a placeholder for a real ML method.
It generates random predictions ignoring the input features.
**Specification arguments:**
- random_seed (int): The random seed for generating random predictions.
**YAML specification:**
.. indent with spaces
.. code-block:: yaml
definitions:
ml_methods:
my_silly_method:
SillyClassifier:
random_seed: 100