How to add a new machine learning method ========================================== .. meta:: :twitter:card: summary :twitter:site: @immuneml :twitter:title: immuneML dev docs: add a new machine learning method :twitter:description: See how to add a new machine learning method to the immuneML platform. :twitter:image: https://docs.immuneml.uio.no/_images/extending_immuneML.png In this tutorial, we will add a new machine learning method. This tutorial assumes you have installed immuneML for development as described at :ref:`Set up immuneML for development`. To add a new ML method to immuneML, add a class that inherits :py:obj:`~immuneML.ml_methods.MLMethod.MLMethod` class to the :py:mod:`immuneML.ml_methods` package and implement abstract methods. The name of the new class has to be different from the ML methods’ classes already defined in the same package. .. note:: The initial development of the new ML method need not take place within immuneML. immuneML can be used to prepare, encode and export the data for developing the method using the :ref:`ExploratoryAnalysis` instruction, desired encoding, and :ref:`DesignMatrixExporter` report. For more details, see :ref:`Testing the ML method outside immuneML with a sample design matrix`. The method can then be developed and debugged separately, and later integrated into the platform as described below to fully benefit from available immuneML functionalities related to importing datasets from different formats, using various data representations, benchmarking against existing methods and robustly assessing the performance. Adding a new MLMethod class ----------------------------------- For methods based on scikit-learn, read how to do this under :ref:`Adding a method based on scikit-learn`. For other methods, see :ref:`Adding native methods`. The ML models in the immuneML support one label. If multiple labels are specified e.g., in :ref:`TrainMLModel` instruction, a separate model will be fitted to each label. .. include:: ./dev_docs_util.rst Adding a method based on scikit-learn ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To add a method from the scikit-learn’s package, go through this step-by-step guide where, exemplary, a SVM class based on scikit-learn’s LinearSVC will be added: #. Add a new class to the package ml_methods. #. Make SklearnMethod a base class of the new class. #. Implement a constructor (handle parameters and parameter_grid as inputs). #. Implement get_ml_model(cores_for_training: int) function, which should return a new instance of the desired scikit-learn’s class with the parameters that were passed to the constructor of the new class. #. Implement _can_predict_proba() to return True or False to indicate whether the method can output the class probabilities. #. Implement get_params(label) to return the coefficients and/or other trained parameters of the model for the given label. #. Add class documentation describing how to use the new ML method. Example scikit-learn-based SVM implementation: .. code-block:: python from sklearn.model_selection import RandomizedSearchCV from sklearn.svm import LinearSVC from immuneML.ml_methods.SklearnMethod import SklearnMethod class NewSVM(SklearnMethod): """ This is a wrapper of scikit-learn’s LinearSVC class. Please see the `scikit-learn documentation `_ of LinearSVC for the parameters. Note: if you are interested in plotting the coefficients of the SVM model, consider running the :ref:`Coefficients` report. For usage instructions, check :py:obj:`~immuneML.ml_methods.SklearnMethod.SklearnMethod`. YAML specification: .. indent with spaces .. code-block:: yaml my_svm: # user-defined method name SVM: # name of the ML method # sklearn parameters (same names as in original sklearn class) penalty: l1 # always use penalty l1 C: [0.01, 0.1, 1, 10, 100] # find the optimal value for C # Additional parameter that determines whether to print convergence warnings show_warnings: True # if any of the parameters under SVM is a list and model_selection_cv is True, # a grid search will be done over the given parameters, using the number of folds specified in model_selection_n_folds, # and the optimal model will be selected model_selection_cv: True model_selection_n_folds: 5 # alternative way to define ML method with default values: my_default_svm: SVM """ def __init__(self, parameter_grid: dict = None, parameters: dict = None): _parameters = parameters if parameters is not None else {"max_iter": 10000, "multi_class": "crammer_singer"} _parameter_grid = parameter_grid if parameter_grid is not None else {} super(NewSVM, self).__init__(parameter_grid=_parameter_grid, parameters=_parameters) def _get_ml_model(self, cores_for_training: int = 2, X=None): return LinearSVC(**self._parameters) def can_predict_proba(self) -> bool: return False def get_params(self): params = self.model.get_params() params["coefficients"] = self.model.coef_[0].tolist() params["intercept"] = self.model.intercept_.tolist() return params Adding native methods ^^^^^^^^^^^^^^^^^^^^^^^ To add a new ML method: #. Add a new class in a new file, where the class name and the file name must match. #. Make the new class inherit :py:obj:`~immuneML.ml_methods.MLMethod.MLMethod` class. #. Define an init function. The constructor arguments in the new class will be the required parameters in the specification file. #. Implement all abstract methods as defined in :py:obj:`~immuneML.ml_methods.MLMethod.MLMethod` class. #. Add class documentation describing how to use the new ML method. Unit testing the new MLMethod ------------------------------- Add a unit test for the new ML method: #. Add a new file to :py:mod:`~test.ml_methods` package named test_newSVM.py. #. Add a class TestNewSVM that inherits :code:`unittest.TestCase` to the new file.. #. Add a function :code:`setUp()` to set up cache used for testing (see example below). This will ensure that the cache location will be set to :code:`EnvironmentSettings.tmp_test_path / "cache/"` #. Define one or more tests for the class and functions you implemented. #. If you need to write data to a path (for example test datasets or results), use the following location: :code:`EnvironmentSettings.tmp_test_path / "some_unique_foldername"` When building unit tests, a useful class is :py:obj:`~immuneML.simulation.dataset_generation.RandomDatasetGenerator.RandomDatasetGenerator`, which can create a dataset with random sequences. An example of the unit test TestNewSVM is given below. .. code-block:: python import os import pickle import shutil from unittest import TestCase import numpy as np from sklearn.svm import SVC from immuneML.caching.CacheType import CacheType from immuneML.data_model.encoded_data.EncodedData import EncodedData from immuneML.environment.Constants import Constants from immuneML.environment.EnvironmentSettings import EnvironmentSettings from immuneML.environment.Label import Label from immuneML.ml_methods.NewSVM import NewSVM # newly added method from immuneML.util.PathBuilder import PathBuilder class TestNewSVM(TestCase): def setUp(self) -> None: os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name # set up cache, always the same def test_fit(self): x = np.array([[1, 0, 0], [0, 1, 1], [1, 1, 1], [0, 1, 1]]) y = {"default": np.array([1, 0, 2, 0])} svm = NewSVM() svm.fit(EncodedData(x, y), Label("default")) # just test if nothing breaks def test_predict(self): x = np.array([[1, 0, 0], [0, 1, 1], [1, 1, 1], [0, 1, 1]]) # encoded data, where one row is one example, and columns are features y = {"test": np.array([1, 0, 2, 0])} # classes for the given label, for each example in the dataset svm = NewSVM() # create an instance of class for testing svm.fit(EncodedData(x, y), Label("test")) # fit the classifier using EncodedData object with includes encoded data and classes for each of the examples test_x = np.array([[0, 1, 0], [1, 0, 0]]) # new encoded data for testing the method y = svm.predict(EncodedData(test_x), Label("test"))["test"] # extract predictions for new encoded data for the given label self.assertTrue(len(y) == 2) # check the number of predictions (2 because there were 2 examples in the new encoded data) self.assertTrue(y[0] in [0, 1, 2]) # check that classes are in the list of valid classes for each prediction self.assertTrue(y[1] in [0, 1, 2]) def test_fit_by_cross_validation(self): x = EncodedData(np.array([[1, 0, 0], [0, 1, 1], [1, 1, 1], [0, 1, 1], [1, 0, 0], [0, 1, 1], [1, 1, 1], [0, 1, 1]]), {"t1": [1, 0, 2, 0, 1, 0, 2, 0], "t2": [1, 0, 2, 0, 1, 0, 2, 0]}) svm = NewSVM() svm.fit_by_cross_validation(x, number_of_splits=2, label=Label("t1")) # check if nothing fails def test_store(self): x = np.array([[1, 0, 0], [0, 1, 1], [1, 1, 1], [0, 1, 1]]) y = {"default": np.array(['a', "b", "c", "a"])} svm = NewSVM() svm.fit(EncodedData(x, y), Label("default")) path = EnvironmentSettings.root_path / "my_svm/" # when the trained method is stored, check if the format is as defined in store() svm.store(path) self.assertTrue(os.path.isfile(path / "svm.pickle")) with open(path / "svm.pickle", "rb") as file: svm2 = pickle.load(file) self.assertTrue(isinstance(svm2, SVC)) shutil.rmtree(path) def test_load(self): x = np.array([[1, 0, 0], [0, 1, 1], [1, 1, 1], [0, 1, 1]]) y = {"default": np.array([1, 0, 2, 0])} svm = NewSVM() svm.fit(EncodedData(x, y), Label("default")) path = EnvironmentSettings.tmp_test_path / "my_svm2/" PathBuilder.build(path) with open(path / "svm.pickle", "wb") as file: pickle.dump(svm.model, file) svm2 = NewSVM() svm2.load(path) # when the model is loaded from disk, check if the class matches self.assertTrue(isinstance(svm2.model, SVC)) shutil.rmtree(path) # optionally, more checks can be added Adding a new MLMethod: additional information ----------------------------------------------- To test the method outside immuneML, see :ref:`Testing the ML method outside immuneML with a sample design matrix`. Using ML methods from specification ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To use ML method from specification, it is necessary to define: #. The method class, #. Parameters for the method, #. And if applicable, whether cross-validation should be performed to determine the optimal parameters. The cross-validation performs the grid search over the parameters if any of the parameters is specified as a list of potential values. An example specification for support vector machine without cross-validation (my_svm), and support vector machine with cross-validation (my_svm_cv) would be: .. indent with spaces .. code-block:: yaml ml_methods: my_svm: # the name of the method which will be used in the specification to refer to the method NewSVM: # class name of the method penalty: l1 # parameters of the model model_selection_cv: False # should there be a grid search and cross-validation - not here model_selection_n_folds: -1 # no number of folds for cross-validation as it is not used here my_svm_cv: # the name of the next method NewSVM: # class name of the method penalty: # parameter of the model - l1 # value of the parameter to test - l2 # another value of the parameter to test model_selection_cv: True # perform cross-validation and grid search model_selection_n_folds: 5 # do 5-fold cross-validation The parameters model_selection_cv and model_selection_n_folds have values False and -1, respectively and can be omitted if there should be no model selection on this level. Also, if no parameters of the model are specified (such as penalty in the example), default values would be used. During parsing, the parameters of the model will be assigned to “parameters” attribute of the ML method object if none of the parameters is a list of possible values. Otherwise, the parameters will be assigned to the parameter_grid parameter which will be later used for grid search and cross-validation. Full specification that simulates the data and trains the added ML method on that data would look like this: .. indent with spaces .. code-block:: yaml definitions: datasets: my_simulated_data: format: RandomRepertoireDataset params: repertoire_count: 50 # a dataset with 50 repertoires sequence_count_probabilities: # each repertoire has 10 sequences 10: 1 sequence_length_probabilities: # each sequence has length 15 15: 1 labels: my_label: # half of the repertoires has my_label = true, the rest has false false: 0.5 true: 0.5 encodings: my_3mer_encoding: KmerFrequency: k: 3 ml_methods: my_svm: # the name of the method which will be used in the specification to refer to the method NewSVM: # class name of the method C: 10 # parameters of the model model_selection_cv: False # should there be a grid search and cross-validation - not here model_selection_n_folds: -1 # no number of folds for cross-validation as it is not used here my_svm_cv: # the name of the next method NewSVM: # class name of the method penalty: # parameter of the model - l1 # value of the parameter to test - l2 # another value of the parameter to test model_selection_cv: True # perform cross-validation and grid search model_selection_n_folds: 5 # do 5-fold cross-validation instructions: train_new_ml_methods_inst: type: TrainMLModel dataset: my_simulated_data assessment: # outer cross-validation loop splitting the data into training and test datasets split_strategy: random split_count: 1 training_percentage: 0.7 selection: # inner cross-validation loop splitting the training data into training and validation datasets split_strategy: random split_count: 1 training_percentage: 0.7 labels: [my_label] optimization_metric: balanced_accuracy settings: - encoding: my_3mer_encoding ml_method: my_svm - encoding: my_3mer_encoding ml_method: my_svm_cv # this method will do a third level of cross-validation to select optimal penalty as described above To run this from the root directory of the project, save the specification to specs.yaml and run the following: .. code-block:: console immune-ml specs.yaml output_dir/ Compatible encoders ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Each ML method is only compatible with a limited set of encoders. immuneML automatically checks if the given encoder and ML method are compatible when running the TrainMLModel instruction, and raises an error if they are not compatible. To ensure immuneML recognizes the encoder-ML method compatibility, make sure that the encoder(s) of interest is added to the list of encoder classes returned by the :code:`get_compatible_encoders()` method of the ML method.