How to add a new report#
Adding an example data report to the immuneML codebase#
In this tutorial, we will show how to add a new report to plot sequence length distribution in repertoire datasets. This tutorial assumes you have installed immuneML for development as described at Set up immuneML for development.
Step-by-step tutorial#
For this tutorial, we provide a RandomDataPlot
(download here
or view below), in order to test adding a new Report file to immuneML.
This report ignores the input dataset, and generates a scatterplot containing random values.
RandomDataPlot.py
from pathlib import Path import numpy as np import pandas as pd import plotly.express as px from immuneML.data_model.dataset.Dataset import Dataset from immuneML.reports.ReportOutput import ReportOutput from immuneML.reports.ReportResult import ReportResult from immuneML.reports.data_reports.DataReport import DataReport from immuneML.util.ParameterValidator import ParameterValidator from immuneML.util.PathBuilder import PathBuilder class RandomDataPlot(DataReport): """ This RandomDataPlot is a placeholder for a real Report. It plots some random numbers. **Specification arguments:** - n_points_to_plot (int): The number of random points to plot. **YAML specification:** .. indent with spaces .. code-block:: yaml definitions: reports: my_report: RandomDataPlot: n_points_to_plot: 10 """ @classmethod def build_object(cls, **kwargs): # Here you may check the values of given user parameters # This will ensure immuneML will crash early (upon parsing the specification) if incorrect parameters are specified ParameterValidator.assert_type_and_value(kwargs['n_points_to_plot'], int, RandomDataPlot.__name__, 'n_points_to_plot', min_inclusive=1) return RandomDataPlot(**kwargs) def __init__(self, dataset: Dataset = None, result_path: Path = None, number_of_processes: int = 1, name: str = None, n_points_to_plot: int = None): super().__init__(dataset=dataset, result_path=result_path, number_of_processes=number_of_processes, name=name) self.n_points_to_plot = n_points_to_plot def check_prerequisites(self): # Here you may check properties of the dataset (e.g. dataset type), or parameter-dataset compatibility # and return False if the prerequisites are incorrect. # This will generate a user-friendly error message and ensure immuneML does not crash, but instead skips the report. # Note: parameters should be checked in 'build_object' return True def _generate(self) -> ReportResult: PathBuilder.build(self.result_path) df = self._get_random_data() # utility function for writing a dataframe to a csv file # and creating a ReportOutput object containing the reference report_output_table = self._write_output_table(df, self.result_path / 'random_data.csv', name="Random data file") # Calling _safe_plot will internally call _plot, but ensure immuneML does not crash if errors occur report_output_fig = self._safe_plot(df=df) # Ensure output is either None or a list with item (not an empty list or list containing None) output_tables = None if report_output_table is None else [report_output_table] output_figures = None if report_output_fig is None else [report_output_fig] return ReportResult(name=self.name, info="Some random numbers", output_tables=output_tables, output_figures=output_figures) def _get_random_data(self): return pd.DataFrame({"random_data_dim1": np.random.rand(self.n_points_to_plot), "random_data_dim2": np.random.rand(self.n_points_to_plot)}) def _plot(self, df: pd.DataFrame) -> ReportOutput: figure = px.scatter(df, x="random_data_dim1", y="random_data_dim2", template="plotly_white") figure.update_layout(template="plotly_white") file_path = self.result_path / "random_data.html" figure.write_html(str(file_path)) return ReportOutput(path=file_path, name="Random data plot")
Add a new Python package to the
data_reports
package. This means: a new folder (with meaningful name) containing an empty__init__.py
file.Add a new class to the
data_reports
package (other reports types should be placed in the appropriate sub-package ofreports
). The new class should inherit from the base classDataReport
.If the encoder has any default parameters, they should be added in a default parameters YAML file. This file should be added to the folder
config/default_params/reports
. The default parameters file is automatically discovered based on the name of the class using the base name converted to snake case, and with an added ‘_params.yaml’ suffix. For theRandomDataPlot
, this israndom_data_report_params.yaml
, which could for example contain the following:n_points_to_plot: 10
In rare cases where classes have unconventional names that do not translate well to CamelCase (e.g., MiXCR, VDJdb), this needs to be accounted for in
convert_to_snake_case()
.
Test running the new report with a YAML specification#
If you want to use immuneML directly to test run your report, the YAML example below may be used.
This example analysis creates a randomly generated dataset, and runs the RandomDataPlot
(which ignores the dataset).
test_run_random_data_report.yaml
definitions: datasets: my_dataset: format: RandomSequenceDataset params: sequence_count: 100 reports: my_random_report: RandomDataPlot: n_points_to_plot: 10 instructions: my_instruction: type: ExploratoryAnalysis analyses: my_analysis_1: dataset: my_dataset report: my_random_report
Adding a unit test for a Report#
Add a unit test for the new RandomDataPlot
(download
the example testfile or view below)
test_randomDataPlot.py
import os import shutil from unittest import TestCase from immuneML.caching.CacheType import CacheType from immuneML.environment.Constants import Constants from immuneML.environment.EnvironmentSettings import EnvironmentSettings from immuneML.reports.data_reports.RandomDataPlot import RandomDataPlot from immuneML.util.PathBuilder import PathBuilder class TestRandomDataPlot(TestCase): def setUp(self) -> None: os.environ[Constants.CACHE_TYPE] = CacheType.TEST.name def test_random_data_plot(self): path = PathBuilder.remove_old_and_build(EnvironmentSettings.tmp_test_path / "random_data_plot") params = {"n_points_to_plot": 10} report = RandomDataPlot.build_object(**params) # make sure to set the 'path' manually report.result_path = path self.assertTrue(report.check_prerequisites()) result = report._generate() # ensure result files are generated self.assertTrue(os.path.isfile(result.output_figures[0].path)) self.assertTrue(os.path.isfile(result.output_tables[0].path)) # don't forget to remove the temporary path shutil.rmtree(path)
Add a new file to
data_reports
package named test_randomDataPlot.py.Add a class
TestRandomDataPlot
that inheritsunittest.TestCase
to the new file.Add a function
setUp()
to set up cache used for testing. This should ensure that the cache location will be set toEnvironmentSettings.tmp_test_path / "cache/"
Define one or more tests for the class and functions you implemented.
It is recommended to at least test building and generating the report
Mock data is typically used to test new classes. Tip: the
RandomDatasetGenerator
class can be used to generate Repertoire, Sequence or Receptor datasets with random sequences.If you need to write data to a path (for example test datasets or results), use the following location:
EnvironmentSettings.tmp_test_path / "some_unique_foldername"
Implementing a new Report#
This section describes tips and tricks for implementing your own new Report
from scratch.
Detailed instructions of how to implement each method, as well as some special cases, can be found in the
Report
base class.
The Report
type is determined by subclassing one of the following:
DataReport
– reports examining some aspect of the dataset (such as sequence length distribution, gene usage)EncodingReport
– shows some aspect of the encoded dataset (such as the feature values of an encoded dataset),MLReport
– shows the characteristics of an inferred machine learning model (such as coefficient values for logistic regression or kernel visualization for CNN)TrainMLModelReport
– show statistics of multiple trained ML models in the TrainMLModelInstruction (such as comparing performance statistics between models, or performance w.r.t. an encoding parameter)MultiDatasetReport
– show statistics when running immuneML with the MultiDatasetBenchmarkTool
Note
Coding conventions and tips
Class names are written in CamelCase
Class methods are writte in snake_case
Abstract base classes
MLMethod
,DatasetEncoder
, andReport
, define an interface for their inheriting subclasses. These classes contain abstract methods which should be overwritten.Class methods starting with _underscore are generally considered “private” methods, only to be called by the class itself. If a method is expected to be called from another class, the method name should not start with an underscore.
When familiarising yourself with existing code, we recommend focusing on public methods. Private methods are typically very unique to a class (internal class-specific calculations), whereas the public methods contain more general functionalities (e.g., returning a main result).
If your class should have any default parameters, they should be defined in a default parameters file under
config/default_params/
.Some utility classes are available in the
util
package to provide useful functionalities. For example,ParameterValidator
can be used to check user input and generate error messages, orPathBuilder
can be used to add and remove folders.
Determine the type of report#
First, it is important to determine what the type of the report is, as this defines which report class should be inherited.
Report types for dataset analysis#
If the report will be used to analyze a Dataset (such as a RepertoireDataset
), either a DataReport
or an EncodingReport
should be used. The simplest
report is the DataReport
, which should typically be used when summarizing some qualities of a dataset. This dataset can be found in the report
attribute dataset.
Use the EncodingReport
when it is necessary to access the encoded_data attribute of a Dataset
. The encoded_data attribute is an instance of a
EncodedData
class. This report should be used when the data
representation first needs to be changed before running the report, either through an existing or a custom encoding (see:
How to add a new encoding). For example, the Matches report represents a RepertoireDataset based on matches to a given reference
dataset, and must first be encoded using a MatchedSequences, MatchedReceptors or MatchedRegex.
Report types for trained ML model analysis#
When the results of an experiment with a machine learning method should be analyzed, an MLReport
or TrainMLModelReport
should be used. These reports
are more advanced and require understanding of the TrainMLModelInstruction
. The MLReport
should be used when plotting statistics or
exporting information about one trained ML model. This report can be executed on any trained ML model (MLMethod
subclass object), both in the assessment and selection loop of
the TrainMLModel. An MLReport
has the following attributes:
train_dataset: a Dataset (e.g., RepertoireDataset) object containing the training data used for the given classifier
test_dataset: similar to train_dataset, but containing the test data
method: the MLMethod object containing trained classifiers for each of the labels.
label: the label that the report is executed for (the same report may be executed several times when training classifiers for multiple labels), can be used to retrieve relevant information from the MLMethod object.
hp_setting: the
HPSetting
object, containing all information about which preprocessing, encoding, and ML methods were used up to this point. This parameter can usually be ignored unless specific information from previous analysis steps is needed.
In contrast, TrainMLModelReport
is used to compare several [optimal] ML models. This report has access to the attribute state: a TrainMLModelState
object, containing information that has been collected through the execution of the TrainMLModelInstruction
. This includes all datasets, trained
models, labels, internal state objects for selection and assessment loops (nested cross-validation), optimal models, and more.
Finally, the MultiDatasetReport
is used in rare cases when running immuneML with the MultiDatasetBenchmarkTool
.
This is an advanced report type and is not typically used.
This report type can be used when comparing the performance of classifiers over several datasets and accumulating the results.
This report has the attribute instruction_states: a list of several TrainMLModelState
objects.
Input and output of the _generate() method#
The abstract method _generate() must be implemented, which has the following responsibilities:
It should create the report results, for example, compute the data or create the plots that should be returned by the report.
It should write the report results to the folder given at the variable
self.result_path
.It should return a
ReportResult
object, which contains lists ofReportOutput
objects. TheseReportOutput
objects simply contain the path to a figure, table, text, or another type of result. One report can have multiple outputs, as long as they are all referred to in the returnedReportResult
object. This is used to format the summary of the results in the HTML output file.When the main result of the report is a plot, it is good practice to also make the raw data available to the user, for example as a csv file.
Creating plots#
The preferred method for plotting data is through plotly, as it creates interactive and rescalable plots in HTML format [recommended] that display nicely in the HTML output file. Alternatively, plots can also be in pdf, png, jpg and svg format.
Note
When plotting data with plotly, we recommend using the following color schemes for consistency: plotly.colors.sequential.Teal, plotly.colors.sequential.Viridis, or plotly.colors.diverging.Tealrose. Additionally, in the most of immuneML plots, ‘plotly_white’ theme is used for the background.
For the overview of color schemes, visit this link. For plotly themes, visit this link.
Checking prerequisites and parameters#
New report objects are created by immuneML by calling the build_object()
method. This method can take in any custom parameters and should return an instance of the
report object. The parameters of the method build_object()
can be directly specified in the YAML specification, nested under the report type, for example:
MyNewReport:
custom_parameter: “value”
Inside the build_object()
method, you can check if the correct parameters are specified and raise an exception when the user input is incorrect
(for example using the ParameterValidator
utility class). Furthermore, it is possible to resolve more
complex input parameters, such as loading reference sequences from an external input file, before passing them to the __init__()
method of the report.
It is important to consider whether the method check_prerequisites()
should be implemented. This method should return a boolean value describing
whether the prerequisites are met, and print a warning message to the user when this condition is false. The report will only be generated when
check_prerequisites()
returns true. This method should not be used to raise exceptions. Instead, it is used to prevent exceptions from happening
during execution, as this might cause lost results. Situations to consider are:
When implementing an EncodingReport, use this function to check that the data has been encoded and that the correct encoder has been used.
Similarly, when creating an MLReport or TrainMLModelReport, check that the appropriate ML methods have been used.
Note
Please see the Report
class for the detailed description of the methods to be implemented.
Specifying different report types in YAML#
Custom reports may be defined in the YAML specification under the key ‘definitions’ the same way as any other reports. The easiest way to test run
Data reports and Encoding reports is through the ExploratoryAnalysis instruction. They may also be specified in the TrainMLModel
instruction in the selection
and assessment
loop under reports:data_splits
and reports:encoding
respectively.
ML model reports and Train ML model reports can only be run through the TrainMLModel instruction. ML reports can be specified inside both the
selection
and assessment
loop under reports:models
. Train ML model reports must be specified under reports
.
Finally, Multi dataset reports can be specified under benchmark_reports
when running the MultiDatasetBenchmarkTool
.
The following specification shows the places where Data reports, Encoding reports , ML model reports, and Train ML model reports can be specified:
definitions:
reports:
my_data_report: MyNewDataReport # example data report without parameters
my_encoding_report: # example encoding report with a parameter
MyNewEncodingReport:
parameter: value
my_ml_report: MyNewMLReport # ml model report
my_trainml_report: MyNewTrainMLModelReport # train ml model report
datasets:
d1:
# if you do not have real data to test your report with, consider
# using a randomly generated dataset, see the documentation:
# “How to generate a random receptor or repertoire dataset”
format: RandomRepertoireDataset
params:
labels: {disease: {True: 0.5, False: 0.5}}
repertoire_count: 50
encodings:
e1: KmerFrequency
ml_methods:
m1: LogisticRegression
instructions:
exploratory_instr: # Example of specifying reports in ExploratoryAnalysis
type: ExploratoryAnalysis
analyses:
analysis_1: # Example analysis with data report
dataset: d1
report: my_data_report
analysis_1: # Example analysis with encoding report
dataset: d1
encoding: e1
report: my_encoding_report
labels: # when running an encoding report, labels must be specified
- disease
trainmlmodel_instr: # Example of specifying reports in TrainMLModel instruction
type: TrainMLModel
settings:
- encoding: e1
ml_method: m1
assessment: # running reports in the assessment (outer) loop
reports:
data: # execute before splitting to training/(validation+test)
- my_data_report
data_splits: # execute on training and (validation+test) sets
- my_data_report
encoding:
- my_encoding_report
models:
- my_ml_report
selection: # running reports in the selection (inner) loop
reports:
data: # execute before splitting to validation/test
- my_data_report
data_splits: # execute on validation and test sets
- my_data_report
encoding:
- my_encoding_report
models:
- my_ml_report
reports:
- my_trainml_report
labels:
- disease
Class documentation standards for reports#
Class documentation should be added as a docstring to all new Encoder, MLMethod, Report or Preprocessing classes. The class docstrings are used to automatically generate the documentation web pages, using Sphinx reStructuredText, and should adhere to a standard format:
A short, general description of the functionality
Optional extended description, including any references or specific cases that should bee considered. For instance: if a class can only be used for a particular dataset type. Compatibility between Encoders, MLMethods and Reports should also be described.
A list of arguments, when applicable. This should follow the format below:
**Specification arguments:** - parameter_name (type): a short description - other_paramer_name (type): a short description
A YAML snippet, to show an example of how the new component should be called. Make sure to test your YAML snippet in an immuneML run to ensure it is specified correctly. The following formatting should be used to ensure the YAML snippet is rendered correctly:
**YAML specification:** .. indent with spaces .. code-block:: yaml definitions: yaml_keyword: # could be encodings/ml_methods/reports/etc... my_new_class: MyNewClass: parameter_name: 0 other_paramer_name: 1
Click to view a full example of Report class documentation.
This RandomDataPlot is a placeholder for a real Report.
It plots some random numbers.
**Specification arguments:**
- n_points_to_plot (int): The number of random points to plot.
**YAML specification:**
.. indent with spaces
.. code-block:: yaml
definitions:
reports:
my_report:
RandomDataPlot:
n_points_to_plot: 10