Report parameters¶
Under the definitions/reports
component, the user can specify reports which visualise or summarise different properties
of the dataset or analysis.
Reports have been divided into different types. Different types of reports can be specified depending on which instruction is run. Click on the name of the report type to see more details.
Data reports show some type of features or statistics about a given dataset.
Encoding reports show some type of features or statistics about an encoded dataset, or may export relevant sequences or tables.
ML model reports show some type of features or statistics about a single trained ML model (e.g., model coefficients).
Train ML model reports plot general statistics or export data of multiple models simultaneously when running the TrainMLModel instruction (e.g., performance comparison between models).
Multi dataset reports are special reports that can be specified when running immuneML with the
MultiDatasetBenchmarkTool
. See Manuscript use case 1: Robustness assessment for an example.
Data reports¶
Data reports show some type of features or statistics about a given dataset.
When running the TrainMLModel instruction, data reports can be specified inside the ‘selection’ or ‘assessment’ specification under the keys ‘reports/data’ (current cross-validation split) or ‘reports/data_splits’ (train/test sub-splits). Example:
my_instruction:
type: TrainMLModel
selection:
reports:
data:
- my_data_report
# other parameters...
assessment:
reports:
data:
- my_data_report
# other parameters...
# other parameters...
Alternatively, when running the ExploratoryAnalysis instruction, data reports can be specified under ‘report’. Example:
my_instruction:
type: ExploratoryAnalysis
analyses:
my_first_analysis:
report: my_data_report
# other parameters...
# other parameters...
AminoAcidFrequencyDistribution¶
Generates a barplot showing the relative frequency of each amino acid at each position in the sequences of a dataset.
Example output:
Specification arguments:
alignment (str): Alignment style for aligning sequences of different lengths. Options are as follows:
CENTER: center-align sequences of different lengths. The middle amino acid of any sequence be labelled position 0. By default, alignment is CENTER.
LEFT: left-align sequences of different lengths, starting at 0.
RIGHT: right align sequences of different lengths, ending at 0 (counting towards negative numbers).
IMGT: align sequences based on their IMGT positional numbering, considering the sequence region_type (IMGT_CDR3 or IMGT_JUNCTION). The main difference between CENTER and IMGT is that IMGT aligns the first and last amino acids, adding gaps in the middle, whereas CENTER aligns the middle of the sequences, padding with gaps at the start and end of the sequence. When region_type is IMGT_JUNCTION, the IMGT positions run from 104 (conserved C) to 118 (conserved W/F). When IMGT_CDR3 is used, these positions are 105 to 117. For long CDR3 sequences, additional numbers are added in between IMGT positions 111 and 112. See the official IMGT documentation for more details: https://www.imgt.org/IMGTScientificChart/Numbering/CDR3-IMGTgaps.html
relative_frequency (bool): Whether to plot relative frequencies (true) or absolute counts (false) of the positional amino acids. Note that when sequences are of different length, setting relative_frequency to True will produce different results depending on the alignment type, as some positions are only covered by the longest sequences. By default, relative_frequency is False.
split_by_label (bool): Whether to split the plots by a label. If set to true, the Dataset must either contain a single label, or alternatively the label of interest can be specified under ‘label’. If split_by_label is set to true, the percentage-wise frequency difference between classes is plotted additionally. By default, split_by_label is False.
label (str): if split_by_label is set to True, a label can be specified here.
region_type (str): which part of the sequence to check; e.g., IMGT_CDR3
YAML specification:
definitions:
reports:
my_aa_freq_report:
AminoAcidFrequencyDistribution:
relative_frequency: False
split_by_label: True
label: CMV
region_type: IMGT_CDR3
GLIPH2Exporter¶
Report which exports the receptor data to GLIPH2 format so that it can be directly used in GLIPH2 tool. Currently, the report accepts only receptor datasets.
GLIPH2 publication: Huang H, Wang C, Rubelt F, Scriba TJ, Davis MM. Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nature Biotechnology. Published online April 27, 2020:1-9. doi:10.1038/s41587-020-0505-4
Specification arguments:
condition (str): name of the parameter present in the receptor metadata in the dataset; condition can be anything which can be processed in GLIPH2, such as tissue type or treatment.
YAML specification:
definitions:
reports:
my_gliph2_exporter:
GLIPH2Exporter:
condition: epitope # for instance, epitope parameter is present in receptors' metadata with values such as "MtbLys" for Mycobacterium tuberculosis (as shown in the original paper).
MotifGeneralizationAnalysis¶
This report splits the given dataset into a training and validation set, identifies significant motifs using the
MotifEncoder
on the training set and plots the precision/recall and precision/true positive predictions of motifs
on both the training and validation sets. This can be used to:
determine the optimal recall cutoff for motifs of a given size
investigate how well motifs learned on a training set generalize to a test set
After running this report and determining the optimal recall cutoffs, the report
MotifTestSetPerformance
can be run to
plot the performance on an independent test set.
Note: the MotifEncoder (and thus this report) can only be used for sequences of the same length.
Specification arguments:
label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.
training_set_identifier_path (str): Path to a file containing ‘sequence_identifiers’ of the sequences used for the training set. This file should have a single column named ‘example_id’ and have one sequence identifier per line. If training_set_identifier_path is not set, a random subset of the data (according to training_percentage) will be assigned to be the training set.
training_percentage (float): If training_set_identifier_path is not set, this value is used to specify the fraction of sequences that will be randomly assigned to form the training set. Should be a value between 0 and 1. By default, training_percentage is 0.7.
random_seed (int): Random seed for splitting the data into training and validation sets a training_set_identifier_path is not provided.
split_by_motif_size (bool): Whether to split the analysis per motif size. If true, a recall threshold is learned for each motif size, and figures are generated for each motif size independently. By default, split_by_motif_size is true.
min_precision:
MotifEncoder
parameter. The minimum precision threshold for keeping a motif on the training set. By default, min_precision is 0.9.test_precision_threshold (float). The desired precision on the test set, given that motifs are learned by using a training set with a precision threshold of min_precision. It is recommended for test_precision_threshold to be lower than min_precision, e.g., min_precision - 0.1. By default, test_precision_threshold is 0.8.
min_recall (float):
MotifEncoder
parameter. The minimum recall threshold for keeping a motif. Any learned recall threshold will be at least as high as the set min_recall value. The default value for min_recall is 0.min_true_positives (int):
MotifEncoder
parameter. The minimum number of true positive training sequences that a motif needs to occur in. The default value for min_true_positives is 1.max_positions (int):
MotifEncoder
parameter. The maximum motif size. This is number of positional amino acids the motif consists of (excluding gaps). The default value for max_positions is 4.min_positions (int):
MotifEncoder
parameter. The minimum motif size (see also: max_positions). The default value for min_positions is 1.no_gaps (bool):
MotifEncoder
parameter. Must be set to True if only contiguous motifs (position-specific k-mers) are allowed. By default, no_gaps is False, meaning both gapped and ungapped motifs are searched for.smoothen_combined_precision (bool): whether to add a smoothed line representing the combined precision to the precision-vs-TP plot. When set to True, this may take considerable extra time to compute. By default, plot_smoothed_combined_precision is set to True.
min_points_in_window (int): Parameter for smoothing the combined_precision line in the precision-vs-TP plot through lognormal kernel smoothing with adaptive window size. This parameter determines the minimum number of points that need to be present in a window to determine the adaptive window size. By default, min_points_in_window is 50.
smoothing_constant1: Parameter for smoothing the combined_precision line in the precision-vs-TP plot through lognormal kernel smoothing with adaptive window size. This smoothing constant determines the dependence of the smoothness on the window size. Increasing this increases smoothness for regions where few points are present. By default, smoothing_constant1 is 5.
smoothing_constant2: Parameter for smoothing the combined_precision line in the precision-vs-TP plot through lognormal kernel smoothing. with adaptive window size. This smoothing constant can be used to scale the overall kernel width, thus influencing the smoothness of all regions regardless of data density. By default, smoothing_constant2 is 10.
training_set_name (str): Name of the training set to be used in figures. By default, the training_set_name is ‘training set’.
test_set_name (str): Name of the test set to be used in figures. By default, the test_set_name is ‘test set’.
highlight_motifs_path (str): Path to a set of motifs of interest to highlight in the output figures (such as implanted ground-truth motifs). By default, no motifs are highlighted.
highlight_motifs_name (str): IF highlight_motifs_path is defined, this name will be used to label the motifs of interest in the output figures.
YAML specification:
definitions:
reports:
my_motif_generalization:
MotifGeneralizationAnalysis:
min_precision: 0.9
min_recall: 0.1
label: # Define a label, and the positive class for that given label
CMV:
positive_class: +
ReceptorDatasetOverview¶
This report plots the length distribution per chain for a receptor (paired-chain) dataset.
Specification arguments:
batch_size (int): how many receptors to load at once; 50 000 by default
YAML specification:
definitions:
reports:
my_receptor_overview_report: ReceptorDatasetOverview
RecoveredSignificantFeatures¶
Compares a given collection of ground truth implanted signals (sequences or k-mers) to the significant label-associated k-mers or sequences according to Fisher’s exact test.
Internally uses the KmerAbundanceEncoder
for calculating
significant k-mers, and
SequenceAbundanceEncoder
or
CompAIRRSequenceAbundanceEncoder
to calculate significant full sequences (depending on whether the argument compairr_path was set).
This report creates two plots:
the first plot is a bar chart showing what percentage of the ground truth implanted signals were found to be significant.
the second plot is a bar chart showing what percentage of the k-mers/sequences found to be significant match the ground truth implanted signals.
To compare k-mers or sequences of differing lengths, the ground truth sequences or long k-mers are split into k-mers of the given size through a sliding window approach. When comparing ‘full_sequences’ to ground truth sequences, a match is only registered if both sequences are of equal length.
Specification arguments:
ground_truth_sequences_path (str): Path to a file containing the true implanted (sub)sequences, e.g., full sequences or k-mers. The file should contain one sequence per line, without a header, and without V or J genes.
sequence_type (str): either amino acid or nucleotide; which type of sequence to use for the analysis
region_type (str): which AIRR field to use for comparison, e.g. IMGT_CDR3 or IMGT_JUNCTION
p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.
k_values (list): Length of the k-mers (number of amino acids) created by the
KmerAbundanceEncoder
. When using a full sequence encoding (SequenceAbundanceEncoder
orCompAIRRSequenceAbundanceEncoder
), specify ‘full_sequence’ here. Each value specified under k_values will represent one bar in the output figure.label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.
compairr_path (str): If ‘full_sequence’ is listed under k_values, the path to the CompAIRR executable may be provided. If the compairr_path is specified, the
CompAIRRSequenceAbundanceEncoder
will be used to compute the significant sequences. If the path is not specified and ‘full_sequence’ is listed under k-values,SequenceAbundanceEncoder
will be used.
YAML specification:
definitions:
reports:
my_recovered_significant_features_report:
RecoveredSignificantFeatures:
groundtruth_sequences_path: path/to/groundtruth/sequences.txt
trim_leading_trailing: False
p_values:
- 0.1
- 0.01
- 0.001
- 0.0001
k_values:
- 3
- 4
- 5
- full_sequence
compairr_path: path/to/compairr # can be specified if 'full_sequence' is listed under k_values
label: # Define a label, and the positive class for that given label
CMV:
positive_class: +
RepertoireClonotypeSummary¶
Shows the number of distinct clonotypes per repertoire in a given dataset as a bar plot.
Specification arguments:
color_by_label (str): name of the label to use to color the plot, e.g., could be disease label, or None
YAML specification:
definitions:
reports:
my_clonotype_summary_rep:
RepertoireClonotypeSummary:
color_by_label: celiac
SequenceCountDistribution¶
Generates a histogram of the duplicate counts of the sequences in a dataset.
Specification arguments:
split_by_label (bool): Whether to split the plots by a label. If set to true, the Dataset must either contain a single label, or alternatively the label of interest can be specified under ‘label’. By default, split_by_label is False.
label (str): Optional label for separating the results by color/creating separate plots. Note that this should the name of a valid dataset label.
YAML specification:
my_sld_report:
SequenceCountDistribution:
label: disease
SequenceLengthDistribution¶
Generates a histogram of the lengths of the sequences in a dataset.
Specification arguments:
sequence_type (str): whether to check the length of amino acid or nucleotide sequences; default value is ‘amino_acid’
region_type (str): which part of the sequence to examine; e.g., IMGT_CDR3
YAML specification:
definitions:
reports:
my_sld_report:
SequenceLengthDistribution:
sequence_type: amino_acid
region_type: IMGT_CDR3
SequencesWithSignificantKmers¶
Given a list of reference sequences, this report writes out the subsets of reference sequences containing significant k-mers
(as computed by the KmerAbundanceEncoder
using Fisher’s exact test).
For each combination of p-value and k-mer size given, a file is written containing all sequences containing a significant k-mer of the given size at the given p-value.
Specification arguments:
reference_sequences_path (str): Path to a file containing the reference sequences, The file should contain one sequence per line, without a header, and without V or J genes.
p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.
k_values (list): Length of the k-mers (number of amino acids) created by the
KmerAbundanceEncoder
. Each k-mer length will become one panel in the output figure.label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.
YAML specification:
definitions:
reports:
my_sequences_with_significant_kmers:
SequencesWithSignificantKmers:
reference_sequences_path: path/to/reference/sequences.txt
p_values:
- 0.1
- 0.01
- 0.001
- 0.0001
k_values:
- 3
- 4
- 5
label: # Define a label, and the positive class for that given label
CMV:
positive_class: +
SignificantFeatures¶
Plots a boxplot of the number of significant features (label-associated k-mers or sequences) per Repertoire according to Fisher’s exact test, across different classes for the given label.
Internally uses the KmerAbundanceEncoder
for calculating
significant k-mers, and
SequenceAbundanceEncoder
or
CompAIRRSequenceAbundanceEncoder
to calculate significant full sequences (depending on whether the argument compairr_path was set).
Specification arguments:
p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.
k_values (list): Length of the k-mers (number of amino acids) created by the
KmerAbundanceEncoder
. When using a full sequence encoding (SequenceAbundanceEncoder
orCompAIRRSequenceAbundanceEncoder
), specify ‘full_sequence’ here. Each value specified under k_values will represent one boxplot in the output figure.label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.
compairr_path (str): If ‘full_sequence’ is listed under k_values, the path to the CompAIRR executable may be provided. If the compairr_path is specified, the
CompAIRRSequenceAbundanceEncoder
will be used to compute the significant sequences. If the path is not specified and ‘full_sequence’ is listed under k-values,SequenceAbundanceEncoder
will be used.log_scale (bool): Whether to plot the y axis in log10 scale (log_scale = True) or continuous scale (log_scale = False). By default, log_scale is False.
YAML specification:
definitions:
reports:
my_significant_features_report:
SignificantFeatures:
p_values:
- 0.1
- 0.01
- 0.001
- 0.0001
k_values:
- 3
- 4
- 5
- full_sequence
compairr_path: path/to/compairr # can be specified if 'full_sequence' is listed under k_values
label: # Define a label, and the positive class for that given label
CMV:
positive_class: +
log_scale: False
SignificantKmerPositions¶
Plots the number of significant k-mers (as computed by the KmerAbundanceEncoder
using Fisher’s exact test)
observed at each IMGT position of a given list of reference sequences.
This report creates a stacked bar chart, where each bar represents an IMGT position, and each segment of the stack represents the observed frequency
of one ‘significant’ k-mer at that position.
Specification arguments:
reference_sequences_path (str): Path to a file containing the reference sequences, The file should contain one sequence per line, without a header, and without V or J genes.
p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.
k_values (list): Length of the k-mers (number of amino acids) created by the
KmerAbundanceEncoder
. Each k-mer length will become one panel in the output figure.label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.
sequence_type (str): nucleotide or amino_acid
region_type (str): which AIRR field to consider, e.g., IMGT_CDR3 or IMGT_JUNCTION
YAML specification:
definitions:
reports:
my_significant_kmer_positions_report:
SignificantKmerPositions:
reference_sequences_path: path/to/reference/sequences.txt
p_values:
- 0.1
- 0.01
- 0.001
- 0.0001
k_values:
- 3
- 4
- 5
label: # Define a label, and the positive class for that given label
CMV:
positive_class: +
SimpleDatasetOverview¶
Generates a simple text-based overview of the properties of any dataset, including the dataset name, size, and metadata labels.
YAML specification:
definitions:
reports:
my_overview: SimpleDatasetOverview
VJGeneDistribution¶
This report creates several plots to gain insight into the V and J gene distribution of a given dataset. When a label is provided, the information in the plots is separated per label value, either by color or by creating separate plots. This way one can for example see if a particular V or J gene is more prevalent across disease associated receptors.
Individual V and J gene distributions: for sequence and receptor datasets, a bar plot is created showing how often
each V or J gene occurs in the dataset. For repertoire datasets, boxplots are used to represent how often each V or J gene is used across all repertoires. Since repertoires may differ in size, these counts are normalised by the repertoire size (original count values are additionaly exported in tsv files).
Combined V and J gene distributions: for sequence and receptor datasets, a heatmap is created showing how often each
combination of V and J genes occurs in the dataset. A similar plot is created for repertoire datasets, except in this case only the average value for the normalised gene usage frequencies are shown (original count values are additionaly exported in tsv files).
Specification arguments:
split_by_label (bool): Whether to split the plots by a label. If set to true, the Dataset must either contain a single label, or alternatively the label of interest can be specified under ‘label’. By default, split_by_label is False.
label (str): Optional label for separating the results by color/creating separate plots. Note that this should the name of a valid dataset label.
is_sequence_label (bool): for RepertoireDatasets, indicates if the label applies to the sequence level (e.g., antigen binding versus non-binding across repertoires) or repertoire level (e.g., diseased repertoires versus healthy repertoires). By default, is_sequence_label is False. For Sequence- and ReceptorDatasets, this parameter is ignored.
YAML specification:
definitions:
reports:
my_vj_gene_report:
VJGeneDistribution:
label: ag_binding
Encoding reports¶
Encoding reports show some type of features or statistics about an encoded dataset, or may in some cases export relevant sequences or tables.
When running the TrainMLModel instruction, encoding reports can be specified inside the ‘selection’ or ‘assessment’ specification under the key ‘reports/encoding’. Example:
my_instruction:
type: TrainMLModel
selection:
reports:
encoding:
- my_encoding_report
# other parameters...
assessment:
reports:
encoding:
- my_encoding_report
# other parameters...
# other parameters...
Alternatively, when running the ExploratoryAnalysis instruction, encoding reports can be specified under ‘report’. Example:
my_instruction:
type: ExploratoryAnalysis
analyses:
my_first_analysis:
report: my_encoding_report
# other parameters...
# other parameters...
DesignMatrixExporter¶
Exports the design matrix and related information of a given encoded Dataset to csv files. If the encoded data has more than 2 dimensions (such as when using the OneHot encoder with option Flatten=False), the data are then exported to different formats to facilitate their import with external software.
Specification arguments:
file_format (str): the format and extension of the file to store the design matrix. The supported formats are: npy, csv, pt, hdf5, npy.zip, csv.zip or hdf5.zip.
Note: when using hdf5 or hdf5.zip output formats, make sure the ‘hdf5’ dependency is installed.
YAML specification:
definitions:
reports:
my_dme_report:
DesignMatrixExporter:
file_format: csv
DimensionalityReduction¶
This report visualizes the data obtained by dimensionality reduction.
Specification arguments:
label (str): name of the label to use for highlighting data points; or None
dim_red_method (str): name of the dimensionality reduction method defined under ml_methods that will be used to transform the data for plotting; if None, it will visualize the encoded data of reduced dimensionality if set
YAML specification:
definitions:
reports:
rep1:
DimensionalityReduction:
label: epitope
dim_red_method:
PCA:
n_components: 2
FeatureComparison¶
Encoding a dataset results in a numeric matrix, where the rows are examples (e.g., sequences, receptors, repertoires) and the columns are features. For example, when KmerFrequency encoder is used, the features are the k-mers (AAA, AAC, etc..) and the feature values are the frequencies per k-mer.
This report separates the examples based on a binary metadata label, and plots the mean feature value of each feature in one example group against the other example group (for example: plot the feature value of ‘sick’ repertoires on the x axis, and ‘healthy’ repertoires on the y axis to spot consistent differences). The plot can be separated into different colors or facets using other metadata labels (for example: plot the average feature values of ‘cohort1’, ‘cohort2’ and ‘cohort3’ in different colors to spot biases).
Alternatively, when plotting features without comparing them across a binary label, see:
FeatureValueBarplot
report to plot
a simple bar chart per feature (average across examples).
Or FeatureDistribution
report to plot
the distribution of each feature across examples, rather than only showing the mean value in a bar plot.
Example output:
Specification arguments:
comparison_label (str): Mandatory label. This label is used to split the encoded data matrix and define the x and y axes of the plot. This label is only allowed to have 2 classes (for example: sick and healthy, binding and non-binding).
color_grouping_label (str): Optional label that is used to color the points in the scatterplot. This can not be the same as comparison_label.
row_grouping_label (str): Optional label that is used to group scatterplots into different row facets. This can not be the same as comparison_label.
column_grouping_label (str): Optional label that is used to group scatterplots into different column facets. This can not be the same as comparison_label.
show_error_bar (bool): Whether to show the error bar (standard deviation) for the points, both in the x and y dimension.
log_scale (bool): Whether to plot the x and y axes in log10 scale (log_scale = True) or continuous scale (log_scale = False). By default, log_scale is False.
keep_fraction (float): The total number of features may be very large and only the features differing significantly across comparison labels may be of interest. When the keep_fraction parameter is set below 1, only the fraction of features that differs the most across comparison labels is kept for plotting (note that the produced .csv file still contains all data). By default, keep_fraction is 1, meaning that all features are plotted.
opacity (float): a value between 0 and 1 setting the opacity for data points making it easier to see if there are overlapping points
YAML specification:
definitions:
reports:
my_comparison_report:
FeatureComparison: # compare the different classes defined in the label disease
comparison_label: disease
FeatureDistribution¶
Encoding a dataset results in a numeric matrix, where the rows are examples (e.g., sequences, receptors, repertoires) and the columns are features. For example, when KmerFrequency encoder is used, the features are the k-mers (AAA, AAC, etc..) and the feature values are the frequencies per k-mer.
This report plots the distribution of feature values. For each feature, a violin plot is created to show the distribution of feature values across all examples. The violin plots can be separated into different colors or facets using metadata labels (for example: plot the feature distributions of ‘cohort1’, ‘cohort2’ and ‘cohort3’ in different colors to spot biases).
See also: FeatureValueBarplot
report to plot
a simple bar chart per feature (average across examples), rather than the entire distribution.
Or FeatureComparison
report to compare
features across binary metadata labels (e.g., plot the feature value of ‘sick’ repertoires on the x axis,
and ‘healthy’ repertoires on the y axis).
Example output:
Specification arguments:
color_grouping_label (str): The label that is used to color each bar, at each level of the grouping_label.
row_grouping_label (str): The label that is used to group bars into different row facets.
column_grouping_label (str): The label that is used to group bars into different column facets.
mode (str): either ‘normal’, ‘sparse’ or ‘auto’ (default). in the ‘normal’ mode there are normal boxplots corresponding to each column of the encoded dataset matrix; in the ‘sparse’ mode all zero cells are eliminated before passing the data to the boxplots. If mode is set to ‘auto’, then it will automatically set to ‘sparse’ if the density of the matrix is below 0.01
x_title (str): x-axis label
y_title (str): y-axis label
YAML specification:
definitions:
reports:
my_fdistr_report:
FeatureDistribution:
mode: sparse
FeatureValueBarplot¶
Encoding a dataset results in a numeric matrix, where the rows are examples (e.g., sequences, receptors, repertoires) and the columns are features. For example, when KmerFrequency encoder is used, the features are the k-mers (AAA, AAC, etc..) and the feature values are the frequencies per k-mer.
This report plots the mean feature values per feature. A bar plot is created where the average feature value across all examples is shown, with optional error bars. The bar plots can be separated into different colors or facets using metadata labels (for example: plot the average feature values of ‘cohort1’, ‘cohort2’ and ‘cohort3’ in different colors to spot biases).
See also: FeatureDistribution
report to plot
the distribution of each feature across examples, rather than only showin the mean value in a bar plot.
Or FeatureComparison
report to compare
features across binary metadata labels (e.g., plot the feature value of ‘sick’ repertoires on the x axis,
and ‘healthy’ repertoires on the y axis.).
Example output:
Specification arguments:
color_grouping_label (str): The label that is used to color each bar, at each level of the grouping_label.
row_grouping_label (str): The label that is used to group bars into different row facets.
column_grouping_label (str): The label that is used to group bars into different column facets.
show_error_bar (bool): Whether to show the error bar (standard deviation) for the bars.
x_title (str): x-axis label
y_title (str): y-axis label
plot_top_n (int): plot n of the largest features on average separately (useful when there are too many features to plot at the same time)
plot_bottom_n (int): plot n of the smallest features on average separately (useful when there are too many features to plot at the same time)
plot_all_features (bool): whether to plot all (might be slow for large number of features)
YAML specification:
definitions:
reports:
my_fvb_report:
FeatureValueBarplot: # timepoint, disease_status and age_group are metadata labels
column_grouping_label: timepoint
row_grouping_label: disease_status
color_grouping_label: age_group
plot_all_features: true
plot_top_n: 10
plot_bottom_n: 5
GroundTruthMotifOverlap¶
Creates report displaying overlap between learned motifs and groundtruth motifs implanted in a given sequence dataset. This report must be used in combination with the MotifEncoder.
Specification arguments:
groundtruth_motifs_path (str): Path to a .tsv file containing groundtruth position-specific motifs. The file should specify the motifs as position-specific amino acids, one column representing the positions concatenated with an ‘&’ symbol, the next column specifying the amino acids concatenated with ‘&’ symbol, and the last column specifying the implant rate.
Example:
indices
amino_acids
n_sequences
0
A
4
4&8&9
G&A&C
30
This file shows a motif ‘A’ at position 0 implanted in 4 sequences, and motif G—AC implanted between positions 4 and 9 in 30 sequences
YAML specification:
definitions:
reports:
my_ground_truth_motif_report:
GroundTruthMotifOverlap:
groundtruth_motifs_path: path/to/file.tsv
Matches¶
Reports the number of matches that were found when using one of the following encoders:
MatchedSequences encoder
MatchedReceptors encoder
MatchedRegex encoder
Report results are:
A table containing all matches, where the rows correspond to the Repertoires, and the columns correspond to the objects to match (regular expressions or receptor sequences).
The repertoire sizes (read frequencies and the number of unique sequences per repertoire), for each of the chains. This can be used to calculate the percentage of matched sequences in a repertoire.
When using MatchedSequences encoder or MatchedReceptors encoder, tables describing the chains and receptors (ids, chains, V and J genes and sequences).
When using MatchedReceptors encoder or using MatchedRegex encoder with chain pairs, tables describing the paired matches (where a match was found in both chains) per repertoire.
YAML specification:
definitions:
reports:
my_match_report: Matches
MotifTestSetPerformance¶
This report can be used to show the performance of a learned set motifs using the MotifEncoder
on an independent test set of unseen data.
It is recommended to first run the report MotifGeneralizationAnalysis
in order to calibrate the optimal recall thresholds and plot the performance of motifs on training- and validation sets.
Specification arguments:
test_dataset (dict): parameters for importing a SequenceDataset to use as an independent test set. By default, the import parameters ‘is_repertoire’ and ‘paired’ will be set to False to ensure a SequenceDataset is imported.
YAML specification:
definitions:
reports:
my_motif_report:
MotifTestSetPerformance:
test_dataset:
format: AIRR # choose any valid import format
params:
path: path/to/files/
is_repertoire: False # is_repertoire must be False to import a SequenceDataset
paired: False # paired must be False to import a SequenceDataset
# optional other parameters...
NonMotifSequenceSimilarity¶
Plots the similarity of positions outside the motifs of interest. This report can be used to investigate if the
motifs of interest as determined by the MotifEncoder
have a tendency occur in sequences that are naturally very similar or dissimilar.
For each motif, the subset of sequences containing the motif is selected, and the hamming distances are computed between all sequences in this subset. Finally, a plot is created showing the distribution of hamming distances between the sequences containing the motif. For motifs occurring in sets of very similar sequences, this distribution will lean towards small hamming distances. Likewise, for motifs occurring in a very diverse set of sequences, the distribution will lean towards containing more large hamming distances.
Specification arguments:
motif_color_map (dict): An optional mapping between motif sizes and colors. If no mapping is given, default colors will be chosen.
YAML specification:
definitions:
reports:
my_motif_sim:
NonMotifSimilarity:
motif_color_map:
3: "#66C5CC"
4: "#F6CF71"
5: "#F89C74"
PositionalMotifFrequencies¶
This report must be used in combination with the MotifEncoder
.
Plots a stacked bar plot of amino acid occurrence at different indices in any given dataset, along with a plot
investigating motif continuity which displays a bar plot of the gap sizes between the amino acids in the motifs in
the given dataset. Note that a distance of 1 means that the amino acids are continuous (next to each other).
Specification arguments:
motif_color_map (dict): Optional mapping between motif lengths and specific colors to be used. Example:
- motif_color_map:
1: #66C5CC 2: #F6CF71 3: #F89C74
YAML specification:
definitions:
reports:
my_pos_motif_report:
PositionalMotifFrequencies:
motif_color_map:
RelevantSequenceExporter¶
Exports the sequences that are extracted as label-associated when using the SequenceAbundanceEncoder
or
CompAIRRSequenceAbundanceEncoder
in AIRR-compliant format.
YAML specification:
definitions:
reports:
my_relevant_sequences: RelevantSequenceExporter
ML model reports¶
ML model reports show some type of features or statistics about a single trained ML model.
In the TrainMLModel instruction, ML model reports can be specified inside the ‘selection’ or ‘assessment’ specification under the key ‘reports/models’. Example:
my_instruction:
type: TrainMLModel
selection:
reports:
models:
- my_ml_report
# other parameters...
assessment:
reports:
models:
- my_ml_report
# other parameters...
# other parameters...
BinaryFeaturePrecisionRecall¶
Plots the precision and recall scores for each added feature to the collection of features selected by the BinaryFeatureClassifier.
YAML specification:
definitions:
reports:
my_report: BinaryFeaturePrecisionRecall
Coefficients¶
A report that plots the coefficients for a given ML method in a barplot. Can be used for LogisticRegression, SVM, SVC, and RandomForestClassifier. In the case of RandomForest, the feature importances will be plotted.
When used in TrainMLModel instruction, the report can be specified under ‘models’, both on the selection and assessment levels.
Which coefficients should be plotted (for example: only nonzero, above a certain threshold, …) can be specified. Multiple options can be specified simultaneously. By default the 25 largest coefficients are plotted. The full set of coefficients will also be exported as a csv file.
Example output:
Specification arguments:
coefs_to_plot (list): A list specifying which coefficients should be plotted. Valid values are: ALL, NONZERO, CUTOFF, N_LARGEST.
cutoff (list): If ‘cutoff’ is specified under ‘coefs_to_plot’, the cutoff values can be specified here. The coefficients which have an absolute value equal to or greater than the cutoff will be plotted.
n_largest (list): If ‘n_largest’ is specified under ‘coefs_to_plot’, the values for n can be specified here. These should be integer values. The n largest coefficients are determined based on their absolute values.
YAML specification:
definitions:
reports:
my_coef_report:
Coefficients:
coefs_to_plot:
- all
- nonzero
- cutoff
- n_largest
cutoff:
- 0.1
- 0.01
n_largest:
- 5
- 10
ConfounderAnalysis¶
A report that plots the numbers of false positives and false negatives with respect to each value of the metadata features specified by the user. This allows checking whether a given machine learning model makes more misclassifications for some values of a metadata feature than for the others.
Specification arguments:
metadata_labels (list): A list of the metadata features to use as a basis for the calculations
YAML specification:
definitions:
reports:
my_confounder_report:
ConfounderAnalysis:
metadata_labels:
- age
- sex
DeepRCMotifDiscovery¶
This report plots the contributions of (i) input sequences and (ii) kernels to trained DeepRC model with respect to the test dataset. Contributions are computed using integrated gradients (IG). This report produces two figures:
inputs_integrated_gradients: Shows the contributions of the characters within the input sequences (test dataset) that was most important for immune status prediction of the repertoire. IG is only applied to sequences of positive class repertoires.
kernel_integrated_gradients: Shows the 1D CNN kernels with the highest contribution over all positions and amino acids.
For both inputs and kernels: Larger characters in the extracted motifs indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the immune status. For kernels only: contributions to positional encoding are indicated by < (beginning of sequence), ∧ (center of sequence), and > (end of sequence).
See DeepRCMotifDiscovery for repertoire classification for a more detailed example.
Reference:
Widrich, M., et al. (2020). Modern Hopfield Networks and Attention for Immune Repertoire Classification. Advances in Neural Information Processing Systems, 33. https://proceedings.neurips.cc//paper/2020/hash/da4902cb0bc38210839714ebdcf0efc3-Abstract.html
Example output:
Specification arguments:
n_steps (int): Number of IG steps (more steps -> better path integral -> finer contribution values). 50 is usually good enough.
threshold (float): Only applies to the plotting of kernels. Contributions are normalized to range [0, 1], and only kernels with normalized contributions above threshold are plotted.
YAML specification:
definitions:
reports:
my_deeprc_report:
DeepRCMotifDiscovery:
threshold: 0.5
n_steps: 50
KernelSequenceLogo¶
A report that plots kernels of a CNN model as sequence logos. It works only with trained ReceptorCNN models which has kernels already normalized to represent information gain matrices. Additionally, it also plots the weights in the final fully-connected layer of the network associated with kernel outputs. For more information on how the model works, see ReceptorCNN.
The kernels are visualized using Logomaker. Original publication: Tareen A, Kinney JB. Logomaker: beautiful sequence logos in Python. Bioinformatics. 2020; 36(7):2272-2274. doi:10.1093/bioinformatics/btz921.
YAML specification:
definitions:
reports:
my_kernel_seq_logo: KernelSequenceLogo
MotifSeedRecovery¶
This report can be used to show how well implanted motifs (for example, through the Simulation instruction) can be recovered by various machine learning methods using the k-mer encoding. This report creates a boxplot, where the x axis (box grouping) represents the maximum possible overlap between an implanted motif seed and a kmer feature (measured in number of positions), and the y axis shows the coefficient size of the respective kmer feature. If the machine learning method has learned the implanted motif seeds, the coefficient size is expected to be largest for the kmer features with high overlap to the motif seeds.
Note that to use this report, the following criteria must be met:
KmerFrequencyEncoder must be used.
One of the following classifiers must be used: RandomForestClassifier, LogisticRegression, SVM, SVC
For each label, the implanted motif seeds relevant to that label must be specified
To find the overlap score between kmer features and implanted motif seeds, the two sequences are compared in a sliding window approach, and the maximum overlap is calculated.
Overlap scores between kmer features and implanted motifs are calculated differently based on the Hamming distance that was allowed during implanting.
Without hamming distance:
Seed: AAA -> score = 3
Feature: xAAAx
^^^
Seed: AAA -> score = 0
Feature: xAAxx
With hamming distance:
Seed: AAA -> score = 3
Feature: xAAAx
^^^
Seed: AAA -> score = 2
Feature: xAAxx
^^
Furthermore, gap positions in the motif seed are ignored:
Seed: A/AA -> score = 3
Feature: xAxAAx
^/^^
See Recovering simulated immune signals for more details.
Example output:
Specification arguments:
implanted_motifs_per_label (dict): a nested dictionary that specifies the motif seeds that were implanted in the given dataset. The first level of keys in this dictionary represents the different labels. In the inner dictionary there should be two keys: “seeds” and “hamming_distance”:
seeds: a list of motif seeds. The seeds may contain gaps, specified by a ‘/’ symbol.
hamming_distance: A boolean value that specifies whether hamming distance was allowed when implanting the motif seeds for a given label. Note that this applies to all seeds for this label.
gap_sizes: a list of all the possible gap sizes that were used when implanting a gapped motif seed. When no gapped seeds are used, this value has no effect.
YAML specification:
definitions:
reports:
my_motif_report:
MotifSeedRecovery:
implanted_motifs_per_label:
CD:
seeds:
- AA/A
- AAA
hamming_distance: False
gap_sizes:
- 0
- 1
- 2
T1D:
seeds:
- CC/C
- CCC
hamming_distance: True
gap_sizes:
- 2
ROCCurve¶
A report that plots the ROC curve for a binary classifier.
YAML specification:
definitions:
reports:
my_roc_report: ROCCurve
SequenceAssociationLikelihood¶
Plots the beta distribution used as a prior for class assignment in ProbabilisticBinaryClassifier. The distribution plotted shows the probability that a sequence is associated with a given class for a label.
YAML specification:
definitions:
reports:
my_sequence_assoc_report: SequenceAssociationLikelihood
TCRdistMotifDiscovery¶
The report for discovering motifs in paired immune receptor data of given specificity based on TCRdist3. The receptors are hierarchically clustered based on the tcrdist distance and then motifs are discovered for each cluster. The report outputs logo plots for the motifs along with the raw data used for plotting in csv format.
For the implementation, TCRdist3 library was used (source code available here). More details on the functionality used for this report are available here.
Original publications:
Dash P, Fiore-Gartland AJ, Hertz T, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. 2017; 547(7661):89-93. doi:10.1038/nature22383
Mayer-Blackwell K, Schattgen S, Cohen-Lavi L, et al. TCR meta-clonotypes for biomarker discovery with tcrdist3: quantification of public, HLA-restricted TCR biomarkers of SARS-CoV-2 infection. bioRxiv. Published online December 26, 2020:2020.12.24.424260. doi:10.1101/2020.12.24.424260
Example output:
Specification arguments:
positive_class_name (str): the class value (e.g., epitope) used to select only the receptors that are specific to the given epitope so that only those sequences are used to infer motifs; the reference receptors as required by TCRdist will be the ones from the dataset that have different or no epitope specified in their metadata; if the labels are available only on the epitope level (e.g., label is “AVFDRKSDAK” and classes are True and False), then here it should be specified that only the receptors with value “True” for label “AVFDRKSDAK” should be used; there is no default value for this argument
cores (int): number of processes to use for the computation of the distance and motifs
min_cluster_size (int): the minimum size of the cluster to discover the motifs for
use_reference_sequences (bool): when showing motifs, this parameter defines if reference sequences should be provided as well as a background
YAML specification:
definitions:
reports:
my_tcr_dist_report: # user-defined name
TCRdistMotifDiscovery:
positive_class_name: True # class name, could also be epitope name, depending on how it's defined in the dataset
cores: 4
min_cluster_size: 30
use_reference_sequences: False
TrainingPerformance¶
A report that plots the evaluation metrics for the performance given machine learning model and training dataset.
The available metrics are accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision,
recall, auc and log_loss (see immuneML.environment.Metric.Metric
).
Specification arguments:
metrics (list): A list of metrics used to evaluate training performance. See
immuneML.environment.Metric.Metric
for available options.
YAML specification:
definitions:
reports:
my_performance_report:
TrainingPerformance:
metrics:
- accuracy
- balanced_accuracy
- confusion_matrix
- f1_micro
- f1_macro
- f1_weighted
- precision
- recall
- auc
- log_loss
Train ML model reports¶
Train ML model reports plot general statistics or export data of multiple models simultaneously when running the TrainMLModel instruction.
In the TrainMLModel instruction, train ML model reports can be specified under ‘reports’. Example:
my_instruction:
type: TrainMLModel
reports:
- my_train_ml_model_report
# other parameters...
CVFeaturePerformance¶
This report plots the average training vs test performance w.r.t. given encoding parameter which is explicitly set in the feature attribute. It can be used only in combination with TrainMLModel instruction and can be only specified under ‘reports’
Specification arguments:
feature: name of the encoder parameter w.r.t. which the performance across training and test will be shown. Possible values depend on the encoder on which it is used.
is_feature_axis_categorical (bool): if the x-axis of the plot where features are shown should be categorical; alternatively it is automatically determined based on the feature values
YAML specification:
definitions:
reports:
report1:
CVFeaturePerformance:
feature: p_value_threshold # parameter value of SequenceAbundance encoder
is_feature_axis_categorical: True # show x-axis as categorical
DiseaseAssociatedSequenceCVOverlap¶
DiseaseAssociatedSequenceCVOverlap report makes one heatmap per label showing the overlap of disease-associated sequences (or k-mers)
produced by the SequenceAbundanceEncoder
,
CompAIRRSequenceAbundanceEncoder
or
KmerAbundanceEncoder
between folds of cross-validation (either inner or outer loop of the nested CV). The overlap is computed by the following equation:
For details, see Greiff V, Menzel U, Miho E, et al. Systems Analysis Reveals High Genetic and Antigen-Driven Predetermination of Antibody Repertoires throughout B Cell Development. Cell Reports. 2017;19(7):1467-1478. doi:10.1016/j.celrep.2017.04.054.
Specification arguments:
compare_in_selection (bool): whether to compute the overlap over the inner loop of the nested CV - the sequence overlap is shown across CV folds for the model chosen as optimal within that selection
compare_in_assessment (bool): whether to compute the overlap over the optimal models in the outer loop of the nested CV
YAML specification:
definitions:
reports:
my_overlap_report: DiseaseAssociatedSequenceCVOverlap # report has no parameters
MLSettingsPerformance¶
Report for TrainMLModel instruction: plots the performance for each of the setting combinations as defined under ‘settings’ in the assessment (outer validation) loop.
The performances are grouped by label (horizontal panels) encoding (vertical panels) and ML method (bar color). When multiple data splits are used, the average performance over the data splits is shown with an error bar representing the standard deviation.
This report can be used only with TrainMLModel instruction under ‘reports’.
Specification arguments:
single_axis_labels (bool): whether to use single axis labels. Note that using single axis labels makes the figure unsuited for rescaling, as the label position is given in a fixed distance from the axis. By default, single_axis_labels is False, resulting in standard plotly axis labels.
x_label_position (float): if single_axis_labels is True, this should be an integer specifying the x axis label position relative to the x axis. The default value for label_position is -0.1.
y_label_position (float): same as x_label_position, but for the y-axis.
YAML specification:
definitions:
reports:
my_hp_report: MLSettingsPerformance
ROCCurveSummary¶
This report plots ROC curves for all trained ML settings ([preprocessing], encoding, ML model) in the outer loop of cross-validation in the TrainMLModel instruction. If there are multiple splits in the outer loop, this report will make one plot per split. This report is defined only for binary classification. If there are multiple labels defined in the instruction, each label has to have two classes to be included in this report.
YAML specification:
definitions:
reports:
my_roc_summary_report: ROCCurveSummary
ReferenceSequenceOverlap¶
The ReferenceSequenceOverlap report compares a list of disease-associated sequences (or k-mers) produced by the
SequenceAbundanceEncoder
,
CompAIRRSequenceAbundanceEncoder
or
KmerAbundanceEncoder
to
a list of reference sequences. It outputs a Venn diagram and a list of sequences found both in the encoder and reference list.
The report compares the sequences by their sequence content and the additional comparison_attributes (such as V or J gene), as specified by the user.
Specification arguments:
reference_path (str): path to the reference file in csv format which contains one entry per row and has columns that correspond to the attributes listed under comparison_attributes argument
comparison_attributes (list): list of attributes to use for comparison; all of them have to be present in the reference file where they should be the names of the columns
label (str): name of the label for which the reference sequences/k-mers should be compared to the model; if none, it takes the one label from the instruction; if it is none and multiple labels were specified for the instruction, the report will not be generated
YAML specification:
definitions:
reports:
my_reference_overlap_report:
ReferenceSequenceOverlap:
reference_path: reference_sequences.csv # example usage with SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder
comparison_attributes:
- sequence_aa
- v_call
- j_call
my_reference_overlap_report_with_kmers:
ReferenceSequenceOverlap:
reference_path: reference_kmers.csv # example usage with KmerAbundanceEncoder
comparison_attributes:
- k-mer
Multi dataset reports¶
Multi dataset reports are special reports that can be specified when running immuneML with the MultiDatasetBenchmarkTool
.
See Manuscript use case 1: Robustness assessment for an example.
When running the MultiDatasetBenchmarkTool
, multi dataset reports can be specified under ‘benchmark_reports’.
Example:
my_instruction:
type: TrainMLModel
benchmark_reports:
- my_benchmark_report
# other parameters...
DiseaseAssociatedSequenceOverlap¶
DiseaseAssociatedSequenceOverlap report makes a heatmap showing the overlap of disease-associated sequences (or k-mers)
produced by the SequenceAbundanceEncoder
,
CompAIRRSequenceAbundanceEncoder
or
KmerAbundanceEncoder
between multiple datasets of different sizes (different number of repertoires per dataset).
This plot can be used only with MultiDatasetBenchmarkTool.
The overlap is computed by the following equation:
For details, see: Greiff V, Menzel U, Miho E, et al. Systems Analysis Reveals High Genetic and Antigen-Driven Predetermination of Antibody Repertoires throughout B Cell Development. Cell Reports. 2017;19(7):1467-1478. doi:10.1016/j.celrep.2017.04.054.
YAML specification:
definitions:
reports:
my_overlap_report: DiseaseAssociatedSequenceOverlap # report has no parameters
PerformanceOverview¶
PerformanceOverview report creates an ROC plot and precision-recall plot for optimal trained models on multiple datasets. The labels on the plots are the names of the datasets, so it might be good to have user-friendly names when defining datasets that are still a combination of letters, numbers and the underscore sign.
This report can be used only with MultiDatasetBenchmarkTool as it will plot ROC and PR curve for trained models across datasets. Also, it requires the task to be immune repertoire classification and cannot be used for receptor or sequence classification. Furthermore, it uses predictions on the test dataset to assess the performance and plot the curves. If the parameter refit_optimal_model is set to True, all data will be used to fit the optimal model, so there will not be a test dataset which can be used to assess performance and the report will not be generated.
If datasets have the same number of examples, the baseline PR curve will be plotted as described in this publication: Saito T, Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE. 2015;10(3):e0118432. doi:10.1371/journal.pone.0118432
If the datasets have different number of examples, the baseline PR curve will not be plotted.
YAML specification:
definitions:
reports:
my_performance_report: PerformanceOverview