Report parameters

Under the definitions/reports component, the user can specify reports which visualise or summarise different properties of the dataset or analysis.

Reports have been divided into different types. Different types of reports can be specified depending on which instruction is run. Click on the name of the report type to see more details.

  • Data reports show some type of features or statistics about a given dataset.

  • Encoding reports show some type of features or statistics about an encoded dataset, or may export relevant sequences or tables.

  • ML model reports show some type of features or statistics about a single trained ML model (e.g., model coefficients).

  • Train ML model reports plot general statistics or export data of multiple models simultaneously when running the TrainMLModel instruction (e.g., performance comparison between models).

  • Multi dataset reports are special reports that can be specified when running immuneML with the MultiDatasetBenchmarkTool. See Manuscript use case 1: Robustness assessment for an example.

Data reports

Data reports show some type of features or statistics about a given dataset.

When running the TrainMLModel instruction, data reports can be specified inside the ‘selection’ or ‘assessment’ specification under the keys ‘reports/data’ (current cross-validation split) or ‘reports/data_splits’ (train/test sub-splits). Example:

my_instruction:
    type: TrainMLModel
    selection:
        reports:
            data:
                - my_data_report
        # other parameters...
    assessment:
        reports:
            data:
                - my_data_report
        # other parameters...
    # other parameters...

Alternatively, when running the ExploratoryAnalysis instruction, data reports can be specified under ‘report’. Example:

my_instruction:
    type: ExploratoryAnalysis
    analyses:
        my_first_analysis:
            report: my_data_report
            # other parameters...
    # other parameters...

AminoAcidFrequencyDistribution

Generates a barplot showing the relative frequency of each amino acid at each position in the sequences of a dataset.

Example output:

Coefficients report Coefficients report

Specification arguments:

  • alignment (str): Alignment style for aligning sequences of different lengths. Options are as follows:

    • CENTER: center-align sequences of different lengths. The middle amino acid of any sequence be labelled position 0. By default, alignment is CENTER.

    • LEFT: left-align sequences of different lengths, starting at 0.

    • RIGHT: right align sequences of different lengths, ending at 0 (counting towards negative numbers).

    • IMGT: align sequences based on their IMGT positional numbering, considering the sequence region_type (IMGT_CDR3 or IMGT_JUNCTION). The main difference between CENTER and IMGT is that IMGT aligns the first and last amino acids, adding gaps in the middle, whereas CENTER aligns the middle of the sequences, padding with gaps at the start and end of the sequence. When region_type is IMGT_JUNCTION, the IMGT positions run from 104 (conserved C) to 118 (conserved W/F). When IMGT_CDR3 is used, these positions are 105 to 117. For long CDR3 sequences, additional numbers are added in between IMGT positions 111 and 112. See the official IMGT documentation for more details: https://www.imgt.org/IMGTScientificChart/Numbering/CDR3-IMGTgaps.html

  • relative_frequency (bool): Whether to plot relative frequencies (true) or absolute counts (false) of the positional amino acids. Note that when sequences are of different length, setting relative_frequency to True will produce different results depending on the alignment type, as some positions are only covered by the longest sequences. By default, relative_frequency is False.

  • split_by_label (bool): Whether to split the plots by a label. If set to true, the Dataset must either contain a single label, or alternatively the label of interest can be specified under ‘label’. If split_by_label is set to true, the percentage-wise frequency difference between classes is plotted additionally. By default, split_by_label is False.

  • label (str): if split_by_label is set to True, a label can be specified here.

  • region_type (str): which part of the sequence to check; e.g., IMGT_CDR3

YAML specification:

definitions:
    reports:
        my_aa_freq_report:
            AminoAcidFrequencyDistribution:
                relative_frequency: False
                split_by_label: True
                label: CMV
                region_type: IMGT_CDR3

GLIPH2Exporter

Report which exports the receptor data to GLIPH2 format so that it can be directly used in GLIPH2 tool. Currently, the report accepts only receptor datasets.

GLIPH2 publication: Huang H, Wang C, Rubelt F, Scriba TJ, Davis MM. Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nature Biotechnology. Published online April 27, 2020:1-9. doi:10.1038/s41587-020-0505-4

Specification arguments:

  • condition (str): name of the parameter present in the receptor metadata in the dataset; condition can be anything which can be processed in GLIPH2, such as tissue type or treatment.

YAML specification:

definitions:
    reports:
        my_gliph2_exporter:
            GLIPH2Exporter:
                condition: epitope # for instance, epitope parameter is present in receptors' metadata with values such as "MtbLys" for Mycobacterium tuberculosis (as shown in the original paper).

MotifGeneralizationAnalysis

This report splits the given dataset into a training and validation set, identifies significant motifs using the MotifEncoder on the training set and plots the precision/recall and precision/true positive predictions of motifs on both the training and validation sets. This can be used to:

  • determine the optimal recall cutoff for motifs of a given size

  • investigate how well motifs learned on a training set generalize to a test set

After running this report and determining the optimal recall cutoffs, the report MotifTestSetPerformance can be run to plot the performance on an independent test set.

Note: the MotifEncoder (and thus this report) can only be used for sequences of the same length.

Specification arguments:

  • label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.

  • training_set_identifier_path (str): Path to a file containing ‘sequence_identifiers’ of the sequences used for the training set. This file should have a single column named ‘example_id’ and have one sequence identifier per line. If training_set_identifier_path is not set, a random subset of the data (according to training_percentage) will be assigned to be the training set.

  • training_percentage (float): If training_set_identifier_path is not set, this value is used to specify the fraction of sequences that will be randomly assigned to form the training set. Should be a value between 0 and 1. By default, training_percentage is 0.7.

  • random_seed (int): Random seed for splitting the data into training and validation sets a training_set_identifier_path is not provided.

  • split_by_motif_size (bool): Whether to split the analysis per motif size. If true, a recall threshold is learned for each motif size, and figures are generated for each motif size independently. By default, split_by_motif_size is true.

  • min_precision: MotifEncoder parameter. The minimum precision threshold for keeping a motif on the training set. By default, min_precision is 0.9.

  • test_precision_threshold (float). The desired precision on the test set, given that motifs are learned by using a training set with a precision threshold of min_precision. It is recommended for test_precision_threshold to be lower than min_precision, e.g., min_precision - 0.1. By default, test_precision_threshold is 0.8.

  • min_recall (float): MotifEncoder parameter. The minimum recall threshold for keeping a motif. Any learned recall threshold will be at least as high as the set min_recall value. The default value for min_recall is 0.

  • min_true_positives (int): MotifEncoder parameter. The minimum number of true positive training sequences that a motif needs to occur in. The default value for min_true_positives is 1.

  • max_positions (int): MotifEncoder parameter. The maximum motif size. This is number of positional amino acids the motif consists of (excluding gaps). The default value for max_positions is 4.

  • min_positions (int): MotifEncoder parameter. The minimum motif size (see also: max_positions). The default value for min_positions is 1.

  • no_gaps (bool): MotifEncoder parameter. Must be set to True if only contiguous motifs (position-specific k-mers) are allowed. By default, no_gaps is False, meaning both gapped and ungapped motifs are searched for.

  • smoothen_combined_precision (bool): whether to add a smoothed line representing the combined precision to the precision-vs-TP plot. When set to True, this may take considerable extra time to compute. By default, plot_smoothed_combined_precision is set to True.

  • min_points_in_window (int): Parameter for smoothing the combined_precision line in the precision-vs-TP plot through lognormal kernel smoothing with adaptive window size. This parameter determines the minimum number of points that need to be present in a window to determine the adaptive window size. By default, min_points_in_window is 50.

  • smoothing_constant1: Parameter for smoothing the combined_precision line in the precision-vs-TP plot through lognormal kernel smoothing with adaptive window size. This smoothing constant determines the dependence of the smoothness on the window size. Increasing this increases smoothness for regions where few points are present. By default, smoothing_constant1 is 5.

  • smoothing_constant2: Parameter for smoothing the combined_precision line in the precision-vs-TP plot through lognormal kernel smoothing. with adaptive window size. This smoothing constant can be used to scale the overall kernel width, thus influencing the smoothness of all regions regardless of data density. By default, smoothing_constant2 is 10.

  • training_set_name (str): Name of the training set to be used in figures. By default, the training_set_name is ‘training set’.

  • test_set_name (str): Name of the test set to be used in figures. By default, the test_set_name is ‘test set’.

  • highlight_motifs_path (str): Path to a set of motifs of interest to highlight in the output figures (such as implanted ground-truth motifs). By default, no motifs are highlighted.

  • highlight_motifs_name (str): IF highlight_motifs_path is defined, this name will be used to label the motifs of interest in the output figures.

YAML specification:

definitions:
    reports:
        my_motif_generalization:
            MotifGeneralizationAnalysis:
                min_precision: 0.9
                min_recall: 0.1
                label: # Define a label, and the positive class for that given label
                    CMV:
                        positive_class: +

ReceptorDatasetOverview

This report plots the length distribution per chain for a receptor (paired-chain) dataset.

Specification arguments:

  • batch_size (int): how many receptors to load at once; 50 000 by default

YAML specification:

definitions:
    reports:
        my_receptor_overview_report: ReceptorDatasetOverview

RecoveredSignificantFeatures

Compares a given collection of ground truth implanted signals (sequences or k-mers) to the significant label-associated k-mers or sequences according to Fisher’s exact test.

Internally uses the KmerAbundanceEncoder for calculating significant k-mers, and SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder to calculate significant full sequences (depending on whether the argument compairr_path was set).

This report creates two plots:

  • the first plot is a bar chart showing what percentage of the ground truth implanted signals were found to be significant.

  • the second plot is a bar chart showing what percentage of the k-mers/sequences found to be significant match the ground truth implanted signals.

To compare k-mers or sequences of differing lengths, the ground truth sequences or long k-mers are split into k-mers of the given size through a sliding window approach. When comparing ‘full_sequences’ to ground truth sequences, a match is only registered if both sequences are of equal length.

Specification arguments:

  • ground_truth_sequences_path (str): Path to a file containing the true implanted (sub)sequences, e.g., full sequences or k-mers. The file should contain one sequence per line, without a header, and without V or J genes.

  • sequence_type (str): either amino acid or nucleotide; which type of sequence to use for the analysis

  • region_type (str): which AIRR field to use for comparison, e.g. IMGT_CDR3 or IMGT_JUNCTION

  • p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.

  • k_values (list): Length of the k-mers (number of amino acids) created by the KmerAbundanceEncoder. When using a full sequence encoding (SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder), specify ‘full_sequence’ here. Each value specified under k_values will represent one bar in the output figure.

  • label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.

  • compairr_path (str): If ‘full_sequence’ is listed under k_values, the path to the CompAIRR executable may be provided. If the compairr_path is specified, the CompAIRRSequenceAbundanceEncoder will be used to compute the significant sequences. If the path is not specified and ‘full_sequence’ is listed under k-values, SequenceAbundanceEncoder will be used.

YAML specification:

definitions:
    reports:
        my_recovered_significant_features_report:
            RecoveredSignificantFeatures:
                groundtruth_sequences_path: path/to/groundtruth/sequences.txt
                trim_leading_trailing: False
                p_values:
                    - 0.1
                    - 0.01
                    - 0.001
                    - 0.0001
                k_values:
                    - 3
                    - 4
                    - 5
                    - full_sequence
                compairr_path: path/to/compairr # can be specified if 'full_sequence' is listed under k_values
                label: # Define a label, and the positive class for that given label
                    CMV:
                        positive_class: +

RepertoireClonotypeSummary

Shows the number of distinct clonotypes per repertoire in a given dataset as a bar plot.

Specification arguments:

  • color_by_label (str): name of the label to use to color the plot, e.g., could be disease label, or None

YAML specification:

definitions:
    reports:
        my_clonotype_summary_rep:
            RepertoireClonotypeSummary:
                color_by_label: celiac

SequenceCountDistribution

Generates a histogram of the duplicate counts of the sequences in a dataset.

Specification arguments:

  • split_by_label (bool): Whether to split the plots by a label. If set to true, the Dataset must either contain a single label, or alternatively the label of interest can be specified under ‘label’. By default, split_by_label is False.

  • label (str): Optional label for separating the results by color/creating separate plots. Note that this should the name of a valid dataset label.

YAML specification:

my_sld_report:
    SequenceCountDistribution:
        label: disease

SequenceLengthDistribution

Generates a histogram of the lengths of the sequences in a dataset.

Specification arguments:

  • sequence_type (str): whether to check the length of amino acid or nucleotide sequences; default value is ‘amino_acid’

  • region_type (str): which part of the sequence to examine; e.g., IMGT_CDR3

YAML specification:

definitions:
    reports:
        my_sld_report:
            SequenceLengthDistribution:
                sequence_type: amino_acid
                region_type: IMGT_CDR3

SequencesWithSignificantKmers

Given a list of reference sequences, this report writes out the subsets of reference sequences containing significant k-mers (as computed by the KmerAbundanceEncoder using Fisher’s exact test).

For each combination of p-value and k-mer size given, a file is written containing all sequences containing a significant k-mer of the given size at the given p-value.

Specification arguments:

  • reference_sequences_path (str): Path to a file containing the reference sequences, The file should contain one sequence per line, without a header, and without V or J genes.

  • p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.

  • k_values (list): Length of the k-mers (number of amino acids) created by the KmerAbundanceEncoder. Each k-mer length will become one panel in the output figure.

  • label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.

YAML specification:

definitions:
    reports:
        my_sequences_with_significant_kmers:
            SequencesWithSignificantKmers:
                reference_sequences_path: path/to/reference/sequences.txt
                p_values:
                    - 0.1
                    - 0.01
                    - 0.001
                    - 0.0001
                k_values:
                    - 3
                    - 4
                    - 5
                label: # Define a label, and the positive class for that given label
                    CMV:
                        positive_class: +

SignificantFeatures

Plots a boxplot of the number of significant features (label-associated k-mers or sequences) per Repertoire according to Fisher’s exact test, across different classes for the given label.

Internally uses the KmerAbundanceEncoder for calculating significant k-mers, and SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder to calculate significant full sequences (depending on whether the argument compairr_path was set).

Specification arguments:

  • p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.

  • k_values (list): Length of the k-mers (number of amino acids) created by the KmerAbundanceEncoder. When using a full sequence encoding (SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder), specify ‘full_sequence’ here. Each value specified under k_values will represent one boxplot in the output figure.

  • label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.

  • compairr_path (str): If ‘full_sequence’ is listed under k_values, the path to the CompAIRR executable may be provided. If the compairr_path is specified, the CompAIRRSequenceAbundanceEncoder will be used to compute the significant sequences. If the path is not specified and ‘full_sequence’ is listed under k-values, SequenceAbundanceEncoder will be used.

  • log_scale (bool): Whether to plot the y axis in log10 scale (log_scale = True) or continuous scale (log_scale = False). By default, log_scale is False.

YAML specification:

definitions:
    reports:
        my_significant_features_report:
            SignificantFeatures:
                p_values:
                    - 0.1
                    - 0.01
                    - 0.001
                    - 0.0001
                k_values:
                    - 3
                    - 4
                    - 5
                    - full_sequence
                compairr_path: path/to/compairr # can be specified if 'full_sequence' is listed under k_values
                label: # Define a label, and the positive class for that given label
                    CMV:
                        positive_class: +
                log_scale: False

SignificantKmerPositions

Plots the number of significant k-mers (as computed by the KmerAbundanceEncoder using Fisher’s exact test) observed at each IMGT position of a given list of reference sequences. This report creates a stacked bar chart, where each bar represents an IMGT position, and each segment of the stack represents the observed frequency of one ‘significant’ k-mer at that position.

Specification arguments:

  • reference_sequences_path (str): Path to a file containing the reference sequences, The file should contain one sequence per line, without a header, and without V or J genes.

  • p_values (list): The p value thresholds to be used by Fisher’s exact test. Each p-value specified here will become one panel in the output figure.

  • k_values (list): Length of the k-mers (number of amino acids) created by the KmerAbundanceEncoder. Each k-mer length will become one panel in the output figure.

  • label (dict): A label configuration. One label should be specified, and the positive_class for this label should be defined. See the YAML specification below for an example.

  • sequence_type (str): nucleotide or amino_acid

  • region_type (str): which AIRR field to consider, e.g., IMGT_CDR3 or IMGT_JUNCTION

YAML specification:

definitions:
    reports:
        my_significant_kmer_positions_report:
            SignificantKmerPositions:
                reference_sequences_path: path/to/reference/sequences.txt
                p_values:
                    - 0.1
                    - 0.01
                    - 0.001
                    - 0.0001
                k_values:
                    - 3
                    - 4
                    - 5
                label: # Define a label, and the positive class for that given label
                    CMV:
                        positive_class: +

SimpleDatasetOverview

Generates a simple text-based overview of the properties of any dataset, including the dataset name, size, and metadata labels.

YAML specification:

definitions:
    reports:
        my_overview: SimpleDatasetOverview

VJGeneDistribution

This report creates several plots to gain insight into the V and J gene distribution of a given dataset. When a label is provided, the information in the plots is separated per label value, either by color or by creating separate plots. This way one can for example see if a particular V or J gene is more prevalent across disease associated receptors.

  • Individual V and J gene distributions: for sequence and receptor datasets, a bar plot is created showing how often

each V or J gene occurs in the dataset. For repertoire datasets, boxplots are used to represent how often each V or J gene is used across all repertoires. Since repertoires may differ in size, these counts are normalised by the repertoire size (original count values are additionaly exported in tsv files).

  • Combined V and J gene distributions: for sequence and receptor datasets, a heatmap is created showing how often each

combination of V and J genes occurs in the dataset. A similar plot is created for repertoire datasets, except in this case only the average value for the normalised gene usage frequencies are shown (original count values are additionaly exported in tsv files).

Specification arguments:

  • split_by_label (bool): Whether to split the plots by a label. If set to true, the Dataset must either contain a single label, or alternatively the label of interest can be specified under ‘label’. By default, split_by_label is False.

  • label (str): Optional label for separating the results by color/creating separate plots. Note that this should the name of a valid dataset label.

  • is_sequence_label (bool): for RepertoireDatasets, indicates if the label applies to the sequence level (e.g., antigen binding versus non-binding across repertoires) or repertoire level (e.g., diseased repertoires versus healthy repertoires). By default, is_sequence_label is False. For Sequence- and ReceptorDatasets, this parameter is ignored.

YAML specification:

definitions:
    reports:
        my_vj_gene_report:
            VJGeneDistribution:
                label: ag_binding

Encoding reports

Encoding reports show some type of features or statistics about an encoded dataset, or may in some cases export relevant sequences or tables.

When running the TrainMLModel instruction, encoding reports can be specified inside the ‘selection’ or ‘assessment’ specification under the key ‘reports/encoding’. Example:

my_instruction:
    type: TrainMLModel
    selection:
        reports:
            encoding:
                - my_encoding_report
        # other parameters...
    assessment:
        reports:
            encoding:
                - my_encoding_report
        # other parameters...
    # other parameters...

Alternatively, when running the ExploratoryAnalysis instruction, encoding reports can be specified under ‘report’. Example:

my_instruction:
    type: ExploratoryAnalysis
    analyses:
        my_first_analysis:
            report: my_encoding_report
            # other parameters...
    # other parameters...

DesignMatrixExporter

Exports the design matrix and related information of a given encoded Dataset to csv files. If the encoded data has more than 2 dimensions (such as when using the OneHot encoder with option Flatten=False), the data are then exported to different formats to facilitate their import with external software.

Specification arguments:

  • file_format (str): the format and extension of the file to store the design matrix. The supported formats are: npy, csv, pt, hdf5, npy.zip, csv.zip or hdf5.zip.

Note: when using hdf5 or hdf5.zip output formats, make sure the ‘hdf5’ dependency is installed.

YAML specification:

definitions:
    reports:
        my_dme_report:
            DesignMatrixExporter:
                file_format: csv

DimensionalityReduction

This report visualizes the data obtained by dimensionality reduction.

Specification arguments:

  • label (str): name of the label to use for highlighting data points; or None

  • dim_red_method (str): name of the dimensionality reduction method defined under ml_methods that will be used to transform the data for plotting; if None, it will visualize the encoded data of reduced dimensionality if set

YAML specification:

definitions:
    reports:
        rep1:
            DimensionalityReduction:
                label: epitope
                dim_red_method:
                    PCA:
                        n_components: 2

FeatureComparison

Encoding a dataset results in a numeric matrix, where the rows are examples (e.g., sequences, receptors, repertoires) and the columns are features. For example, when KmerFrequency encoder is used, the features are the k-mers (AAA, AAC, etc..) and the feature values are the frequencies per k-mer.

This report separates the examples based on a binary metadata label, and plots the mean feature value of each feature in one example group against the other example group (for example: plot the feature value of ‘sick’ repertoires on the x axis, and ‘healthy’ repertoires on the y axis to spot consistent differences). The plot can be separated into different colors or facets using other metadata labels (for example: plot the average feature values of ‘cohort1’, ‘cohort2’ and ‘cohort3’ in different colors to spot biases).

Alternatively, when plotting features without comparing them across a binary label, see: FeatureValueBarplot report to plot a simple bar chart per feature (average across examples). Or FeatureDistribution report to plot the distribution of each feature across examples, rather than only showing the mean value in a bar plot.

Example output:

Feature comparison zoomed in plot with VLEQ highlighted

Specification arguments:

  • comparison_label (str): Mandatory label. This label is used to split the encoded data matrix and define the x and y axes of the plot. This label is only allowed to have 2 classes (for example: sick and healthy, binding and non-binding).

  • color_grouping_label (str): Optional label that is used to color the points in the scatterplot. This can not be the same as comparison_label.

  • row_grouping_label (str): Optional label that is used to group scatterplots into different row facets. This can not be the same as comparison_label.

  • column_grouping_label (str): Optional label that is used to group scatterplots into different column facets. This can not be the same as comparison_label.

  • show_error_bar (bool): Whether to show the error bar (standard deviation) for the points, both in the x and y dimension.

  • log_scale (bool): Whether to plot the x and y axes in log10 scale (log_scale = True) or continuous scale (log_scale = False). By default, log_scale is False.

  • keep_fraction (float): The total number of features may be very large and only the features differing significantly across comparison labels may be of interest. When the keep_fraction parameter is set below 1, only the fraction of features that differs the most across comparison labels is kept for plotting (note that the produced .csv file still contains all data). By default, keep_fraction is 1, meaning that all features are plotted.

  • opacity (float): a value between 0 and 1 setting the opacity for data points making it easier to see if there are overlapping points

YAML specification:

definitions:
    reports:
        my_comparison_report:
            FeatureComparison: # compare the different classes defined in the label disease
                comparison_label: disease

FeatureDistribution

Encoding a dataset results in a numeric matrix, where the rows are examples (e.g., sequences, receptors, repertoires) and the columns are features. For example, when KmerFrequency encoder is used, the features are the k-mers (AAA, AAC, etc..) and the feature values are the frequencies per k-mer.

This report plots the distribution of feature values. For each feature, a violin plot is created to show the distribution of feature values across all examples. The violin plots can be separated into different colors or facets using metadata labels (for example: plot the feature distributions of ‘cohort1’, ‘cohort2’ and ‘cohort3’ in different colors to spot biases).

See also: FeatureValueBarplot report to plot a simple bar chart per feature (average across examples), rather than the entire distribution. Or FeatureComparison report to compare features across binary metadata labels (e.g., plot the feature value of ‘sick’ repertoires on the x axis, and ‘healthy’ repertoires on the y axis).

Example output:

Feature distribution report example

Specification arguments:

  • color_grouping_label (str): The label that is used to color each bar, at each level of the grouping_label.

  • row_grouping_label (str): The label that is used to group bars into different row facets.

  • column_grouping_label (str): The label that is used to group bars into different column facets.

  • mode (str): either ‘normal’, ‘sparse’ or ‘auto’ (default). in the ‘normal’ mode there are normal boxplots corresponding to each column of the encoded dataset matrix; in the ‘sparse’ mode all zero cells are eliminated before passing the data to the boxplots. If mode is set to ‘auto’, then it will automatically set to ‘sparse’ if the density of the matrix is below 0.01

  • x_title (str): x-axis label

  • y_title (str): y-axis label

YAML specification:

definitions:
    reports:
        my_fdistr_report:
            FeatureDistribution:
                mode: sparse

FeatureValueBarplot

Encoding a dataset results in a numeric matrix, where the rows are examples (e.g., sequences, receptors, repertoires) and the columns are features. For example, when KmerFrequency encoder is used, the features are the k-mers (AAA, AAC, etc..) and the feature values are the frequencies per k-mer.

This report plots the mean feature values per feature. A bar plot is created where the average feature value across all examples is shown, with optional error bars. The bar plots can be separated into different colors or facets using metadata labels (for example: plot the average feature values of ‘cohort1’, ‘cohort2’ and ‘cohort3’ in different colors to spot biases).

See also: FeatureDistribution report to plot the distribution of each feature across examples, rather than only showin the mean value in a bar plot. Or FeatureComparison report to compare features across binary metadata labels (e.g., plot the feature value of ‘sick’ repertoires on the x axis, and ‘healthy’ repertoires on the y axis.).

Example output:

Feature value barplot report example

Specification arguments:

  • color_grouping_label (str): The label that is used to color each bar, at each level of the grouping_label.

  • row_grouping_label (str): The label that is used to group bars into different row facets.

  • column_grouping_label (str): The label that is used to group bars into different column facets.

  • show_error_bar (bool): Whether to show the error bar (standard deviation) for the bars.

  • x_title (str): x-axis label

  • y_title (str): y-axis label

  • plot_top_n (int): plot n of the largest features on average separately (useful when there are too many features to plot at the same time)

  • plot_bottom_n (int): plot n of the smallest features on average separately (useful when there are too many features to plot at the same time)

  • plot_all_features (bool): whether to plot all (might be slow for large number of features)

YAML specification:

definitions:
    reports:
        my_fvb_report:
            FeatureValueBarplot: # timepoint, disease_status and age_group are metadata labels
                column_grouping_label: timepoint
                row_grouping_label: disease_status
                color_grouping_label: age_group
                plot_all_features: true
                plot_top_n: 10
                plot_bottom_n: 5

GroundTruthMotifOverlap

Creates report displaying overlap between learned motifs and groundtruth motifs implanted in a given sequence dataset. This report must be used in combination with the MotifEncoder.

Specification arguments:

  • groundtruth_motifs_path (str): Path to a .tsv file containing groundtruth position-specific motifs. The file should specify the motifs as position-specific amino acids, one column representing the positions concatenated with an ‘&’ symbol, the next column specifying the amino acids concatenated with ‘&’ symbol, and the last column specifying the implant rate.

    Example:

    indices

    amino_acids

    n_sequences

    0

    A

    4

    4&8&9

    G&A&C

    30

    This file shows a motif ‘A’ at position 0 implanted in 4 sequences, and motif G—AC implanted between positions 4 and 9 in 30 sequences

YAML specification:

definitions:
    reports:
        my_ground_truth_motif_report:
            GroundTruthMotifOverlap:
                groundtruth_motifs_path: path/to/file.tsv

Matches

Reports the number of matches that were found when using one of the following encoders:

Report results are:

  • A table containing all matches, where the rows correspond to the Repertoires, and the columns correspond to the objects to match (regular expressions or receptor sequences).

  • The repertoire sizes (read frequencies and the number of unique sequences per repertoire), for each of the chains. This can be used to calculate the percentage of matched sequences in a repertoire.

  • When using MatchedSequences encoder or MatchedReceptors encoder, tables describing the chains and receptors (ids, chains, V and J genes and sequences).

  • When using MatchedReceptors encoder or using MatchedRegex encoder with chain pairs, tables describing the paired matches (where a match was found in both chains) per repertoire.

YAML specification:

definitions:
    reports:
        my_match_report: Matches

MotifTestSetPerformance

This report can be used to show the performance of a learned set motifs using the MotifEncoder on an independent test set of unseen data.

It is recommended to first run the report MotifGeneralizationAnalysis in order to calibrate the optimal recall thresholds and plot the performance of motifs on training- and validation sets.

Specification arguments:

  • test_dataset (dict): parameters for importing a SequenceDataset to use as an independent test set. By default, the import parameters ‘is_repertoire’ and ‘paired’ will be set to False to ensure a SequenceDataset is imported.

YAML specification:

definitions:
    reports:
        my_motif_report:
            MotifTestSetPerformance:
                test_dataset:
                    format: AIRR # choose any valid import format
                    params:
                        path: path/to/files/
                        is_repertoire: False  # is_repertoire must be False to import a SequenceDataset
                        paired: False         # paired must be False to import a SequenceDataset
                        # optional other parameters...

NonMotifSequenceSimilarity

Plots the similarity of positions outside the motifs of interest. This report can be used to investigate if the motifs of interest as determined by the MotifEncoder have a tendency occur in sequences that are naturally very similar or dissimilar.

For each motif, the subset of sequences containing the motif is selected, and the hamming distances are computed between all sequences in this subset. Finally, a plot is created showing the distribution of hamming distances between the sequences containing the motif. For motifs occurring in sets of very similar sequences, this distribution will lean towards small hamming distances. Likewise, for motifs occurring in a very diverse set of sequences, the distribution will lean towards containing more large hamming distances.

Specification arguments:

  • motif_color_map (dict): An optional mapping between motif sizes and colors. If no mapping is given, default colors will be chosen.

YAML specification:

definitions:
    reports:
        my_motif_sim:
            NonMotifSimilarity:
                motif_color_map:
                    3: "#66C5CC"
                    4: "#F6CF71"
                    5: "#F89C74"

PositionalMotifFrequencies

This report must be used in combination with the MotifEncoder. Plots a stacked bar plot of amino acid occurrence at different indices in any given dataset, along with a plot investigating motif continuity which displays a bar plot of the gap sizes between the amino acids in the motifs in the given dataset. Note that a distance of 1 means that the amino acids are continuous (next to each other).

Specification arguments:

  • motif_color_map (dict): Optional mapping between motif lengths and specific colors to be used. Example:

    motif_color_map:

    1: #66C5CC 2: #F6CF71 3: #F89C74

YAML specification:

definitions:
    reports:
        my_pos_motif_report:
            PositionalMotifFrequencies:
                motif_color_map:

RelevantSequenceExporter

Exports the sequences that are extracted as label-associated when using the SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder in AIRR-compliant format.

YAML specification:

definitions:
    reports:
        my_relevant_sequences: RelevantSequenceExporter

ML model reports

ML model reports show some type of features or statistics about a single trained ML model.

In the TrainMLModel instruction, ML model reports can be specified inside the ‘selection’ or ‘assessment’ specification under the key ‘reports/models’. Example:

my_instruction:
    type: TrainMLModel
    selection:
        reports:
            models:
                - my_ml_report
        # other parameters...
    assessment:
        reports:
            models:
                - my_ml_report
        # other parameters...
    # other parameters...

BinaryFeaturePrecisionRecall

Plots the precision and recall scores for each added feature to the collection of features selected by the BinaryFeatureClassifier.

YAML specification:

definitions:
    reports:
        my_report: BinaryFeaturePrecisionRecall

Coefficients

A report that plots the coefficients for a given ML method in a barplot. Can be used for LogisticRegression, SVM, SVC, and RandomForestClassifier. In the case of RandomForest, the feature importances will be plotted.

When used in TrainMLModel instruction, the report can be specified under ‘models’, both on the selection and assessment levels.

Which coefficients should be plotted (for example: only nonzero, above a certain threshold, …) can be specified. Multiple options can be specified simultaneously. By default the 25 largest coefficients are plotted. The full set of coefficients will also be exported as a csv file.

Example output:

Coefficients report

Specification arguments:

  • coefs_to_plot (list): A list specifying which coefficients should be plotted. Valid values are: ALL, NONZERO, CUTOFF, N_LARGEST.

  • cutoff (list): If ‘cutoff’ is specified under ‘coefs_to_plot’, the cutoff values can be specified here. The coefficients which have an absolute value equal to or greater than the cutoff will be plotted.

  • n_largest (list): If ‘n_largest’ is specified under ‘coefs_to_plot’, the values for n can be specified here. These should be integer values. The n largest coefficients are determined based on their absolute values.

YAML specification:

definitions:
    reports:
        my_coef_report:
            Coefficients:
                coefs_to_plot:
                    - all
                    - nonzero
                    - cutoff
                    - n_largest
                cutoff:
                    - 0.1
                    - 0.01
                n_largest:
                    - 5
                    - 10

ConfounderAnalysis

A report that plots the numbers of false positives and false negatives with respect to each value of the metadata features specified by the user. This allows checking whether a given machine learning model makes more misclassifications for some values of a metadata feature than for the others.

Specification arguments:

  • metadata_labels (list): A list of the metadata features to use as a basis for the calculations

YAML specification:

definitions:
    reports:
        my_confounder_report:
            ConfounderAnalysis:
                metadata_labels:
                  - age
                  - sex

DeepRCMotifDiscovery

This report plots the contributions of (i) input sequences and (ii) kernels to trained DeepRC model with respect to the test dataset. Contributions are computed using integrated gradients (IG). This report produces two figures:

  • inputs_integrated_gradients: Shows the contributions of the characters within the input sequences (test dataset) that was most important for immune status prediction of the repertoire. IG is only applied to sequences of positive class repertoires.

  • kernel_integrated_gradients: Shows the 1D CNN kernels with the highest contribution over all positions and amino acids.

For both inputs and kernels: Larger characters in the extracted motifs indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the immune status. For kernels only: contributions to positional encoding are indicated by < (beginning of sequence), ∧ (center of sequence), and > (end of sequence).

See DeepRCMotifDiscovery for repertoire classification for a more detailed example.

Reference:

Widrich, M., et al. (2020). Modern Hopfield Networks and Attention for Immune Repertoire Classification. Advances in Neural Information Processing Systems, 33. https://proceedings.neurips.cc//paper/2020/hash/da4902cb0bc38210839714ebdcf0efc3-Abstract.html

Example output:

DeepRC IG over inputs DeepRC IG over kernels

Specification arguments:

  • n_steps (int): Number of IG steps (more steps -> better path integral -> finer contribution values). 50 is usually good enough.

  • threshold (float): Only applies to the plotting of kernels. Contributions are normalized to range [0, 1], and only kernels with normalized contributions above threshold are plotted.

YAML specification:

definitions:
    reports:
        my_deeprc_report:
            DeepRCMotifDiscovery:
                threshold: 0.5
                n_steps: 50

MotifSeedRecovery

This report can be used to show how well implanted motifs (for example, through the Simulation instruction) can be recovered by various machine learning methods using the k-mer encoding. This report creates a boxplot, where the x axis (box grouping) represents the maximum possible overlap between an implanted motif seed and a kmer feature (measured in number of positions), and the y axis shows the coefficient size of the respective kmer feature. If the machine learning method has learned the implanted motif seeds, the coefficient size is expected to be largest for the kmer features with high overlap to the motif seeds.

Note that to use this report, the following criteria must be met:

  • KmerFrequencyEncoder must be used.

  • One of the following classifiers must be used: RandomForestClassifier, LogisticRegression, SVM, SVC

  • For each label, the implanted motif seeds relevant to that label must be specified

To find the overlap score between kmer features and implanted motif seeds, the two sequences are compared in a sliding window approach, and the maximum overlap is calculated.

Overlap scores between kmer features and implanted motifs are calculated differently based on the Hamming distance that was allowed during implanting.

Without hamming distance:
Seed:     AAA  -> score = 3
Feature: xAAAx
          ^^^

Seed:     AAA  -> score = 0
Feature: xAAxx

With hamming distance:
Seed:     AAA  -> score = 3
Feature: xAAAx
          ^^^

Seed:     AAA  -> score = 2
Feature: xAAxx
          ^^

Furthermore, gap positions in the motif seed are ignored:
Seed:     A/AA  -> score = 3
Feature: xAxAAx
          ^/^^

See Recovering simulated immune signals for more details.

Example output:

Motif seed recovery report

Specification arguments:

  • implanted_motifs_per_label (dict): a nested dictionary that specifies the motif seeds that were implanted in the given dataset. The first level of keys in this dictionary represents the different labels. In the inner dictionary there should be two keys: “seeds” and “hamming_distance”:

    • seeds: a list of motif seeds. The seeds may contain gaps, specified by a ‘/’ symbol.

    • hamming_distance: A boolean value that specifies whether hamming distance was allowed when implanting the motif seeds for a given label. Note that this applies to all seeds for this label.

    • gap_sizes: a list of all the possible gap sizes that were used when implanting a gapped motif seed. When no gapped seeds are used, this value has no effect.

YAML specification:

definitions:
    reports:
        my_motif_report:
            MotifSeedRecovery:
                implanted_motifs_per_label:
                    CD:
                        seeds:
                        - AA/A
                        - AAA
                        hamming_distance: False
                        gap_sizes:
                        - 0
                        - 1
                        - 2
                    T1D:
                        seeds:
                        - CC/C
                        - CCC
                        hamming_distance: True
                        gap_sizes:
                        - 2

ROCCurve

A report that plots the ROC curve for a binary classifier.

YAML specification:

definitions:
    reports:
        my_roc_report: ROCCurve

SequenceAssociationLikelihood

Plots the beta distribution used as a prior for class assignment in ProbabilisticBinaryClassifier. The distribution plotted shows the probability that a sequence is associated with a given class for a label.

YAML specification:

definitions:
    reports:
        my_sequence_assoc_report: SequenceAssociationLikelihood

TCRdistMotifDiscovery

The report for discovering motifs in paired immune receptor data of given specificity based on TCRdist3. The receptors are hierarchically clustered based on the tcrdist distance and then motifs are discovered for each cluster. The report outputs logo plots for the motifs along with the raw data used for plotting in csv format.

For the implementation, TCRdist3 library was used (source code available here). More details on the functionality used for this report are available here.

Original publications:

Dash P, Fiore-Gartland AJ, Hertz T, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. 2017; 547(7661):89-93. doi:10.1038/nature22383

Mayer-Blackwell K, Schattgen S, Cohen-Lavi L, et al. TCR meta-clonotypes for biomarker discovery with tcrdist3: quantification of public, HLA-restricted TCR biomarkers of SARS-CoV-2 infection. bioRxiv. Published online December 26, 2020:2020.12.24.424260. doi:10.1101/2020.12.24.424260

Example output:

TCRdist alpha chain logo plot TCRdist beta chain logo plot

Specification arguments:

  • positive_class_name (str): the class value (e.g., epitope) used to select only the receptors that are specific to the given epitope so that only those sequences are used to infer motifs; the reference receptors as required by TCRdist will be the ones from the dataset that have different or no epitope specified in their metadata; if the labels are available only on the epitope level (e.g., label is “AVFDRKSDAK” and classes are True and False), then here it should be specified that only the receptors with value “True” for label “AVFDRKSDAK” should be used; there is no default value for this argument

  • cores (int): number of processes to use for the computation of the distance and motifs

  • min_cluster_size (int): the minimum size of the cluster to discover the motifs for

  • use_reference_sequences (bool): when showing motifs, this parameter defines if reference sequences should be provided as well as a background

YAML specification:

definitions:
    reports:
        my_tcr_dist_report: # user-defined name
            TCRdistMotifDiscovery:
                positive_class_name: True # class name, could also be epitope name, depending on how it's defined in the dataset
                cores: 4
                min_cluster_size: 30
                use_reference_sequences: False

TrainingPerformance

A report that plots the evaluation metrics for the performance given machine learning model and training dataset. The available metrics are accuracy, balanced_accuracy, confusion_matrix, f1_micro, f1_macro, f1_weighted, precision, recall, auc and log_loss (see immuneML.environment.Metric.Metric).

Specification arguments:

  • metrics (list): A list of metrics used to evaluate training performance. See immuneML.environment.Metric.Metric for available options.

YAML specification:

definitions:
    reports:
        my_performance_report:
            TrainingPerformance:
                metrics:
                    - accuracy
                    - balanced_accuracy
                    - confusion_matrix
                    - f1_micro
                    - f1_macro
                    - f1_weighted
                    - precision
                    - recall
                    - auc
                    - log_loss

Train ML model reports

Train ML model reports plot general statistics or export data of multiple models simultaneously when running the TrainMLModel instruction.

In the TrainMLModel instruction, train ML model reports can be specified under ‘reports’. Example:

my_instruction:
    type: TrainMLModel
    reports:
        - my_train_ml_model_report
    # other parameters...

CVFeaturePerformance

This report plots the average training vs test performance w.r.t. given encoding parameter which is explicitly set in the feature attribute. It can be used only in combination with TrainMLModel instruction and can be only specified under ‘reports’

Specification arguments:

  • feature: name of the encoder parameter w.r.t. which the performance across training and test will be shown. Possible values depend on the encoder on which it is used.

  • is_feature_axis_categorical (bool): if the x-axis of the plot where features are shown should be categorical; alternatively it is automatically determined based on the feature values

YAML specification:

definitions:
    reports:
        report1:
            CVFeaturePerformance:
                feature: p_value_threshold # parameter value of SequenceAbundance encoder
                is_feature_axis_categorical: True # show x-axis as categorical

DiseaseAssociatedSequenceCVOverlap

DiseaseAssociatedSequenceCVOverlap report makes one heatmap per label showing the overlap of disease-associated sequences (or k-mers) produced by the SequenceAbundanceEncoder, CompAIRRSequenceAbundanceEncoder or KmerAbundanceEncoder between folds of cross-validation (either inner or outer loop of the nested CV). The overlap is computed by the following equation:

\[overlap(X,Y) = \frac{|X \cap Y|}{min(|X|, |Y|)} x 100\]

For details, see Greiff V, Menzel U, Miho E, et al. Systems Analysis Reveals High Genetic and Antigen-Driven Predetermination of Antibody Repertoires throughout B Cell Development. Cell Reports. 2017;19(7):1467-1478. doi:10.1016/j.celrep.2017.04.054.

Specification arguments:

  • compare_in_selection (bool): whether to compute the overlap over the inner loop of the nested CV - the sequence overlap is shown across CV folds for the model chosen as optimal within that selection

  • compare_in_assessment (bool): whether to compute the overlap over the optimal models in the outer loop of the nested CV

YAML specification:

definitions:
    reports:
        my_overlap_report: DiseaseAssociatedSequenceCVOverlap # report has no parameters

MLSettingsPerformance

Report for TrainMLModel instruction: plots the performance for each of the setting combinations as defined under ‘settings’ in the assessment (outer validation) loop.

The performances are grouped by label (horizontal panels) encoding (vertical panels) and ML method (bar color). When multiple data splits are used, the average performance over the data splits is shown with an error bar representing the standard deviation.

This report can be used only with TrainMLModel instruction under ‘reports’.

Specification arguments:

  • single_axis_labels (bool): whether to use single axis labels. Note that using single axis labels makes the figure unsuited for rescaling, as the label position is given in a fixed distance from the axis. By default, single_axis_labels is False, resulting in standard plotly axis labels.

  • x_label_position (float): if single_axis_labels is True, this should be an integer specifying the x axis label position relative to the x axis. The default value for label_position is -0.1.

  • y_label_position (float): same as x_label_position, but for the y-axis.

YAML specification:

definitions:
    reports:
        my_hp_report: MLSettingsPerformance

ROCCurveSummary

This report plots ROC curves for all trained ML settings ([preprocessing], encoding, ML model) in the outer loop of cross-validation in the TrainMLModel instruction. If there are multiple splits in the outer loop, this report will make one plot per split. This report is defined only for binary classification. If there are multiple labels defined in the instruction, each label has to have two classes to be included in this report.

YAML specification:

definitions:
    reports:
        my_roc_summary_report: ROCCurveSummary

ReferenceSequenceOverlap

The ReferenceSequenceOverlap report compares a list of disease-associated sequences (or k-mers) produced by the SequenceAbundanceEncoder, CompAIRRSequenceAbundanceEncoder or KmerAbundanceEncoder to a list of reference sequences. It outputs a Venn diagram and a list of sequences found both in the encoder and reference list.

The report compares the sequences by their sequence content and the additional comparison_attributes (such as V or J gene), as specified by the user.

Specification arguments:

  • reference_path (str): path to the reference file in csv format which contains one entry per row and has columns that correspond to the attributes listed under comparison_attributes argument

  • comparison_attributes (list): list of attributes to use for comparison; all of them have to be present in the reference file where they should be the names of the columns

  • label (str): name of the label for which the reference sequences/k-mers should be compared to the model; if none, it takes the one label from the instruction; if it is none and multiple labels were specified for the instruction, the report will not be generated

YAML specification:

definitions:
    reports:
        my_reference_overlap_report:
            ReferenceSequenceOverlap:
                reference_path: reference_sequences.csv  # example usage with SequenceAbundanceEncoder or CompAIRRSequenceAbundanceEncoder
                comparison_attributes:
                    - sequence_aa
                    - v_call
                    - j_call
        my_reference_overlap_report_with_kmers:
            ReferenceSequenceOverlap:
                reference_path: reference_kmers.csv  # example usage with KmerAbundanceEncoder
                comparison_attributes:
                    - k-mer

Multi dataset reports

Multi dataset reports are special reports that can be specified when running immuneML with the MultiDatasetBenchmarkTool. See Manuscript use case 1: Robustness assessment for an example.

When running the MultiDatasetBenchmarkTool, multi dataset reports can be specified under ‘benchmark_reports’. Example:

my_instruction:
    type: TrainMLModel
    benchmark_reports:
        - my_benchmark_report
    # other parameters...

DiseaseAssociatedSequenceOverlap

DiseaseAssociatedSequenceOverlap report makes a heatmap showing the overlap of disease-associated sequences (or k-mers) produced by the SequenceAbundanceEncoder, CompAIRRSequenceAbundanceEncoder or KmerAbundanceEncoder between multiple datasets of different sizes (different number of repertoires per dataset).

This plot can be used only with MultiDatasetBenchmarkTool.

The overlap is computed by the following equation:

\[overlap(X,Y) = \frac{|X \cap Y|}{min(|X|, |Y|)} * 100\]

For details, see: Greiff V, Menzel U, Miho E, et al. Systems Analysis Reveals High Genetic and Antigen-Driven Predetermination of Antibody Repertoires throughout B Cell Development. Cell Reports. 2017;19(7):1467-1478. doi:10.1016/j.celrep.2017.04.054.

YAML specification:

definitions:
    reports:
        my_overlap_report: DiseaseAssociatedSequenceOverlap # report has no parameters

PerformanceOverview

PerformanceOverview report creates an ROC plot and precision-recall plot for optimal trained models on multiple datasets. The labels on the plots are the names of the datasets, so it might be good to have user-friendly names when defining datasets that are still a combination of letters, numbers and the underscore sign.

This report can be used only with MultiDatasetBenchmarkTool as it will plot ROC and PR curve for trained models across datasets. Also, it requires the task to be immune repertoire classification and cannot be used for receptor or sequence classification. Furthermore, it uses predictions on the test dataset to assess the performance and plot the curves. If the parameter refit_optimal_model is set to True, all data will be used to fit the optimal model, so there will not be a test dataset which can be used to assess performance and the report will not be generated.

If datasets have the same number of examples, the baseline PR curve will be plotted as described in this publication: Saito T, Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE. 2015;10(3):e0118432. doi:10.1371/journal.pone.0118432

If the datasets have different number of examples, the baseline PR curve will not be plotted.

YAML specification:

definitions:
    reports:
        my_performance_report: PerformanceOverview