Manuscript use case 1: Reproduction of a published study inside immuneML
==========================================================================

.. meta::

   :twitter:card: summary
   :twitter:site: @immuneml
   :twitter:title: immuneML use case: reproduction of a published study inside immuneML
   :twitter:description: See how the study by Emerson et al. 2017 on CMV-status prediction from TCRbeta repertoires was reproduced in the immuneML manuscript.
   :twitter:image: https://docs.immuneml.uio.no/_images/emerson_reproduction.png


In this use case, we show how the study by Emerson and colleagues on CMV status prediction from TCRbeta repertoires (`Emerson et al. 2017 <https://www.nature.com/articles/ng.3822>`_)
can be reproduced within immuneML. Additionally, we test the approach on datasets subsampled from the original study (to include randomly chosen 400,
200, 100 and 50 subjects) and estimate the performance of the approach when fewer examples are available.

The dataset was downloaded from Adaptive Biotechnologies' `website <https://clients.adaptivebiotech.com/pub/Emerson-2017-NatGen>`_.
Out of the 786 subjects (cohort 1: 666, cohort 2: 120), we removed 103 subjects from cohort 1 (1 with missing repertoire data, 25 with unknown CMV
status, 3 with negative template counts for some of the sequences and the rest with no template count information).

The complete collection of original files used in this use case can be found in the NIRD research data archive (DOI: `10.11582/2021.00008 <https://doi.org/10.11582/2021.00008>`_).
This also includes the metadata files for cohorts 1 and 2 with the list of subjects included in this use case.
Note that the YAML specifications in the original dataset were compatible with immuneML version 1.0.1.
This documentation page contains the YAML specifications for equivalent analyses with the latest immuneML version.

Reproduction of the CMV status predictions study
-------------------------------------------------

To reproduce the analysis, we used immuneML as a command line tool. We define the dataset to be used, data representation (encoding) which we call
`SequenceAbundance`, the statistical model called `ProbabilisticBinaryClassifier` and different reports.

The encoding represents each repertoire by two numbers: the number of disease-associated sequences as determined by the Fisher's exact test and the
total number of sequences in the repertoire. As a hyperparameter of the encoding, it is possible to set the p-value threshold which can be used to
determine which sequences are disease-associated. As in the original publication, we try out different p-value thresholds through cross-validation.

The statistical model `ProbabilisticBinaryClassifier` relies on `SequenceAbundance` encoding and fits two sets of parameters ({:math:`\alpha_0`, :math:`\beta_0`}
and {:math:`\alpha_1`, :math:`\beta_1`}) to describe beta-distributed prior for CMV-negative and CMV-positive subjects. These parameters are then used
to create log-posterior odds ratio for class assignment for new subjects.

.. note::

  When used on large datasets, 'SequenceAbundance' encoder might be slow. If you want to reproduce the analysis faster and you are using
  immuneML version 2.2.0 or later, use :ref:`CompAIRRSequenceAbundance` encoder instead. `p_value_threshold` parameter is the same, and by default
  the analysis is performed using the amino acid sequence, and V and J genes. This can be turned of by setting `ignore_genes` to True (it is False by default).

To find the optimal p-value threshold we used 10-fold cross-validation on the cohort 1 and chose the one minimizing the cross-entropy loss (also
called logarithmic loss). We then tested the performance of the optimal model (optimal p-value and the classifier fitted on resulting data representation)
on the cohort 2 (as it was done in the original study).

The full YAML specification:

    .. collapse:: emerson_reproduction.yaml

        .. code-block:: yaml

          definitions:
            datasets:
              cmv2017: # the full dataset including cohort 1 and cohort 2
                format: ImmunoSEQRearrangement
                params:
                  path: ../emerson_adaptive/ # dataset files as provided by Adaptive Biotechnologies unzipped into emerson_adaptive folder
                  metadata_file: cmv_metadata.csv # metadata file including subjects for cohorts 1 and 2
                  result_path: imported_data/ # where to store imported data [immuneML implementation detail]
            encodings: # multiple encodings are defined that differ only on the p-value threshold they use to determine CMV-associated sequences
              enc01:
                SequenceAbundance:
                  comparison_attributes: # how we define a sequence: as a combination of CDR3, V and J gene (as in the original publication)
                    - sequence_aas
                    - v_genes
                    - j_genes
                  p_value_threshold: 0.01
                  sequence_batch_size: 1000000 # immuneML implementation details
                  repertoire_batch_size: 150 # immuneML implementation details
              enc001: # second encoding with different p-value
                SequenceAbundance:
                  comparison_attributes:
                    - sequence_aas
                    - v_genes
                    - j_genes
                  p_value_threshold: 0.001
                  sequence_batch_size: 1000000
                  repertoire_batch_size: 150
              enc0001:
                SequenceAbundance:
                  comparison_attributes:
                    - sequence_aas
                    - v_genes
                    - j_genes
                  p_value_threshold: 0.0001
                  sequence_batch_size: 1000000
                  repertoire_batch_size: 150
              enc00001:
                SequenceAbundance:
                  comparison_attributes:
                    - sequence_aas
                    - v_genes
                    - j_genes
                  p_value_threshold: 0.00001
                  sequence_batch_size: 1000000
                  repertoire_batch_size: 150
              enc000001:
                SequenceAbundance:
                  comparison_attributes:
                    - sequence_aas
                    - v_genes
                    - j_genes
                  p_value_threshold: 0.000001
                  sequence_batch_size: 1000000
                  repertoire_batch_size: 150
            ml_methods: # here we define the classifier
              ml:
                ProbabilisticBinaryClassifier:
                  max_iterations: 50000 # how many iterations are allowed while estimating the distribution parameters
                  update_rate: 0.01
            reports:
              enc_data: DesignMatrixExporter # export encoded data in csv format
              sequence_association_likelihood: SequenceAssociationLikelihood # plot sequence association likelihood with estimated parameters
              feature_performance_plot: # show how performance changes depending on p-value threshold
                CVFeaturePerformance:
                  feature: p_value_threshold
              sequence_overlap: # show if CMV-associated sequences overlap between CV folds
                DiseaseAssociatedSequenceCVOverlap:
                  compare_in_selection: True
                  compare_in_assessment: True
              relevant_sequences: RelevantSequenceExporter # export CMV-associated sequences
              emerson_reference_overlap: # check how many sequences overlap with the original study
                ReferenceSequenceOverlap:
                  reference_path: emerson_reference.csv
                  comparison_attributes:
                    - sequence_aas
                    - v_genes
                    - j_genes
          instructions:
            cmv_study_reproduction: # defines what analysis should be like
              reports: [feature_performance_plot, sequence_overlap, emerson_reference_overlap] # reports to run after nested CV is finished
              assessment: # outer loop of nested cross-validation: split manually to training (cohort 1) and test (cohort 2)
                split_strategy: manual
                split_count: 1
                manual_config:
                  train_metadata_path: cmv_train_metadata.csv # cohort 1
                  test_metadata_path: cmv_test_metadata.csv # cohort 2
                reports:
                  encoding: [enc_data, relevant_sequences]
                  models: [sequence_association_likelihood]
              selection: # inner loop of nested cross-validation performing 10 cross-validation to choose best p-value threshold
                split_strategy: k_fold
                split_count: 10
                reports:
                  encoding: [enc_data, relevant_sequences]
              number_of_processes: 32
              dataset: cmv2017
              labels: # which labels to use from the metadata, here: CMV
              - CMV:
                  positive_class: True # positive class is called "True" - implementation detail used for encodings and some reports
              optimization_metric: log_loss # which metric to use for choose the best encoding
              metrics: # additional metrics to compute
              - balanced_accuracy
              - auc
              strategy: GridSearch # try out all settings combinations (all encodings here)
              type: TrainMLModel # name of the instruction
              refit_optimal_model: True # whether to refit the data on the whole dataset when all training and testing is finished
              settings: # combinations of encodings and classifiers to try out, basically, just listing all encodings with different p-values
              - encoding: enc01
                ml_method: ml
              - encoding: enc001
                ml_method: ml
              - encoding: enc0001
                ml_method: ml
              - encoding: enc00001
                ml_method: ml
              - encoding: enc000001
                ml_method: ml
          output:
            format: HTML # output the result as HTML


Robustness assessment
--------------------------

After reproducing the study, we also assessed the robustness of the method on smaller datasets. To do that, we first constructed smaller datasets, and
and then reproduced the analysis on those smaller datasets.

Constructing subsampled datasets
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To construct smaller datasets of 400, 200, 100 and 50 subjects randomly from both cohorts, we used Subsampling instruction with the following YAML specification.

    .. collapse:: subsample_datasets.yaml

        .. code-block:: yaml

          definitions:
            datasets:
              cmv2017: # we import the full dataset with 683 subjects as it was imported previously in immuneML-optimized format
                format: ImmuneML
                params:
                  path: imported_data/cmv2017.yaml # Note that in older versions of immuneML (<3), immuneML datasets had .iml_dataset extension instead of .yaml
          instructions:
            subsampling_inst: # user-defined name of the instruction
              type: Subsampling # which instruction to execute
              dataset: cmv2017 # original dataset to be subsampled
              subsampled_dataset_sizes: # how large the subsampled datasets should be, one dataset will be created for each list item
                - 400
                - 200
                - 100
                - 50
              dataset_export_formats: # in which formats to export the subsampled datasets
                - ImmuneML


Running the analysis on subsampled datasets
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To analyze and compare performances on datasets of different sizes, we use MultiDatasetBenchmarkTool.

The MultiDatasetBenchmarkTool can be run from the command line by providing the tool parameter in addition to YAML specification and the resulting folder:

.. code-block:: console

  immune-ml robustness_assessment_specs.yaml robustness_assessment_result/ --tool MultiDatasetBenchmarkTool

The YAML specification is mostly the same as when only TrainMLModel instruction is used except:

  - the `dataset` parameter is now called `datasets` and accepts a list of datasets on which the TrainMLModel instruction has to be performed (format and functionality are the same as described under reproduction), and
  - it has one additional parameter called `benchmark_reports` that will be executed after all datasets have been used to compare performances.

The YAML specification is given below.

    .. collapse:: robustness_assessment.yaml

        .. code-block:: yaml

          definitions:
            datasets: # datasets for assessing robustness
              cmv2017_400: # with 400 repertoires
                format: ImmuneML
                params:
                  path: subsampled_datasets/subsampling_inst/cmv2017_400_subsampled_1/exported/immuneml/cmv2017_400_subsampled_1.yaml # Note that in older versions of immuneML (<3), immuneML datasets had .iml_dataset extension instead of .yaml
              cmv2017_200: # with 200 repertoires
                format: ImmuneML
                params:
                  path: subsampled_datasets/subsampling_inst/cmv2017_200_subsampled_2/exported/immuneml/cmv2017_200_subsampled_2.yaml
              cmv2017_100: # with 100 repertoires
                format: ImmuneML
                params:
                  path: subsampled_datasets/subsampling_inst/cmv2017_100_subsampled_3/exported/immuneml/cmv2017_100_subsampled_3.yaml
              cmv2017_50: # with 50 repertoires
                format: ImmuneML
                params:
                  path: subsampled_datasets/subsampling_inst/cmv2017_50_subsampled_4/exported/immuneml/cmv2017_50_subsampled_4.yaml
            encodings: # encodings as in Emerson et al. 2017 with different p-values to discover disease-associated combination of amino acid sequence and V and J gene
              enc01:
                SequenceAbundance:
                  comparison_attributes:
                    - sequence_aas
                    - v_genes
                    - j_genes
                  p_value_threshold: 0.01
                  sequence_batch_size: 1000000 # implementation details not affecting the algorithm, only the speed
                  repertoire_batch_size: 150 # implementation details not affecting the algorithm, only the speed
              enc001:
                SequenceAbundance:
                  comparison_attributes:
                    - sequence_aas
                    - v_genes
                    - j_genes
                  p_value_threshold: 0.001
                  sequence_batch_size: 1000000
                  repertoire_batch_size: 150
              enc0001:
                SequenceAbundance:
                  comparison_attributes:
                    - sequence_aas
                    - v_genes
                    - j_genes
                  p_value_threshold: 0.0001
                  sequence_batch_size: 1000000
                  repertoire_batch_size: 150
              enc00001:
                SequenceAbundance:
                  comparison_attributes:
                    - sequence_aas
                    - v_genes
                    - j_genes
                  p_value_threshold: 0.00001
                  sequence_batch_size: 1000000
                  repertoire_batch_size: 150
              enc000001:
                SequenceAbundance:
                  comparison_attributes:
                    - sequence_aas
                    - v_genes
                    - j_genes
                  p_value_threshold: 0.000001
                  sequence_batch_size: 1000000
                  repertoire_batch_size: 150
            ml_methods:
              ml:
                ProbabilisticBinaryClassifier: # classifier as described in Emerson et al. 2017
                  max_iterations: 50000
                  update_rate: 0.01
            reports:
              enc_data: DesignMatrixExporter # export encoded data as a csv file, also export labels and other info
              sequence_association_likelihood: SequenceAssociationLikelihood
              feature_performance_plot: # show how performance changes on average for different p-value thresholds for determining disease-associated sequences
                CVFeaturePerformance:
                  feature: p_value_threshold
                  is_feature_axis_categorical: True
              sequence_overlap: # check how stable are the estimates of disease-associated sequences across folds
                DiseaseAssociatedSequenceCVOverlap:
                  compare_in_selection: True # compare disease-associated sequences only for the chosen optimal model across CV folds
                  compare_in_assessment: True
              relevant_sequences: RelevantSequenceExporter # export disease-associated sequences
              emerson_reference_overlap: # check how much discovered disease-associated sequences overlap with the results published in the paper
                ReferenceSequenceOverlap:
                  reference_path: emerson_reference.csv
                  comparison_attributes:
                    - sequence_aas
                    - v_genes
                    - j_genes
              sequence_overlap_across_datasets: DiseaseAssociatedSequenceOverlap # check how much disease-associated sequences overlap across datasets of different size
              performance_report: PerformanceOverview # show AUROC, AUPRC across datasets
          instructions:
            cmv_study_reproduction: # the format of the instruction is the same as above except there is a parameter benchmark_reports which are run when the instructions have finished
              reports: [feature_performance_plot, sequence_overlap, emerson_reference_overlap] # reports to run after nested CV is finished
              benchmark_reports: [sequence_overlap_across_datasets, performance_report] # reports to run after all dataset have been benchmarked
              assessment: # nested 5-fold CV (outer loop)
                split_strategy: k_fold
                split_count: 5
                reports:
                  encoding: [enc_data, relevant_sequences]
                  models: [sequence_association_likelihood]
              selection: # nested 5-fold CV (inner loop)
                split_strategy: k_fold
                split_count: 5
                reports:
                  encoding: [enc_data, relevant_sequences]
              number_of_processes: 32
              datasets: # instead of one dataset, there are 4 now when we use MultiDatasetBenchmarkTool
                - cmv2017_400
                - cmv2017_200
                - cmv2017_100
                - cmv2017_50
              labels:
              - CMV:
                  positive_class: True
              metrics:
              - balanced_accuracy
              - auc
              optimization_metric: log_loss
              strategy: GridSearch # try out all combinations of encoding and ml_method listed under settings
              type: TrainMLModel # the type of the instruction which will be executed for each dataset
              refit_optimal_model: False
              settings: # combinations to try out to choose the best p-value
              - encoding: enc01
                ml_method: ml
              - encoding: enc001
                ml_method: ml
              - encoding: enc0001
                ml_method: ml
              - encoding: enc00001
                ml_method: ml
          output:
            format: HTML


Results
-------------------------------------------------

The results of reproducing the study by Emerson et al. are shown in the following figure:

.. figure:: ../_static/images/usecases/emerson_reproduction.png
   :alt: Emerson reproduction results
   :width: 80%

   Reproducing the CMV status prediction study by Emerson et al.5 A. The overlap of the 164 disease-associated TCRβ sequences (V-TCRβaa-J) determined in the original study by Emerson et al., labeled “reference”, with those determined by the optimal model as reproduced here with a p-value threshold of 0.001 (labeled “model”). B. The overlap percentage of disease-associated TCRβ sequences for the optimal model with the p-value threshold of 0.001 between different data splits in 10-fold cross-validation (between 50% and 65% overlap). C. The probability that a TCRβ sequence is CMV-associated following a beta distribution estimated separately for CMV positive and negative subjects, which is then used for CMV status prediction of new subjects. D. Area under the ROC curve (AUROC) over p-value thresholds in training data (average AUROC over 10 cross-validation splits) and test data (AUROC in cohort 2).


Furthermore, the results of the robustness assessment (reproducing the study with a lower number of repertoires) are show here:

.. figure:: ../_static/images/usecases/emerson_robustness.png
   :alt: Emerson robustness results
   :width: 70%

   Decreasing the number of repertoires (400, 200, 100, and 50) leads to decreased prediction accuracy (AUROC: 0.86–0.46).