How to perform clustering analysis

In this tutorial, we will generate a synthetic dataset and perform clustering analysis on it.

Step 1: Creating a dataset

First, we will create a synthetic dataset using LIgO tool from immuneML. It generates immune receptor sequences using Olga and simulates an immune event by implanting a list of k-mers. We will create a dataset with 100 sequences, where 50 will contain signal1 (meaning they will have either AAA or GGG) and 50 will not contain the signal.

Here is the configuration yaml file:

. collapse:: ligo_complete_specification.yaml

definitions:
  motifs:
    motif1:
      seed: AAA
    motif2:
      seed: GGG
  signals:
    signal1:
      motifs: [motif1, motif2]
  simulations:
    sim1:
      is_repertoire: false # the simulation is on the sequence level (nor repertoire level)
      paired: false # we are simulating single-chain sequences
      sequence_type: amino_acid
      simulation_strategy: Implanting # how to simulate the signals
      remove_seqs_with_signals: true # remove signal-specific AIRs from the background
      sim_items:
        sim_item: # group of AIRs with the same parameters
          AIRR1:
            signals:
              signal1: 1 # all sequences in this group will have signal1
            number_of_examples: 50 # simulate 50 sequences
            generative_model: # how to generate background AIRs
              default_model_name: humanTRB # use default model
              type: OLGA # use OLGA for background simulation
          AIRR2: # another set of sequences, but with different parameters
            signals: {} # no signals here
            number_of_examples: 50
            generative_model:
              default_model_name: humanTRB
              type: OLGA
instructions:
  my_sim_inst:
    export_p_gens: false
    max_iterations: 100
    number_of_processes: 4
    sequence_batch_size: 1000
    simulation: sim1
    type: LigoSim

To run this analysis from the command line with immuneML installed, run:

immune-ml ligo_complete_specification.yaml ./simulated_dataset/

Step 2: Clustering analysis

To perform the clustering, we will use KmerFrequencyEncoding, PCA and KMeans algorithms from immuneML and scikit-learn. We will split the data into discovery and validation sets, where the discovery set will be used to fit the clustering model, that will then be used to predict the clusters of the validation set. This is of special interest if:

  • we want to see how well the model generalizes to new data (even in this unsupervised setting),

  • we want to compare different clustering settings (e.g. different number of clusters or different ways of encoding the data).

Following the paper by Ullmann and colleagues (2023), immuneML supports two types of validation: method-based and result-based. In method-based validation, we perform the same preprocessing+encoding+clustering on discovery and validation sets and compare the results. In result-based validation, we fit a supervised classifier to the clusters determined on the discovery dataset and use it to predict the clustering on the validation data, which shows if the clustering result itself is useful for validation data.

Additionally, immuneML supports different metrics for clustering evaluation: internal metrics which evaluate the quality of the clustering, and external metrics that compare the clustering to some external information available.

In this tutorial, we will use the following settings:

. collapse:: clustering_analysis.yaml

definitions:
  datasets:
    d1:
      format: AIRR
      params:
        path: simulated_dataset/simulated_dataset.tsv # paths to files from the previous step
        dataset_file: simulated_dataset/simulated_dataset.yaml
  encodings:
    kmer: KmerFrequency # we encode the sequences using k-mer frequencies
  ml_methods:
    kmeans2: # we try out kmeans with k=2
      KMeans:
        n_clusters: 2
    kmeans3: # and k=3
      KMeans:
        n_clusters: 3
    pca:
      PCA:
        n_components: 4
  reports:
    rep1: # this is how we will visualize the data
      DimensionalityReduction:
        dim_red_method:
          PCA:
            n_components: 2
        label: signal1 # we will color the graph by the signal we implanted
    cluster_vis: # this will visualize clustering results
      ClusteringVisualization: # plot a scatter plot of dim-reduced data and color the points by cluster assignments
        dim_red_method:
          KernelPCA: # here we can use any dimensionality reduction method supported in immuneML (see docs)
            n_components: 2
            kernel: rbf
    stability: # for each split, assess how well the clusters from discovery data correspond to validation data (see docs)
      ClusteringStabilityReport:
        metric: adjusted_rand_score
    external_labels_summary: # show heatmap of how cluster assignments correspond to external labels
      ExternalLabelClusterSummary:
        external_labels: [signal1]
instructions:
  clustering_instruction_with_ligo_data:
    clustering_settings: # what combinations of encoding+dim_reduction+clustering we want to try
    - encoding: kmer
      method: kmeans2
    - dim_reduction: pca
      encoding: kmer
      method: kmeans3
    dataset: d1
    labels: # here we list external labels we want to compare against if available
    - signal1
    metrics: # list metrics we want to use, both internal, and external (if labels are available)
    - adjusted_rand_score
    - adjusted_mutual_info_score
    - silhouette_score
    - calinski_harabasz_score
    number_of_processes: 4
    reports:
    - rep1
    - stability
    - external_labels_summary
    - cluster_vis
    split_config: # we want to repeat the analysis on different splits of the data to assess stability of the results
      split_count: 2
      split_strategy: random # the splits will be random
      training_percentage: 0.5 # we will use 50% of the data for discovery and 50% for validation
    type: Clustering
    validation_type: # the type of validation we want to perform [here we do both]
    - result_based
    - method_based

To run the clustering analysis from the command line with immuneML installed, run:

immune-ml clustering_analysis.yaml ./clustering_results/

This will generate a report with the clustering results in the specified directory. To explore the results, see the index.html file in output directory.