Integration use case: Performing analysis on immuneSIM-generated repertoires

This use case will show you how to use immuneML in conjunction with immuneSIM (Weber et al. 2020), tunable multi-feature simulation tool of B- and T-cell receptor repertoires for immunoinformatics benchmarking.

The combined use of these tools enables the user to generate datasets with know signals as a baseline for ML classification. The user can either input already existing immuneSIM repertoires into the workflow or use a cross-platform metadata file to both simulate datasets and provide the label information for immuneML input.

For reference, a detailed documentation of immuneSIM including an installation guide can be found here. Additionally, example files for this particular workflow are available on the immuneSIM GitHub. Note: This use case is based on immuneSIM v0.9.0.

figure_immuneSIML

Combined workflow with an immuneSIM and immuneML compatible metadata file

The most efficient way to combine immuneSIM and immuneML is by using a single metadata file for both the simulation in immuneSIM and training the model in immmuneML. Below an example of a metadata table, containing parameters for the simulation of immuneSIM repertoires, which can also be used as metadata in an immuneML workflow.

metadata_immuneSIML

This metadata file (metadata_full_sim.csv) is used by the following immuneSIM script to generate a set of repertoires. The metadata file can then be fed into immuneML together with the resulting simulated repertoires as described in the How to import data into immuneML section.

The following R script generates the simulated repertoires using immuneSIM. The script can also be downloaded here: immuneSIM_for_ML.R.

## ImmuneML use case (https://immuneml.uio.no/)
# This script simulates immuneSIM repertoires based on an immuneML compatible metadata file.
# requires immuneSIM 0.9.0 (github: https://github.com/GreiffLab/immuneSIM)

library(immuneSIM)

PATH <- "./immuneML_Sim"

#load metadata file
metadata <- read.delim(file.path(PATH,"metadata_full_sim.csv"),sep=",")


#Define motif for cases where motif==TRUE. Here two motifs are inserted with a probability of 0.5 at a fixed position.
motif <- data.frame(aa=c("AA","FF"),nt=c("gccgcc","tttttt"),freq=c(0.5,0.5))
fixed_pos <- 4


#for each line in metadata simulate a repertoire and write out.
for(i in 1:nrow(metadata)){

  #simulate repertoire
  curr_df <- immuneSIM(number_of_seqs = metadata$nb_seqs[i],
                       vdj_list = list_germline_genes_allele_01,
                       species = metadata$species[i],
                       receptor = substr(metadata$receptor[i],1,2),
                       chain = substr(metadata$receptor[i],3,3),
                       insertions_and_deletion_lengths = insertions_and_deletion_lengths_df,
                       user_defined_alpha = 2,
                       name_repertoire = metadata$filename[i],
                       length_distribution_rand = length_dist_simulation,
                       random = FALSE,
                       shm.mode = 'none',
                       shm.prob = 15/350,
                       vdj_noise = 0,
                       vdj_dropout = c(V=metadata$v_drop[i],D=0,J=0),
                       ins_del_dropout = metadata$ins_del[i],
                       equal_cc = FALSE,
                       freq_update_time = round(0.5*metadata$nb_seqs[i]),
                       max_cdr3_length = 100,
                       min_cdr3_length = 6,
                       verbose = TRUE,
                       airr_compliant = TRUE)

  #after simulation implant motifs
  if(metadata$motif[i]==TRUE){
    curr_df <- motif_implantation(curr_df, motif,fixed_pos)
  }

  #write repertoire to file
  write.table(curr_df,file=file.path(PATH, "data", metadata$filename[i]),sep="\t",quote=FALSE,row.names=FALSE)
}

Using existing immuneSIM repertoires with immuneML

As immuneSIM repertoires use AIRR-compliant column naming, they can be directly fed into any immuneML workflow using AIRR importer. For this, a metadata file indexing the existing immuneSIM repertoires and indicating classification relevant labels has to be created.

If the user chooses to write their own using .yaml file, the declaration of format: AIRR in the definition section is sufficient to ensure compatibility with immuneSIM datasets.

Here we show an example of the analysis using immuneSIM-generated repertoires to train a logistic regression and examine the coefficients of the model.

definitions:
  datasets:
    immuneSIM_dataset: # user-defined dataset name: here described the immuneSIM dataset
      format: AIRR
      params:
        path: ./immuneML_Sim/data/         # path to the folder containing the repertoire files generated by immuneSIM
        metadata_file: ./immuneML_Sim/metadata_full_sim.csv

  encodings:
    my_kmer_frequency: # user-defined encoding name
      KmerFrequency:   # encoding type
        k: 3           # encoding parameters

  ml_methods:
    my_logistic_regression: LogisticRegression # user-defined ML model name: ML model type (no user-specified parameters)

  reports:
    my_coefficients: Coefficients # user-defined report name: report type (no user-specified parameters)

instructions:
  my_training_instruction: # user-defined instruction name
    type: TrainMLModel

    dataset: immuneSIM_dataset # use the same dataset name as in definitions
    labels:
    - label    # use a label available in the metadata file

    settings: # which combinations of ML settings to run
    - encoding: my_kmer_frequency
      ml_method: my_logistic_regression

    assessment: # parameters in the assessment (outer) cross-validation loop
      reports:  # plot the coefficients for the trained model
        models:
        - my_coefficients
      split_strategy: random   # how to split the data - here: split randomly
      split_count: 1           # how many times (here once - just to train and test)
      training_percentage: 0.7 # use 70% of the data for training

    selection: # parameters in the selection (inner) cross-validation loop
      split_strategy: random
      split_count: 1
      training_percentage: 1 # use all data for training

    optimization_metric: balanced_accuracy # the metric to optimize during nested cross-validation when comparing multiple models
    metrics: # other metrics to compute for reference
    - auc # area under the ROC curve
    - precision
    - recall