YAML specification of the LigoSim instruction for introducing immune signals#

The YAML definition consists of three components: motif, signal and simulation definitions.

  • motifs: defined by either a (gapped) k-mer (see: SeedMotif) and how it might vary or by a position weight matrix (see: PWM).

  • signals (see: Signal): defined as a union of a set of motifs and AIR-specific information, such as V or J gene or IMGT position of the motif in the CDR3 sequence. Immune signals correspond to e.g., antigen-specificity.

  • immune_events: immune events are sets of immune signals and their proportion in an AIRR. They correspond to diseases, vaccination, allergies. In practice, we define which signals are present and how often in a set of examples, and the immune event is assigned as a label for that example set.

  • simulations defines how the signals will be combined and simulated in the receptors or repertoires.

Simulation (as defined in Simulation config in the specification) groups examples (receptors or repertoires, depending on the level of simulation desired by the user) with the same characteristics into simulation items (Simulation config item) that precisely defines how this set of examples should be simulated.

Each simulation item defines the following for a set of examples:

  • which signals should exist in that set of examples (and if it’s repertoire-level simulation: in which percentage of each individual repertoire)

  • what is the generative model that will create background AIR sequences which will be used as a starting point for simulation. Currently supported generative models for this purpose are OLGA (which can generate sequences from either one of the default OLGA models or from a custom model) and ExperimentalImport (which allows any set of sequences to be imported and used as background).

  • immune events are defined on this level and have the same value for all examples within the given set.

See also the tutorial about recovering simulated immune signals.

An example of a simulation with disease-associated signals is given below.

      seed: AA
      seed: GG
      motifs: [motif1]
      motifs: [motif2]
      is_repertoire: true # if the simulation is on repertoire or receptor here -> here it's repertoire level
      paired: false # whether to simulate paired chain data or not
      sequence_type: amino_acid
      simulation_strategy: Implanting # how to simulate the signals
      remove_seqs_with_signals: true # remove signal-specific AIRs from the background
        sim_item: # group of AIRs with the same parameters
            immune_events: # all repertoires in this set will have these values for immune events
              ievent1: True
              ievent1: False
              signal1: 0.3 # in each repertoire 30% of sequences will have signal1
              signal2: 0.3 # in each repertoire other 30% of sequences will have signal2
            number_of_examples: 10 # simulate 10 repertoires
            receptors_in_repertoire_count: 6 # how many receptor sequences should be in each repertoire
            generative_model: # how to generate background AIRs
              chain: heavy
              default_model_name: humanIGH # use default model
              type: OLGA # use OLGA for background simulation
          AIRR2: # another set of repertoires, but with different parameters
              ievent1: False
              ievent1: True
            signals: {signal1: 0.5, signal2: 0.5}
            number_of_examples: 10
            receptors_in_repertoire_count: 6
              chain: heavy
              default_model_name: humanIGH
              model_path: null # if there was a custom model to use, path to the folder should be given here
              type: OLGA
    export_p_gens: false
    max_iterations: 100
    number_of_processes: 4
    sequence_batch_size: 1000
    simulation: sim1
    type: LigoSim