How to simulate immune events into an existing AIRR dataset in Galaxy
The Galaxy tool Simulate immune events into an existing repertoire dataset can be used to implant short disease-associated motifs into an existing Repertoire dataset. Such a dataset with simulated immune signals can be used to benchmark different ML methods. Any type of repertoire dataset (experimental or simulated) can be used as a starting point for an immune event simulation, as long as it contains amino acid sequences.
If you instead want to simulate a synthetic dataset from scratch, start by simulating an AIRR dataset in Galaxy.
An example Galaxy history showing how to use this tool can be found here.
Creating the YAML specification
The Galaxy tool takes in an immuneML dataset from the history and a YAML specification.
In the YAML specification, the Simulation instruction must be used. Additionally, the elements
motifs
, signals
and simulations
must be defined in the definitions
section of the YAML specification.
A motif describes a short amino acid sequence (motif seed) which may be implanted into an immune receptor sequence with small modifications.
A signal is a group of motifs, and may for example represent a disease. A simulation consists of implantings, which each represent
one or more signals. This can be used to simulate confounding factors (signals occurring together), or ensure signals do not overlap between repertoires.
These elements are described in more detail in YAML specification of the Simulation instruction for introducing immune signals.
This figure shows how the different concepts in a Simulation relate to each other:
When writing an analysis specification for Galaxy, it can be assumed that all selected files are present in the current working directory. A path to an additional file thus consists only of the filename. A complete YAML specification for immune event simulation in Galaxy is show here:
definitions:
datasets:
dataset: # user-defined dataset name
format: ImmuneML # the default format used by the 'Create dataset' galaxy tool is ImmuneML
params:
path: dataset.iml_dataset # specify the dataset name, the default name used by
# the 'Create dataset' galaxy tool is dataset.iml_dataset
motifs:
my_simple_motif: # a simple motif without gaps or hamming distance
seed: AAA
instantiation: GappedKmer
my_complex_motif: # complex motif containing a gap + hamming distance
seed: AA/A # ‘/’ denotes gap position if present, if not, there’s no gap
instantiation:
GappedKmer:
min_gap: 1
max_gap: 2
hamming_distance_probabilities: # probabilities for each number of
0: 0.7 # modification to the seed
1: 0.3
position_weights: # probabilities for modification per position
0: 1
1: 0 # note that index 2, the position of the gap,
3: 0 # is excluded from position_weights
alphabet_weights: # probabilities for using each amino acid in
A: 0.2 # a hamming distance modification
C: 0.2
D: 0.4
E: 0.2
signals:
my_signal:
motifs: # specify one or more motifs defined above
- my_simple_motif
- my_complex_motif
implanting: HealthySequence
sequence_position_weights: # the relative likelihood of implanting
109: 1 # the motif at each IMGT starting position
110: 2
111: 5
112: 1
simulations:
my_simulation:
my_implanting:
signals:
- my_signal # specify one or more signals defined above
dataset_implanting_rate: 0.5 # 50% of the repertoires will contain my_signal
repertoire_implanting_rate: 0.25 # 25% of the sequences in a repertoire will contain one motif
instructions:
my_simulation_instruction:
type: Simulation
dataset: dataset # specify the dataset defined above
simulation: my_simulation
export_formats:
# only one format can be specified here and the dataset in this format will be
# available as a Galaxy collection afterwards
- ImmuneML # Can be AIRR (human-readable) or ImmuneML (recommended for further Galaxy-analysis)
Tool output
This Galaxy tool will produce the following history elements:
Summary: immune signal simulation: a HTML page describing general characteristics of the dataset, including the name of the dataset (this name should be specified when importing the dataset later in immuneML), the dataset type and size, and the details of the simulated immune signals.
Archive: immune signal simulation: a .zip file containing the complete output folder as it was produced by immuneML. This folder contains the output of the Simulation instruction including all raw data files. Furthermore, the folder contains the complete YAML specification file for the immuneML run, the HTML output and a log file.
immuneML dataset with simulated immune events: Galaxy collection containing all relevant files for the new dataset.