immuneML data model#

immuneML works with adaptive immune receptor sequencing data. Internally, the classes and data structures used to represent this data adheres to the AIRR Rearrangement Schema, although it is possible to import data from a wider variety of common formats.

Most immuneML analyses are based on the amino acid CDR3 junction. Some analyses also use the V and J gene name (‘call’) information. While importing of full-length (V + CDR3 + J) sequences is supported, there are no functionalities in immuneML designed for analysing sequences at that level.

An immuneML dataset consists of a set of ‘examples’. These examples are the

immuneML data model supports three types of datasets that can be used for analyses:

  1. Repertoire dataset (RepertoireDataset) - each example in the dataset is a large set of AIR sequences which are typically derived from one subject (individual).

  2. Receptor dataset (ReceptorDataset) - each example is one paired-chain receptor consisting of two AIR sequences (e.g., TCR alpha-beta, or IGH heavy-light).

  3. Sequence dataset (SequenceDataset) - each example is one single AIR sequence chain.

A single AIR rearrangement is represented by a ReceptorSequence class. A Sequence dataset contains a set of such ReceptorSequence objects. A Receptor dataset contains a set of Receptor objects, which contain two ReceptorSequences each. Relevant shared code for Sequence- and ReceptorDatasets can be found in the ElementDataset class. A Repertoire dataset contains a set of Repertoire objects, which each contain a set of ReceptorSequence objects.