immuneML.api.aggregated_runs package

Submodules

immuneML.api.aggregated_runs.MultiDatasetBenchmarkTool module

class immuneML.api.aggregated_runs.MultiDatasetBenchmarkTool.MultiDatasetBenchmarkTool(specification_path: pathlib.Path, result_path: pathlib.Path, **kwargs)[source]

Bases: object

MultiDatasetBenchmarkTool trains the models using nested cross-validation (CV) to determine optimal model on multiple datasets. Internally, it uses TrainMLModel instruction for each of the listed datasets and performs nested CV on each, accumulates the results of these runs and then generates reports on the cumulative results.

YAML specification:

definitions: # everything under definitions can be defined in a standard way
    datasets:
        d1: ...
        d2: ...
        d3: ...
    ml_methods:
        ml1: ...
        ml2: ...
        ml3: ...
    encodings:
        enc1: ...
        enc2: ...
    reports:
        r1: ...
        r2: ...
instructions: # there can be only one instruction
    benchmark_instruction:
        type: TrainMLModel
        benchmark_reports: [r1, r2] # list of reports that will be executed on the results for all datasets
        datasets: [d1, d2, d3] # the same optimization will be performed separately for each dataset
        settings: # a list of combinations of preprocessing, encoding and ml_method to optimize over
        - encoding: enc1 # mandatory field
          ml_method: ml1 # mandatory field
        - encoding: enc2
          ml_method: ml2
        - encoding: enc2
          ml_method: ml3
        assessment: # outer loop of nested CV
            split_strategy: random # perform Monte Carlo CV (randomly split the data into train and test)
            split_count: 1 # how many train/test datasets to generate
            training_percentage: 0.7 # what percentage of the original data should be used for the training set
        selection: # inner loop of nested CV
            split_strategy: k_fold # perform k-fold CV
            split_count: 5 # how many fold to create: here these two parameters mean: do 5-fold CV
        labels: # list of labels to optimize the classifier for, as given in the metadata for the dataset
            - celiac
        strategy: GridSearch # how to choose the combinations which to test from settings (GridSearch means test all)
        metrics: # list of metrics to compute for all settings, but these do not influence the choice of optimal model
            - accuracy
            - auc
        reports: # reports to execute on the dataset (before CV, splitting, encoding etc.)
            - rep1
        number_of_processes: 4 # number of parallel processes to create (could speed up the computation)
        optimization_metric: balanced_accuracy # the metric to use for choosing the optimal model and during training

run()[source]

Module contents