MultiDatasetBenchmarkTool(specification_path: pathlib.Path, result_path: pathlib.Path, **kwargs)¶
MultiDatasetBenchmarkTool trains the models using nested cross-validation (CV) to determine optimal model on multiple datasets. Internally, it uses TrainMLModel instruction for each of the listed datasets and performs nested CV on each, accumulates the results of these runs and then generates reports on the cumulative results.
definitions: # everything under definitions can be defined in a standard way datasets: d1: ... d2: ... d3: ... ml_methods: ml1: ... ml2: ... ml3: ... encodings: enc1: ... enc2: ... reports: r1: ... r2: ... instructions: # there can be only one instruction benchmark_instruction: type: TrainMLModel benchmark_reports: [r1, r2] # list of reports that will be executed on the results for all datasets datasets: [d1, d2, d3] # the same optimization will be performed separately for each dataset settings: # a list of combinations of preprocessing, encoding and ml_method to optimize over - encoding: enc1 # mandatory field ml_method: ml1 # mandatory field - encoding: enc2 ml_method: ml2 - encoding: enc2 ml_method: ml3 assessment: # outer loop of nested CV split_strategy: random # perform Monte Carlo CV (randomly split the data into train and test) split_count: 1 # how many train/test datasets to generate training_percentage: 0.7 # what percentage of the original data should be used for the training set selection: # inner loop of nested CV split_strategy: k_fold # perform k-fold CV split_count: 5 # how many fold to create: here these two parameters mean: do 5-fold CV labels: # list of labels to optimize the classifier for, as given in the metadata for the dataset - celiac strategy: GridSearch # how to choose the combinations which to test from settings (GridSearch means test all) metrics: # list of metrics to compute for all settings, but these do not influence the choice of optimal model - accuracy - auc reports: # reports to execute on the dataset (before CV, splitting, encoding etc.) - rep1 number_of_processes: 4 # number of parallel processes to create (could speed up the computation) optimization_metric: balanced_accuracy # the metric to use for choosing the optimal model and during training