immuneML.hyperparameter_optimization.config package
Submodules
immuneML.hyperparameter_optimization.config.LeaveOneOutConfig module
immuneML.hyperparameter_optimization.config.ManualSplitConfig module
immuneML.hyperparameter_optimization.config.ReportConfig module
- class immuneML.hyperparameter_optimization.config.ReportConfig.ReportConfig(data_splits: dict = None, models: dict = None, data: dict = None, encoding: dict = None)[source]
Bases:
object
A class encapsulating different report lists which can be executed while performing nested cross-validation (CV) using TrainMLModel instruction. All arguments are optional.
- Parameters:
data (dict) – Data reports to be executed on the whole dataset before it is split to training/test or training/validation
data_splits (dict) – Data reports to be executed after the data has been split into training and test (assessment CV loop) or training and validation (selection CV loop) datasets before they are encoded
models (dict) – ML model reports to be executed on all trained classifiers
encoding (dict) – Encoding reports to be executed on each of the encoded training/test datasets or training/validation datasets
YAML specification:
# as a part of a TrainMLModel instruction, defining the outer (assessment) loop of nested cross-validation: assessment: # outer loop of nested CV split_strategy: random # perform Monte Carlo CV (randomly split the data into train and test) split_count: 5 # how many train/test datasets to generate training_percentage: 0.7 # what percentage of the original data should be used for the training set reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods data_splits: # list of reports to execute on training/test datasets (before they are preprocessed and encoded) - my_data_split_report encoding: # list of reports to execute on encoded training/test datasets - my_encoding_report # as a part of a TrainMLModel instruction, defining the inner (selection) loop of nested cross-validation: selection: # inner loop of nested CV split_strategy: random # perform Monte Carlo CV (randomly split the data into train and validation) split_count: 5 # how many train/validation datasets to generate training_percentage: 0.7 # what percentage of the original data should be used for the training set reports: # reports to execute on training/validation datasets, encoded datasets and trained ML methods data_splits: # list of reports to execute on training/validation datasets (before they are preprocessed and encoded) - my_data_split_report encoding: # list of reports to execute on encoded training/validation datasets - my_encoding_report models: - my_ml_model_report
immuneML.hyperparameter_optimization.config.SplitConfig module
- class immuneML.hyperparameter_optimization.config.SplitConfig.SplitConfig(split_strategy: SplitType, split_count: int, training_percentage: float = None, reports: ReportConfig = None, manual_config: ManualSplitConfig = None, leave_one_out_config: LeaveOneOutConfig = None)[source]
Bases:
object
SplitConfig describes how to split the data for cross-validation. It allows for the following combinations:
loocv (leave-one-out cross-validation)
k_fold (k-fold cross-validation)
stratified_k_fold (stratified k-fold cross-validation that can be used when immuneML is used for single-label classification, see `this documentation<https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html>`_ for more details on how this is implemented)
random (Monte Carlo cross-validation - randomly splitting the dataset to training and test datasets)
manual (train and test dataset are explicitly specified by providing metadata files for the two datasets)
leave_one_out_stratification (leave-one-out CV where one refers to a specific parameter, e.g. if subject is known in a receptor dataset, it is possible to have leave-subject-out CV - currently only available for receptor and sequence datasets).
- Parameters:
split_strategy (SplitType) – one of the types of cross-validation listed above (LOOCV, K_FOLD, STRATIFIED_K_FOLD, MANUAL, `` or RANDOM)
split_count (int) – if split_strategy is K_FOLD, then this defined how many splits to make (K), if split_strategy is RANDOM, split_count defines how many random splits to make, resulting in split_count training/test dataset pairs, or if split_strategy is LOOCV, MANUAL or LEAVE_ONE_OUT_STRATIFICATION, split_count does not need to be specified.
training_percentage – if split_strategy is RANDOM, this defines which portion of the original dataset to use for creating the training dataset; for other values of split_strategy, this parameter is not used.
reports (ReportConfig) – defines which reports to execute on which datasets or settings. See ReportConfig for more details.
manual_config (
ManualSplitConfig
) – if split strategy is MANUAL,given (here the paths to metadata files should be) –
provided (using the "subject_id" field in for repertoire datasets so it has to be present in both the original dataset and the metadata files) –
datasets (here. For receptor and sequence) –
either ("example_id" field needs to be provided in the metadata files and it will be mapped to) –
MANUAL ('sequence_identifiers' or 'receptor_identifiers' in the original dataset. If split strategy is anything other than) –
has (this field) –
omitted. (this field has no effect and can be) –
leave_one_out_config (
LeaveOneOutConfig
) – if split strategy isLEAVE_ONE_OUT_STRATIFICATION –
dataset (this config describes which parameter to use for stratification thus making a list of train/test) –
argument (combinations in which in the test set there are examples with only one value of the specified parameter. leave_one_out_config) –
inputs (accepts two) – parameter which is the name of the parameter to use for stratification and min_count which defines the minimum
receptor (number of examples that can be present in the test dataset. This type of generating train and test datasets is only supported for) –
else (and sequence datasets so far. If split strategy is anything) –
omitted. –
YAML specification:
# as a part of a TrainMLModel instruction, defining the outer (assessment) loop of nested cross-validation: assessment: # outer loop of nested CV split_strategy: random # perform Monte Carlo CV (randomly split the data into train and test) split_count: 5 # how many train/test datasets to generate training_percentage: 0.7 # what percentage of the original data should be used for the training set reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods data_splits: # list of data reports to execute on training/test datasets (before they are encoded) - rep1 encoding: # list of encoding reports to execute on encoded training/test datasets - rep2 models: # list of ML model reports to execute on the trained classifiers in the assessment loop - rep3 # as a part of a TrainMLModel instruction, defining the inner (selection) loop of nested cross-validation: selection: # inner loop of nested CV split_strategy: leave_one_out_stratification leave_one_out_config: # perform leave-(subject)-out CV parameter: subject # which parameter to use for splitting, must be present in the metadata for each example min_count: 1 # what is the minimum number of examples with unique value of the parameter specified above for the analysis to be valid reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods data_splits: # list of data reports to execute on training/test datasets (before they are encoded) - rep1 encoding: # list of encoding reports to execute on encoded training/test datasets - rep2 encoding: # list of ML model reports to execute the trained classifiers in the selection loop - rep3