immuneML.hyperparameter_optimization.config package

Submodules

immuneML.hyperparameter_optimization.config.LeaveOneOutConfig module

class immuneML.hyperparameter_optimization.config.LeaveOneOutConfig.LeaveOneOutConfig(parameter: str = None, min_count: int = None)[source]

Bases: object

min_count: int = None
parameter: str = None

immuneML.hyperparameter_optimization.config.ManualSplitConfig module

class immuneML.hyperparameter_optimization.config.ManualSplitConfig.ManualSplitConfig(train_metadata_path: pathlib.Path = None, test_metadata_path: pathlib.Path = None)[source]

Bases: object

test_metadata_path: pathlib.Path = None
train_metadata_path: pathlib.Path = None

immuneML.hyperparameter_optimization.config.ReportConfig module

class immuneML.hyperparameter_optimization.config.ReportConfig.ReportConfig(data_splits: Optional[dict] = None, models: Optional[dict] = None, data: Optional[dict] = None, encoding: Optional[dict] = None)[source]

Bases: object

A class encapsulating different report lists which can be executed while performing nested cross-validation (CV) using TrainMLModel instruction. All arguments are optional.

Parameters
  • data (dict) – Data reports to be executed on the whole dataset before it is split to training/test or training/validation

  • data_splits (dict) – Data reports to be executed after the data has been split into training and test (assessment CV loop) or training and validation (selection CV loop) datasets before they are encoded

  • models (dict) – ML model reports to be executed on all trained classifiers

  • encoding (dict) – Encoding reports to be executed on each of the encoded training/test datasets or training/validation datasets

YAML specification:

# as a part of a TrainMLModel instruction, defining the outer (assessment) loop of nested cross-validation:
assessment: # outer loop of nested CV
    split_strategy: random # perform Monte Carlo CV (randomly split the data into train and test)
    split_count: 5 # how many train/test datasets to generate
    training_percentage: 0.7 # what percentage of the original data should be used for the training set
    reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods
        data_splits: # list of reports to execute on training/test datasets (before they are preprocessed and encoded)
            - my_data_split_report
        encoding: # list of reports to execute on encoded training/test datasets
            - my_encoding_report

# as a part of a TrainMLModel instruction, defining the inner (selection) loop of nested cross-validation:
selection: # inner loop of nested CV
    split_strategy: random # perform Monte Carlo CV (randomly split the data into train and validation)
    split_count: 5 # how many train/validation datasets to generate
    training_percentage: 0.7 # what percentage of the original data should be used for the training set
    reports: # reports to execute on training/validation datasets, encoded datasets and trained ML methods
        data_splits: # list of reports to execute on training/validation datasets (before they are preprocessed and encoded)
            - my_data_split_report
        encoding: # list of reports to execute on encoded training/validation datasets
            - my_encoding_report
        models:
            - my_ml_model_report
static get_documentation()[source]

immuneML.hyperparameter_optimization.config.SplitConfig module

class immuneML.hyperparameter_optimization.config.SplitConfig.SplitConfig(split_strategy: immuneML.hyperparameter_optimization.config.SplitType.SplitType, split_count: int, training_percentage: Optional[float] = None, reports: Optional[immuneML.hyperparameter_optimization.config.ReportConfig.ReportConfig] = None, manual_config: Optional[immuneML.hyperparameter_optimization.config.ManualSplitConfig.ManualSplitConfig] = None, leave_one_out_config: Optional[immuneML.hyperparameter_optimization.config.LeaveOneOutConfig.LeaveOneOutConfig] = None)[source]

Bases: object

SplitConfig describes how to split the data for cross-validation. It allows for the following combinations:

  • loocv (leave-one-out cross-validation)

  • k_fold (k-fold cross-validation)

  • stratified_k_fold (stratified k-fold cross-validation that can be used when immuneML is used for single-label classification, see `this documentation<https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html>`_ for more details on how this is implemented)

  • random (Monte Carlo cross-validation - randomly splitting the dataset to training and test datasets)

  • manual (train and test dataset are explicitly specified by providing metadata files for the two datasets - currently available only for repertoire datasets)

  • leave_one_out_stratification (leave-one-out CV where one refers to a specific parameter, e.g. if subject is known in a receptor dataset, it is possible to have leave-subject-out CV - currently only available for receptor datasets).

Parameters
  • split_strategy (SplitType) – one of the types of cross-validation listed above (LOOCV, K_FOLD, STRATIFIED_K_FOLD, MANUAL, `` or RANDOM)

  • split_count (int) – if split_strategy is K_FOLD, then this defined how many splits to make (K), if split_strategy is RANDOM, split_count defines how many random splits to make, resulting in split_count training/test dataset pairs, or if split_strategy is LOOCV, MANUAL or LEAVE_ONE_OUT_STRATIFICATION, split_count does not need to be specified.

  • training_percentage – if split_strategy is RANDOM, this defines which portion of the original dataset to use for creating the training dataset; for other values of split_strategy, this parameter is not used.

  • reports (ReportConfig) – defines which reports to execute on which datasets or settings. See ReportConfig for more details.

  • manual_config (ManualSplitConfig) – if split strategy is MANUAL,

  • given (here the paths to metadata files should be) –

  • to (using the "subject_id" field so it has to be present in both the original dataset and the metadata files provided here. Manual splitting) –

  • else (datasets so far. If split strategy is anything) –

  • effect (this field has no) –

  • omitted. (this field has no effect and can be) –

  • leave_one_out_config (LeaveOneOutConfig) – if split strategy is

  • LEAVE_ONE_OUT_STRATIFICATION

  • dataset (this config describes which parameter to use for stratification thus making a list of train/test) –

  • argument (combinations in which in the test set there are examples with only one value of the specified parameter. leave_one_out_config) –

  • inputs (accepts two) – parameter which is the name of the parameter to use for stratification and min_count which defines the minimum

  • receptor (number of examples that can be present in the test dataset. This type of generating train and test datasets is only supported for) –

  • else

  • omitted.

YAML specification:

# as a part of a TrainMLModel instruction, defining the outer (assessment) loop of nested cross-validation:
assessment: # outer loop of nested CV
    split_strategy: random # perform Monte Carlo CV (randomly split the data into train and test)
    split_count: 5 # how many train/test datasets to generate
    training_percentage: 0.7 # what percentage of the original data should be used for the training set
    reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods
        data_splits: # list of data reports to execute on training/test datasets (before they are encoded)
            - rep1
        encoding: # list of encoding reports to execute on encoded training/test datasets
            - rep2
        models: # list of ML model reports to execute on the trained classifiers in the assessment loop
            - rep3

# as a part of a TrainMLModel instruction, defining the inner (selection) loop of nested cross-validation:
selection: # inner loop of nested CV
    split_strategy: leave_one_out_stratification
    leave_one_out_config: # perform leave-(subject)-out CV
        parameter: subject # which parameter to use for splitting, must be present in the metadata for each example
        min_count: 1 # what is the minimum number of examples with unique value of the parameter specified above for the analysis to be valid
    reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods
        data_splits: # list of data reports to execute on training/test datasets (before they are encoded)
            - rep1
        encoding: # list of encoding reports to execute on encoded training/test datasets
            - rep2
        encoding: # list of ML model reports to execute the trained classifiers in the selection loop
            - rep3
static get_documentation()[source]

immuneML.hyperparameter_optimization.config.SplitType module

class immuneML.hyperparameter_optimization.config.SplitType.SplitType(value)[source]

Bases: enum.Enum

An enumeration.

K_FOLD = 0
LEAVE_ONE_OUT_STRATIFICATION = 4
LOOCV = 1
MANUAL = 3
RANDOM = 2
STRATIFIED_K_FOLD = 5

Module contents