immuneML.hyperparameter_optimization.config package

Submodules

immuneML.hyperparameter_optimization.config.LeaveOneOutConfig module

class immuneML.hyperparameter_optimization.config.LeaveOneOutConfig.LeaveOneOutConfig(parameter: str = None, min_count: int = None)[source]

Bases: object

min_count: int = None
parameter: str = None

immuneML.hyperparameter_optimization.config.ManualSplitConfig module

class immuneML.hyperparameter_optimization.config.ManualSplitConfig.ManualSplitConfig(train_metadata_path: pathlib.Path = None, test_metadata_path: pathlib.Path = None)[source]

Bases: object

test_metadata_path: Path = None
train_metadata_path: Path = None

immuneML.hyperparameter_optimization.config.ReportConfig module

class immuneML.hyperparameter_optimization.config.ReportConfig.ReportConfig(data_splits: dict = None, models: dict = None, data: dict = None, encoding: dict = None)[source]

Bases: object

A class encapsulating different report lists which can be executed while performing nested cross-validation (CV) using TrainMLModel instruction. All arguments are optional.

Parameters:
  • data (dict) – Data reports to be executed on the whole dataset before it is split to training/test or training/validation

  • data_splits (dict) – Data reports to be executed after the data has been split into training and test (assessment CV loop) or training and validation (selection CV loop) datasets before they are encoded

  • models (dict) – ML model reports to be executed on all trained classifiers

  • encoding (dict) – Encoding reports to be executed on each of the encoded training/test datasets or training/validation datasets

YAML specification:

# as a part of a TrainMLModel instruction, defining the outer (assessment) loop of nested cross-validation:
assessment: # outer loop of nested CV
    split_strategy: random # perform Monte Carlo CV (randomly split the data into train and test)
    split_count: 5 # how many train/test datasets to generate
    training_percentage: 0.7 # what percentage of the original data should be used for the training set
    reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods
        data_splits: # list of reports to execute on training/test datasets (before they are preprocessed and encoded)
            - my_data_split_report
        encoding: # list of reports to execute on encoded training/test datasets
            - my_encoding_report

# as a part of a TrainMLModel instruction, defining the inner (selection) loop of nested cross-validation:
selection: # inner loop of nested CV
    split_strategy: random # perform Monte Carlo CV (randomly split the data into train and validation)
    split_count: 5 # how many train/validation datasets to generate
    training_percentage: 0.7 # what percentage of the original data should be used for the training set
    reports: # reports to execute on training/validation datasets, encoded datasets and trained ML methods
        data_splits: # list of reports to execute on training/validation datasets (before they are preprocessed and encoded)
            - my_data_split_report
        encoding: # list of reports to execute on encoded training/validation datasets
            - my_encoding_report
        models:
            - my_ml_model_report
static get_documentation()[source]

immuneML.hyperparameter_optimization.config.SplitConfig module

class immuneML.hyperparameter_optimization.config.SplitConfig.SplitConfig(split_strategy: SplitType, split_count: int, training_percentage: float = None, reports: ReportConfig = None, manual_config: ManualSplitConfig = None, leave_one_out_config: LeaveOneOutConfig = None)[source]

Bases: object

SplitConfig describes how to split the data for cross-validation. It allows for the following combinations:

  • loocv (leave-one-out cross-validation)

  • k_fold (k-fold cross-validation)

  • stratified_k_fold (stratified k-fold cross-validation that can be used when immuneML is used for single-label classification, see `this documentation<https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html>`_ for more details on how this is implemented)

  • random (Monte Carlo cross-validation - randomly splitting the dataset to training and test datasets)

  • manual (train and test dataset are explicitly specified by providing metadata files for the two datasets)

  • leave_one_out_stratification (leave-one-out CV where one refers to a specific parameter, e.g. if subject is known in a receptor dataset, it is possible to have leave-subject-out CV - currently only available for receptor and sequence datasets).

Parameters:
  • split_strategy (SplitType) – one of the types of cross-validation listed above (LOOCV, K_FOLD, STRATIFIED_K_FOLD, MANUAL, `` or RANDOM)

  • split_count (int) – if split_strategy is K_FOLD, then this defined how many splits to make (K), if split_strategy is RANDOM, split_count defines how many random splits to make, resulting in split_count training/test dataset pairs, or if split_strategy is LOOCV, MANUAL or LEAVE_ONE_OUT_STRATIFICATION, split_count does not need to be specified.

  • training_percentage – if split_strategy is RANDOM, this defines which portion of the original dataset to use for creating the training dataset; for other values of split_strategy, this parameter is not used.

  • reports (ReportConfig) – defines which reports to execute on which datasets or settings. See ReportConfig for more details.

  • manual_config (ManualSplitConfig) – if split strategy is MANUAL,

  • given (here the paths to metadata files should be) –

  • provided (using the "subject_id" field in for repertoire datasets so it has to be present in both the original dataset and the metadata files) –

  • datasets (here. For receptor and sequence) –

  • either ("example_id" field needs to be provided in the metadata files and it will be mapped to) –

  • MANUAL ('sequence_identifiers' or 'receptor_identifiers' in the original dataset. If split strategy is anything other than) –

  • has (this field) –

  • omitted. (this field has no effect and can be) –

  • leave_one_out_config (LeaveOneOutConfig) – if split strategy is

  • LEAVE_ONE_OUT_STRATIFICATION

  • dataset (this config describes which parameter to use for stratification thus making a list of train/test) –

  • argument (combinations in which in the test set there are examples with only one value of the specified parameter. leave_one_out_config) –

  • inputs (accepts two) – parameter which is the name of the parameter to use for stratification and min_count which defines the minimum

  • receptor (number of examples that can be present in the test dataset. This type of generating train and test datasets is only supported for) –

  • else (and sequence datasets so far. If split strategy is anything) –

  • omitted.

YAML specification:

# as a part of a TrainMLModel instruction, defining the outer (assessment) loop of nested cross-validation:
assessment: # outer loop of nested CV
    split_strategy: random # perform Monte Carlo CV (randomly split the data into train and test)
    split_count: 5 # how many train/test datasets to generate
    training_percentage: 0.7 # what percentage of the original data should be used for the training set
    reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods
        data_splits: # list of data reports to execute on training/test datasets (before they are encoded)
            - rep1
        encoding: # list of encoding reports to execute on encoded training/test datasets
            - rep2
        models: # list of ML model reports to execute on the trained classifiers in the assessment loop
            - rep3

# as a part of a TrainMLModel instruction, defining the inner (selection) loop of nested cross-validation:
selection: # inner loop of nested CV
    split_strategy: leave_one_out_stratification
    leave_one_out_config: # perform leave-(subject)-out CV
        parameter: subject # which parameter to use for splitting, must be present in the metadata for each example
        min_count: 1 # what is the minimum number of examples with unique value of the parameter specified above for the analysis to be valid
    reports: # reports to execute on training/test datasets, encoded datasets and trained ML methods
        data_splits: # list of data reports to execute on training/test datasets (before they are encoded)
            - rep1
        encoding: # list of encoding reports to execute on encoded training/test datasets
            - rep2
        encoding: # list of ML model reports to execute the trained classifiers in the selection loop
            - rep3
static get_documentation()[source]

immuneML.hyperparameter_optimization.config.SplitType module

class immuneML.hyperparameter_optimization.config.SplitType.SplitType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

K_FOLD = 0
LEAVE_ONE_OUT_STRATIFICATION = 4
LOOCV = 1
MANUAL = 3
RANDOM = 2
STRATIFIED_K_FOLD = 5

Module contents