immuneML.workflows.instructions.clustering package¶

Submodules¶

immuneML.workflows.instructions.clustering.ClusteringInstruction module¶

class immuneML.workflows.instructions.clustering.ClusteringInstruction.ClusteringInstruction(dataset: Dataset, metrics: List[str], clustering_settings: List[ClusteringSetting], name: str, label_config: LabelConfiguration = None, reports: List[Report] = None, number_of_processes: int = None, split_config: SplitConfig = None, sequence_type: SequenceType = None, region_type: RegionType = None, validation_type: List[str] = None)[source]¶

Bases: Instruction

Clustering instruction fits clustering methods to the provided encoded dataset and compares the combinations of clustering method with its hyperparameters, and encodings across a pre-defined set of metrics. The dataset is split into discovery and validation datasets and the clustering results are reported on both. Finally, it provides options to include a set of reports to visualize the results.

For more details on choosing the clustering algorithm and its hyperparameters, see the paper: Ullmann, T., Hennig, C., & Boulesteix, A.-L. (2022). Validation of cluster analysis results on validation data: A systematic framework. WIREs Data Mining and Knowledge Discovery, 12(3), e1444. https://doi.org/10.1002/widm.1444

Specification arguments:

dataset (str): name of the dataset to be clustered
metrics (list): a list of metrics to use for comparison of clustering algorithms and encodings (it can include metrics for either internal evaluation if no labels are provided or metrics for external evaluation so that the clusters can be compared against a list of predefined labels)
labels (list): an optional list of labels to use for external evaluation of clustering

split_config (SplitConfig): how to perform splitting of the original dataset into discovery and validation data; for this parameter, specify: split_strategy (leave_one_out_stratification, manual, random), training percentage if split_strategy is random, and defaults of manual or leave one out stratification config for corresponding split strategy; all three options are illustrated here:

split_config:
    split_strategy: manual
    manual_config:
        discovery_data: file_with_ids_of_examples_for_discovery_data.csv
        validation_data: file_with_ids_of_examples_for_validation_data.csv

split_config:
    split_strategy: random
    training_percentage: 0.5

split_config:
    split_strategy: leave_one_out_stratification
    leave_one_out_config:
        parameter: subject_id # any name of the parameter for split, must be present in the metadata
        min_count: 1 #  defines the minimum number of examples that can be present in the validation dataset.

clustering_settings (list): a list where each element represents a ClusteringSetting; a combinations of encoding, optional dimensionality reduction algorithm, and the clustering algorithm that will be evaluated
reports (list): a list of reports to be run on the clustering results or the encoded data
number_of_processes (int): how many processes to use for parallelization
sequence_type (str): whether to do analysis on the amino_acid or nucleotide level; this value is used only if nothing is specified on the encoder level
region_type (str): which part of the receptor sequence to analyze (e.g., IMGT_CDR3); this value is used only if nothing is specified on the encoder level
validation_type (list): a list of validation types to use for comparison of clustering algorithms and encodings; it can be method_based and/or result_based

YAML specification:

instructions:
    my_clustering_instruction:
        type: Clustering
        dataset: d1
        metrics: [adjusted_rand_score, adjusted_mutual_info_score]
        labels: [epitope, v_call]
        sequence_type: amino_acid
        region_type: imgt_cdr3
        validation_type: [method_based, result_based]
        split_config:
            split_strategy: manual
            manual_config:
                discovery_data: file_with_ids_of_examples_for_discovery_data.csv
                validation_data: file_with_ids_of_examples_for_validation_data.csv
        clustering_settings:
            - encoding: e1
              dim_reduction: pca
              method: k_means1
            - encoding: e2
              method: dbscan
        reports: [rep1, rep2]

run(result_path: Path)[source]¶: Execute the clustering instruction workflow.

immuneML.workflows.instructions.clustering.clustering_run_model module¶

class immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringItem(dataset: immuneML.data_model.datasets.Dataset.Dataset = None, method: immuneML.ml_methods.clustering.ClusteringMethod.ClusteringMethod = None, encoder: immuneML.encodings.DatasetEncoder.DatasetEncoder = None, internal_performance: immuneML.workflows.instructions.clustering.clustering_run_model.DataFrameWrapper = None, external_performance: immuneML.workflows.instructions.clustering.clustering_run_model.DataFrameWrapper = None, predictions: numpy.ndarray = None, cl_setting: immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringSetting = None)[source]¶

Bases: object

cl_setting: ClusteringSetting = None¶

dataset: Dataset = None¶

encoder: DatasetEncoder = None¶

external_performance: DataFrameWrapper = None¶

internal_performance: DataFrameWrapper = None¶

method: ClusteringMethod = None¶

predictions: ndarray = None¶

class immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringSetting(encoder: immuneML.encodings.DatasetEncoder.DatasetEncoder, encoder_params: dict, encoder_name: str, clustering_method: immuneML.ml_methods.clustering.ClusteringMethod.ClusteringMethod, clustering_params: dict, clustering_method_name: str, dim_reduction_method: immuneML.ml_methods.dim_reduction.DimRedMethod.DimRedMethod = None, dim_red_params: dict = None, dim_red_name: str = None, path: pathlib.Path = None)[source]¶

Bases: object

clustering_method: ClusteringMethod¶

clustering_method_name: str¶

clustering_params: dict¶

dim_red_name: str = None¶

dim_red_params: dict = None¶

dim_reduction_method: DimRedMethod = None¶

encoder: DatasetEncoder¶

encoder_name: str¶

encoder_params: dict¶

get_key() → str[source]¶

path: Path = None¶

class immuneML.workflows.instructions.clustering.clustering_run_model.DataFrameWrapper(path: Path, df: pandas.DataFrame = None)[source]¶

Bases: object

get_df()[source]¶

immuneML.workflows.instructions.clustering package¶

Submodules¶

immuneML.workflows.instructions.clustering.ClusteringInstruction module¶

immuneML.workflows.instructions.clustering.clustering_run_model module¶

Module contents¶