immuneML.workflows.instructions.clustering package

Submodules

immuneML.workflows.instructions.clustering.ClusteringInstruction module

class immuneML.workflows.instructions.clustering.ClusteringInstruction.ClusteringInstruction(dataset: Dataset, metrics: List[str], clustering_settings: List[ClusteringSetting], name: str, label_config: LabelConfiguration = None, reports: List[Report] = None, number_of_processes: int = None, split_config: SplitConfig = None, sequence_type: SequenceType = None, region_type: RegionType = None, validation_type: List[str] = None)[source]

Bases: Instruction

Clustering instruction fits clustering methods to the provided encoded dataset and compares the combinations of clustering method with its hyperparameters, and encodings across a pre-defined set of metrics. The dataset is split into discovery and validation datasets and the clustering results are reported on both. Finally, it provides options to include a set of reports to visualize the results.

See also: How to perform clustering analysis

For more details on choosing the clustering algorithm and its hyperparameters, see the paper: Ullmann, T., Hennig, C., & Boulesteix, A.-L. (2022). Validation of cluster analysis results on validation data: A systematic framework. WIREs Data Mining and Knowledge Discovery, 12(3), e1444. https://doi.org/10.1002/widm.1444

Specification arguments:

  • dataset (str): name of the dataset to be clustered

  • metrics (list): a list of metrics to use for comparison of clustering algorithms and encodings (it can include metrics for either internal evaluation if no labels are provided or metrics for external evaluation so that the clusters can be compared against a list of predefined labels); some of the supported metrics include adjusted_rand_score, completeness_score, homogeneity_score, silhouette_score; for the full list, see scikit-learn’s documentation of clustering metrics at https://scikit-learn.org/stable/api/sklearn.metrics.html#module-sklearn.metrics.cluster.

  • labels (list): an optional list of labels to use for external evaluation of clustering

  • split_config (SplitConfig): how to perform splitting of the original dataset into discovery and validation data; for this parameter, specify: split_strategy (leave_one_out_stratification, manual, random), training percentage if split_strategy is random, and defaults of manual or leave one out stratification config for corresponding split strategy; all three options are illustrated here:

    split_config:
        split_strategy: manual
        manual_config:
            discovery_data: file_with_ids_of_examples_for_discovery_data.csv
            validation_data: file_with_ids_of_examples_for_validation_data.csv
    
    split_config:
        split_strategy: random
        training_percentage: 0.5
        split_count: 3 # repeat the random split 3 times -> 3 discovery and 3 validation datasets
    
    split_config:
        split_strategy: leave_one_out_stratification
        leave_one_out_config:
            parameter: subject_id # any name of the parameter for split, must be present in the metadata
            min_count: 1 #  defines the minimum number of examples that can be present in the validation dataset.
    
  • clustering_settings (list): a list where each element represents a ClusteringSetting; a combinations of encoding, optional dimensionality reduction algorithm, and the clustering algorithm that will be evaluated

  • reports (list): a list of reports to be run on the clustering results or the encoded data

  • number_of_processes (int): how many processes to use for parallelization

  • sequence_type (str): whether to do analysis on the amino_acid or nucleotide level; this value is used only if nothing is specified on the encoder level

  • region_type (str): which part of the receptor sequence to analyze (e.g., IMGT_CDR3); this value is used only if nothing is specified on the encoder level

  • validation_type (list): a list of validation types to use for comparison of clustering algorithms and encodings; it can be method_based and/or result_based

YAML specification:

instructions:
    my_clustering_instruction:
        type: Clustering
        dataset: d1
        metrics: [adjusted_rand_score, adjusted_mutual_info_score]
        labels: [epitope, v_call]
        sequence_type: amino_acid
        region_type: imgt_cdr3
        validation_type: [method_based, result_based]
        split_config:
            split_count: 1
            split_strategy: manual
            manual_config:
                discovery_data: file_with_ids_of_examples_for_discovery_data.csv
                validation_data: file_with_ids_of_examples_for_validation_data.csv
        clustering_settings:
            - encoding: e1
              dim_reduction: pca
              method: k_means1
            - encoding: e2
              method: dbscan
        reports: [rep1, rep2]
run(result_path: Path)[source]

Execute the clustering instruction workflow.

immuneML.workflows.instructions.clustering.ClusteringReportHandler module

class immuneML.workflows.instructions.clustering.ClusteringReportHandler.ClusteringReportHandler(reports: List[Report])[source]

Bases: object

Manages report generation for clustering results.

run_clustering_reports(state: ClusteringState)[source]

Generate overall clustering reports.

run_item_reports(cl_item: ClusteringItem, analysis_desc: str, run_id: int, path: Path, state: ClusteringState) list[source]

Generate reports for individual clustering items.

immuneML.workflows.instructions.clustering.ClusteringRunner module

class immuneML.workflows.instructions.clustering.ClusteringRunner.ClusteringRunner(config: ClusteringConfig, n_processes: int, report_handler: ClusteringReportHandler)[source]

Bases: object

Handles core clustering operations like fitting, prediction and evaluation.

evaluate_clustering(predictions: DataFrame, cl_setting: ClusteringSetting, features) Dict[str, Path][source]
run_all_settings(dataset: Dataset, analysis_desc: str, path: Path, run_id: int, predictions_df: DataFrame, state: ClusteringState)[source]
run_setting(dataset: Dataset, cl_setting: ClusteringSetting, analysis_desc: str, path: Path, run_id: int, predictions_df: DataFrame, state: ClusteringState) Tuple[ClusteringItemResult, DataFrame][source]
immuneML.workflows.instructions.clustering.ClusteringRunner.encode_dataset(dataset: Dataset, cl_setting: ClusteringSetting, number_of_processes: int, label_config: LabelConfiguration, learn_model: bool, sequence_type: SequenceType, region_type: RegionType, encoder: DatasetEncoder = None)[source]
immuneML.workflows.instructions.clustering.ClusteringRunner.encode_dataset_internal(dataset: Dataset, cl_setting: ClusteringSetting, number_of_processes: int, label_config: LabelConfiguration, learn_model: bool, sequence_type: SequenceType, region_type: RegionType, encoder: DatasetEncoder = None)[source]
immuneML.workflows.instructions.clustering.ClusteringRunner.get_features(dataset: Dataset, cl_setting: ClusteringSetting)[source]

Get features from encoded dataset.

immuneML.workflows.instructions.clustering.ClusteringState module

class immuneML.workflows.instructions.clustering.ClusteringState.ClusteringConfig(name: str, dataset: immuneML.data_model.datasets.Dataset.Dataset, metrics: List[str], split_config: immuneML.hyperparameter_optimization.config.SplitConfig.SplitConfig, validation_type: List[str], clustering_settings: List[immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringSetting], region_type: immuneML.data_model.SequenceParams.RegionType = <RegionType.IMGT_CDR3: 'cdr3'>, label_config: immuneML.environment.LabelConfiguration.LabelConfiguration = None, sequence_type: immuneML.environment.SequenceType.SequenceType = <SequenceType.AMINO_ACID: 'sequence_aa'>)[source]

Bases: object

clustering_settings: List[ClusteringSetting]
dataset: Dataset
label_config: LabelConfiguration = None
metrics: List[str]
name: str
region_type: RegionType = 'cdr3'
sequence_type: SequenceType = 'sequence_aa'
split_config: SplitConfig
validation_type: List[str]
class immuneML.workflows.instructions.clustering.ClusteringState.ClusteringItemResult(item: immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringItem, report_results: List[immuneML.reports.ReportResult.ReportResult] = <factory>)[source]

Bases: object

item: ClusteringItem
report_results: List[ReportResult]
class immuneML.workflows.instructions.clustering.ClusteringState.ClusteringResultPerRun(run_id: int, run_type: str, items: Dict[str, immuneML.workflows.instructions.clustering.ClusteringState.ClusteringItemResult] = <factory>)[source]

Bases: object

get_cl_item(cl_setting: str | ClusteringSetting)[source]
items: Dict[str, ClusteringItemResult]
run_id: int
run_type: str
class immuneML.workflows.instructions.clustering.ClusteringState.ClusteringResults(discovery: immuneML.workflows.instructions.clustering.ClusteringState.ClusteringResultPerRun = None, method_based_validation: immuneML.workflows.instructions.clustering.ClusteringState.ClusteringResultPerRun = None, result_based_validation: immuneML.workflows.instructions.clustering.ClusteringState.ClusteringResultPerRun = None)[source]

Bases: object

discovery: ClusteringResultPerRun = None
method_based_validation: ClusteringResultPerRun = None
result_based_validation: ClusteringResultPerRun = None
class immuneML.workflows.instructions.clustering.ClusteringState.ClusteringState(name: str, config: immuneML.workflows.instructions.clustering.ClusteringState.ClusteringConfig, result_path: pathlib.Path = None, clustering_items: List[immuneML.workflows.instructions.clustering.ClusteringState.ClusteringResults] = <factory>, predictions_paths: List[Dict[str, pathlib.Path]] = None, discovery_datasets: List[immuneML.data_model.datasets.Dataset.Dataset] = None, validation_datasets: List[immuneML.data_model.datasets.Dataset.Dataset] = None, clustering_report_results: List[immuneML.reports.ReportResult.ReportResult] = <factory>)[source]

Bases: object

add_cl_result_per_run(run_id: int, analysis_desc: str, cl_item_result: ClusteringResultPerRun)[source]
clustering_items: List[ClusteringResults]
clustering_report_results: List[ReportResult]
config: ClusteringConfig
discovery_datasets: List[Dataset] = None
name: str
predictions_paths: List[Dict[str, Path]] = None
result_path: Path = None
validation_datasets: List[Dataset] = None

immuneML.workflows.instructions.clustering.ValidationHandler module

class immuneML.workflows.instructions.clustering.ValidationHandler.ValidationHandler(config: ClusteringConfig, runner: ClusteringRunner, report_handler: ClusteringReportHandler, num_of_processes: int)[source]

Bases: object

Handles different validation strategies for clustering.

run_method_based_validation(dataset: Dataset, run_id: int, path: Path, predictions_df: DataFrame, state: ClusteringState)[source]

Run method-based validation.

run_result_based_validation(dataset: Dataset, run_id: int, path: Path, predictions_df: DataFrame, state: ClusteringState)[source]

Run result-based validation by training a classifier on discovery clusters.

immuneML.workflows.instructions.clustering.ValidationHandler.get_complementary_classifier(cl_setting: ClusteringSetting)[source]

Returns a complementary classifier based on the clustering method.

Parameters:

cl_setting – ClusteringSetting object containing the clustering method configuration

Returns:

An instance of the appropriate classifier; NearestCentroid if no matches are found

immuneML.workflows.instructions.clustering.clustering_run_model module

class immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringItem(dataset: immuneML.data_model.datasets.Dataset.Dataset = None, method: immuneML.ml_methods.clustering.ClusteringMethod.ClusteringMethod = None, encoder: immuneML.encodings.DatasetEncoder.DatasetEncoder = None, internal_performance: immuneML.workflows.instructions.clustering.clustering_run_model.DataFrameWrapper = None, external_performance: immuneML.workflows.instructions.clustering.clustering_run_model.DataFrameWrapper = None, predictions: numpy.ndarray = None, cl_setting: immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringSetting = None)[source]

Bases: object

cl_setting: ClusteringSetting = None
dataset: Dataset = None
encoder: DatasetEncoder = None
external_performance: DataFrameWrapper = None
internal_performance: DataFrameWrapper = None
method: ClusteringMethod = None
predictions: ndarray = None
class immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringSetting(encoder: immuneML.encodings.DatasetEncoder.DatasetEncoder, encoder_params: dict, encoder_name: str, clustering_method: immuneML.ml_methods.clustering.ClusteringMethod.ClusteringMethod, clustering_params: dict, clustering_method_name: str, dim_reduction_method: immuneML.ml_methods.dim_reduction.DimRedMethod.DimRedMethod = None, dim_red_params: dict = None, dim_red_name: str = None, path: pathlib.Path = None)[source]

Bases: object

clustering_method: ClusteringMethod
clustering_method_name: str
clustering_params: dict
dim_red_name: str = None
dim_red_params: dict = None
dim_reduction_method: DimRedMethod = None
encoder: DatasetEncoder
encoder_name: str
encoder_params: dict
get_key() str[source]
path: Path = None
class immuneML.workflows.instructions.clustering.clustering_run_model.DataFrameWrapper(path: Path, df: DataFrame = None)[source]

Bases: object

get_df()[source]

Module contents