immuneML.workflows.instructions.clustering package

Submodules

immuneML.workflows.instructions.clustering.ClusteringInstruction module

class immuneML.workflows.instructions.clustering.ClusteringInstruction.ClusteringInstruction(dataset: Dataset, metrics: List[str], clustering_settings: List[ClusteringSetting], name: str, label_config: LabelConfiguration = None, reports: List[Report] = None, number_of_processes: int = None, sample_config: SampleConfig = None, stability_config: StabilityConfig = None, sequence_type: SequenceType = None, region_type: RegionType = None)[source]

Bases: Instruction

Clustering instruction fits clustering methods to the provided encoded dataset and compares the combinations of clustering method with its hyperparameters, and encodings across a pre-defined set of metrics. It provides results either for the full discovery dataset or for multiple subsets of discovery data as way to assess the stability of different metrics (Liu et al., 2022; Dangl and Leisch, 2020; Lange et al. 2004). Finally, it provides options to include a set of reports to visualize the results.

See also: How to perform clustering analysis for more details on the clustering procedure.

References:

Lange, T., Roth, V., Braun, M. L., & Buhmann, J. M. (2004). Stability-Based Validation of Clustering Solutions. Neural Computation, 16(6), 1299–1323. https://doi.org/10.1162/089976604773717621

Dangl, R., & Leisch, F. (2020). Effects of Resampling in Determining the Number of Clusters in a Data Set. Journal of Classification, 37(3), 558–583. https://doi.org/10.1007/s00357-019-09328-2

Liu, T., Yu, H., & Blair, R. H. (2022). Stability estimation for unsupervised clustering: A review. WIREs Computational Statistics, 14(6), e1575. https://doi.org/10.1002/wics.1575

Specification arguments:

  • dataset (str): name of the dataset to be clustered

  • metrics (list): a list of metrics to use for comparison of clustering algorithms and encodings (it can include metrics for either internal evaluation if no labels are provided or metrics for external evaluation so that the clusters can be compared against a list of predefined labels); some of the supported metrics include adjusted_rand_score, completeness_score, homogeneity_score, silhouette_score; for the full list, see scikit-learn’s documentation of clustering metrics at https://scikit-learn.org/stable/api/sklearn.metrics.html#module-sklearn.metrics.cluster.

  • labels (list): an optional list of labels to use for external evaluation of clustering

  • sample_config (SampleConfig): configuration describing how to construct the data subsets to estimate different clustering settings’ performance with different internal and external validation indices; with parameters percentage, split_count, random_seed:

sample_config: # make 5 subsets with 80% of the data each
    split_count: 5
    percentage: 0.8
    random_seed: 42
  • stability_config (StabilityConfig): configuration describing how to compute clustering stability; currently, clustering stability is computed following approach by Lange et al. (2004) and only takes the number of repetitions as a parameter. Other strategies to compute clustering stability will be added in the future.

stability_config:
    split_count: 5 # number of times to repeat clustering for stability estimation
    random_seed: 12
  • clustering_settings (list): a list where each element represents a ClusteringSetting; a combinations of encoding, optional dimensionality reduction algorithm, and the clustering algorithm that will be evaluated

  • reports (list): a list of reports to be run on the clustering results or the encoded data

  • number_of_processes (int): how many processes to use for parallelization

  • sequence_type (str): whether to do analysis on the amino_acid or nucleotide level; this value is used only if nothing is specified on the encoder level

  • region_type (str): which part of the receptor sequence to analyze (e.g., IMGT_CDR3); this value is used only if nothing is specified on the encoder level

YAML specification:

instructions:
    my_clustering_instruction:
        type: Clustering
        dataset: d1
        metrics: [adjusted_rand_score, adjusted_mutual_info_score]
        labels: [epitope, v_call]
        sequence_type: amino_acid
        region_type: imgt_cdr3
        sample_config:
            split_count: 5
            percentage: 0.8
            random_seed: 42
        stability_config:
            split_count: 5
            random_seed: 12
        clustering_settings:
            - encoding: e1
              dim_reduction: pca
              method: k_means1
            - encoding: e2
              method: dbscan
        reports: [rep1, rep2]
run(result_path: Path)[source]

Main entry point: computes validation indices and estimates stability.

immuneML.workflows.instructions.clustering.ClusteringReportHandler module

class immuneML.workflows.instructions.clustering.ClusteringReportHandler.ClusteringReportHandler(reports: List[Report])[source]

Bases: object

Manages report generation for clustering results.

run_clustering_reports(state: ClusteringState)[source]

Generate overall clustering reports.

run_item_reports(cl_item: ClusteringItem, run_id: int, path: Path, state: ClusteringState) list[source]

Generate reports for individual clustering items.

immuneML.workflows.instructions.clustering.ClusteringState module

class immuneML.workflows.instructions.clustering.ClusteringState.ClusteringConfig(name: str, dataset: immuneML.data_model.datasets.Dataset.Dataset, metrics: List[str], sample_config: immuneML.hyperparameter_optimization.config.SampleConfig.SampleConfig, stability_config: immuneML.workflows.instructions.clustering.ClusteringState.StabilityConfig, clustering_settings: List[immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringSetting], region_type: immuneML.data_model.SequenceParams.RegionType = <RegionType.IMGT_CDR3: 'cdr3'>, label_config: immuneML.environment.LabelConfiguration.LabelConfiguration = None, sequence_type: immuneML.environment.SequenceType.SequenceType = <SequenceType.AMINO_ACID: 'sequence_aa'>)[source]

Bases: object

clustering_settings: List[ClusteringSetting]
dataset: Dataset
get_cl_setting_by_key(key: str) ClusteringSetting[source]
label_config: LabelConfiguration = None
metrics: List[str]
name: str
region_type: RegionType = 'cdr3'
sample_config: SampleConfig
sequence_type: SequenceType = 'sequence_aa'
stability_config: StabilityConfig
class immuneML.workflows.instructions.clustering.ClusteringState.ClusteringItemResult(item: immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringItem, report_results: List[immuneML.reports.ReportResult.ReportResult] = <factory>)[source]

Bases: object

item: ClusteringItem
report_results: List[ReportResult]
class immuneML.workflows.instructions.clustering.ClusteringState.ClusteringResultPerRun(run_id: int, items: Dict[str, immuneML.workflows.instructions.clustering.ClusteringState.ClusteringItemResult] = <factory>)[source]

Bases: object

get_cl_item(cl_setting: str | ClusteringSetting)[source]
items: Dict[str, ClusteringItemResult]
run_id: int
class immuneML.workflows.instructions.clustering.ClusteringState.ClusteringState(name: str, config: immuneML.workflows.instructions.clustering.ClusteringState.ClusteringConfig, result_path: pathlib.Path = None, clustering_items: List[immuneML.workflows.instructions.clustering.ClusteringState.ClusteringResultPerRun] = <factory>, predictions_paths: List[pathlib.Path] = None, subsampled_datasets: List[immuneML.data_model.datasets.Dataset.Dataset] = None, clustering_report_results: List[immuneML.reports.ReportResult.ReportResult] = <factory>, metrics_performance_paths: Dict[str, pathlib.Path] = <factory>, optimal_settings_on_discovery: Dict[str, immuneML.workflows.instructions.clustering.ClusteringState.ClusteringItemResult] = <factory>, final_predictions_path: pathlib.Path = None, best_settings_zip_paths: Dict[str, Dict[str, Any]] = <factory>)[source]

Bases: object

add_cl_result_per_run(run_id: int, cl_item_result: ClusteringResultPerRun)[source]
best_settings_zip_paths: Dict[str, Dict[str, Any]]
clustering_items: List[ClusteringResultPerRun]
clustering_report_results: List[ReportResult]
config: ClusteringConfig
final_predictions_path: Path = None
metrics_performance_paths: Dict[str, Path]
name: str
optimal_settings_on_discovery: Dict[str, ClusteringItemResult]
predictions_paths: List[Path] = None
result_path: Path = None
subsampled_datasets: List[Dataset] = None
class immuneML.workflows.instructions.clustering.ClusteringState.StabilityConfig(split_count: int = None, random_seed: int = None)[source]

Bases: object

random_seed: int = None
split_count: int = None

immuneML.workflows.instructions.clustering.ValidateClusteringInstruction module

class immuneML.workflows.instructions.clustering.ValidateClusteringInstruction.ValidateClusteringInstruction(clustering_item: ClusteringItem, dataset: Dataset, metrics: List[str], validation_type: List[str], label_config: LabelConfiguration = None, sequence_type: SequenceType = SequenceType.AMINO_ACID, region_type: RegionType = RegionType.IMGT_CDR3, number_of_processes: int = 1, reports: List[Report] = None, name: str = 'validate_clustering', result_path: Path = None)[source]

Bases: Instruction

ValidateClustering instruction supports the application of the chosen clustering setting (preprocessing, encoding, clustering, with all hyperparameters) to a new dataset for validation.

For more details on validating the clustering algorithm and its hyperparameters, see the paper: Ullmann, T., Hennig, C., & Boulesteix, A.-L. (2022). Validation of cluster analysis results on validation data: A systematic framework. WIREs Data Mining and Knowledge Discovery, 12(3), e1444. https://doi.org/10.1002/widm.1444

Specification arguments:

  • clustering_config_path (str): path to the clustering exported by the Clustering instruction that will be applied to the new dataset

  • dataset (str): name of the validation dataset to which the clustering will be applied, as defined under definitions

  • metrics (list): a list of metrics to use for comparison of clustering algorithms and encodings (it can include metrics for either internal evaluation if no labels are provided or metrics for external evaluation so that the clusters can be compared against a list of predefined labels); some of the supported metrics include adjusted_rand_score, completeness_score, homogeneity_score, silhouette_score; for the full list, see scikit-learn’s documentation of clustering metrics at https://scikit-learn.org/stable/api/sklearn.metrics.html#module-sklearn.metrics.cluster.

  • validation_type (list): how to perform validation; options are method_based validation (refit the clustering algorithm to the new dataset and compare the clusterings) and result_based validation (transfer the clustering from the original dataset to the validation dataset using a supervised classifier and compare the clusterings)

  • reports (list): a list of reports to run on the validation results; supported report types include:

    • ClusteringMethodReport: reports that analyze the clustering method results (e.g., ClusteringVisualization)

    • EncodingReport: reports that analyze the encoded dataset

    • DataReport: reports that analyze the raw dataset

YAML specification:

instructions:
    validate_clustering_inst:
        type: ValidateClustering
        clustering_config_path: /path/to/exported_clustering.zip
        dataset: val_dataset
        metrics: [adjusted_rand_score, silhouette_score]
        validation_type: [method_based, result_based]
        reports: [cluster_vis, encoding_report]
run(result_path: Path) ValidateClusteringState[source]
class immuneML.workflows.instructions.clustering.ValidateClusteringInstruction.ValidateClusteringState(cl_item: immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringItem = None, dataset: immuneML.data_model.datasets.Dataset.Dataset = None, metrics: List[str] = None, validation_type: List[str] = None, result_path: pathlib.Path = None, name: str = 'validate_clustering', label_config: immuneML.environment.LabelConfiguration.LabelConfiguration = None, sequence_type: immuneML.environment.SequenceType.SequenceType = <SequenceType.AMINO_ACID: 'sequence_aa'>, region_type: immuneML.data_model.SequenceParams.RegionType = <RegionType.IMGT_CDR3: 'cdr3'>, number_of_processes: int = 1, method_based_result: immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringItem = None, result_based_result: immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringItem = None, predictions_path: pathlib.Path = None, method_based_report_results: List[immuneML.reports.ReportResult.ReportResult] = <factory>, result_based_report_results: List[immuneML.reports.ReportResult.ReportResult] = <factory>, data_report_results: List[immuneML.reports.ReportResult.ReportResult] = <factory>)[source]

Bases: object

cl_item: ClusteringItem = None
data_report_results: List[ReportResult]
dataset: Dataset = None
label_config: LabelConfiguration = None
method_based_report_results: List[ReportResult]
method_based_result: ClusteringItem = None
metrics: List[str] = None
name: str = 'validate_clustering'
number_of_processes: int = 1
predictions_path: Path = None
region_type: RegionType = 'cdr3'
result_based_report_results: List[ReportResult]
result_based_result: ClusteringItem = None
result_path: Path = None
sequence_type: SequenceType = 'sequence_aa'
validation_type: List[str] = None

immuneML.workflows.instructions.clustering.ValidationHandler module

immuneML.workflows.instructions.clustering.clustering_run_model module

class immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringItem(dataset: immuneML.data_model.datasets.Dataset.Dataset = None, method: immuneML.ml_methods.clustering.ClusteringMethod.ClusteringMethod = None, encoder: immuneML.encodings.DatasetEncoder.DatasetEncoder = None, dim_red_method: immuneML.ml_methods.dim_reduction.DimRedMethod.DimRedMethod = None, internal_performance: immuneML.workflows.instructions.clustering.clustering_run_model.DataFrameWrapper = None, external_performance: immuneML.workflows.instructions.clustering.clustering_run_model.DataFrameWrapper = None, predictions: numpy.ndarray = None, cl_setting: immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringSetting = None, classifier: immuneML.ml_methods.classifiers.MLMethod.MLMethod = None)[source]

Bases: object

cl_setting: ClusteringSetting = None
classifier: MLMethod = None
dataset: Dataset = None
dim_red_method: DimRedMethod = None
encoder: DatasetEncoder = None
external_performance: DataFrameWrapper = None
internal_performance: DataFrameWrapper = None
method: ClusteringMethod = None
predictions: ndarray = None
class immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringSetting(encoder: immuneML.encodings.DatasetEncoder.DatasetEncoder, encoder_params: dict, encoder_name: str, clustering_method: immuneML.ml_methods.clustering.ClusteringMethod.ClusteringMethod, clustering_params: dict, clustering_method_name: str, dim_reduction_method: immuneML.ml_methods.dim_reduction.DimRedMethod.DimRedMethod = None, dim_red_params: dict = None, dim_red_name: str = None, path: pathlib.Path = None)[source]

Bases: object

clustering_method: ClusteringMethod
clustering_method_name: str
clustering_params: dict
dim_red_name: str = None
dim_red_params: dict = None
dim_reduction_method: DimRedMethod = None
encoder: DatasetEncoder
encoder_name: str
encoder_params: dict
get_key() str[source]
path: Path = None
class immuneML.workflows.instructions.clustering.clustering_run_model.DataFrameWrapper(path: Path, df: DataFrame = None)[source]

Bases: object

get_df()[source]

immuneML.workflows.instructions.clustering.clustering_runner module

immuneML.workflows.instructions.clustering.clustering_runner.apply_cluster_classifier(dataset: Dataset, cl_setting: ClusteringSetting, classifier, encoder: DatasetEncoder, dim_red_method: DimRedMethod, predictions_path: Path, number_of_processes: int, sequence_type: SequenceType, region_type: RegionType) ClusteringItem[source]
immuneML.workflows.instructions.clustering.clustering_runner.encode_dataset(dataset: Dataset, cl_setting: ClusteringSetting, number_of_processes: int, label_config: LabelConfiguration, learn_model: bool, sequence_type: SequenceType, region_type: RegionType, encoder: DatasetEncoder = None, dim_red_method: DimRedMethod = None)[source]

Encode a dataset using the specified clustering setting’s encoder. Results are cached based on parameters.

Parameters:
  • dataset – The dataset to encode

  • cl_setting – The clustering setting containing encoder configuration

  • number_of_processes – Number of processes for parallelization

  • label_config – Label configuration

  • learn_model – Whether to learn the encoder model or use existing

  • sequence_type – Sequence type for encoding

  • region_type – Region type for encoding

  • encoder – Optional pre-configured encoder

  • dim_red_method – Optional pre-configured dimensionality reduction method

Returns:

Encoded dataset, encoder, and dimensionality reduction method

immuneML.workflows.instructions.clustering.clustering_runner.encode_dataset_internal(dataset: Dataset, cl_setting: ClusteringSetting, number_of_processes: int, label_config: LabelConfiguration, learn_model: bool, sequence_type: SequenceType, region_type: RegionType, encoder: DatasetEncoder = None, dim_red_method: DimRedMethod = None) Tuple[Dataset, DatasetEncoder, DimRedMethod][source]

Internal function to encode a dataset (called by encode_dataset with caching).

Parameters:
  • dataset – The dataset to encode

  • cl_setting – The clustering setting containing encoder configuration

  • number_of_processes – Number of processes for parallelization

  • label_config – Label configuration

  • learn_model – Whether to learn the encoder model or use existing

  • sequence_type – Sequence type for encoding

  • region_type – Region type for encoding

  • encoder – Optional pre-configured encoder

  • dim_red_method – Optional pre-configured dimensionality reduction method

Returns:

Encoded dataset with optional dimensionality reduction

immuneML.workflows.instructions.clustering.clustering_runner.eval_external_metrics(predictions_df: DataFrame, cl_setting: ClusteringSetting, metrics: List[str], label_config: LabelConfiguration, predictions_col_name: str = None) Path | None[source]
immuneML.workflows.instructions.clustering.clustering_runner.eval_internal_metrics(predictions_df: DataFrame, cl_setting: ClusteringSetting, features, metrics: List[str], predictions_col_name: str = None) Path[source]
immuneML.workflows.instructions.clustering.clustering_runner.evaluate_clustering(predictions_df: DataFrame, cl_setting: ClusteringSetting, features, metrics: List[str], label_config: LabelConfiguration, cl_item: ClusteringItem, predictions_col_name: str = None) ClusteringItem[source]

Evaluate clustering results using internal and external metrics.

Parameters:
  • predictions_col_name – name of the predictions column in predictions_df

  • predictions_df – DataFrame containing predictions and labels

  • cl_setting – The clustering setting used

  • features – Feature matrix for internal metrics

  • metrics – List of metric names to compute

  • label_config – Label configuration for external metrics

  • cl_item – Clustering item to evaluate and update with performance csv files

Returns:

Updated ClusteringItem with performance CSV file paths

immuneML.workflows.instructions.clustering.clustering_runner.fit_and_predict(dataset: Dataset, method: ClusteringMethod) ndarray[source]

Fit clustering method and get predictions.

immuneML.workflows.instructions.clustering.clustering_runner.get_complementary_classifier(cl_setting: ClusteringSetting)[source]

Returns a complementary classifier based on the clustering method.

Parameters:

cl_setting – ClusteringSetting object containing the clustering method configuration

Returns:

An instance of the appropriate classifier; kNN if no matches are found

immuneML.workflows.instructions.clustering.clustering_runner.get_features(dataset: Dataset, cl_setting: ClusteringSetting)[source]

Get features from encoded dataset.

immuneML.workflows.instructions.clustering.clustering_runner.run_all_settings(dataset: Dataset, clustering_settings: List[ClusteringSetting], path: Path, predictions_df: DataFrame, metrics: List[str], label_config: LabelConfiguration, number_of_processes: int, sequence_type: SequenceType, region_type: RegionType, report_handler=None, run_id: int = None, state=None) Tuple[Dict, DataFrame][source]

Run all clustering settings on a dataset and collect results.

Parameters:
  • dataset – The dataset to cluster

  • clustering_settings – List of clustering settings to evaluate

  • path – Output path for results

  • predictions_df – DataFrame to store predictions

  • metrics – List of metric names to compute

  • label_config – Label configuration for external metrics

  • number_of_processes – Number of processes for parallelization

  • sequence_type – Sequence type for encoding

  • region_type – Region type for encoding

  • report_handler – Optional report handler for running item reports

  • run_id – Optional run identifier

  • state – Optional clustering state for report handler

Returns:

Tuple of (clustering_items dict, updated predictions_df)

immuneML.workflows.instructions.clustering.clustering_runner.run_setting(dataset: Dataset, cl_setting: ClusteringSetting, path: Path, predictions_df: DataFrame, metrics: List[str], label_config: LabelConfiguration, number_of_processes: int, sequence_type: SequenceType, region_type: RegionType, report_handler=None, run_id: int = None, state=None, evaluate: bool = True) Tuple[ClusteringItemResult, DataFrame][source]

Run a single clustering setting on a dataset.

Parameters:
  • dataset – The dataset to cluster

  • cl_setting – The clustering setting to use

  • path – Output path for results

  • predictions_df – DataFrame to store predictions

  • metrics – List of metric names to compute

  • label_config – Label configuration for external metrics

  • number_of_processes – Number of processes for parallelization

  • sequence_type – Sequence type for encoding

  • region_type – Region type for encoding

  • report_handler – Optional report handler for running item reports

  • run_id – Optional run identifier

  • state – Optional clustering state for report handler

  • evaluate – Whether to compute internal/external evaluation metrics

Returns:

Tuple of (ClusteringItemResult, updated predictions_df)

immuneML.workflows.instructions.clustering.clustering_runner.train_cluster_classifier(cl_item: ClusteringItem)[source]

Module contents