immuneML.workflows.instructions.clustering package¶
Submodules¶
immuneML.workflows.instructions.clustering.ClusteringInstruction module¶
- class immuneML.workflows.instructions.clustering.ClusteringInstruction.ClusteringInstruction(dataset: Dataset, metrics: List[str], clustering_settings: List[ClusteringSetting], name: str, label_config: LabelConfiguration = None, reports: List[Report] = None, number_of_processes: int = None, sample_config: SampleConfig = None, stability_config: StabilityConfig = None, sequence_type: SequenceType = None, region_type: RegionType = None)[source]¶
Bases:
InstructionClustering instruction fits clustering methods to the provided encoded dataset and compares the combinations of clustering method with its hyperparameters, and encodings across a pre-defined set of metrics. It provides results either for the full discovery dataset or for multiple subsets of discovery data as way to assess the stability of different metrics (Liu et al., 2022; Dangl and Leisch, 2020; Lange et al. 2004). Finally, it provides options to include a set of reports to visualize the results.
See also: How to perform clustering analysis for more details on the clustering procedure.
References:
Lange, T., Roth, V., Braun, M. L., & Buhmann, J. M. (2004). Stability-Based Validation of Clustering Solutions. Neural Computation, 16(6), 1299–1323. https://doi.org/10.1162/089976604773717621
Dangl, R., & Leisch, F. (2020). Effects of Resampling in Determining the Number of Clusters in a Data Set. Journal of Classification, 37(3), 558–583. https://doi.org/10.1007/s00357-019-09328-2
Liu, T., Yu, H., & Blair, R. H. (2022). Stability estimation for unsupervised clustering: A review. WIREs Computational Statistics, 14(6), e1575. https://doi.org/10.1002/wics.1575
Specification arguments:
dataset (str): name of the dataset to be clustered
metrics (list): a list of metrics to use for comparison of clustering algorithms and encodings (it can include metrics for either internal evaluation if no labels are provided or metrics for external evaluation so that the clusters can be compared against a list of predefined labels); some of the supported metrics include adjusted_rand_score, completeness_score, homogeneity_score, silhouette_score; for the full list, see scikit-learn’s documentation of clustering metrics at https://scikit-learn.org/stable/api/sklearn.metrics.html#module-sklearn.metrics.cluster.
labels (list): an optional list of labels to use for external evaluation of clustering
sample_config (SampleConfig): configuration describing how to construct the data subsets to estimate different clustering settings’ performance with different internal and external validation indices; with parameters percentage, split_count, random_seed:
sample_config: # make 5 subsets with 80% of the data each split_count: 5 percentage: 0.8 random_seed: 42
stability_config (StabilityConfig): configuration describing how to compute clustering stability; currently, clustering stability is computed following approach by Lange et al. (2004) and only takes the number of repetitions as a parameter. Other strategies to compute clustering stability will be added in the future.
stability_config: split_count: 5 # number of times to repeat clustering for stability estimation random_seed: 12
clustering_settings (list): a list where each element represents a
ClusteringSetting; a combinations of encoding, optional dimensionality reduction algorithm, and the clustering algorithm that will be evaluatedreports (list): a list of reports to be run on the clustering results or the encoded data
number_of_processes (int): how many processes to use for parallelization
sequence_type (str): whether to do analysis on the amino_acid or nucleotide level; this value is used only if nothing is specified on the encoder level
region_type (str): which part of the receptor sequence to analyze (e.g., IMGT_CDR3); this value is used only if nothing is specified on the encoder level
YAML specification:
instructions: my_clustering_instruction: type: Clustering dataset: d1 metrics: [adjusted_rand_score, adjusted_mutual_info_score] labels: [epitope, v_call] sequence_type: amino_acid region_type: imgt_cdr3 sample_config: split_count: 5 percentage: 0.8 random_seed: 42 stability_config: split_count: 5 random_seed: 12 clustering_settings: - encoding: e1 dim_reduction: pca method: k_means1 - encoding: e2 method: dbscan reports: [rep1, rep2]
immuneML.workflows.instructions.clustering.ClusteringReportHandler module¶
- class immuneML.workflows.instructions.clustering.ClusteringReportHandler.ClusteringReportHandler(reports: List[Report])[source]¶
Bases:
objectManages report generation for clustering results.
- run_clustering_reports(state: ClusteringState)[source]¶
Generate overall clustering reports.
- run_item_reports(cl_item: ClusteringItem, run_id: int, path: Path, state: ClusteringState) list[source]¶
Generate reports for individual clustering items.
immuneML.workflows.instructions.clustering.ClusteringState module¶
- class immuneML.workflows.instructions.clustering.ClusteringState.ClusteringConfig(name: str, dataset: immuneML.data_model.datasets.Dataset.Dataset, metrics: List[str], sample_config: immuneML.hyperparameter_optimization.config.SampleConfig.SampleConfig, stability_config: immuneML.workflows.instructions.clustering.ClusteringState.StabilityConfig, clustering_settings: List[immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringSetting], region_type: immuneML.data_model.SequenceParams.RegionType = <RegionType.IMGT_CDR3: 'cdr3'>, label_config: immuneML.environment.LabelConfiguration.LabelConfiguration = None, sequence_type: immuneML.environment.SequenceType.SequenceType = <SequenceType.AMINO_ACID: 'sequence_aa'>)[source]¶
Bases:
object- clustering_settings: List[ClusteringSetting]¶
- get_cl_setting_by_key(key: str) ClusteringSetting[source]¶
- label_config: LabelConfiguration = None¶
- metrics: List[str]¶
- name: str¶
- region_type: RegionType = 'cdr3'¶
- sample_config: SampleConfig¶
- sequence_type: SequenceType = 'sequence_aa'¶
- stability_config: StabilityConfig¶
- class immuneML.workflows.instructions.clustering.ClusteringState.ClusteringItemResult(item: immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringItem, report_results: List[immuneML.reports.ReportResult.ReportResult] = <factory>)[source]¶
Bases:
object- item: ClusteringItem¶
- report_results: List[ReportResult]¶
- class immuneML.workflows.instructions.clustering.ClusteringState.ClusteringResultPerRun(run_id: int, items: Dict[str, immuneML.workflows.instructions.clustering.ClusteringState.ClusteringItemResult] = <factory>)[source]¶
Bases:
object- get_cl_item(cl_setting: str | ClusteringSetting)[source]¶
- items: Dict[str, ClusteringItemResult]¶
- run_id: int¶
- class immuneML.workflows.instructions.clustering.ClusteringState.ClusteringState(name: str, config: immuneML.workflows.instructions.clustering.ClusteringState.ClusteringConfig, result_path: pathlib.Path = None, clustering_items: List[immuneML.workflows.instructions.clustering.ClusteringState.ClusteringResultPerRun] = <factory>, predictions_paths: List[pathlib.Path] = None, subsampled_datasets: List[immuneML.data_model.datasets.Dataset.Dataset] = None, clustering_report_results: List[immuneML.reports.ReportResult.ReportResult] = <factory>, metrics_performance_paths: Dict[str, pathlib.Path] = <factory>, optimal_settings_on_discovery: Dict[str, immuneML.workflows.instructions.clustering.ClusteringState.ClusteringItemResult] = <factory>, final_predictions_path: pathlib.Path = None, best_settings_zip_paths: Dict[str, Dict[str, Any]] = <factory>)[source]¶
Bases:
object- add_cl_result_per_run(run_id: int, cl_item_result: ClusteringResultPerRun)[source]¶
- best_settings_zip_paths: Dict[str, Dict[str, Any]]¶
- clustering_items: List[ClusteringResultPerRun]¶
- clustering_report_results: List[ReportResult]¶
- config: ClusteringConfig¶
- final_predictions_path: Path = None¶
- metrics_performance_paths: Dict[str, Path]¶
- name: str¶
- optimal_settings_on_discovery: Dict[str, ClusteringItemResult]¶
- predictions_paths: List[Path] = None¶
- result_path: Path = None¶
immuneML.workflows.instructions.clustering.ValidateClusteringInstruction module¶
- class immuneML.workflows.instructions.clustering.ValidateClusteringInstruction.ValidateClusteringInstruction(clustering_item: ClusteringItem, dataset: Dataset, metrics: List[str], validation_type: List[str], label_config: LabelConfiguration = None, sequence_type: SequenceType = SequenceType.AMINO_ACID, region_type: RegionType = RegionType.IMGT_CDR3, number_of_processes: int = 1, reports: List[Report] = None, name: str = 'validate_clustering', result_path: Path = None)[source]¶
Bases:
InstructionValidateClustering instruction supports the application of the chosen clustering setting (preprocessing, encoding, clustering, with all hyperparameters) to a new dataset for validation.
For more details on validating the clustering algorithm and its hyperparameters, see the paper: Ullmann, T., Hennig, C., & Boulesteix, A.-L. (2022). Validation of cluster analysis results on validation data: A systematic framework. WIREs Data Mining and Knowledge Discovery, 12(3), e1444. https://doi.org/10.1002/widm.1444
Specification arguments:
clustering_config_path (str): path to the clustering exported by the Clustering instruction that will be applied to the new dataset
dataset (str): name of the validation dataset to which the clustering will be applied, as defined under definitions
metrics (list): a list of metrics to use for comparison of clustering algorithms and encodings (it can include metrics for either internal evaluation if no labels are provided or metrics for external evaluation so that the clusters can be compared against a list of predefined labels); some of the supported metrics include adjusted_rand_score, completeness_score, homogeneity_score, silhouette_score; for the full list, see scikit-learn’s documentation of clustering metrics at https://scikit-learn.org/stable/api/sklearn.metrics.html#module-sklearn.metrics.cluster.
validation_type (list): how to perform validation; options are method_based validation (refit the clustering algorithm to the new dataset and compare the clusterings) and result_based validation (transfer the clustering from the original dataset to the validation dataset using a supervised classifier and compare the clusterings)
reports (list): a list of reports to run on the validation results; supported report types include:
ClusteringMethodReport: reports that analyze the clustering method results (e.g., ClusteringVisualization)
EncodingReport: reports that analyze the encoded dataset
DataReport: reports that analyze the raw dataset
YAML specification:
instructions: validate_clustering_inst: type: ValidateClustering clustering_config_path: /path/to/exported_clustering.zip dataset: val_dataset metrics: [adjusted_rand_score, silhouette_score] validation_type: [method_based, result_based] reports: [cluster_vis, encoding_report]
- run(result_path: Path) ValidateClusteringState[source]¶
- class immuneML.workflows.instructions.clustering.ValidateClusteringInstruction.ValidateClusteringState(cl_item: immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringItem = None, dataset: immuneML.data_model.datasets.Dataset.Dataset = None, metrics: List[str] = None, validation_type: List[str] = None, result_path: pathlib.Path = None, name: str = 'validate_clustering', label_config: immuneML.environment.LabelConfiguration.LabelConfiguration = None, sequence_type: immuneML.environment.SequenceType.SequenceType = <SequenceType.AMINO_ACID: 'sequence_aa'>, region_type: immuneML.data_model.SequenceParams.RegionType = <RegionType.IMGT_CDR3: 'cdr3'>, number_of_processes: int = 1, method_based_result: immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringItem = None, result_based_result: immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringItem = None, predictions_path: pathlib.Path = None, method_based_report_results: List[immuneML.reports.ReportResult.ReportResult] = <factory>, result_based_report_results: List[immuneML.reports.ReportResult.ReportResult] = <factory>, data_report_results: List[immuneML.reports.ReportResult.ReportResult] = <factory>)[source]¶
Bases:
object- cl_item: ClusteringItem = None¶
- data_report_results: List[ReportResult]¶
- label_config: LabelConfiguration = None¶
- method_based_report_results: List[ReportResult]¶
- method_based_result: ClusteringItem = None¶
- metrics: List[str] = None¶
- name: str = 'validate_clustering'¶
- number_of_processes: int = 1¶
- predictions_path: Path = None¶
- region_type: RegionType = 'cdr3'¶
- result_based_report_results: List[ReportResult]¶
- result_based_result: ClusteringItem = None¶
- result_path: Path = None¶
- sequence_type: SequenceType = 'sequence_aa'¶
- validation_type: List[str] = None¶
immuneML.workflows.instructions.clustering.ValidationHandler module¶
immuneML.workflows.instructions.clustering.clustering_run_model module¶
- class immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringItem(dataset: immuneML.data_model.datasets.Dataset.Dataset = None, method: immuneML.ml_methods.clustering.ClusteringMethod.ClusteringMethod = None, encoder: immuneML.encodings.DatasetEncoder.DatasetEncoder = None, dim_red_method: immuneML.ml_methods.dim_reduction.DimRedMethod.DimRedMethod = None, internal_performance: immuneML.workflows.instructions.clustering.clustering_run_model.DataFrameWrapper = None, external_performance: immuneML.workflows.instructions.clustering.clustering_run_model.DataFrameWrapper = None, predictions: numpy.ndarray = None, cl_setting: immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringSetting = None, classifier: immuneML.ml_methods.classifiers.MLMethod.MLMethod = None)[source]¶
Bases:
object- cl_setting: ClusteringSetting = None¶
- dim_red_method: DimRedMethod = None¶
- encoder: DatasetEncoder = None¶
- external_performance: DataFrameWrapper = None¶
- internal_performance: DataFrameWrapper = None¶
- method: ClusteringMethod = None¶
- predictions: ndarray = None¶
- class immuneML.workflows.instructions.clustering.clustering_run_model.ClusteringSetting(encoder: immuneML.encodings.DatasetEncoder.DatasetEncoder, encoder_params: dict, encoder_name: str, clustering_method: immuneML.ml_methods.clustering.ClusteringMethod.ClusteringMethod, clustering_params: dict, clustering_method_name: str, dim_reduction_method: immuneML.ml_methods.dim_reduction.DimRedMethod.DimRedMethod = None, dim_red_params: dict = None, dim_red_name: str = None, path: pathlib.Path = None)[source]¶
Bases:
object- clustering_method: ClusteringMethod¶
- clustering_method_name: str¶
- clustering_params: dict¶
- dim_red_name: str = None¶
- dim_red_params: dict = None¶
- dim_reduction_method: DimRedMethod = None¶
- encoder: DatasetEncoder¶
- encoder_name: str¶
- encoder_params: dict¶
- path: Path = None¶
immuneML.workflows.instructions.clustering.clustering_runner module¶
- immuneML.workflows.instructions.clustering.clustering_runner.apply_cluster_classifier(dataset: Dataset, cl_setting: ClusteringSetting, classifier, encoder: DatasetEncoder, dim_red_method: DimRedMethod, predictions_path: Path, number_of_processes: int, sequence_type: SequenceType, region_type: RegionType) ClusteringItem[source]¶
- immuneML.workflows.instructions.clustering.clustering_runner.encode_dataset(dataset: Dataset, cl_setting: ClusteringSetting, number_of_processes: int, label_config: LabelConfiguration, learn_model: bool, sequence_type: SequenceType, region_type: RegionType, encoder: DatasetEncoder = None, dim_red_method: DimRedMethod = None)[source]¶
Encode a dataset using the specified clustering setting’s encoder. Results are cached based on parameters.
- Parameters:
dataset – The dataset to encode
cl_setting – The clustering setting containing encoder configuration
number_of_processes – Number of processes for parallelization
label_config – Label configuration
learn_model – Whether to learn the encoder model or use existing
sequence_type – Sequence type for encoding
region_type – Region type for encoding
encoder – Optional pre-configured encoder
dim_red_method – Optional pre-configured dimensionality reduction method
- Returns:
Encoded dataset, encoder, and dimensionality reduction method
- immuneML.workflows.instructions.clustering.clustering_runner.encode_dataset_internal(dataset: Dataset, cl_setting: ClusteringSetting, number_of_processes: int, label_config: LabelConfiguration, learn_model: bool, sequence_type: SequenceType, region_type: RegionType, encoder: DatasetEncoder = None, dim_red_method: DimRedMethod = None) Tuple[Dataset, DatasetEncoder, DimRedMethod][source]¶
Internal function to encode a dataset (called by encode_dataset with caching).
- Parameters:
dataset – The dataset to encode
cl_setting – The clustering setting containing encoder configuration
number_of_processes – Number of processes for parallelization
label_config – Label configuration
learn_model – Whether to learn the encoder model or use existing
sequence_type – Sequence type for encoding
region_type – Region type for encoding
encoder – Optional pre-configured encoder
dim_red_method – Optional pre-configured dimensionality reduction method
- Returns:
Encoded dataset with optional dimensionality reduction
- immuneML.workflows.instructions.clustering.clustering_runner.eval_external_metrics(predictions_df: DataFrame, cl_setting: ClusteringSetting, metrics: List[str], label_config: LabelConfiguration, predictions_col_name: str = None) Path | None[source]¶
- immuneML.workflows.instructions.clustering.clustering_runner.eval_internal_metrics(predictions_df: DataFrame, cl_setting: ClusteringSetting, features, metrics: List[str], predictions_col_name: str = None) Path[source]¶
- immuneML.workflows.instructions.clustering.clustering_runner.evaluate_clustering(predictions_df: DataFrame, cl_setting: ClusteringSetting, features, metrics: List[str], label_config: LabelConfiguration, cl_item: ClusteringItem, predictions_col_name: str = None) ClusteringItem[source]¶
Evaluate clustering results using internal and external metrics.
- Parameters:
predictions_col_name – name of the predictions column in predictions_df
predictions_df – DataFrame containing predictions and labels
cl_setting – The clustering setting used
features – Feature matrix for internal metrics
metrics – List of metric names to compute
label_config – Label configuration for external metrics
cl_item – Clustering item to evaluate and update with performance csv files
- Returns:
Updated ClusteringItem with performance CSV file paths
- immuneML.workflows.instructions.clustering.clustering_runner.fit_and_predict(dataset: Dataset, method: ClusteringMethod) ndarray[source]¶
Fit clustering method and get predictions.
- immuneML.workflows.instructions.clustering.clustering_runner.get_complementary_classifier(cl_setting: ClusteringSetting)[source]¶
Returns a complementary classifier based on the clustering method.
- Parameters:
cl_setting – ClusteringSetting object containing the clustering method configuration
- Returns:
An instance of the appropriate classifier; kNN if no matches are found
- immuneML.workflows.instructions.clustering.clustering_runner.get_features(dataset: Dataset, cl_setting: ClusteringSetting)[source]¶
Get features from encoded dataset.
- immuneML.workflows.instructions.clustering.clustering_runner.run_all_settings(dataset: Dataset, clustering_settings: List[ClusteringSetting], path: Path, predictions_df: DataFrame, metrics: List[str], label_config: LabelConfiguration, number_of_processes: int, sequence_type: SequenceType, region_type: RegionType, report_handler=None, run_id: int = None, state=None) Tuple[Dict, DataFrame][source]¶
Run all clustering settings on a dataset and collect results.
- Parameters:
dataset – The dataset to cluster
clustering_settings – List of clustering settings to evaluate
path – Output path for results
predictions_df – DataFrame to store predictions
metrics – List of metric names to compute
label_config – Label configuration for external metrics
number_of_processes – Number of processes for parallelization
sequence_type – Sequence type for encoding
region_type – Region type for encoding
report_handler – Optional report handler for running item reports
run_id – Optional run identifier
state – Optional clustering state for report handler
- Returns:
Tuple of (clustering_items dict, updated predictions_df)
- immuneML.workflows.instructions.clustering.clustering_runner.run_setting(dataset: Dataset, cl_setting: ClusteringSetting, path: Path, predictions_df: DataFrame, metrics: List[str], label_config: LabelConfiguration, number_of_processes: int, sequence_type: SequenceType, region_type: RegionType, report_handler=None, run_id: int = None, state=None, evaluate: bool = True) Tuple[ClusteringItemResult, DataFrame][source]¶
Run a single clustering setting on a dataset.
- Parameters:
dataset – The dataset to cluster
cl_setting – The clustering setting to use
path – Output path for results
predictions_df – DataFrame to store predictions
metrics – List of metric names to compute
label_config – Label configuration for external metrics
number_of_processes – Number of processes for parallelization
sequence_type – Sequence type for encoding
region_type – Region type for encoding
report_handler – Optional report handler for running item reports
run_id – Optional run identifier
state – Optional clustering state for report handler
evaluate – Whether to compute internal/external evaluation metrics
- Returns:
Tuple of (ClusteringItemResult, updated predictions_df)
- immuneML.workflows.instructions.clustering.clustering_runner.train_cluster_classifier(cl_item: ClusteringItem)[source]¶