Skip to content

Latest commit

 

History

History
249 lines (188 loc) · 16 KB

README.md

File metadata and controls

249 lines (188 loc) · 16 KB

1. About GPmix

GPmix is a clustering algorithm for functional data that are generated from Gaussian process mixtures. Although designed for Gaussian process mixtures, our experimental study demonstrated that GPmix works well even for functional data that are not from Gaussian process mixtures.

The main steps of the algorithm are:

  • Smoothing: Apply smoothing methods on the raw data to get continuous functions.
  • Projection: Project the functional data onto a few randomly generated functions.
  • Learning GMMs: For each projection function, learn a univariate Gaussian mixture model from the projection coefficients.
  • Ensemble: Extract a consensus clustering from the multiple GMMs.

If you used this package in your research, please cite it:

@InProceedings{pmlr-v235-akeweje24a,
  title = 	 {Learning Mixtures of {G}aussian Processes through Random Projection},
  author =       {Akeweje, Emmanuel and Zhang, Mimi},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {720--739},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
}

2. Code Example

This quick start guide will demonstrate how to use the package with the CBF dataset, one of the real-world datasets tested in our paper. Follow these steps to prepare the dataset for analysis:

import numpy as np
data = np.concatenate([np.loadtxt('CBF\CBF_TEST.txt'), np.loadtxt('CBF\CBF_TRAIN.txt')])
X, y = data[:,1:], data[:,0]

To use the GPmix algorithm in your project, start by importing the necessary modules. The following import statements will include all the main functionalities from the GPmix package, as well as the specific utility for estimating the number of clusters:

from GPmix import *
from GPmix.misc import estimate_nclusters

Smoothing

Begin by initializing the Smoother object, specifying the type of basis for the smoothing process. You can customize the smoothing by passing additional configurations through the basis_params argument. If not specified, the system will automatically determine the best configurations using methods like Random Grid Search and Generalized Cross Validation. After initialization, apply the fit method to the raw data to obtain the fitted functional data.

For this demonstration, we will use the Fourier basis.

sm = Smoother(basis= 'fourier')
fd = sm.fit(X)
fd.plot(group = y)

Projection

To project the fitted functional data onto specified projection functions, use the Projector object. Initialize the Projector object with the type of projection functions and the desired number of projections. The basis_type argument specifies the type of projection functions. The n_proj argument defines the number of projections. The basis_params argument allows for further configuration of the projection functions.

For this demonstration, we will use wavelets as projection functions. We will specify the family of wavelets using basis_params. After initializing, apply the fit method to the functional data object to compute the projection coefficients. Here, we will use 14 projection functions generated from the Haar wavelet family.

proj = Projector(basis_type= 'wavelet', n_proj = 14, basis_params= {'wv_name': 'haar'})
coeffs = proj.fit(fd)

Ensemble Clustering

The UniGaussianMixtureEnsemble object facilitates ensemble clustering by fitting a univariate Gaussian Mixture Model (GMM) to each set of projection coefficients. Follow these steps:

  • Initialize the UniGaussianMixtureEnsemble object by specifying the number of clusters (n_clusters) you want to identify in your dataset.
  • Use the fit_gmms method to obtain a collection of GMMs, one for each set of projection coefficients.
  • Use the get_clustering method, which aggregates the results from the individual GMMs, to form a consensus clustering.

For this demonstration, we will identify 3 clusters in the functional data.

model = UniGaussianMixtureEnsemble(n_clusters= 3)
model.fit_gmms(coeffs)
pred_labels = model.get_clustering()

To visualize the clustering result, apply the plot_clustering method to the functional data object:

model.plot_clustering(fd)

Furthermore, the UniGaussianMixtureEnsemble object supports the calculation of several clustering validation indices. For external validation (comparing generated clusters against true labels), you can calculate Adjusted Mutual Information, Adjusted Rand Index, and Correct Classification Accuracy by passing the true labels as parameters. For internal validation (assessing the internal structure of the clusters), you can calculate the Silhouette Score and Davies-Bouldin Score by passing the functional data object as parameters. These metrics help evaluate the effectiveness of the clustering process.

For this demonstration, we calculate all the clustering validation metrics.

model.adjusted_mutual_info_score(y)   # Adjusted Mutual Information

model.adjusted_rand_score(y)    # Adjusted Rand Index

model.correct_classification_accuracy(y)    # Correct Classification Accuracy

model.silhouette_score(fd)    # Silhouette Score

model.davies_bouldin_score(fd)    # Davies-Bouldin Score

Estimating the Number of Clusters

To estimate the optimal number of clusters in the functional data, our package includes the estimate_nclusters function. This function employs a systematic search to identify the number of clusters that minimize the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC), as detailed in our paper. Here’s how you can apply this function to your data:

estimate_nclusters(fd)

Replicating the Experiment Results

The simulation scenarios investigated in our paper are available in simulations.py. To reproduce the results from the paper for each specific scenario, you will need to execute the following command after cloning the repo:

python GPmix_Clustering.py data_configs/scenario_<tag>_config.yml

Replace <tag> with the appropriate scenario identifier, which ranges from A to L. Each tag corresponds to a different configuration file located in the data_configs. By executing the command with the relevant tag, the results for that particular scenario will be replicated.

3. Package Functions

GPmix.Smoother

Apply smoothing methods on the raw data to get continuous functions.

Smoother(basis = 'bspline', basis_params = {}, domain_range = None)

Parameters

  • basis {'bspline', 'fourier', 'wavelet', 'nadaraya_watson', 'knn'} : a string specifying the smoothing method to use. The default value is 'bspline'.

  • basis_params (dict) : additional parameters for the selected smoothing method. The default value is {}. Below are examples of how to specify these parameters for different smoothing methods:

    Smoother(basis = 'bspline', basis_params = {'order': 3, 'n_basis': 20})
    Smoother(basis = 'wavelet', basis_params = {'wavelet': 'db5', 'mode': 'soft'}) 
    Smoother(basis = 'knn', basis_params = {'bandwidth': 1.0})
    Smoother(basis = 'fourier', basis_params = {'n_basis': 20, 'period': 1})

    For all smoothing methods except wavelet smoothing, if basis_params is not specified, the parameter values are determined by the Generalized Cross-Validation (GCV) technique.
    For wavelet smoothing, the wavelet shrinkage denoising technique is implemented, requiring two parameters:

    • 'wavelet': The wavelet family to use for smoothing. A list of all supported discrete wavelets can be obtained by running: print(pywt.wavelist(kind='discrete')).
    • 'mode': The method and extent of denoising. The available modes are: {'soft', 'hard', 'garrote', 'greater', 'less'}.

    For wavelet smoothing, if basis_params is not specified, the default configuration basis_params = {'wavelet': 'db5', 'mode': 'soft'} will be used.

  • domain_range (tuple) : the domain of the functions. The default value is None.
    If domain_range is set to None, the domain range will default to [0,1] if the data is array-like. If the data is an FDataGrid object, it will use the domain_range of that object.

Attributes

  • fd_smooth (FDataGrid): functional data obtained via the smoothing technique.

Methods

  • fit(X, return_data = True): Apply a smoothing method on the raw data X to get continuous functions.
    • X : raw data, array-like of shape (n_samples, n_features) or FDataGrid object.
    • return_data (bool) : Return the functional data if True. The default value is True.

GPmix.Projector

Project the functional data onto a few randomly generated functions.

Projector(basis_type, n_proj = 3, basis_params = {})

Parameters

  • basis_type {'fourier', 'fpc', 'wavelet', 'bspline', 'ou', 'rl-fpc'} : a string specifying the type of projection functions. Supported basis_type options are: eigen-functions from the fPC decomposition ('fpc'), random linear combinations of eigen-functions ('rl-fpc'), B-splines, Fourier basis, wavelets, and Ornstein-Uhlenbeck ('ou') random functions.
  • n_proj (int) : number of projection functions to generate. The default value is 3.
  • basis_params (dict) : additional hyperparameters required by 'fourier', 'bspline' and 'wavelet'. The default value is {}. Below are examples of how to specify these hyperparameters:
    Projector(basis_type = 'fourier', basis_params = {'period': 2})
    Projector(basis_type = 'bspline', basis_params = {'order': 1}) 
    Projector(basis_type = 'wavelet', basis_params = {'wv_name': 'haar', 'resolution': 1})
    For fourier, the default configuration is setting the 'period' equal to the length of the domain of the functional data. This approach works well for all datasets investigated in our paper. However, we have included the period as a hyperparameter in case users want to project onto Fourier functions with lower (or higher) oscillations. For bspline, if basis_params is not specified, the default configuration basis_params = {'order': 3} will be applied. Similarly, for wavelet, if basis_params is not specified, the default configuration basis_params = {'wv_name': 'db5', 'resolution': 1} will be applied.

Attributes

  • n_features (int) : number of evaluation points for each sample curve and for the projection functions.
  • basis (FDataGrid) : generated projection functions.
  • coefficients (array-like of shape (n_proj, sample size)) : projection coefficients.

Methods

  • fit(FDataGrid) : computing the projection coefficients. Return array-like object of shape (n_proj, sample size).
  • plot_basis() : plots the projection functions.
  • plot_projection_coeffs(**kwargs) : plots the distribution of projection coefficients. Takes kwargs from seaborn.histplot.

GPmix.UniGaussianMixtureEnsemble

For each projection function, learn a univariate Gaussian mixture model from the projection coefficients. Then extract a consensus clustering from the multiple GMMs.

UniGaussianMixtureEnsemble(n_clusters, init_method = 'kmeans', n_init = 10, mom_epsilon = 5e-2)

Parameters

  • n_clusters (int) : number of mixture components in the GMMs.
  • init_method {'kmeans', 'k-means++', 'random', 'random_from_data', 'mom'} : method for initializing the EM algorithm (for estimating GMM parameters). The default value is 'kmeans'.
  • n_init (int) : number of repeats of the EM algorithm, each with a different initilization. The algorithm returns the best GMM fit. The default value is 10.
  • mom_epsilon (float) : lower bound for GMM weights, only applicable if init_method = 'mom'. The default value is 5e-2.

Attributes

  • n_projs (int) : number of base clusterings (or projections).
  • data_size (int) : sample size.
  • gmms (list) : a list of univariate GMMs, one for each projection function.
  • clustering_weights_ (array-like of shape (n_projs,)) : weights for the base clusterings.

Methods

  • fit_gmms(projs_coeffs, n_jobs = -1, **kwargs): fit GMM to projection coefficients.
    • projs_coeffs (array-like of shape (n_proj, sample size)) : projection coefficients.
    • n_jobs : number of concurrently running jobs to parrallelize fitting the gmms. The default value is -1, to use all available CPUs.
    • kwargs : any keyword argument of joblib.Parallel.
  • plot_gmms(ncol = 4, fontsize = 12, fig_kws = { }, **kwargs): visualization of GMM fits.
    • ncol (int) : number of subplot columns. The default value is 4.
    • fontsize (int) : fontsize for the plot labels. The default value is 12.
    • fig_kws : keyword arguments for the figures (subplots). The default value is {}.
    • kwargs : other keyword arguments for customizing seaborn histplot.
  • get_clustering(weighted_sum = True, precompute_gmms = None): obtain the consensus clustering. Return array-like object of shape (sample size,), the cluster labels for the sample curves.
    • weighted_sum (bool) : specifying whether the total misclassification probability, which measures the overlap among the GMM components, should be weighted by the mixing proportion. The default value is True.
    • precompute_gmms (list) : a subset of the fitted GMMs. By default, the consensus clustering is extracted from all the fitted GMMs. This parameter allows selecting a subset of the fitted GMMs for constructing the consensus clustering.
  • plot_clustering(FDataGrid) : visualize the clustered functional data.
  • adjusted_mutual_info_score(true_labels): computing the Adjusted Mutual Information.
    • true_labels (array-like of shape (sample size,)) : true cluster labels.
  • adjusted_rand_score(true_labels): computing the Adjusted Rand Index.
    • true_labels (array-like of shape (sample size,)) : true cluster labels.
  • correct_classification_accuracy(true_labels): computing the Correct Classification Accuracy.
    • true_labels (array-like of shape (sample size,)) : true cluster labels.
  • silhouette_score(FDataGrid): computing the Silhouette Score.
  • davies_bouldin_score(FDataGrid): computing the Davies-Bouldin Score.

GPmix.misc.estimate_nclusters

The function estimate_nclusters employs a systematic search to identify the number of clusters that minimize the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC).

estimate_nclusters(fdata, ncluster_grid = None)

Parameters

  • fdata (FDataGrid) : functional data object.
  • ncluster_grid (array-like) : specifies the grid within which the number of clusters is searched. The default value is [2, 3, ..., 14].

Contributing

This project is under active development. If you find a bug, or anything that needs correction, please let us know.

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.