Skip to content

HiErArchical Data Splitting and Stitching Software for Non-Distributed Clustering Algorithms

License

Notifications You must be signed in to change notification settings

simonharnqvist/HEADSS2

 
 

Repository files navigation

HEADSS2

This is the new version of HEADSS: HiErArchical Data Splitting and Stitching Software for Non-Distributed Clustering Algorithms. HEADSS2 provides a more predicable API, and enables parallelisation of the most compute-intense steps of the algorithm.

For docs, go to https://headss2.readthedocs.io/en/latest/

HEADSS

HEADSS represents a process of splitting big data to avoid the introduction of edge effects and formalise the stitching process to provide a complete feature space.

Example split and stitch boundaries for n = 3 implementation, where n refers to the number of cuts in each feature in the base layer.

https://user-images.githubusercontent.com/84581147/170474116-5f718b98-618d-4d61-a95c-c1c7a8012f57.png

https://user-images.githubusercontent.com/84581147/170474111-fe226e70-14d4-4408-b4f0-61451f06b48a.png

The current version supports clustering with HDBSCAN:
McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017

With the ability to split and stitch data while clustering independently if alternative clustering methods are preferred.

Installation

Currently, HEADSS2 can be installed from github: .. code-block:: bash

pip install git+https://github.com/simonharnqvist/HEADSS2.git

Example datasets

The following example datasets are provided with HEADSS2 available as headss2.dataset(dataset_name):

https://raw.githubusercontent.com/simonharnqvist/HEADSS2/refs/heads/docs/datasets.png

Example usage

Simplified API

from headss2 import HEADSS2

headss_obj = HEADSS2(
 n = 2,
 min_cluster_size=10,
 min_samples=10,
 allow_single_cluster=False,
 clustering_method="eom",
 drop_unclustered=True,
 total_point_overlap_threshold=0.1,
 bound_region_point_overlap_threshold=0.5,
 min_n_overlap=10,
 spark_session=spark
)

headss_obj.fit(t4_8k, ["x", "y"])

Contributors

  • Simon Harnqvist, Wide-field Astronomy Unit (WFAU), University of Edinburgh. Current maintainer and author of HEADSS2.
  • Dennis Crake, formerly of WFAU. Original author of HEADSS.

Contributing, bug reports, and feature requests

We welcome contributions in any form but particuarly with implementations of additional clustering algorithms. To contribute please fork the project and submit a pull request.

If you have found a (potential) bug, or have ideas for improvements or extensions that you are not able to contribute via a PR, please open a GitHub issue.

Citation

If using HEADSS2, please cite both this repository and Dennis' <i>Astronomy and Computing</i> paper below for the algorithm and original implementation:

Crake, DA, Hambly, NC & Mann, RG 2023, 'HEADSS: HiErArchical Data Splitting and Stitching software for non-distributed clustering algorithms', Astronomy and Computing, vol. 43, 100709, pp. 1-9. https://doi.org/10.1016/j.ascom.2023.100709

Licensing

The hdbscan package is 3-clause BSD licensed. BSD licence and contact the authors for details on contributing to this code.

About

HiErArchical Data Splitting and Stitching Software for Non-Distributed Clustering Algorithms

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 91.5%
  • MATLAB 8.5%