This is the new version of HEADSS: HiErArchical Data Splitting and Stitching Software for Non-Distributed Clustering Algorithms. HEADSS2 provides a more predicable API, and enables parallelisation of the most compute-intense steps of the algorithm.
For docs, go to https://headss2.readthedocs.io/en/latest/
HEADSS represents a process of splitting big data to avoid the introduction of edge effects and formalise the stitching process to provide a complete feature space.
Example split and stitch boundaries for n = 3 implementation, where n refers to the number of cuts in each feature in the base layer.
- The current version supports clustering with HDBSCAN:
- McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017
With the ability to split and stitch data while clustering independently if alternative clustering methods are preferred.
Currently, HEADSS2 can be installed from github: .. code-block:: bash
pip install git+https://github.com/simonharnqvist/HEADSS2.git
The following example datasets are provided with HEADSS2 available as headss2.dataset(dataset_name):
Simplified API
from headss2 import HEADSS2
headss_obj = HEADSS2(
n = 2,
min_cluster_size=10,
min_samples=10,
allow_single_cluster=False,
clustering_method="eom",
drop_unclustered=True,
total_point_overlap_threshold=0.1,
bound_region_point_overlap_threshold=0.5,
min_n_overlap=10,
spark_session=spark
)
headss_obj.fit(t4_8k, ["x", "y"])- Simon Harnqvist, Wide-field Astronomy Unit (WFAU), University of Edinburgh. Current maintainer and author of HEADSS2.
- Dennis Crake, formerly of WFAU. Original author of HEADSS.
We welcome contributions in any form but particuarly with implementations of additional clustering algorithms. To contribute please fork the project and submit a pull request.
If you have found a (potential) bug, or have ideas for improvements or extensions that you are not able to contribute via a PR, please open a GitHub issue.
If using HEADSS2, please cite both this repository and Dennis' <i>Astronomy and Computing</i> paper below for the algorithm and original implementation:
Crake, DA, Hambly, NC & Mann, RG 2023, 'HEADSS: HiErArchical Data Splitting and Stitching software for non-distributed clustering algorithms', Astronomy and Computing, vol. 43, 100709, pp. 1-9. https://doi.org/10.1016/j.ascom.2023.100709
The hdbscan package is 3-clause BSD licensed. BSD licence and contact the authors for details on contributing to this code.


