HiErArchical Data Splitting and Stitching Software for Non-Distributed Clustering Algorithms
HEADSS represents a process of splitting big data to avoid the introduction of edge effects and formalise the stitching process to provide a complete feature space. The process is shown (below) with a number of example notebooks for examples of how to use, highlighting best practices and implimentations to avoid.
Example split and stitch boundaries for N = 3 implimentation, where N refers to the number of cuts in each feature in the base layer.

The current version supports clustering with:
- HDBSCAN
McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017
With the ability to split and stitch data while clustering independently if alternative clustering methods are preferred.
- Train.ipynb
- quick_start.ipynb
- eval.ipynb
Demonstrates in detail the processes involved with HEADSS and how to extract the split/stitching process to use an alternative clustering algorithm to HDBSCAN.
Demonstrates a quick use implimentation to explore the algorithm. A number of example datasets are included and can be called using the provided function or the user can provide thier own.
merge = headss_merge(df = data, N = 2, split_columns = ['x', 'y'], merge = True,
cluster_columns=['x','y'], min_cluster_size = 10,
min_samples = 10, cluster_method = 'leaf', allow_single_cluster = False,
total_threshold = 0.1, overlap_threshold = 0.5, minimum_members = 10)
# clustering result
merged_df = merge.members_df
Demonstrate the performance on all example datasets, producing the plots found in the paper.
To be made available using pip or conda. For now, use the provided requirements.txt or environment.yml files.
All workbooks are .ipynb, however HEADSS itself is simply .py and can be used provided the requirements are correctly installed.
To install requirements:
pip install -r requirements.txt
To set up an environment:
conda env create -f environment.yml
In some cases hdbscan is not properly installed, if this occurs please install manually after entering the environment using:
pip install hdbscan
We welcome contributions in any form but particuarly with implementations of additional clustering algorithms. To contribute please fork the project and submit a pull request. Any assistance with formatting, syntax or anything else feel free to contact the authors.
The hdbscan package is 3-clause BSD licensed.
BSD licence and contact the authors for details on contributing to this code.
