Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,6 @@ jobs:
run: |
TAG=${{ github.ref }}
VERSION=${TAG#refs/tags/v}
gh release create -R JelmerBot/plscan -t "Version $VERSION" -n "**Full Changelog**: https://github.com/JelmerBot/plscan/commits/$TAG" "$TAG" dist/*.whl dist/*.tar.gz
gh release create -R JelmerBot/fast_plscan -t "Version $VERSION" -n "**Full Changelog**: https://github.com/JelmerBot/fast_plscan/commits/$TAG" "$TAG" dist/*.whl dist/*.tar.gz
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
14 changes: 8 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
[![PyPi version](https://badge.fury.io/py/plscan.svg)](https://badge.fury.io/py/plscan)
![Conda version](https://anaconda.org/conda-forge/plscan/badges/version.svg)
[![PyPi version](https://badge.fury.io/py/fast-plscan.svg)](https://badge.fury.io/py/plscan)
![Conda version](https://anaconda.org/conda-forge/fast-plscan/badges/version.svg)
[![Repository DOI](https://zenodo.org/badge/xxx.svg)](https://zenodo.org/doi/xxx/zenodo.yyy)

# Persistent Leaf Spatial Clustering for Applications with Noise
# Persistent Leaves Spatial Clustering for Applications with Noise

This library provides a new clustering algorithm based on HDBSCAN*. The primary
advantages of PLSCAN over the standard ``hdbscan`` library are:
advantages of PLSCAN over the
[``hdbscan``](https://github.com/scikit-learn-contrib/hdbscan) and
[``fast_hdbscan``](https://github.com/TutteInstitute/fast_hdbscan) libraries are:

- PLSCAN automatically finds the optimal minimum cluster size.
- PLSCAN can easily use all available cores to speed up computation.
Expand All @@ -22,7 +24,7 @@ stable clusters.
import numpy as np
import matplotlib.pyplot as plt

from plscan import PLSCAN
from fast_plscan import PLSCAN

data = np.load("docs/data/data.npy")

Expand Down Expand Up @@ -186,4 +188,4 @@ When using this work, please cite our (upcoming) preprint:

## Licensing

The ``plscan`` package has a 3-Clause BSD license.
The ``fast-plscan`` package has a 3-Clause BSD license.
4 changes: 2 additions & 2 deletions docs/_paper_figures.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,19 @@
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"id": "098248f7",
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"import numpy as np\n",
"import pandas as pd\n",
"from plscan import PLSCAN\n",
"from hdbscan import HDBSCAN\n",
"from itertools import product\n",
"from collections import defaultdict\n",
"\n",
"from fast_plscan import PLSCAN\n",
"from lib.plotting import *\n",
"from lib.drawing import regplot_lowess_ci\n",
"\n",
Expand Down
4 changes: 2 additions & 2 deletions docs/_trial_barcodes.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
"import matplotlib.pyplot as plt\n",
"from gtda.homology import VietorisRipsPersistence\n",
"\n",
"from plscan import PLSCAN\n",
"from fast_plscan import PLSCAN\n",
"\n",
"warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
"plt.rcParams[\"figure.dpi\"] = 150\n",
Expand Down Expand Up @@ -82,7 +82,7 @@
"metadata": {},
"outputs": [],
"source": [
"from plscan._api import compute_cluster_labels\n",
"from fast_plscan._api import compute_cluster_labels\n",
"\n",
"l, p = compute_cluster_labels(c._leaf_tree, c._condensed_tree, np.array([18]).astype(np.uint32), n)\n",
"i = np.where(l >= 0)[0]"
Expand Down
16 changes: 8 additions & 8 deletions docs/_trial_persistence_pruning.ipynb

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,13 @@

# -- Project information -----------------------------------------------------

project = "plscan"
project = "fast_plscan"
copyright = "2025, Jelmer Bot"
author = "Jelmer Bot"

# -- General configuration ---------------------------------------------------

release = get_version("plscan")
release = get_version("fast_plscan")
version = ".".join(release.split(".")[:2])
master_doc = "index"
templates_path = ["_templates"]
Expand Down Expand Up @@ -61,4 +61,4 @@
# -- Options for HTML output -------------------------------------------------

html_theme = "furo"
htmlhelp_basename = "plscan_doc"
htmlhelp_basename = "fast_plscan_doc"
38 changes: 19 additions & 19 deletions docs/demo_computational_performance.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,17 @@
"source": [
"# Performance Benchmarks\n",
"\n",
"The ``plscan`` library is a re-implementation of the ``fast_hdbscan`` library,\n",
"while retaining most of the features from the classic ``hdbscan`` library. In\n",
"this notebook, we compare the computational performance of these libraries.\n",
"Since all three implementations use *space trees*, we are particularly\n",
"interested in understanding how their performance changes as the number of\n",
"dimensions increases."
"The ``fast-plscan`` library is a re-implementation of the ``fast-hdbscan``\n",
"library, while retaining most of the features from the classic ``hdbscan``\n",
"library. In this notebook, we compare the computational performance of these\n",
"libraries. Since all three implementations use *space trees*, we are\n",
"particularly interested in understanding how their performance changes as the\n",
"number of dimensions increases."
]
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 1,
"id": "32686d54",
"metadata": {},
"outputs": [],
Expand All @@ -31,10 +31,10 @@
"from tqdm import tqdm\n",
"from sklearn.datasets import make_blobs\n",
"\n",
"from plscan import PLSCAN\n",
"from hdbscan import HDBSCAN\n",
"from sklearn.cluster import KMeans\n",
"from fast_hdbscan import HDBSCAN as FastHDBSCAN\n",
"from sklearn.cluster import KMeans\n",
"from fast_plscan import PLSCAN\n",
"\n",
"plt.rcParams[\"figure.dpi\"] = 150\n",
"plt.rcParams[\"figure.figsize\"] = (2.75, 0.618 * 2.75)"
Expand All @@ -50,16 +50,16 @@
"name": "stdout",
"output_type": "stream",
"text": [
"plscan 0.1.dev28+g9eb67a7\n",
"fast-plscan 0.1.dev33+gdfffaf8.d20251217\n",
"hdbscan 0.8.40\n",
"fast_hdbscan 0.2.2\n",
"fast-hdbscan 0.2.2\n",
"scikit-learn 1.7.0\n"
]
}
],
"source": [
"from importlib.metadata import version\n",
"for name in ['plscan', 'hdbscan', 'fast_hdbscan', 'scikit-learn']:\n",
"for name in ['fast-plscan', 'hdbscan', 'fast-hdbscan', 'scikit-learn']:\n",
" print(name, version(name))"
]
},
Expand Down Expand Up @@ -204,8 +204,8 @@
"metadata": {},
"source": [
"Plotting the results, we observe considerable differences between ``hdbscan``\n",
"and the ``fast_hdbscan`` and ``plscan`` libraries on low-dimensional data. The\n",
"parallel implementations offer substantial improvements in scalability."
"and the ``fast-hdbscan`` and ``fast-plscan`` libraries on low-dimensional data.\n",
"The parallel implementations offer substantial improvements in scalability."
]
},
{
Expand Down Expand Up @@ -247,7 +247,7 @@
"At 10 dimensions, the advantage of parallelization remains, but the scaling\n",
"curves become steeper than that of ``kmeans``, which is largely unaffected by\n",
"the number of dimensions. Interestingly, the ball-tree implementation in\n",
"``plscan`` scales considerably worse than the kd-tree implementation.\n",
"``fast-plscan`` scales considerably worse than the kd-tree implementation.\n",
"\n",
"At 20 dimensions, the parallel implementations provide little benefit when\n",
"compared to ``kmeans``, which continues to perform efficiently regardless of\n",
Expand All @@ -256,10 +256,10 @@
"Plotting over the dimensions we see only ``kmeans`` does not need more time with\n",
"higher dimensional data. The scaling trends are a bit more complicated.\n",
"``hdbscan`` starts of steepest but appears to level off a bit. Ball-tree\n",
"``plscan`` quickly approaches the ``hdbscan`` curve with higher dimensions. The\n",
"curves for ``fast_hdbscan`` and kd-tree ``plscan`` remain fairly shallow, with\n",
"``plscan`` consistently being slightly quicker. ``kmeans`` is unaffected by the\n",
"number of dimensions, achieving quick times in all cases."
"``fast-plscan`` quickly approaches the ``hdbscan`` curve with higher dimensions.\n",
"The curves for ``fast_hdbscan`` and kd-tree ``fast-plscan`` remain fairly\n",
"shallow, with ``fast-plscan`` consistently being slightly quicker. ``kmeans`` is\n",
"unaffected by the number of dimensions, achieving quick times in all cases."
]
},
{
Expand Down
10 changes: 5 additions & 5 deletions docs/demo_parameter_sensitivity.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@
"# Parameter sensitivity analysis\n",
"\n",
"[The cluster selection strategies notebook](./demo_selection_strategies.ipynb)\n",
"demonstrates ``plscan`` is less sensitive to the ``min_samples`` parameter ($k$)\n",
"than ``hdbscan``. This notebook runs a more comprehensive parameter sensitivity\n",
"demonstrates PLSCAN is less sensitive to the ``min_samples`` parameter ($k$)\n",
"than HDBSCAN*. This notebook runs a more comprehensive parameter sensitivity\n",
"analysis to determine whether that pattern holds on other datasets. This\n",
"parameter sensitivity analysis tells us how the clustering quality changes as a\n",
"result of changing $k$. The resulting value is independent of $k$ itself.\n",
Expand Down Expand Up @@ -52,7 +52,7 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 1,
"id": "ef96a020",
"metadata": {},
"outputs": [],
Expand All @@ -72,8 +72,8 @@
"from umap import UMAP\n",
"from lensed_umap import embed_graph\n",
"\n",
"from plscan import PLSCAN\n",
"from hdbscan import HDBSCAN\n",
"from fast_plscan import PLSCAN\n",
"from sklearn.metrics import adjusted_rand_score, homogeneity_score, completeness_score\n",
"\n",
"from lib.plotting import sns, plt, mpl, lighten, frame_off\n",
Expand Down Expand Up @@ -240,7 +240,7 @@
"source": [
"Then, we create functions that evaluate the algorithms. These functions return\n",
"one or more records identifying the dataset, algorithm configuration, and\n",
"resulting quality scores. For `plscan`, we compute scores for its top-$n$\n",
"resulting quality scores. For PLSCAN, we compute scores for its top-$n$\n",
"layers:"
]
},
Expand Down
51 changes: 24 additions & 27 deletions docs/demo_selection_strategies.ipynb

Large diffs are not rendered by default.

24 changes: 13 additions & 11 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,10 @@
:maxdepth: 1
:hidden:

_autosummary/plscan
_autosummary/plscan.plots
_autosummary/plscan._api
_autosummary/plscan._helpers
_autosummary/fast_plscan
_autosummary/fast_plscan.plots
_autosummary/fast_plscan._api
_autosummary/fast_plscan._helpers

.. toctree::
:caption: Development
Expand All @@ -42,7 +42,9 @@ Persistent Leaves Spatial Clustering of Applications with Noise
===============================================================

This library provides a new clustering algorithm based on HDBSCAN*. The primary
advantages of PLSCAN over the standard ``hdbscan`` library are:
advantages of PLSCAN over the `hdbscan
<https://github.com/scikit-learn-contrib/hdbscan>`_ and `fast_hdbscan
<https://github.com/TutteInstitute/fast_hdbscan>`_ libraries are:

- PLSCAN automatically finds the optimal minimum cluster size.
- PLSCAN can easily use all available cores to speed up computation;
Expand All @@ -60,7 +62,7 @@ stable clusters.
import numpy as np
import matplotlib.pyplot as plt

from plscan import PLSCAN
from fast_plscan import PLSCAN

data = np.load("docs/data/data.npy")

Expand Down Expand Up @@ -144,11 +146,11 @@ When using this work, please cite our (upcoming) preprint:
Licensing
---------

The ``plscan`` package has a 3-Clause BSD license.
The ``fast-plscan`` package has a 3-Clause BSD license.

.. |PyPI version| image:: https://badge.fury.io/py/plscan.svg
:target: https://badge.fury.io/py/plscan
.. |Conda version| image:: https://anaconda.org/conda-forge/plscan/badges/version.svg
:target: https://anaconda.org/conda-forge/plscan
.. |PyPI version| image:: https://badge.fury.io/py/fast-plscan.svg
:target: https://badge.fury.io/py/fast-plscan
.. |Conda version| image:: https://anaconda.org/conda-forge/fast-plscan/badges/version.svg
:target: https://anaconda.org/conda-forge/fast-plscan
.. |DOI badge| image:: https://zenodo.org/badge/xxx.svg
:target: https://zenodo.org/doi/xxx/zenodo.yyy
4 changes: 2 additions & 2 deletions docs/reference_plscan.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,5 @@ The public API
:toctree: _autosummary
:recursive:

plscan
plscan.plots
fast_plscan
fast_plscan.plots
20 changes: 10 additions & 10 deletions docs/using_basic_api.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@
"source": [
"# Basic usage\n",
"\n",
"You can use ``plscan`` as a drop in replacement for ``hdbscan`` for basic usage\n",
"of ``hdbscan``. That is, ``plscan`` mostly uses the same interface as\n",
"``hdbscan`` -- so whenever you are using the parts of ``hdbscan`` that\n",
"``plscan`` supports then you can simply swap them."
"You can use PLSCAN as a drop in replacement for HDBSCAN for basic usage. That\n",
"is, ``fast-plscan`` mostly uses the same interface as ``hdbscan`` -- so whenever\n",
"you are using the parts of ``hdbscan`` that ``fast-plscan`` supports then you\n",
"can simply swap them."
]
},
{
Expand Down Expand Up @@ -67,10 +67,10 @@
"id": "9fcf9ff8",
"metadata": {},
"source": [
"To use ``plscan`` you can apply it exactly as you would ``hdbscan``. In this\n",
"case, no parameter values are needed to find a usable clustering because PLSCAN\n",
"will automatically find the minimum cluster size that produces \"optimal\" leaf\n",
"clusters."
"To use ``fast-plscan`` you can apply it exactly as you would ``hdbscan``. In\n",
"this case, no parameter values are needed to find a usable clustering because\n",
"PLSCAN will automatically find the minimum cluster size that produces \"optimal\"\n",
"leaf clusters."
]
},
{
Expand All @@ -91,7 +91,7 @@
}
],
"source": [
"from plscan import PLSCAN\n",
"from fast_plscan import PLSCAN\n",
"\n",
"labels = PLSCAN().fit_predict(data)\n",
"\n",
Expand All @@ -106,7 +106,7 @@
"id": "b337de5c",
"metadata": {},
"source": [
"And that's all there is to it in terms of getting started with ``plscan``."
"And that's all there is to it in terms of getting started with PLSCAN."
]
}
],
Expand Down
15 changes: 8 additions & 7 deletions docs/using_bi_persistences.ipynb

Large diffs are not rendered by default.

17 changes: 9 additions & 8 deletions docs/using_exploration_plots.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"source": [
"# Exploration plots\n",
"\n",
"``plscan`` contains several attributes for exploring and plotting its the cluster\n",
"PLSCAN contains several attributes for exploring and plotting its the cluster\n",
"hierarchy."
]
},
Expand All @@ -20,7 +20,7 @@
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from plscan import PLSCAN\n",
"from fast_plscan import PLSCAN\n",
"\n",
"plt.rcParams[\"figure.dpi\"] = 150\n",
"plt.rcParams[\"figure.figsize\"] = (2.75, 0.618 * 2.75)\n",
Expand All @@ -35,9 +35,9 @@
"source": [
"### Condensed tree\n",
"\n",
"Like ``hdbscan``, ``plscan`` has a condensed tree showing the cluster hierarchy\n",
"along data point distances. Unlike ``hdbscan``, ``plscan`` uses distances rather\n",
"than density estimates to plot the tree."
"Like HDBSCAN, PLSCAN has a condensed tree showing the cluster hierarchy along\n",
"data point distances. Unlike HDBSCAN, PLSCAN uses distances rather than density\n",
"estimates to plot the tree."
]
},
{
Expand Down Expand Up @@ -103,8 +103,9 @@
"source": [
"### Leaf tree\n",
"\n",
"``plscan`` does not extract clusters from the condensed tree directly. Instead, it\n",
"first creates a cluster hierarchy describing which clusters exist at each minimum cluster size value. The `leaf_tree_` attribute contains this hierarchy:"
"PLSCAN does not extract clusters from the condensed tree directly. Instead, it\n",
"first creates a cluster hierarchy describing which clusters exist at each\n",
"minimum cluster size value. The `leaf_tree_` attribute contains this hierarchy:"
]
},
{
Expand Down Expand Up @@ -170,7 +171,7 @@
"source": [
"### Persistence trace\n",
"\n",
"``plscan`` extracts clusters from the leaf tree by computing each leaf cluster's\n",
"PLSCAN extracts clusters from the leaf tree by computing each leaf cluster's\n",
"persistence and finding the minimum cluster size with the highest leaf-cluster\n",
"persistence sum. The `persistence_trace_` attribute can plot this minimum\n",
"cluster size -- persistence trace."
Expand Down
Loading