This code implements the PNN-smoothing seeding method for k-means described in the paper "Systematically and efficiently improving k-means initialization by pairwise-nearest-neighbor smoothing" by C. Baldassi, TMLR, (2022) TMLR, arXiv.
The code is written in Julia. It requires Julia 1.6 or later.
It provides a multi-threaded implementation of Lloyd's algorithm, also known as k-means, with several seeding options:
- sample centroids uniformly at random from the dataset, without replacement
- kmeans++, including the "greedy" variant which is also used by scikit-learn
- furthest-point heuristic, also called "maxmin" from this paper
- kmeans‖, also called "scalable kmeans++"
- pairwise nearest-neighbor hierarchical clustering (note: this scales more than quadratically with the number of points)
- the PNN-smoothing meta-method
- the refine meta-method
It also implements a number of different techniques that can accelerate the iterations of Lloyd's algorithm:
- the naive standard one (no acceleration)
- the "reduced comparison" method from this paper
- methods based on Hamerly 2010
- methods based on Elkan 2003
- variants of the "yinyang" method from this paper
- the "exponion" method from this paper
- the "ball" method from this paper
The package also provides two functions to compute the centroid index as defined in this paper,
an asymmetric one called CI
and a symmetric one called CI_sym
. These are not exported.
It also provides a function to compute the variation of information metric to quantify the
distance between two partitions as defined in this paper. The function is called VI
and is
not exported.
To install the module, just clone it from GitHub into some directory. Then enter in such directory and run julia with the "project" option:
$ julia --project
(Alternatively, if you start Julia from some other directory, you can press ; to enter
in shell mode, cd
into the project's directory, enter in pkg mode with ] and use the
activate
command.)
The first time you do this, you will then need to setup the project's environment. To do that, when you're in the Julia REPL, press the ] key to enter in pkg mode, then resolve the dependencies:
(KMeansPNNSmoothing) pkg> resolve
This should download all the required packages. You can subsequently type test
to check that
everything works. After this, you can press the backspace key to get back to the standard Julia
prompt, and load the package:
julia> using KMeansPNNSmoothing
The format of the data must be a Matrix{Float64}
with the data points organized by column.
(Typically, this means that if you're reading a dataset you'll need to transpose it. See for
example the runfile.jl
script in the test
directory.)
Here is an example run, assuming we want to cluster a data
matrix into k
clusters with
the original kmeans++ algorithm (the {1}
type parameter deactivates the "greedy" version)
and using the "reduced yinyang" acceleration method:
result = kmeans(data, k; kmseeder=KMSeed.PlusPlus{1}(), accel=KMAccel.Ryy)
and here is an example running the PNN-smoothing scheme, using the non-greedy kmeans++ to seed the initial sub-sets (this is actually the default if no keyword arguments are passed):
result = kmeans(data, k; kmseeder=KMSeed.PNNS(KMSeed.PlusPlus{1}()))
and here is again PNN-smoothing but this time using "maxmin" at 2 levels of recursion:
result = kmeans(data, k; kmseeder=KMSeed.PNNS(KMSeed.MaxMin(), rlevel=2))
For the complete documentation you can use the Julia help (press the ? key in
the REPL, then type kmeans
, or KMSeed
or KMAccel
)
All codes are parallellized (in most cases over the data points) if there are threads
available: either run Julia with the -t
option or use the JULIA_NUM_THREADS
environment
variable.
The code is released under the MIT licence.
The code has originated from the code of RecombinatorKMeans.jl, see the licence information there.