Skip to content

Commit

Permalink
Docs: section on "Clustering"
Browse files Browse the repository at this point in the history
  • Loading branch information
ashvardanian committed Aug 23, 2023
1 parent da771e6 commit c99d528
Showing 1 changed file with 35 additions and 1 deletion.
36 changes: 35 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ Linux • MacOS • Windows • iOS • Docker • WebAssembly
- ✅ Space-efficient point-clouds with `uint40_t`, accommodating 4B+ size.
- ✅ Compatible with OpenMP and custom "executors", for fine-grained control over CPU utilization.
- ✅ Heterogeneous lookups, renaming/relabeling, and on-the-fly deletions.
- ✅ Near-real-time [clustering and sub-clusterings](#clustering) for Tens or Millions of clusters.
-[Semantic Search](#usearch--ai--multi-modal-semantic-search) and [Joins](#joins).

[usearch-header]: https://github.com/unum-cloud/usearch/blob/main/include/usearch/index.hpp
Expand Down Expand Up @@ -123,7 +124,6 @@ Comparing the performance of FAISS against USearch on 1 Million 96-dimensional v
[benchmarking]: https://github.com/unum-cloud/usearch/blob/main/docs/benchmarks.md


## User-Defined Functions

While most vector search packages concentrate on just a couple of metrics - "Inner Product distance" and "Euclidean distance," USearch extends this list to include any user-defined metrics.
Expand Down Expand Up @@ -208,6 +208,36 @@ multi_index = Indexes(
multi_index.search(...)
```

## Clustering

Once the index is constructed, it can be used to cluster entries much faster.
In essense, the `Index` itself can be seen as a clustering, and it allows iterative deepening.

```py
clustering = index.cluster(
min_count=10, # Optional
max_count=15, # Optional
threads=..., # Optional
)

# Get the clusters and their sizes
centroid_keys, sizes = clustering.centroids_popularity

# Use Matplotlib draw a histogram
clustering.plot_centroids_popularity()

# Export a NetworkX graph of the clusters
g = clustering.network

# Get members of a specific cluster
first_members = clustering.members_of(centroid_keys[0])

# Deepen into that cluster spliting it into more parts, all same arguments supported
sub_clustering = clustering.subcluster(min_count=..., max_count=...)
```

Using Scikit-Learn, on a 1 Million point dataset, one may expect queries to take anywhere from minutes to hours, depending on the number of clusters you want to highlight. For 50'000 clusters the performance difference between USearch and conventional clustering methods may easily reach 100x.

## Joins, One-to-One, One-to-Many, and Many-to-Many Mappings

One of the big questions these days is how will AI change the world of databases and data management.
Expand Down Expand Up @@ -326,6 +356,10 @@ matches = index.search(fingerprints, 10)
[smiles]: https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
[rdkit-fingerprints]: https://www.rdkit.org/docs/RDKit_Book.html#additional-information-about-the-fingerprints

### USearch + POI Coordinates = GIS Applications... on iOS?

With Objective-C and iOS bindings, USearch can be easily used in mobile applications


## Integrations

Expand Down

0 comments on commit c99d528

Please sign in to comment.