Skip to content

Commit

Permalink
Preparing for 0.6.0 diskannpy release (#407)
Browse files Browse the repository at this point in the history
* Some early staging for README updates and pyproject updates for a 0.6.0 release for diskannpy.

* Trying to fix the CI badge to point toward main's latest build

* Updating documentation for pdoc generation

* Documentation updates. Tightened up the API to drop list support (there were entirely too many cases where it wouldn't work, and it's easier to just tell people to convert it themselves)

* Some module reorganization to make pdoc actually display the docstrings for variables re-exported at the top level

* A copy paste happened that shouldn't have.

* Updating the apps to use the new 0.6.0 api

* Addressing PR feedback

* Some of the documentation changes didn't get made in both from_file or the constructor
  • Loading branch information
daxpryce authored Aug 2, 2023
1 parent 1eac702 commit 06fc0b7
Show file tree
Hide file tree
Showing 19 changed files with 1,109 additions and 647 deletions.
26 changes: 17 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# DiskANN

[![DiskANN Pull Request Build and Test](https://github.com/microsoft/DiskANN/actions/workflows/pr-test.yml/badge.svg)](https://github.com/microsoft/DiskANN/actions/workflows/pr-test.yml)
[![DiskANN Paper](https://img.shields.io/badge/Paper-NeurIPS%3A_DiskANN-blue)](https://papers.nips.cc/paper/9527-rand-nsg-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node.pdf)
[![DiskANN Paper](https://img.shields.io/badge/Paper-Arxiv%3A_Fresh--DiskANN-blue)](https://arxiv.org/abs/2105.09613)
[![DiskANN Paper](https://img.shields.io/badge/Paper-Filtered--DiskANN-blue)](https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf)
[![DiskANN Main](https://github.com/microsoft/DiskANN/actions/workflows/push-test.yml/badge.svg?branch=main)](https://github.com/microsoft/DiskANN/actions/workflows/push-test.yml)
[![PyPI version](https://img.shields.io/pypi/v/diskannpy.svg)](https://pypi.org/project/diskannpy/)
[![Downloads shield](https://pepy.tech/badge/diskannpy)](https://pepy.tech/project/diskannpy)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

DiskANN is a suite of scalable, accurate and cost-effective approximate nearest neighbor search algorithms for large-scale vector search that support real-time changes and simple filters.
This code is based on ideas from the [DiskANN](https://papers.nips.cc/paper/9527-rand-nsg-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node.pdf), [Fresh-DiskANN](https://arxiv.org/abs/2105.09613) and the [Filtered-DiskANN](https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf) papers with further improvements.
Expand All @@ -12,8 +18,6 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio

See [guidelines](CONTRIBUTING.md) for contributing to this project.



## Linux build:

Install the following packages through apt-get
Expand Down Expand Up @@ -71,12 +75,16 @@ OR for Visual Studio 2017 and earlier:
```
<full-path-to-installed-cmake>\cmake ..
```
* This will create a diskann.sln solution. Open it from VisualStudio and build either Release or Debug configuration.
* Alternatively, use MSBuild:
**This will create a diskann.sln solution**. Now you can:

- Open it from VisualStudio and build either Release or Debug configuration.
- `<full-path-to-installed-cmake>\cmake --build build`
- Use MSBuild:
```
msbuild.exe diskann.sln /m /nologo /t:Build /p:Configuration="Release" /property:Platform="x64"
```
* This will also build gperftools submodule for libtcmalloc_minimal dependency.

* This will also build gperftools submodule for libtcmalloc_minimal dependency.
* Generated binaries are stored in the x64/Release or x64/Debug directories.

## Usage:
Expand All @@ -88,16 +96,16 @@ Please see the following pages on using the compiled code:
- [Commandline examples for using in-memory streaming indices](workflows/dynamic_index.md)
- [Commandline interface for building and search in memory indices with label data and filters](workflows/filtered_in_memory.md)
- [Commandline interface for building and search SSD based indices with label data and filters](workflows/filtered_ssd_index.md)
- To be added: Python interfaces and docker files
- [diskannpy - DiskANN as a python extension module](python/README.md)

Please cite this software in your work as:

```
@misc{diskann-github,
author = {Simhadri, Harsha Vardhan and Krishnaswamy, Ravishankar and Srinivasa, Gopal and Subramanya, Suhas Jayaram and Antonijevic, Andrija and Pryce, Dax and Kaczynski, David and Williams, Shane and Gollapudi, Siddarth and Sivashankar, Varun and Karia, Neel and Singh, Aditi and Jaiswal, Shikhar and Mahapatro, Neelam and Adams, Philip and Tower, Bryan}},
author = {Simhadri, Harsha Vardhan and Krishnaswamy, Ravishankar and Srinivasa, Gopal and Subramanya, Suhas Jayaram and Antonijevic, Andrija and Pryce, Dax and Kaczynski, David and Williams, Shane and Gollapudi, Siddarth and Sivashankar, Varun and Karia, Neel and Singh, Aditi and Jaiswal, Shikhar and Mahapatro, Neelam and Adams, Philip and Tower, Bryan and Patel, Yash}},
title = {{DiskANN: Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search}},
url = {https://github.com/Microsoft/DiskANN},
version = {0.5},
version = {0.6.0},
year = {2023}
}
```
13 changes: 11 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "diskannpy"
version = "0.5.0.rc5"
version = "0.6.0"

description = "DiskANN Python extension module"
readme = "python/README.md"
Expand All @@ -25,17 +25,26 @@ authors = [
{name = "Dax Pryce", email = "daxpryce@microsoft.com"}
]

[project.optional-dependencies]
dev = ["black", "isort", "mypy"]

[tool.setuptools]
package-dir = {"" = "python/src"}

[tool.isort]
profile = "black"
multi_line_output = 3

[tool.mypy]
plugins = "numpy.typing.mypy_plugin"

[tool.cibuildwheel]
manylinux-x86_64-image = "manylinux_2_28"
test-requires = ["scikit-learn~=1.2"]
build-frontend = "build"
skip = ["pp*", "*-win32", "*-manylinux_i686", "*-musllinux*"]
test-command = "python -m unittest discover {project}/python/tests"


[tool.cibuildwheel.linux]
before-build = [
"dnf makecache --refresh",
Expand Down
31 changes: 23 additions & 8 deletions python/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,17 @@
# diskannpy

[![DiskANN Paper](https://img.shields.io/badge/Paper-NeurIPS%3A_DiskANN-blue)](https://papers.nips.cc/paper/9527-rand-nsg-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node.pdf)
[![DiskANN Paper](https://img.shields.io/badge/Paper-Arxiv%3A_Fresh--DiskANN-blue)](https://arxiv.org/abs/2105.09613)
[![DiskANN Paper](https://img.shields.io/badge/Paper-Filtered--DiskANN-blue)](https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf)
[![DiskANN Main](https://github.com/microsoft/DiskANN/actions/workflows/push-test.yml/badge.svg?branch=main)](https://github.com/microsoft/DiskANN/actions/workflows/push-test.yml)
[![PyPI version](https://img.shields.io/pypi/v/diskannpy.svg)](https://pypi.org/project/diskannpy/)
[![Downloads shield](https://pepy.tech/badge/diskannpy)](https://pepy.tech/project/diskannpy)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## Installation
Packages published to PyPI will always be built using the latest numpy major.minor release (at this time, 1.25).

Conda distributions for versions 1.19-1.25 will be completed as a future effort. In the meantime, feel free to
Conda distributions for versions 1.19-1.25 will be completed as a future effort. In the meantime, feel free to
clone this repository and build it yourself.

## Local Build Instructions
Expand All @@ -16,11 +24,18 @@ build `diskannpy` with these additional instructions.
In the root folder of DiskANN, there is a file `pyproject.toml`. You will need to edit the version of numpy in both the
`[build-system.requires]` section, as well as the `[project.dependencies]` section. The version numbers must match.

#### Linux
```bash
python3.11 -m venv venv # versions from python3.8 and up should work. on windows, you might need to use py -3.11 -m venv venv
source venv/bin/activate # linux
# or
venv\Scripts\Activate.{ps1, bat} # windows
python3.11 -m venv venv # versions from python3.9 and up should work
source venv/bin/activate
pip install build
python -m build
```

#### Windows
```powershell
py -3.11 -m venv venv # versions from python3.9 and up should work
venv\Scripts\Activate.ps1
pip install build
python -m build
```
Expand All @@ -31,10 +46,10 @@ The built wheel will be placed in the `dist` directory in your DiskANN root. Ins
Please cite this software in your work as:
```
@misc{diskann-github,
author = {Simhadri, Harsha Vardhan and Krishnaswamy, Ravishankar and Srinivasa, Gopal and Subramanya, Suhas Jayaram and Antonijevic, Andrija and Pryce, Dax and Kaczynski, David and Williams, Shane and Gollapudi, Siddarth and Sivashankar, Varun and Karia, Neel and Singh, Aditi and Jaiswal, Shikhar and Mahapatro, Neelam and Adams, Philip and Tower, Bryan}},
author = {Simhadri, Harsha Vardhan and Krishnaswamy, Ravishankar and Srinivasa, Gopal and Subramanya, Suhas Jayaram and Antonijevic, Andrija and Pryce, Dax and Kaczynski, David and Williams, Shane and Gollapudi, Siddarth and Sivashankar, Varun and Karia, Neel and Singh, Aditi and Jaiswal, Shikhar and Mahapatro, Neelam and Adams, Philip and Tower, Bryan and Patel, Yash}},
title = {{DiskANN: Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search}},
url = {https://github.com/Microsoft/DiskANN},
version = {0.5},
version = {0.6.0},
year = {2023}
}
```
```
29 changes: 14 additions & 15 deletions python/apps/in-mem-dynamic.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,26 +40,25 @@ def insert_and_search(
npts, ndims = utils.get_bin_metadata(indexdata_file)

if dtype_str == "float":
index = diskannpy.DynamicMemoryIndex(
"l2", np.float32, ndims, npts, Lb, graph_degree
)
queries = utils.bin_to_numpy(np.float32, querydata_file)
data = utils.bin_to_numpy(np.float32, indexdata_file)
dtype = np.float32
elif dtype_str == "int8":
index = diskannpy.DynamicMemoryIndex(
"l2", np.int8, ndims, npts, Lb, graph_degree
)
queries = utils.bin_to_numpy(np.int8, querydata_file)
data = utils.bin_to_numpy(np.int8, indexdata_file)
dtype = np.int8
elif dtype_str == "uint8":
index = diskannpy.DynamicMemoryIndex(
"l2", np.uint8, ndims, npts, Lb, graph_degree
)
queries = utils.bin_to_numpy(np.uint8, querydata_file)
data = utils.bin_to_numpy(np.uint8, indexdata_file)
dtype = np.uint8
else:
raise ValueError("data_type must be float, int8 or uint8")

index = diskannpy.DynamicMemoryIndex(
distance_metric="l2",
vector_dtype=dtype,
dimensions=ndims,
max_vectors=npts,
complexity=Lb,
graph_degree=graph_degree
)
queries = diskannpy.vectors_from_file(querydata_file, dtype)
data = diskannpy.vectors_from_file(indexdata_file, dtype)

tags = np.zeros(npts, dtype=np.uintc)
timer = utils.Timer()
for i in range(npts):
Expand Down
31 changes: 15 additions & 16 deletions python/apps/insert-in-clustered-order.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,26 +24,25 @@ def insert_and_search(
npts, ndims = utils.get_bin_metadata(indexdata_file)

if dtype_str == "float":
index = diskannpy.DynamicMemoryIndex(
"l2", np.float32, ndims, npts, Lb, graph_degree, False
)
queries = utils.bin_to_numpy(np.float32, querydata_file)
data = utils.bin_to_numpy(np.float32, indexdata_file)
dtype = np.float32
elif dtype_str == "int8":
index = diskannpy.DynamicMemoryIndex(
"l2", np.int8, ndims, npts, Lb, graph_degree
)
queries = utils.bin_to_numpy(np.int8, querydata_file)
data = utils.bin_to_numpy(np.int8, indexdata_file)
dtype = np.int8
elif dtype_str == "uint8":
index = diskannpy.DynamicMemoryIndex(
"l2", np.uint8, ndims, npts, Lb, graph_degree
)
queries = utils.bin_to_numpy(np.uint8, querydata_file)
data = utils.bin_to_numpy(np.uint8, indexdata_file)
dtype = np.uint8
else:
raise ValueError("data_type must be float, int8 or uint8")

index = diskannpy.DynamicMemoryIndex(
distance_metric="l2",
vector_dtype=dtype,
dimensions=ndims,
max_vectors=npts,
complexity=Lb,
graph_degree=graph_degree
)
queries = diskannpy.vectors_from_file(querydata_file, dtype)
data = diskannpy.vectors_from_file(indexdata_file, dtype)

offsets, permutation = utils.cluster_and_permute(
dtype_str, npts, ndims, data, num_clusters
)
Expand All @@ -52,7 +51,7 @@ def insert_and_search(
timer = utils.Timer()
for c in range(num_clusters):
cluster_index_range = range(offsets[c], offsets[c + 1])
cluster_indices = np.array(permutation[cluster_index_range], dtype=np.uintc)
cluster_indices = np.array(permutation[cluster_index_range], dtype=np.uint32)
cluster_data = data[cluster_indices, :]
index.batch_insert(cluster_data, cluster_indices + 1, num_insert_threads)
print('Inserted cluster', c, 'in', timer.elapsed(), 's')
Expand Down
Loading

0 comments on commit 06fc0b7

Please sign in to comment.