Mini-Disk-ANN

Project for the course Software Development for Information Systems, DIT UOA, Winter 2024. Team members:

Dimitris Dimochronis, sdi2100034@di.uoa.gr,
Nikos Pentasilis, sdi2100152@di.uoa.gr,
Giorgos Ragkos, sdi2100167@di.uoa.gr.

This project consists of implementing algorithms described for DiskANN, a fast accurate billion-point nearest neighbor search on a single node. For our project, we will experiment with datasets up to 1 million points. More about the algorithms implemented can be found in Resources

Build and Run project

To run the project:

make run

The configuration file is by default config.txt. To run with other file name: ./bin/vamana <file name>

To run the tests:

make test

To clean the directory:

make clean

To check for memory leaks in tests:

make valgrind_test

Configuration file

Parameters are provided via a configuration file, whose name is given as argument to the program.
The file must have the following format:

dataset=data/SIGMOD/DUMMY/dummy-data.bin
query=data/SIGMOD/DUMMY/dummy-queries.bin
groundtruth=data/SIGMOD/DUMMY/dummy-groundtruth.bin
vamana_function=filtered
graph_name=filtered_tmp
q_idx=
k=100
R=15
L=100
a=1.2

The dataset/query/groundtruth specify the paths of these files.
The vamana_function determines the indexing & searching algorithm to run:
- vamana, for simple Vamana
- filtered, for Filtered Vamana,
- stiched, for Stitched Vamana.
graph_name is the name of the graph we want to create or use.
k, R, L, a are the well-known parameters for the functions.
With q_idx you can:
- specify the index of the query in the file you want to test,
- set -2 to test only one query, randomly chosen by the program,
- set -3 to only create the index, without running queries,
- set -4 to create the groundtruth from the dataset and queries, which will be saved with the name given in the corresponding field,
- leave it empty to run for all queries.

Performance

For a dataset of 10,000 points, the algorithms show recall@100 > 95%. Simple Vamana now requires 16 seconds for indexing, while Filtered and Stitched take 6 seconds.
For a dataset of 1M points, the algorithms show recall@100 > 97%. Simple Vamana never finishes, Filtered needs 12hr for indexing, and Stitched needs 13hr.
For unfiltered queries, initial recall@100 did not exceed 35%. This is due to the lack of edges between nodes with different filters, resulting in no access to the entire graph. To address this issue, the medoid for each filter in the search is taken as the result of Filtered Greedy Search for that filter with k = 1. Thus, the search is more targeted, starting from the closest point for that filter.

Greedy Search

Simple Greedy Search

L_m_V computes the elements in L not in V.
greedy_search initializes sets L and V. While there are elements in L not in V, it finds the closest point to p using euclidean_distance, adds the point's neighbors to L, and if L exceeds L_s in size, it resizes to keep only the L-L_s closest nodes. Finally, returns a pair of graphs: the first contains the k nearest nodes to q, and the second contains the visited nodes.

Filtered Greedy Search

filtered_greedy_search initializes sets L and V. L is initialized with nodes s whose filter matches fq. While there are elements in L not in V, it finds the closest point to p using euclidean_distance, initializes set N, adds neighbors of P with filter Fq not in V to N, adds nodes from N not already in L to L. If L exceeds L_s, it resizes to keep only the L-L_s closest nodes. Finally, returns a pair of graphs: the first with the k nearest nodes to q, and the second with the visited nodes.

Robust Pruning

Simple Robust Pruning

Argument p_node is the node of point p in the graph.
Adds all out edges of p to V, checking for duplicates.
Removes all edges of p from the graph, then while V is not empty and |N_out(p)| is not equal to R, computes the distance from p to each point in V and adds the closest one to N_out(p).

Filtered Robust Pruning

Same process as simple robust pruning, with two key differences. First, it has an extra argument, the vector C, containing the filter of each point in the graph. Second, before checking for possible nodes to remove from V (i.e., to delete an edge), it checks if p, p*, p' have the same filter and if not, it moves to the next point in V (so that it does not delete an edge between points with different filters).

Vamana Indexing

Simple Vamana indexing

Creates a graph from points in dataset P and creates up to R random edges (if enough points exist). Then, finds the graph's medoid and traverses each point in the graph randomly. Runs Greedy Search with q being the current point in that iteration, passing the vector of visited nodes as an argument to Pruning. Then, for each out neighbor, creates reverse edges and prunes out neighbors if they exceed R in number. N_out_j_p is used to convert the set of out-neighbors to a vector to pass to Robust Prune.

Filtered Vamana indexing

Creates an empty graph with points from dataset P. Finds the medoid for each filter, which is a random point with that filter, as each point in the dataset has only one filter. Traverses each point in the graph randomly. Runs Filtered Greedy Search with q being the current point in that iteration, passing the vector of visited nodes to Filtered Pruning. Then, for each out neighbor, creates reverse edges and prunes out neighbors if they exceed R in number. N_out_j is used to convert the set of out-neighbors to a vector to pass to Robust Prune.

Stitched Vamana indexing

For each filter, creates a graph from points with that filter using classic Vamana. Then, joins all graphs as follows:
   - Stores the mapping of each point in P to its index in the graph of its filter.
   - Stores the mapping of each point in each filter graph to its index in dataset P.
   - Adds all points to the final graph S_stitched.
   - Using the first two structures, converts the out neighbor indices in each filter graph to match the stitched graph and dataset P.
Filtered Pruning is skipped, and R_stitched is applied directly to Vamana since each point has only one filter, so it is not possible to create edges with points from other filters.

Datasets

Datasets for simple Vamana Indexing: Corpus.
Datasets for Filtered & Stiched Vamana Indexing: DB Group.

Resources

Suhas Jayaram Subramanya, Devvrit, Rohan Kadekodi, Ravishankar Krishaswamy, and Harsha Vardhan Simhadri. (2019).
DiskANN: fast accurate billion-point nearest neighbor search on a single node. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 1233, 13766–13776.
https://dl.acm.org/doi/abs/10.5555/3454287.3455520
Siddharth Gollapudi, Neel Karia, Varun Sivashankar, Ravishankar Krishnaswamy, Nikit Begwani, Swapnil Raz, Yiyong Lin, Yin Zhang, Neelam Mahapatro, Premkumar Srinivasan, Amit Singh, and Harsha Vardhan Simhadri. (2023).
Filtered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filters. In Proceedings of the ACM Web Conference 2023 (WWW '23). Association for Computing Machinery, New York, NY, USA, 3406–3416.
https://doi.org/10.1145/3543507.3583552

Name		Name	Last commit message	Last commit date
Latest commit History 192 Commits
.github/workflows		.github/workflows
data		data
graphs		graphs
include		include
src		src
testing		testing
.gitignore		.gitignore
Project_Report.pdf		Project_Report.pdf
README.md		README.md
config.txt		config.txt
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mini-Disk-ANN

Build and Run project

To run the project:

To run the tests:

To clean the directory:

To check for memory leaks in tests:

Configuration file

Performance

Greedy Search

Simple Greedy Search

Filtered Greedy Search

Robust Pruning

Simple Robust Pruning

Filtered Robust Pruning

Vamana Indexing

Simple Vamana indexing

Filtered Vamana indexing

Stitched Vamana indexing

Datasets

Resources

About

Uh oh!

Releases 3

Packages

Contributors 3

Uh oh!

Languages

maxidimi/Mini-Disk-ANN

Folders and files

Latest commit

History

Repository files navigation

Mini-Disk-ANN

Build and Run project

To run the project:

To run the tests:

To clean the directory:

To check for memory leaks in tests:

Configuration file

Performance

Greedy Search

Simple Greedy Search

Filtered Greedy Search

Robust Pruning

Simple Robust Pruning

Filtered Robust Pruning

Vamana Indexing

Simple Vamana indexing

Filtered Vamana indexing

Stitched Vamana indexing

Datasets

Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 3

Uh oh!

Languages

Packages