VectorDB4Java

A Spring Boot application for vector similarity search with advanced text processing capabilities.

Key Features

Vector Similarity Search: Fast KNN search using Apache Lucene
Smart Text Processing:
- Proper noun and compound word handling
- Part-of-speech aware weighting (nouns: 1.2x, verbs: 1.1x, adjectives: 1.05x)
- Automatic duplicate detection (95% similarity threshold)
Document Processing: Extract and process text from various document formats
REST API: Swagger UI available at /swagger-ui.html

Quick Start

Build: mvn clean install
Run: mvn spring-boot:run
Access: http://localhost:8080

API Examples

Create Vector from Text

curl -X GET "http://localhost:8080/api/vectors/text?text=your text here"

Find Similar Vectors

curl -X POST "http://localhost:8080/api/vectors/search" \
  -H "Content-Type: application/json" \
  -d '{"vector": [0.1, 0.2, ...]}'

Process Document

curl -X POST "http://localhost:8080/api/vectors/document" \
  -F "file=@/path/to/document.pdf"

Configuration

Key settings in application.properties:

# Vector similarity threshold (0.0 to 1.0)
vector.similarity.threshold=0.95

# File upload limits
spring.servlet.multipart.max-file-size=10MB

Notes on Proper Noun Handling

The application includes logic to handle proper nouns and part-of-speech-aware weighting during text processing.
However, the effectiveness of proper noun distinction depends on the underlying embedding model (e.g., GloVe).
In some cases, the model may not produce sufficiently distinct vectors for sentences differing only in proper nouns.
The integration test for proper noun handling is currently ignored due to this limitation.

Embedding Model

This project uses the GloVe (Global Vectors for Word Representation) pre-trained word embeddings for generating vector representations of text.

Model Used: GloVe 6B, 100-dimensional vectors
Download Link: glove.6B.zip (822 MB)
Direct File: glove.6B.100d.txt (347 MB)

Setup Instructions

Download the GloVe model from the links above.
Unzip the archive if you downloaded glove.6B.zip.
Place the file glove.6B.100d.txt in the directory: src/main/resources/models/
- The final path should be: src/main/resources/models/glove.6B.100d.txt

The application will automatically load this file at startup.

Indexing Strategy

The application uses Apache Lucene for vector similarity search, implementing a custom indexing strategy optimized for high-dimensional vectors.

Current Implementation

Index Structure: Uses Lucene's ByteBuffersDirectory for in-memory indexing
Vector Storage: Vectors are stored as binary fields in Lucene documents
Similarity Search: Implements K-Nearest Neighbors (KNN) search using cosine similarity
Performance Characteristics:
- Fast for small to medium-sized datasets (up to ~100K vectors)
- Memory-efficient due to in-memory indexing
- Linear search complexity (O(n) for n vectors)

Limitations and Potential Improvements

Scalability:
- Current implementation uses in-memory indexing, limiting dataset size
- Could be improved by implementing disk-based indexing for larger datasets
- Consider using Lucene's MMapDirectory or NIOFSDirectory for persistent storage
Search Performance:
- Linear search becomes slow for large datasets
- Potential improvements:
  - Implement HNSW (Hierarchical Navigable Small World) graph for approximate nearest neighbors
  - Use Product Quantization (PQ) for vector compression
  - Implement IVF (Inverted File) index for faster approximate search

Experimental Implementations

An experimental branch feature/indexing-experiments explores alternative indexing strategies:

Dense vector indexing (HNSW, IVF)
Sparse vector indexing
Performance comparisons between different approaches

Relevant Resources

Vector Search Fundamentals:
- Approximate Nearest Neighbors Oh Yeah (ANNOY) - Spotify's library for approximate nearest neighbors
- HNSW: Hierarchical Navigable Small World - Paper on HNSW algorithm
- Product Quantization for Nearest Neighbor Search - Paper on PQ technique
Lucene and Vector Search:
- Apache Lucene Documentation
- Lucene Vector Search
- Elasticsearch Vector Search - Example of production-grade vector search implementation
Alternative Solutions:
- FAISS - Facebook's library for efficient similarity search
- Milvus - Open-source vector database
- Weaviate - Vector search engine with GraphQL API
Performance Optimization:
- Vector Search Performance - Guide to vector search performance
- Approximate Nearest Neighbor Search - Overview of ANN algorithms

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VectorDB4Java

Key Features

Quick Start

API Examples

Create Vector from Text

Find Similar Vectors

Process Document

Configuration

Notes on Proper Noun Handling

Embedding Model

Setup Instructions

Indexing Strategy

Current Implementation

Limitations and Potential Improvements

Experimental Implementations

Relevant Resources

License

About

Uh oh!

Releases

Packages

Languages

newbee14/VectorDB4Java

Folders and files

Latest commit

History

Repository files navigation

VectorDB4Java

Key Features

Quick Start

API Examples

Create Vector from Text

Find Similar Vectors

Process Document

Configuration

Notes on Proper Noun Handling

Embedding Model

Setup Instructions

Indexing Strategy

Current Implementation

Limitations and Potential Improvements

Experimental Implementations

Relevant Resources

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages