A Spring Boot application for vector similarity search with advanced text processing capabilities.
- Vector Similarity Search: Fast KNN search using Apache Lucene
- Smart Text Processing:
- Proper noun and compound word handling
- Part-of-speech aware weighting (nouns: 1.2x, verbs: 1.1x, adjectives: 1.05x)
- Automatic duplicate detection (95% similarity threshold)
- Document Processing: Extract and process text from various document formats
- REST API: Swagger UI available at
/swagger-ui.html
- Build:
mvn clean install - Run:
mvn spring-boot:run - Access:
http://localhost:8080
curl -X GET "http://localhost:8080/api/vectors/text?text=your text here"curl -X POST "http://localhost:8080/api/vectors/search" \
-H "Content-Type: application/json" \
-d '{"vector": [0.1, 0.2, ...]}'curl -X POST "http://localhost:8080/api/vectors/document" \
-F "file=@/path/to/document.pdf"Key settings in application.properties:
# Vector similarity threshold (0.0 to 1.0)
vector.similarity.threshold=0.95
# File upload limits
spring.servlet.multipart.max-file-size=10MB- The application includes logic to handle proper nouns and part-of-speech-aware weighting during text processing.
- However, the effectiveness of proper noun distinction depends on the underlying embedding model (e.g., GloVe).
- In some cases, the model may not produce sufficiently distinct vectors for sentences differing only in proper nouns.
- The integration test for proper noun handling is currently ignored due to this limitation.
This project uses the GloVe (Global Vectors for Word Representation) pre-trained word embeddings for generating vector representations of text.
- Model Used: GloVe 6B, 100-dimensional vectors
- Download Link: glove.6B.zip (822 MB)
- Direct File: glove.6B.100d.txt (347 MB)
- Download the GloVe model from the links above.
- Unzip the archive if you downloaded
glove.6B.zip. - Place the file
glove.6B.100d.txtin the directory:src/main/resources/models/- The final path should be:
src/main/resources/models/glove.6B.100d.txt
- The final path should be:
The application will automatically load this file at startup.
The application uses Apache Lucene for vector similarity search, implementing a custom indexing strategy optimized for high-dimensional vectors.
- Index Structure: Uses Lucene's
ByteBuffersDirectoryfor in-memory indexing - Vector Storage: Vectors are stored as binary fields in Lucene documents
- Similarity Search: Implements K-Nearest Neighbors (KNN) search using cosine similarity
- Performance Characteristics:
- Fast for small to medium-sized datasets (up to ~100K vectors)
- Memory-efficient due to in-memory indexing
- Linear search complexity (O(n) for n vectors)
-
Scalability:
- Current implementation uses in-memory indexing, limiting dataset size
- Could be improved by implementing disk-based indexing for larger datasets
- Consider using Lucene's
MMapDirectoryorNIOFSDirectoryfor persistent storage
-
Search Performance:
- Linear search becomes slow for large datasets
- Potential improvements:
- Implement HNSW (Hierarchical Navigable Small World) graph for approximate nearest neighbors
- Use Product Quantization (PQ) for vector compression
- Implement IVF (Inverted File) index for faster approximate search
An experimental branch feature/indexing-experiments explores alternative indexing strategies:
- Dense vector indexing (HNSW, IVF)
- Sparse vector indexing
- Performance comparisons between different approaches
-
Vector Search Fundamentals:
- Approximate Nearest Neighbors Oh Yeah (ANNOY) - Spotify's library for approximate nearest neighbors
- HNSW: Hierarchical Navigable Small World - Paper on HNSW algorithm
- Product Quantization for Nearest Neighbor Search - Paper on PQ technique
-
Lucene and Vector Search:
- Apache Lucene Documentation
- Lucene Vector Search
- Elasticsearch Vector Search - Example of production-grade vector search implementation
-
Alternative Solutions:
-
Performance Optimization:
- Vector Search Performance - Guide to vector search performance
- Approximate Nearest Neighbor Search - Overview of ANN algorithms
MIT License