Skip to content

newbee14/VectorDB4Java

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

VectorDB4Java

A Spring Boot application for vector similarity search with advanced text processing capabilities.

Key Features

  • Vector Similarity Search: Fast KNN search using Apache Lucene
  • Smart Text Processing:
    • Proper noun and compound word handling
    • Part-of-speech aware weighting (nouns: 1.2x, verbs: 1.1x, adjectives: 1.05x)
    • Automatic duplicate detection (95% similarity threshold)
  • Document Processing: Extract and process text from various document formats
  • REST API: Swagger UI available at /swagger-ui.html

Quick Start

  1. Build: mvn clean install
  2. Run: mvn spring-boot:run
  3. Access: http://localhost:8080

API Examples

Create Vector from Text

curl -X GET "http://localhost:8080/api/vectors/text?text=your text here"

Find Similar Vectors

curl -X POST "http://localhost:8080/api/vectors/search" \
  -H "Content-Type: application/json" \
  -d '{"vector": [0.1, 0.2, ...]}'

Process Document

curl -X POST "http://localhost:8080/api/vectors/document" \
  -F "file=@/path/to/document.pdf"

Configuration

Key settings in application.properties:

# Vector similarity threshold (0.0 to 1.0)
vector.similarity.threshold=0.95

# File upload limits
spring.servlet.multipart.max-file-size=10MB

Notes on Proper Noun Handling

  • The application includes logic to handle proper nouns and part-of-speech-aware weighting during text processing.
  • However, the effectiveness of proper noun distinction depends on the underlying embedding model (e.g., GloVe).
  • In some cases, the model may not produce sufficiently distinct vectors for sentences differing only in proper nouns.
  • The integration test for proper noun handling is currently ignored due to this limitation.

Embedding Model

This project uses the GloVe (Global Vectors for Word Representation) pre-trained word embeddings for generating vector representations of text.

Setup Instructions

  1. Download the GloVe model from the links above.
  2. Unzip the archive if you downloaded glove.6B.zip.
  3. Place the file glove.6B.100d.txt in the directory: src/main/resources/models/
    • The final path should be: src/main/resources/models/glove.6B.100d.txt

The application will automatically load this file at startup.

Indexing Strategy

The application uses Apache Lucene for vector similarity search, implementing a custom indexing strategy optimized for high-dimensional vectors.

Current Implementation

  • Index Structure: Uses Lucene's ByteBuffersDirectory for in-memory indexing
  • Vector Storage: Vectors are stored as binary fields in Lucene documents
  • Similarity Search: Implements K-Nearest Neighbors (KNN) search using cosine similarity
  • Performance Characteristics:
    • Fast for small to medium-sized datasets (up to ~100K vectors)
    • Memory-efficient due to in-memory indexing
    • Linear search complexity (O(n) for n vectors)

Limitations and Potential Improvements

  1. Scalability:

    • Current implementation uses in-memory indexing, limiting dataset size
    • Could be improved by implementing disk-based indexing for larger datasets
    • Consider using Lucene's MMapDirectory or NIOFSDirectory for persistent storage
  2. Search Performance:

    • Linear search becomes slow for large datasets
    • Potential improvements:
      • Implement HNSW (Hierarchical Navigable Small World) graph for approximate nearest neighbors
      • Use Product Quantization (PQ) for vector compression
      • Implement IVF (Inverted File) index for faster approximate search

Experimental Implementations

An experimental branch feature/indexing-experiments explores alternative indexing strategies:

  • Dense vector indexing (HNSW, IVF)
  • Sparse vector indexing
  • Performance comparisons between different approaches

Relevant Resources

  1. Vector Search Fundamentals:

  2. Lucene and Vector Search:

  3. Alternative Solutions:

    • FAISS - Facebook's library for efficient similarity search
    • Milvus - Open-source vector database
    • Weaviate - Vector search engine with GraphQL API
  4. Performance Optimization:

License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages