Semantica is a lightweight semantic search engine for PDF documents. It processes PDF files, converts them into vectorized text chunks, and enables intelligent retrieval using embedding-based similarity search — all without using an LLM.
- 📄 Upload any PDF file
- ✂️ Automatic chunking of document content
- 🔢 Embedding with HuggingFace (
MiniLM
) - 🧠 Vector search with Qdrant
- ⚡ Fast and local — no OpenAI API required
- 📆 Built with FastAPI, LangChain, and Qdrant
Layer | Tool |
---|---|
Backend | FastAPI |
Parsing | pymupdf4llm |
Chunking | LangChain MarkdownTextSplitter |
Embedding | sentence-transformers/all-MiniLM-L6-v2 |
Vector DB | Qdrant via Docker |
git clone https://github.com/yourname/semantica.git
cd semantica
pip install -r requirements.txt
docker run -p 6333:6333 qdrant/qdrant
fastapi dev main
Then open the Swagger UI at:
📍 http://localhost:8000/docs
Uploads and parses a PDF file. Chunks it and saves to Qdrant with embeddings.
Send a semantic query and receive relevant chunks. Example request:
{
"query": "Does this PDF mention 'fun' keyword?"
}
Example response:
[
{
"score": 0.92,
"text": "This is a simple PDF file. Fun fun fun.",
"source_file": "sample.pdf",
"chunk_id": 1
}
]
- LLM-based answer generation
- Multi-document support
- Frontend interface for document search (possibly separated)
Pull requests, feedback and ideas are always welcome. If you use this project, feel free to ⭐️ the repo and share your feedback.
MIT