ClipSearch is a production-ready, asynchronous document search engine designed for high-performance indexing and AI-powered summarization of PDF and TXT files. Built with a cloud-native architecture, it leverages a distributed pipeline to process documents and provide instant, searchable insights.
- Asynchronous Processing: S3-triggered events handled via SQS and dedicated workers.
- AI Summarization: Automated generation of 2-3 bullet point summaries using local LLMs (TinyLlama/Ollama).
- Full-Text Search: High-performance indexing and retrieval powered by Elasticsearch.
- Enterprise-Ready: Fully containerized and optimized for OpenShift/Kubernetes deployment.
- Clean UI: Modern Angular-based frontend for seamless file uploads and search.
- Ingestion: Users upload PDF/TXT files via the API, which stores them in S3.
- Messaging: An event is pushed to an SQS queue to trigger background processing.
- Extraction: Workers pull the event, extract text using Apache Tika, and request a summary from the AI engine.
- Indexing: The metadata, extracted text, and AI summary are indexed into Elasticsearch for real-time searching.
graph LR
User((User)) -->|Upload/Search| FE[Angular Frontend]
FE -->|REST API| API[Quarkus API]
API -->|1. Store| S3[(S3 Storage)]
API -->|2. Notify| SQS[SQS Queue]
SQS -->|3. Trigger| Worker[Quarkus Worker]
Worker -->|4. Summarize| LLM[Ollama/TinyLlama]
Worker -->|5. Index| ES[(Elasticsearch)]
API -->|Search| ES
- Backend: Java 17, Quarkus, LangChain4j, Apache Tika.
- Frontend: Angular 19, Tailwind CSS.
- Infrastructure: Elasticsearch, LocalStack (S3/SQS).
- AI Engine: Ollama / Red Hat OpenShift AI.
- Deployment: Docker Compose, OpenShift (Kustomize).
- Prerequisites: Docker and Docker Compose installed.
- Start Services:
docker-compose up -d
- Access UI: Open
http://localhost:4200in your browser.
# Backend
mvn -f backend/pom.xml clean package -pl api,worker -am
# Frontend
cd frontend && npm install && npm run build- Knowledge Management: Quick indexing of internal documentation and research papers.
- Automated Summarization: Fast-tracking document review with AI-generated snippets.
- Searchable Archives: Converting large volumes of static files into a searchable database.
The project uses GitHub Actions (.github/workflows/ci.yml) to:
- Build Java components with Maven.
- Build Docker images for API, Worker, and Frontend.
- Push images to
ghcr.io/dawidbera/clipsearch-*.
- 405 on Uploads: Consistently use
UploadResourcefor all upload-related logic. - Single Search Result: Ensure
uploadIdis used as the Elasticsearch_idto avoid duplicates or overwrites. - S3 Connectivity: Use
path-style accessand ensure endpoints are correctly resolved (internal vs external). - Ollama Connection: Ensure Ollama is running and accessible (check
llm-standalonelogs on OpenShift).
