thisKB is a personal knowledge base application that allows you to manage documents, parse their contents, and search through them efficiently using both traditional text search and semantic vector search.
thisKB enables users to upload and store documents, parse them for meaningful content, and then perform advanced searches and tasks. It provides a simple interface for document management with powerful search capabilities under the hood.
- Document Upload: Upload documents through UI and API
- Document Parsing: Automatically extract meaningful content from various document formats
- Semantic Search: Search documents using both text-based and semantic vector search
- Multi-lingual Support: Process and search documents in multiple languages
- Chat Interface: Interact with your knowledge base conversationally
- Backend: Django with Pydantic AI
- Frontend: HTMX + UIkit for a responsive interface
- Databases:
- PostgreSQL/ParadeDB for data storage with pg_search and pgvector
- Processing:
- Celery + Redis for task management and caching
- Extractous (Rust-based), Marker (PDF), and Magika for file processing
- Embeddings:
- Jina Embeddings v3 (1024 dimensions, multilingual)
- OpenAI text-embedding-3-small (1536 dimensions)
- Chunking: Chonkie for semantic chunking (and with double-pass merging)
- Web Scraping: crawl4ai, Jina Reader API/LLM
- Storage: S3/MinIO for document storage
-
Clone the repository:
git clone https://github.com/yourusername/thiskb.git cd thiskb
-
Set up a virtual environment and install dependencies with uv:
uv venv source .venv/bin/activate # On Windows: .venv\Scripts\activate uv pip install -e .
-
Set up environment variables in
.env
:DATABASE_URL=postgres://user:password@localhost/thiskb REDIS_URL=redis://localhost:6379 S3_BUCKET=thiskb S3_ENDPOINT=http://localhost:9000 S3_ACCESS_KEY=minioadmin S3_SECRET_KEY=minioadmin
-
Run migrations:
python manage.py migrate
-
Start the development server:
python manage.py runserver
-
In a separate terminal, start Celery worker:
celery -A thiskb worker --loglevel=info
[To be added]
python manage.py test
ruff format .
API documentation is available at /api/docs
when the server is running.
[To be added]
[To be added]
This project is licensed under the Apache License 2.0. See the LICENSE file for more information.