A lightweight, Retrieval-Augmented Generation (RAG) chatbot that answers only from your local knowledge base (and optional URL sources). HarpyBot runs as a Gradio web app, cites the sources it used, and clearly says “I don’t know” when the answer isn’t supported by the knowledge base.
➡️ HarpyBot on Hugging Face: https://programmeratlarge-harpybot.hf.space
Tip: For fastest startup on Spaces, deploy only the prebuilt index cache (see “Cache-only deploy”) and set
TRUST_CACHE=1.
- 🔎 RAG over your files: PDFs, DOCX, TXT/MD/HTML, CSV/XLSX (+ optional URLs via
knowledge_base/url_list.txt) - 🧩 Fast startup: chunking + embeddings cached under
knowledge_base/.index_cache/ - 🧠 Session memory: keeps a short recent chat history while staying grounded to retrieved context
- 📝 Prompt overrides: drop a
prompts.yamlwithsystem_prompt:to customize tone/behavior - 🖼️ Custom header: optional
header.md“hero” box (dark/light aware) - 🚀 Cache-only deploy: ship just
.index_cache/to Hugging Face (no raw documents) withTRUST_CACHE=1
HarpyBot scans a folder (default: ./knowledge_base/) and optionally a list of URLs (knowledge_base/url_list.txt), chunks and embeds the text, and serves a chatbot UI. When you ask a question, it retrieves the most relevant chunks and generates an answer only from those chunks. If it can’t support an answer with retrieved text, it responds:
I don't know — I can only answer questions about information in my knowledge base.
Every answer includes source numbers (file path and page/sheet when available).
-
Put your documents under:
knowledge_base/ ├─ subfolder1/ ├─ subfolder2/ └─ ...Supported:
.pdf,.docx,.txt,.md,.html,.htm,.csv,.xlsx,.xls -
(Optional) Add URLs (one per line) to
knowledge_base/url_list.txt:https://example.com/whitepaper.html https://example.com/notes.pdf -
(Optional) Override the system prompt with
prompts.yaml(next to the app):system_prompt: "You are Harpy Bot, a cautious research assistant. Answer ONLY using the provided knowledge base context. If the answer is not clearly supported by the context, say: I don't know - I can only answer questions about information in my knowledge base. Be concise. Always include source numbers at the end if any are used."
-
(Optional) Add a header hero via
header.md(same folder as the app). Reference images with relative paths (e.g.,). -
Build the index (first time only or when docs change), then launch:
# 1) Set your OpenAI key export OPENAI_API_KEY=sk-... # Windows PowerShell: $env:OPENAI_API_KEY="sk-..." # 2) Install deps pip install -r requirements.txt # 3) Run (rebuild index on first run) python app.py --kb ./knowledge_base --rebuild --host 0.0.0.0 --port 7860
-
Open the URL shown in the terminal (default http://127.0.0.1:7860), type a question, and click Send.
You:
What does the pipeline recommend for demultiplexing, and where is that described?
HarpyBot:
The pipeline recommends using X for demultiplexing and provides a step-by-step guide.
Sources:
[1] knowledge_base/pipeline_guide.pdf (page 7)
[2] knowledge_base/notes/demux.md
- Python 3.10+ (3.11/3.12 fine)
- pip
- OpenAI API key in
OPENAI_API_KEY
gradio>=4
openai>=1
python-dotenv
pyyaml
tiktoken
numpy
# Only needed if you'll (re)build the index on this machine:
pypdf
python-docx
beautifulsoup4
pandas
openpyxl
export OPENAI_API_KEY=sk-... # Windows PS: $env:OPENAI_API_KEY="sk-..."
pip install -r requirements.txt
# First run with rebuild to create the index
python app.py --kb ./knowledge_base --rebuild
# Later runs can skip rebuild:
python app.py --kb ./knowledge_baseUseful flags
--kb PATH– path to knowledge base (defaultknowledge_base)--rebuild– force re-indexing--host/--port– server options--trust-cache– cache-only mode: load.index_cacheand skip fingerprint/rebuild
Environment variables
OPENAI_API_KEY– required for chat (and embeddings if rebuilding)TRUST_CACHE=1– same as--trust-cache
Deploy with no raw documents—just the prebuilt cache:
knowledge_base/.index_cache/
├─ embeddings.npy
├─ texts.pkl
├─ metadatas.pkl
└─ fingerprint.json # optional in trust-cache mode
Run with:
export OPENAI_API_KEY=sk-...
export TRUST_CACHE=1
python app.py --kb ./knowledge_baseOn Hugging Face Spaces, set:
- Secret:
OPENAI_API_KEY - Variable:
TRUST_CACHE=1
If
embeddings.npy> 100MB, use Git LFS:git lfs install git lfs track "knowledge_base/.index_cache/embeddings.npy" git add .gitattributes knowledge_base/.index_cache/* git commit -m "Add prebuilt index cache" git push
-
v1.2.0 — 2025-09-18
prompts.yamloverride for system prompt- Dark/Light header styling + customizable hero card
- Footer controls (Index Status + “Refresh Knowledge Base”), compact status field
- “Send” emphasized (
variant="primary") - Cache-only deploy option (
--trust-cache/TRUST_CACHE=1)
-
v1.1.0 — 2025-09-11
- URL ingestion via
knowledge_base/url_list.txt(HTML/PDF/text) - Robust header image handling
- URL ingestion via
-
v1.0.0 — 2025-09-10
- Initial RAG chatbot with local KB indexing, citations, session memory
HarpyBot is maintained by the Cornell Genomics Innovation Hub (GIH). © 2025 Cornell Genomics Innovation Hub. All rights reserved.
.
├─ app.py # main app
├─ prompts.yaml # optional prompt override
├─ header.md # optional hero header (images in same dir)
├─ requirements.txt
└─ knowledge_base/
├─ .index_cache/ # embeddings + chunk store (generated or shipped)
├─ url_list.txt # optional URL list (one per line)
├─ docs/... # your documents (optional if cache-only deploy)
└─ ...
Provide prompts.yaml with a system_prompt key:
system_prompt: "You are Harpy Bot, a cautious research assistant. Answer ONLY using the provided knowledge base context. If the answer is not clearly supported by the context, say: I don't know - I can only answer questions about information in my knowledge base. Be concise. Always include source numbers at the end if any are used."If prompts.yaml is missing, a generic, safe prompt is used.
Create header.md next to app.py. Reference images with relative paths:

# Harpy — Knowledge-Base Research AssistantThe app’s CSS targets a .hero-box/#hero-box wrapper to look good in both light and dark modes.
