Echo - Local RAG Chatbot A local retrieval-augmented generation (RAG) chatbot implementation using quantized language models and CPU-based inference. Echo provides conversational AI capabilities with persistent memory and knowledge retrieval, running entirely offline. Features
Local Execution: Runs completely offline using GGUF quantized models via llama-cpp-python Retrieval-Augmented Generation: TF-IDF based document retrieval for context-aware responses Persistent Memory: Conversation history with automatic summarization CPU Optimized: Designed for efficient CPU inference without GPU requirements Web Interface: Streamlit-based UI with configurable parameters Anti-Hallucination Measures: Prompt engineering and response validation to reduce model hallucination
Requirements
Python 3.8+ 16GB RAM recommended (minimum 8GB) CPU with 4+ cores recommended Approximately 5GB disk space for model storage
A privacy-first conversational AI that runs entirely on your local machine using GGUF quantized models. No cloud services, no API costs, no data collection.
This project demonstrates practical ML engineering skills:
- RAG Implementation: Custom retrieval system using TF-IDF and cosine similarity for document-grounded responses
- CPU-Optimized Inference: Uses llama-cpp-python for efficient CPU inference with GGUF quantized models
- Modern UI: Streamlit interface with proper caching, session state management, and real-time updates
- Conversation Memory: Persistent chat history with automatic compression to manage context windows
- Error Handling: Robust retry logic with exponential backoff for model loading and generation
- Performance Optimization: Memory-efficient caching prevents model reloading across sessions
- Completely local processing - your conversations stay private
- RAG-powered responses using custom knowledge base
- Persistent conversation memory with automatic summarization
- CPU-optimized with configurable threading
- Clean Streamlit web interface
- CLI mode for testing and automation
- Python 3.10+
- 8GB RAM minimum (16GB recommended)
- GGUF model file (Phi-2 Q4_K_M or similar)
- Windows, Linux, or macOS
git clone https://github.com/MckAnissa/rag-echo-v2.git
cd rag-echo-v2
python -m venv venvWindows (PowerShell):
.\venv\Scripts\Activate.ps1Linux/Mac:
source venv/bin/activatepip install -r requirements.txtDownload Phi-2 Q4_K_M from HuggingFace: https://huggingface.co/TheBloke/phi-2-GGUF
Place the .gguf file in the project root directory.
CLI mode:
python echo_rag.py --model-path phi-2.Q4_K_M.ggufStreamlit UI:
streamlit run echo_streamlit.pypython echo_rag.py --model-path <path> --n-threads 4 --n-ctx 2048--model-path: Path to GGUF model file--n-threads: Number of CPU threads (default: 4)--n-ctx: Context window size (default: 2048)
Edit line 618 in echo_rag.py:
parser.add_argument("--model-path", type=str, default="phi-2.Q4_K_M.gguf", ...)rag-echo-v2/
├── echo_rag.py # Core bot logic and CLI interface
├── echo_streamlit.py # Streamlit web UI with caching
├── requirements.txt # Python dependencies
├── README.md # This file
├── .gitignore # Git ignore rules
└── echo_memory.json # Auto-generated conversation history
- Document Retrieval: TF-IDF vectorization finds relevant knowledge base entries
- Context Building: Combines retrieved documents with conversation history
- Generation: llama-cpp-python performs efficient CPU inference on GGUF models
- Memory Management: Automatically compresses old conversations into summaries
Typical response times on modern CPU:
- Model loading: 5-10 seconds (first run only)
- Per message: 10-30 seconds depending on length
- Memory usage: 4-6GB RAM
Echo includes curated knowledge on:
- Ethics and moral philosophy
- Human rights and political systems
- Animal welfare and rights
- Technology and AI ethics
- Personal identity and consciousness
- Religion and spirituality
- Environmental issues
Verify the .gguf file path:
dir *.gguf # Windows
ls *.gguf # Linux/MacUse the exact filename (case-sensitive):
python echo_rag.py --model-path "phi-2.Q4_K_M.gguf"On Windows, ensure Visual Studio Build Tools are installed.
Try upgrading:
pip install --upgrade llama-cpp-python- Reduce context window:
--n-ctx 1024 - Use smaller quantization (Q4_0 instead of Q4_K_M)
- Close other applications
- Use a machine with more RAM
This is normal on CPU. To improve:
- Reduce max_tokens in generation settings
- Use fewer CPU threads if system is overloaded
- Consider using a GPU-enabled machine for real-time responses
- Add embeddings-based retrieval (FAISS/ChromaDB)
- Implement streaming responses
- Add PDF/DOCX document upload
- GPU support with automatic CUDA detection
- Docker container for easy deployment
- REST API wrapper
- Conversation export functionality
Contributions, issues, and feature requests are welcome.
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
MIT License - see LICENSE file for details.
- Built with Streamlit for the web interface
- Uses llama-cpp-python for efficient CPU inference
- Models from HuggingFace (Phi-2 and others)
- Inspired by the need for private, local AI assistants
Built by Anissa McKnight as a personal learning project exploring RAG systems, local LLM deployment, and conversational AI.
Contact:
- Email: MckAnissa@proton.me
- GitHub: @MckAnissa
If you found this project interesting, consider giving it a star.