GitHub - MckAnissa/echo-rag-chatbot: Personal project. Local RAG chatbot using Mistralv0.2/TinyLlama with TF-IDF retrieval. Streamlit interface for CPU-optimized inference without GPU requirements.

Echo - Local RAG Chatbot A local retrieval-augmented generation (RAG) chatbot implementation using quantized language models and CPU-based inference. Echo provides conversational AI capabilities with persistent memory and knowledge retrieval, running entirely offline. Features

Local Execution: Runs completely offline using GGUF quantized models via llama-cpp-python Retrieval-Augmented Generation: TF-IDF based document retrieval for context-aware responses Persistent Memory: Conversation history with automatic summarization CPU Optimized: Designed for efficient CPU inference without GPU requirements Web Interface: Streamlit-based UI with configurable parameters Anti-Hallucination Measures: Prompt engineering and response validation to reduce model hallucination

Requirements

Python 3.8+ 16GB RAM recommended (minimum 8GB) CPU with 4+ cores recommended Approximately 5GB disk space for model storage

A privacy-first conversational AI that runs entirely on your local machine using GGUF quantized models. No cloud services, no API costs, no data collection.

Technical Highlights

This project demonstrates practical ML engineering skills:

RAG Implementation: Custom retrieval system using TF-IDF and cosine similarity for document-grounded responses
CPU-Optimized Inference: Uses llama-cpp-python for efficient CPU inference with GGUF quantized models
Modern UI: Streamlit interface with proper caching, session state management, and real-time updates
Conversation Memory: Persistent chat history with automatic compression to manage context windows
Error Handling: Robust retry logic with exponential backoff for model loading and generation
Performance Optimization: Memory-efficient caching prevents model reloading across sessions

Features

Completely local processing - your conversations stay private
RAG-powered responses using custom knowledge base
Persistent conversation memory with automatic summarization
CPU-optimized with configurable threading
Clean Streamlit web interface
CLI mode for testing and automation

Requirements

Python 3.10+
8GB RAM minimum (16GB recommended)
GGUF model file (Phi-2 Q4_K_M or similar)
Windows, Linux, or macOS

Quick Start

1. Clone and setup

git clone https://github.com/MckAnissa/rag-echo-v2.git
cd rag-echo-v2
python -m venv venv

2. Activate virtual environment

Windows (PowerShell):

.\venv\Scripts\Activate.ps1

Linux/Mac:

source venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

4. Download a GGUF model

Download Phi-2 Q4_K_M from HuggingFace: https://huggingface.co/TheBloke/phi-2-GGUF

Place the .gguf file in the project root directory.

5. Run

CLI mode:

python echo_rag.py --model-path phi-2.Q4_K_M.gguf

Streamlit UI:

streamlit run echo_streamlit.py

Configuration

CLI Arguments

python echo_rag.py --model-path <path> --n-threads 4 --n-ctx 2048

--model-path: Path to GGUF model file
--n-threads: Number of CPU threads (default: 4)
--n-ctx: Context window size (default: 2048)

Setting Default Model Path

Edit line 618 in echo_rag.py:

parser.add_argument("--model-path", type=str, default="phi-2.Q4_K_M.gguf", ...)

Project Structure

rag-echo-v2/
├── echo_rag.py          # Core bot logic and CLI interface
├── echo_streamlit.py    # Streamlit web UI with caching
├── requirements.txt     # Python dependencies
├── README.md           # This file
├── .gitignore          # Git ignore rules
└── echo_memory.json    # Auto-generated conversation history

How It Works

Document Retrieval: TF-IDF vectorization finds relevant knowledge base entries
Context Building: Combines retrieved documents with conversation history
Generation: llama-cpp-python performs efficient CPU inference on GGUF models
Memory Management: Automatically compresses old conversations into summaries

Performance

Typical response times on modern CPU:

Model loading: 5-10 seconds (first run only)
Per message: 10-30 seconds depending on length
Memory usage: 4-6GB RAM

Built-in Knowledge Base

Echo includes curated knowledge on:

Ethics and moral philosophy
Human rights and political systems
Animal welfare and rights
Technology and AI ethics
Personal identity and consciousness
Religion and spirituality
Environmental issues

Troubleshooting

Model not found

Verify the .gguf file path:

dir *.gguf  # Windows
ls *.gguf   # Linux/Mac

Use the exact filename (case-sensitive):

python echo_rag.py --model-path "phi-2.Q4_K_M.gguf"

llama-cpp-python installation issues

On Windows, ensure Visual Studio Build Tools are installed.

Try upgrading:

pip install --upgrade llama-cpp-python

Out of memory

Reduce context window: --n-ctx 1024
Use smaller quantization (Q4_0 instead of Q4_K_M)
Close other applications
Use a machine with more RAM

Slow responses

This is normal on CPU. To improve:

Reduce max_tokens in generation settings
Use fewer CPU threads if system is overloaded
Consider using a GPU-enabled machine for real-time responses

Future Improvements

Add embeddings-based retrieval (FAISS/ChromaDB)
Implement streaming responses
Add PDF/DOCX document upload
GPU support with automatic CUDA detection
Docker container for easy deployment
REST API wrapper
Conversation export functionality

Contributing

Contributions, issues, and feature requests are welcome.

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Open a Pull Request

License

MIT License - see LICENSE file for details.

Acknowledgments

Built with Streamlit for the web interface
Uses llama-cpp-python for efficient CPU inference
Models from HuggingFace (Phi-2 and others)
Inspired by the need for private, local AI assistants

About

Built by Anissa McKnight as a personal learning project exploring RAG systems, local LLM deployment, and conversational AI.

Contact:

Email: MckAnissa@proton.me
GitHub: @MckAnissa

If you found this project interesting, consider giving it a star.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Technical Highlights

Features

Requirements

Quick Start

1. Clone and setup

2. Activate virtual environment

3. Install dependencies

4. Download a GGUF model

5. Run

Configuration

CLI Arguments

Setting Default Model Path

Project Structure

How It Works

Performance

Built-in Knowledge Base

Troubleshooting

Model not found

llama-cpp-python installation issues

Out of memory

Slow responses

Future Improvements

Contributing

License

Acknowledgments

About

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
License		License
README.md		README.md
echo_rag.py		echo_rag.py
echo_streamlit.py		echo_streamlit.py
echo_streamlit.py.backup		echo_streamlit.py.backup
model_server.py		model_server.py
profile.txt		profile.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Technical Highlights

Features

Requirements

Quick Start

1. Clone and setup

2. Activate virtual environment

3. Install dependencies

4. Download a GGUF model

5. Run

Configuration

CLI Arguments

Setting Default Model Path

Project Structure

How It Works

Performance

Built-in Knowledge Base

Troubleshooting

Model not found

llama-cpp-python installation issues

Out of memory

Slow responses

Future Improvements

Contributing

License

Acknowledgments

About

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages