This digital assistant, inspired by Dr. Michael Greger & his team at NutritionFacts.org, was created to answer user questions about healthy eating and lifestyle choices. Drawing from over 1,200 well-researched blog posts since 2011, it provides science-backed insights to help users live a healthier, more informed life.
Start chatting with Dr. Greger's Digital Twin here.
streamlit-app-2024-09-10-16-09-09.webm
- What problems does the chatbot try to mitigate
- How you can run & test the chatbot yourself
- How I built and evaluated this chatbot
- Personal project evaluation based on the criteria of the LLM-zoomcamp course
- Dataset used to build the chatbot
- Technologies used
The raw data used to build the RAG knowledge base is stored in data/blog_posts/json
. It consists of all blog posts from https://nutritionfacts.org/blog/ (as of 28.08.2024). See the notebooks/web_scraping.ipynb
notebook for more technical details on the web scraping process.
The chatbot was build with the following technologies:
-
Web Scraping: Beautiful Soup Library
-
Text embeddings: pre-trained model
multi-qa-MiniLM-L6-cos-v1
of the Sentence Transformers Library- build with PyTorch and Huggingface's Transformers Library
- It was "tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs."
-
Vector Store (aka Knowledge Base of RAG): LanceDB Library
-
Information Retrieval (IR):
- Full-text search (aka Keyword-Search): Tantivy Library (based on BM25) (LanceDB Doc).
- Vector Search (aka Search for nearest neighbors) Metric: Cosine Similarity (LanceDB Doc).
- Reranker: Linear Combination Reranker with 30% for Vector Search (LanceDB Doc).
-
LLM API: Groq Cloud (free tier)
-
Web App: Streamlit Library
-
Deployment: Streamlit Cloud (free tier)
-
Database for User Data: MongoDB