Retrieval-Augmented Generation (RAG) for Local Data Processing

Project Overview

This repository focuses on creating a local Retrieval-Augmented Generation (RAG) system designed to understand your queries through a Large Language Model (LLM), search your local database, and semantically retrieve relevant information. By integrating the Ollama platform, this project enables an advanced AI system similar to ChatGPT, but it operates entirely on your local machine without requiring an internet connection.

The local RAG system leverages the LLM's ability to comprehend the nuances of your query, enabling it to search through your dataset and extract semantically related information. Unlike a traditional database query, which requires precise keywords or structured queries (such as SQL), the RAG system allows for natural language input. This means that instead of returning exact keyword matches, the RAG system can interpret the context and intent behind your query, pulling information that is semantically relevant even if it doesn NOT match exact keywords.

For example, in a regular database query, asking for "customer complaints about delivery delays" might only return entries explicitly tagged with "delivery delay." In contrast, a RAG-based system can understand synonyms, variations, or implicit mentions of the issue, such as "shipment took too long" or "late delivery." This makes it significantly more powerful when searching through unstructured or loosely structured data like customer reviews, patient records, or emails.

By performing all tasks locally, the system guarantees data privacy and ensures compliance with confidentiality regulations, making it ideal for handling sensitive information such as patient records or proprietary customer feedback. Additionally, the combination of retrieval and generation offers robust data analysis, report generation, and decision support capabilities.

Integrating a RAG system into your database infrastructure can dramatically enhance the ability to extract actionable insights from large datasets. Instead of relying on rigid, predefined query patterns, a RAG system empowers users to engage with their data in a more intuitive and context-aware manner. This enables more flexible exploration of data, more accurate information retrieval, and the ability to generate coherent explanations for complex queries—all while maintaining full control over data workflows and privacy.

This project not only provides a powerful tool for managing and interpreting your own data but also serves as a learning resource for building customized RAG systems. Users can explore how the system semantically searches their databases and generates context-aware responses, enhancing the capability of traditional database management.

Key Concepts

1. Retrieval-Augmented Generation (RAG)

RAG enhances large language models by retrieving relevant information from external data sources, such as documents, databases, or a vector database. The system represents local documents as embeddings (vectors), which allows the model to find information based on the input query.

RAG is particularly effective when:

The language model lacks sufficient context or knowledge about a particular subject.
The data being queried is large, but only certain relevant sections are needed.
The queries should be answered using specific or recent data from local files.

2. Vector Database (Using FAISS)

A vector database is used to store and retrieve high-dimensional vectors, or embeddings, that represent the semantic meaning of documents. In this project, we use FAISS (Facebook AI Similarity Search), which allows us to store document embeddings and retrieve the most semantically similar documents based on the query embedding.

After you upload documents (PDF, text, or JSON), the system converts these documents into embeddings using the Ollama mxbai-embed-large model. These embeddings are then indexed using FAISS, allowing for efficient semantic search based on the meaning of the user's query.

When a user asks a question, the system converts the query into an embedding and searches for the most similar document embeddings using FAISS. This allows the system to retrieve relevant information, even if the words in the query do not match the exact words in the documents, as the search is based on meaning rather than keywords.

Persistent Storage of Embeddings and FAISS Index

To improve efficiency, the embeddings generated from the uploaded documents and the FAISS index are saved to disk. This means that if you re-run the program later, the embeddings and index will be loaded instead of re-generated, saving time.

FAISS Index: Saved as faiss_index.bin.
Embeddings: Saved as vault_embeddings.npy.

If new documents are uploaded, the system will regenerate the embeddings and re-index the entire set of documents.

3. Embeddings

Embeddings are vector representations of text that capture the semantic meaning of words or documents. The system uses Ollama's mxbai-embed-large model to create high-quality embeddings for the local documents. These embeddings are stored in a vector database, enabling fast and accurate retrieval based on user queries.

Embeddings serve as the backbone of the RAG system. By transforming text into a numerical format, the system can quickly compare and retrieve the most relevant documents or portions of data.

4. Local LLMs with Ollama

Ollama is an open-source platform that allows the running of large language models (LLMs) directly on your machine, without needing cloud-based services. This local execution ensures data privacy and minimizes latency while still maintaining the powerful capabilities of models like llama3.

Setup

The following steps will help you set up the Local RAG system on your laptop. This setup is designed to be lightweight and straightforward, making it ideal for educational purposes or prototyping.

Step 1: Clone the Repository

First, clone this repository to your local machine:

git clone https://github.com/memari-majid/local_rag.git

Go to the repository downloaded in your local machine:

cd local_rag

Step 2: Install Package Manager

To manage Python environments more effectively, we recommend using Miniconda. Follow these steps to install Miniconda:

Download Miniconda: Go to the Miniconda Installation page and download the installer suitable for your operating system (Windows, macOS, Linux).
Run the Installer:
- On Windows, open the installer and follow the installation instructions.
- On macOS and Linux, run the following commands in your terminal (replace the installer name with the appropriate one for your OS):
For Linux
```
bash Miniconda3-latest-Linux-x86_64.sh
```
For macOS
```
bash Miniconda3-latest-MacOSX-x86_64.sh
```
Follow the Prompts: During the installation process, follow the prompts to install Miniconda and allow the installer to initialize the conda environment.
Create a New Conda Environment with Python 3.9 Once Miniconda is installed, create a new environment with Python 3.9 using the following commands:

Create a new conda environment with Python 3.9

conda create --name rag python=3.9

Activate the environment

conda activate rag

Step 3: Install Python Dependencies

Install the required Python packages by running the following command:

pip install -r requirements.txt

These dependencies include tools for managing language models, handling documents, and generating embeddings.

Step 4: Install Ollama

Ollama is the core platform for running the local language models.

macOS

Download

Windows

Download

Linux Install

To install Ollama, run the following command:

curl -fsSL https://ollama.com/install.sh | sh

Linux Manual install

Download and extract the package:

curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz
sudo tar -C /usr -xzf ollama-linux-amd64.tgz

Start Ollama:

ollama serve

Docker

The official Ollama Docker image ollama/ollama is available on Docker Hub.

Once installed, Ollama will allow access to a variety of optimized language models for local use.

Step 5: Pull the Required Models

You will need to download the specific models that power the RAG system.

Models and Memory Requirements

The following table outlines various Large Language Models (LLMs) with their corresponding number of parameters and memory requirements. llama models, especially the Llama 3 series developed by Meta, are highly advanced models optimized for Natural Language Processing (NLP) tasks. Llama 3 provides improvements in performance, speed, and efficiency compared to earlier versions. Different versions of the llama model (with varying numbers of parameters) require varying amounts of memory (RAM) to run effectively.

LLM leaderboard

Quality: Evaluation of the model’s overall language understanding and generation capabilities.

Model	Creator	Context Window	Quality (Index)
o1-preview	OpenAI	128k	85
o1-mini	OpenAI	128k	82
GPT-4o	OpenAI	128k	77
GPT-4o mini	OpenAI	128k	71
Llama 3.1 405B	Meta	128k	72
Llama 3.1 70B	Meta	128k	65
Llama 3.1 8B	Meta	128k	53
Gemini 1.5 Pro	Google	2m	72
Gemini 1.5 Flash	Google	1m	60
Gemma 2 27B	Google	8k	49
Gemma 2 9B	Google	8k	47
Claude 3.5 Sonnet	Anthropic	200k	77

For example, the llama 3.1, 8B parameter model requires 4.7 GB of memory, which would be a suitable candidate to run locally on your laptop if you have sufficient RAM. This model offers a balance between performance and memory footprint, making it practical for local use on systems with moderate hardware capabilities.

Model	Parameters	Memory Size	Ollama Command
Moondream 2	1.4B	829 MB	`ollama run moondream`
Gemma 2	2B	1.6 GB	`ollama run gemma2:2b`
Phi 3 Mini	3.8B	2.3 GB	`ollama run phi3`
Code Llama	7B	3.8 GB	`ollama run codellama`
Llama 2 Uncensored	7B	3.8 GB	`ollama run llama2-uncensored`
Mistral	7B	4.1 GB	`ollama run mistral`
Neural Chat	7B	4.1 GB	`ollama run neural-chat`
Starling	7B	4.1 GB	`ollama run starling-lm`
LLaVA	7B	4.5 GB	`ollama run llava`
Llama 3.1	8B	4.7 GB	`ollama run llama3.1`
Solar	10.7B	6.1 GB	`ollama run solar`
Gemma 2	9B	5.5 GB	`ollama run gemma2`
Phi 3 Medium	14B	7.9 GB	`ollama run phi3:medium`
Gemma 2	27B	16 GB	`ollama run gemma2:27b`
Llama 3.1	70B	40 GB	`ollama run llama3.1:70b`
Llama 3.1	405B	231 GB	`ollama run llama3.1:405b`

When selecting a llama model, it is important to consider the available memory and hardware capabilities, as larger models require more memory but deliver better performance for complex tasks.

Use the following commands to pull the necessary models:

ollama pull llama3.1

Generating High-Quality Embeddings from Documents

The mxbai-embed-large model is designed to generate high-quality embeddings from textual documents. Embeddings are vector representations of text that capture the semantic meaning of words or sentences. These embeddings are essential for tasks like similarity search, document clustering, and retrieval-augmented generation (RAG) systems.

By leveraging a large embedding model like mxbai-embed-large, you can ensure that your documents are encoded into dense vectors that preserve their semantic richness and contextual information, which improves the performance of downstream tasks such as search, recommendation, or summarization.

To pull the mxbai-embed-large model using ollama, use the following command:

mxbai-embed-large for generating high-quality embeddings from documents:

ollama pull mxbai-embed-large

These models will be stored locally, allowing the entire system to run offline.

Step 6: Upload Your Documents

The upload.py script allows users to upload documents in PDF, TXT, and JSON formats and append their contents to a file called vault.txt. This tool is designed to handle documents by converting them into chunks of text, making them ready for further processing.

Key Features of `upload.py`:

PDF to Text Conversion: Converts PDF files to text and normalizes the text by removing unnecessary whitespace. The text is split into chunks (up to 1000 characters) and appended to vault.txt, with each chunk on a new line.
TXT File Upload: Reads text from .txt files, normalizes whitespace, and splits the content into 1000-character chunks. The text is then appended to vault.txt, similar to the PDF functionality.
JSON File Upload: Parses JSON files and flattens the data into a single string. It then processes the content by normalizing whitespace and splitting it into chunks before appending it to vault.txt.

How the Script Works:

User Interface: A simple GUI is provided using tkinter, where buttons allow users to select and upload their files.
PDF Handling: When a user selects a PDF file, the script extracts the text from each page, normalizes it, and splits it into chunks of up to 1000 characters.
Text File Handling: When a .txt file is uploaded, the text is similarly cleaned and split into chunks of up to 1000 characters before being added to vault.txt.
JSON File Handling: For JSON files, the script flattens the entire structure into a single text string, normalizes whitespace, and then splits the content into chunks before appending to vault.txt.

Example Usage:

To upload files using the script, simply run the script and select the file format you wish to upload:

python upload.py

This will open a GUI where you can choose a PDF, TXT, or JSON file. The selected file will be processed, and its content will be appended to vault.txt in chunks.

Adding New Document Types: If you want to extend the functionality of upload.py to support additional file types, you can modify the script by adding new functions similar to upload_txtfile or upload_jsonfile.

This script makes it easy to gather documents from various formats into a central text file, ready for processing or analysis.

Step 7: Query Your Documents

The localrag.py script provides a local environment for running Retrieval-Augmented Generation (RAG) using a vault of documents. This script allows you to query a set of local documents and retrieve relevant context for answering questions. It integrates with Ollama to generate embeddings and retrieve context based on user input. Additionally, it utilizes PyTorch to handle similarity calculations and embeddings.

Key Features of `localrag.py`:

Document Retrieval: The script reads from a local text file (vault.txt), generates embeddings for the contents, and uses cosine similarity to retrieve relevant text based on the user's query.
Ollama Integration: The script uses Ollama's embedding and completion models, such as mxbai-embed-large for generating embeddings and llama3 for answering user queries based on the context.
Query Rewriting: When a user asks a question, the script can rewrite the query by referencing recent conversation history, improving the query to retrieve better results without changing its original intent.

How the Script Works:

Loading Documents: The script reads a text file named vault.txt containing the documents you want to use for context retrieval. Each line of the file is treated as a separate chunk of text, and embeddings are generated for these chunks using the mxbai-embed-large model.
Generating Embeddings: Once the vault is loaded, the script creates embeddings for each chunk using Ollama's embedding model. These embeddings are stored as tensors and used for similarity searches.
User Interaction:
- The user is prompted to input a query. If context from the vault is relevant, the script retrieves the top matching chunks based on cosine similarity.
- If there is sufficient conversation history, the script rewrites the user's query using the most recent context.
- The query, along with the relevant context, is passed to Ollama's language model (e.g., llama3) for generating a response.
Conversation Loop: The script enters an interactive loop where the user can continually input queries and receive responses. The conversation history is updated with each interaction to improve context.

Example Usage:

To run the localrag.py script, use the following command in your terminal:

python localrag.py

This will initiate the conversation loop. The system will use the llama3 model by default for processing your queries. You can specify a different model by passing the --model argument.

Vault File: Ensure that your vault.txt is properly formatted, with each document or chunk of text on a new line. The script will generate embeddings for each line and use them to find the most relevant context for your queries.

Example Commands:

Ask a Query: Type a question and the script will retrieve the most relevant content from vault.txt, rewriting your query if needed, and provide a response.
Quit the Script: To exit the conversation, simply type quit.

Configuration:

Ollama API: The script is configured to interact with an Ollama API running locally on http://localhost:11434/v1. Ensure that the API is running before executing the script.
Embedding Model: By default, the script uses the mxbai-embed-large model to generate document embeddings.

This script is highly flexible for building local RAG systems, enabling users to query a custom set of documents and receive intelligent, context-aware responses.

Using `localrag_no_rewrite.py` for Retrieval-Augmented Generation (RAG)

The localrag_no_rewrite.py script is a simplified version of a Retrieval-Augmented Generation (RAG) system that allows you to query a set of local documents and retrieve relevant context without rewriting the user's query. This script uses Ollama for generating embeddings and answering user queries by pulling relevant context from a local "vault" of documents.

Key Features of `localrag_no_rewrite.py`:

No Query Rewriting: Unlike other RAG systems, this script does not rewrite the user's query before retrieving relevant context. Instead, it directly uses the user's input for context retrieval.
Document Retrieval: The script reads from a local text file (vault.txt), generates embeddings for the contents using Ollama's embedding model, and retrieves the most relevant chunks based on the user's query.
Ollama Integration: The script utilizes Ollama's API for both embedding generation (e.g., mxbai-embed-large) and text generation (e.g., llama3 or other specified models).
Cosine Similarity Matching: The system uses cosine similarity to match the user's input with the most relevant chunks of text from the vault, providing highly relevant context for answering queries.

How the Script Works:

Loading Documents: The script reads a file named vault.txt containing chunks of documents or text lines. Each line in the file is treated as a document, and embeddings are generated for each chunk using the mxbai-embed-large model.
Generating Embeddings: Embeddings are generated for the vault content using Ollama's embedding model. These embeddings are stored as tensors and used to compute cosine similarity for relevance matching.
User Interaction:
- The user provides a query. The system retrieves the top relevant chunks of text from the vault based on cosine similarity.
- The relevant context is concatenated with the user’s input and sent to Ollama's language model (e.g., llama3) to generate a response.
Conversation Loop: The script continuously prompts the user for input, retrieves context, and provides responses in a loop. The conversation history is maintained to allow continuity in the dialogue.

Example Usage:

To run the localrag_no_rewrite.py script, use the following command in your terminal:

python localrag_no_rewrite.py

This will initiate the conversation loop. By default, the system uses the llama3 model, but you can specify another model using the --model argument.

Vault File: Ensure that your vault.txt is properly formatted, with each document or chunk of text on a new line. The script will generate embeddings for each line, which are used to match the user's query with relevant content.

Example Commands:

Ask a Query: Type your question and the system will retrieve relevant content from vault.txt and provide a response.
Quit the Script: Type quit to exit the conversation loop.

Configuration:

Ollama API: The script interacts with an Ollama API running locally on http://localhost:11434/v1. Ensure that the API is running before executing the script.
Embedding Model: The script uses the mxbai-embed-large model for generating document embeddings by default.

This script provides a simple, yet powerful tool for querying local documents and generating responses using RAG techniques without modifying the user's original input.

Step 8: Test Description for Querying Complex Patient Records with RAG

This test demonstrates how the local Retrieval-Augmented Generation (RAG) system retrieves nuanced information from a complex JSON file of patient records. In this case, the query involves retrieving the treatments for Michael Lee's anxiety and understanding how his anxiety might be related to his hypertension.

Query Example:

What treatments are being used for Michael Lee's anxiety, and how is his condition related to his hypertension?

Context Retrieved from the sample_complex.json File:

The system pulls the relevant context from the JSON file, which includes:

Michael Lee's medical history, with conditions like Hypertension and Anxiety.
Treatments:
- For hypertension, he is on Losartan 50mg daily.
- For anxiety, he undergoes Cognitive Behavioral Therapy (CBT) and takes Escitalopram 10mg daily.
Recent visits that link his increased anxiety to a hypertension follow-up visit.

RAG System's Response: Based on the retrieved context, the RAG system generated the following detailed response:

Anxiety Treatment: Michael Lee's anxiety is being treated with CBT and Escitalopram 10mg daily, a treatment regimen that began after his diagnosis in 2016.
Relation to Hypertension: The system highlighted a recent visit on May 5th, 2023, where Michael reported increased anxiety during a hypertension follow-up, suggesting a link between his anxiety and blood pressure issues. It also noted improvement in anxiety symptoms on September 15th, 2023, following continued CBT sessions.

The system also suggested potential additional factors outside the given context that might impact both conditions:

Exercise: Michael engages in light physical activity (occasional walks), but increasing exercise could help manage both conditions.
Sleep: Since Michael has sleep apnea, improving sleep quality could positively affect both his anxiety and hypertension.
Stress Management: The system recommended incorporating stress-reduction techniques like meditation or breathing exercises to better manage his anxiety and its effects on blood pressure.

Why This Test Demonstrates the Power of RAG:

This test showcases how RAG goes beyond traditional SQL queries by:

Understanding the relationships between seemingly separate conditions (anxiety and hypertension).
Pulling detailed contextual information from multiple areas of the patient’s record (treatments, recent visits, and lifestyle factors).
Generating actionable insights about how to manage the patient’s overall health, which would be difficult to achieve with a simple keyword search or SQL query.

By using the RAG system, healthcare providers can gain a more comprehensive understanding of a patient’s health conditions, treatments, and how different factors are interconnected, allowing for more personalized and effective care.

Test Output: Querying Patient Distress Using RAG with Integrated Systems

Query:

Which patients have shown signs of distress in recent facial recognition scans or feedback, and do they require urgent intervention?

Response:

Based on the provided context and patient data, I've identified two patients who have shown signs of distress:

Patient 006 (Sarah Miller):

Recent visit on 2023-09-10 for "Increase in depressive symptoms," with Sertraline dosage increased to 100mg daily.
A facial recognition scan on 2023-09-10 at 15:35:00 detected sadness, and her emergency contact was alerted.

Patient 007 (David Brown):

Recent visit on 2023-08-22 related to a manic episode, with Lithium dosage adjusted to 1200mg daily.
No facial recognition alert, but a customer review analysis flagged high anxiety in feedback on 2023-09-08, recommending escalation.

Additional Insights and Recommendations:

Patient 006 (Sarah Miller) may need urgent intervention due to her worsening depressive symptoms and the triggered facial recognition alert. Addressing her lifestyle factors, such as improving sleep quality and increasing physical activity, could aid in managing her condition.
Patient 007 (David Brown) should also be closely monitored, as his recent feedback indicated high levels of anxiety. Although no facial recognition alert was triggered, further support may be necessary.

Urgent Intervention Recommendations:

Sarah Miller: Address the increasing depressive symptoms and facial recognition-triggered alert.
David Brown: Provide additional support based on his flagged customer review analysis and anxiety escalation.

Please consult with the patient's care team and relevant healthcare professionals to determine the best course of action.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
__pycache__		__pycache__
.env		.env
LICENSE		LICENSE
README.md		README.md
collect_emails.py		collect_emails.py
config.yaml		config.yaml
emailrag2.py		emailrag2.py
faiss_index.bin		faiss_index.bin
localrag.py		localrag.py
localrag_no_rewrite.py		localrag_no_rewrite.py
pipeline.png		pipeline.png
rag2.py		rag2.py
requirements.txt		requirements.txt
sample.json		sample.json
sample_complex.json		sample_complex.json
sample_mental_health.json		sample_mental_health.json
sample_test.pdf		sample_test.pdf
sample_test_rag.txt		sample_test_rag.txt
upload.py		upload.py
vault.txt		vault.txt
vault_embeddings.npy		vault_embeddings.npy
word_embeddings.png		word_embeddings.png

License

memari-majid/local_rag

Folders and files

Latest commit

History

Repository files navigation

Retrieval-Augmented Generation (RAG) for Local Data Processing

Project Overview

Key Concepts

1. Retrieval-Augmented Generation (RAG)

2. Vector Database (Using FAISS)

Persistent Storage of Embeddings and FAISS Index

3. Embeddings

4. Local LLMs with Ollama

Setup

Step 1: Clone the Repository

Step 2: Install Package Manager

Step 3: Install Python Dependencies

Step 4: Install Ollama

macOS

Windows

Linux Install

Linux Manual install

Docker

Step 5: Pull the Required Models

Models and Memory Requirements

LLM leaderboard

Generating High-Quality Embeddings from Documents

Step 6: Upload Your Documents

Key Features of upload.py:

How the Script Works:

Example Usage:

Step 7: Query Your Documents

Key Features of localrag.py:

How the Script Works:

Example Usage:

Using localrag_no_rewrite.py for Retrieval-Augmented Generation (RAG)

Key Features of localrag_no_rewrite.py:

How the Script Works:

Example Usage:

Step 8: Test Description for Querying Complex Patient Records with RAG

Test Output: Querying Patient Distress Using RAG with Integrated Systems

Response:

Patient 006 (Sarah Miller):

Patient 007 (David Brown):

Additional Insights and Recommendations:

Urgent Intervention Recommendations:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Key Features of `upload.py`:

Key Features of `localrag.py`:

Using `localrag_no_rewrite.py` for Retrieval-Augmented Generation (RAG)

Key Features of `localrag_no_rewrite.py`:

Packages