Pinecone Vectorstore Updater

This script updates a Pinecone vectorstore with new PDF documents from a Google Cloud Storage bucket. It keeps track of processed files to avoid duplicates and only processes new documents.

Components

Vectorstore Updater: Automatically processes and indexes new PDF documents from a Google Cloud Storage bucket into a Pinecone vectorstore.
Retriever: Efficiently queries the Pinecone vectorstore to retrieve relevant information from indexed PDF documents.

Prerequisites

Python 3.7+
Google Cloud account with a storage bucket
Pinecone account with an index created
OpenAI API key

Setup

Clone this repository:

git clone https://github.com/your-username/pinecone-vectorstore-updater.git
cd pinecone-vectorstore-updater

Set up a virtual environment:
```
python -m venv venv
```
Activate the virtual environment:
- On Windows:
```
venv\Scripts\activate
```
- On macOS and Linux:
```
source venv/bin/activate
```
Install the required packages:
```
pip install -r requirements.txt
```

Set up your .env file with the following variables, see .env.example for reference:

OPENAI_API_KEY=your_openai_api_key
PINECONE_API_KEY=your_pinecone_api_key
PINECONE_INDEX=your_pinecone_index_name
CLOUD_STORAGE_BUCKET=your_google_cloud_storage_bucket_name

Authentication with Google Cloud:

Option 1: Using gcloud CLI (recommended)
- Install the Google Cloud SDK
- Run gcloud auth application-default login and follow the prompts
Option 2: Using a service account JSON key
- Obtain a service account JSON key from the Google Cloud Console
- Update the storage_client initialization in import.py
Clear out or delete processed_files.json on the very odd chance you are also collecting beekeeping pdfs. Update the prompt in retrieve.py to suit your needs.

Usage

Vectorstore Updater

Run the script with:

python import.py

The script will:

Connect to your Google Cloud Storage bucket
Check for new PDF files
Process new files and add them to your Pinecone vectorstore
Keep track of processed files to avoid duplicates in future runs

How it works:

The script loads a list of previously processed files from processed_files.json.
It compares this list with the files in your Google Cloud Storage bucket.
New files are downloaded, processed, and added to the Pinecone vectorstore.
The list of processed files is updated and saved for the next run.

Retriever

Run the script with:

python retrieve.py

Interact with the chatbot by typing your questions. The chatbot will retrieve relevant information from the indexed documents and provide answers. To exit the conversation, type "exit" or "quit".

Customization

Within import.py, update the CharacterTextSplitter parameters in the process_pdf function to change how PDFs are split into chunks.
Modify the embeddings initialization if you want to use a different embedding model. Be sure this matches pinecone and retriever.
In the retriever script, you can modify the llm initialization to use different OpenAI models or adjust the temperature for varying levels of creativity in responses.
The retriever currently uses an in-memory chat history store. For persistence across sessions, you can implement a database-backed history store.

Troubleshooting

If you encounter authentication issues, ensure you've set up Google Cloud authentication correctly.
For Pinecone-related issues, check your API key and index name in the .env file.
Make sure your Google Cloud Storage bucket contains PDF files.
Make sure your model in pinecone matches your model in OpenAIEmbeddings()
If you're having issues with package installations, ensure your virtual environment is activated and your requirements.txt file is up to date.

Contributing

Feel free to submit issues or pull requests if you have suggestions for improvements or encounter any problems. If someone knows what this chatter means, let me know lol 'Ignoring wrong pointing object 11 0 (offset 0)'. I think it is from pypdf but haven't dove in yet.

TODO

Implement database convo memory
Create FastAPI route
Offer option to include sources

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
import.py		import.py
processed_files.json		processed_files.json
readme.md		readme.md
requirements.txt		requirements.txt
retrieve.py		retrieve.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pinecone Vectorstore Updater

Components

Prerequisites

Setup

Usage

Vectorstore Updater

Retriever

Customization

Troubleshooting

Contributing

TODO

License

About

Contributors 2

Languages

License

nickySmack/rag-pdf-vectorstore-updater-and-retreiver

Folders and files

Latest commit

History

Repository files navigation

Pinecone Vectorstore Updater

Components

Prerequisites

Setup

Usage

Vectorstore Updater

Retriever

Customization

Troubleshooting

Contributing

TODO

License

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages