This repository contains two Python scripts for processing documents and generating embeddings using OpenAI's API, followed by storing the embeddings in an AstraDB collection. The scripts provide a solution for managing text data, including loading various document formats, normalizing the text, generating embeddings, and storing the processed data in a vector database.
git clone https://github.com/yourusername/document-embedding-astradb.git
cd document-embedding-astradb
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
Code 1: process_documents_v1.py This script is designed to:
Load documents (PDF, DOCX, TXT). Normalize and preprocess text. Split text into manageable chunks. Generate embeddings using OpenAI's API. Store the embeddings and metadata in an AstraDB collection. Run the script: Update the file_paths list in the script to include the paths to your documents. Run the script:
python process_documents_v1.py
This script offers similar functionality with slight variations, such as:
Adjusted error handling. Additional emphasis on document references. Slightly different configurations for chunk size and overlap during text splitting. Run the script: Update the file_paths list in the script to include the paths to your documents.
python process_documents_v2.py
You need to set up environment variables for your API keys and database tokens. These can be stored securely using google.colab.userdata, or you can use a .env file.
ASTRA_DB_APPLICATION_TOKEN
ASTRA_DB_API_ENDPOINT
OPENAI_API_KEY
Example configuration:
python
Copy code
import os
from google.colab import userdata
ASTRA_DB_APPLICATION_TOKEN = userdata.get("ASTRA_DB_APPLICATION_TOKEN")
ASTRA_DB_API_ENDPOINT = userdata.get("ASTRA_DB_API_ENDPOINT")
OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
Document Processing
Handles the loading of documents from various formats (PDF, DOCX, TXT). RecursiveCharacterTextSplitterComponent: Splits large text into smaller chunks for efficient processing. OpenAIEmbeddingsComponent: Generates embeddings for text chunks using OpenAI's API.
AstraDBManager: Manages the interaction with AstraDB, including creating collections and storing documents.
The main() function orchestrates the loading, processing, embedding, and storing of documents.
langchain
nltk
astrapy
openai
python-dotenv
pip install langchain nltk astrapy openai python-dotenv
Contributions are welcome! Please open an issue or submit a pull request with any enhancements or bug fixes.