This is an MVP of a LLM Document Search RAG.
Requirements Doc:
- Scan PDF (pypdf) (AWS Textract)
- Create pages
- Chunk pages (langchain)
- Embeddings (openAI)
- Store in Vector DB (Chroma)
- Test our embeddings (pyTest)
- Retrieve with search query (nistral)
run this command to install dependencies in the requirements.txt file.
pip install -r requirements.txtpip install pytest
pip install pyPdfTo Scan all the pdf files in the data folder and put them into the RAG run:
python load_pdf.pyThis will scan the pdfs using pypdf through langchain document loader, split the docs into pages and then will chunk it. Chunks are embedded and stored in Chroma
Query the Chroma DB and use Mistral to create an answer
python query_data.py "Your question relevant to the context of the application"Test Mistral's answers using PyTest
pytest test_cases.py