We are enhancing the LLM Extractor Project in this organization to implement Retrieval-Augmented Generation (RAG) concepts for reducing input tokens and improving efficiency.
This project implements a Retrieval-Augmented Generation (RAG) pipeline using Apache Airflow for orchestrating data pipelines and FastAPI + Streamlit for interactive user interfaces.
The goal is to build a scalable, modular system that extracts insights from NVIDIA’s quarterly reports (past 5 years) using various parsing and retrieval techniques.
- Data Source: NVIDIA quarterly reports (last 5 years)
- Assignment 1's Parser
- Docling
- Mistral OCR
- Manual Embeddings + Cosine Similarity
- Pinecone Integration
- ChromaDB Integration
- Fixed-size chunks
- Semantic chunks
- Sliding window chunks
- Query by quarter to fetch context-specific information
- Upload PDFs
- Select parser, chunking strategy, and RAG method
- Query quarter-specific data
- Dockerized Airflow pipeline
- Dockerized Streamlit + FastAPI interface
- Docker & Docker Compose
- Python 3.9+
- Conda (optional for local development)
- NVIDIA reports downloaded into
/data/raw_reports
📦 Building-a-RAG-Pipeline-with-Airflow/
├── 📂 Airflow/
│ ├── 📂 dags/
│ ├── 📂 logs/
│ ├── 📂 config/
│ ├── 📂 plugins/
│ └── 📄 Dockerfile
├── 📂 Backend/
│ ├── 📄 __init__.py
│ ├── 📄 api.py
│ ├── 📄 logger.py
│ ├── 📄 litellm_query_generator.py
│ ├── 📂 parsing_methods/
│ │ ├── 📄 __init__.py
│ │ ├── 📄 doclingparsing.py
│ │ ├── 📄 mistralparsing.py
│ │ └── 📄 mistralparsing_userpdf.py
│ └── 📄 Dockerfile
├── 📂 Rag_modelings/
│ ├── 📄 __init__.py
│ ├── 📄 chromadb_pipeline.py
│ ├── 📄 rag_pinecone.py
│ └── 📄 rag_manual.py
├── 📂 uploads/
├── 📂 user_markdowns/
├── 📂 chroma_db/
├── 📂 chunk_storage/
├── 📂 local_vector.db/
├── 📄 docker-compose.yml
├── 📄 pyproject.toml
├── 📄 poetry.lock
└── 📄 .env
- airflow pipeline = http://35.224.2.133:8080/docs
- fastapi = http://35.224.2.133:8000/docs
- streamlit = https://rag-llm-pipelines.streamlit.app/
- Google Codelabs = https://codelabs-preview.appspot.com/?file_id=1FQ1l6-_82iIJGELg1N5bqvxTEUn3tmiW7c4I_f8xYzo#0