BigDataTeam5 is an organization focused on developing cutting-edge data engineering, AI, and analytics solutions across various domains. Our team specializes in building data pipelines, multi-agent AI architectures, retrieval-augmented generation (RAG) systems, and advanced analytics platforms using modern cloud technologies.
- Shushil Girish
- Yash Khavnekar
- Riya Mate
A multi-agent research assistant for NVIDIA financial and corporate analysis using:
- RAG Agent: Retrieves historical NVIDIA quarterly reports from Pinecone with metadata filtering
- Web Search Agent: Uses SerpAPI for real-time web search related to NVIDIA
- Snowflake Agent: Queries structured NVIDIA valuation metrics from Snowflake
- Architecture: Multi-node LangGraph orchestration with state management
- Deployment: Dockerized FastAPI backend and Streamlit frontend
- Demo: https://nvidia-research-assistant.streamlit.app/
An intelligent analytics platform for understanding healthcare market segments using advanced AI and NLP:
- Features: Market segmentation analysis, strategic query optimization, product comparison, sales & marketing analysis
- Architecture: FastAPI backend with MCP servers, Streamlit frontend
- Healthcare Segments: Diagnostic, Supplement, OTC Pharmaceutical, Fitness Wearable, Skin Care
- RAG System: Provides access to marketing knowledge from Philip Kotler's Marketing Management book
- Deployment: Multi-server architecture with unified and specialized microservices
- Demo : https://marketscope-ai-powered-industry-segment-intelligence-platform.streamlit.app/
A comprehensive data pipeline for processing, analyzing, and visualizing CO2 measurement data using Snowflake:
- Architecture: Multi-layer data architecture (Raw, Harmonized, Analytics, External)
- Key Components: Raw data ingestion, change data capture, harmonization, analytics processing, UDFs
- Technologies: Snowflake, Python, Snowpark, GitHub Actions, AWS S3, AWS Lambda
- CI/CD Pipeline: Automated testing and deployment with RSA key-based authentication
- License: MIT License
An end-to-end solution for building a master financial statement database for US public companies:
- Features: Data scraping, storage design, validation, Airflow ETL pipelines
- Technologies: Snowflake, Airflow, dbt, AWS S3, Streamlit, FastAPI
- Deployment: Google Cloud Run (FastAPI), Streamlit built-in deployment
- Demo Application: Financial analytics dashboard with fundamental analysis capabilities
- Demo Link(Streamlit): http://34.56.233.252:8501/
Enhanced the LLM Extractor Project with RAG concepts for token reduction and improved efficiency:
- Data Source: NVIDIA quarterly reports (past 5 years)
- PDF Parsing Options: Various parsers including Docling and Mistral OCR
- RAG Pipeline Options: Manual embeddings, Pinecone, ChromaDB
- Chunking Strategies: Fixed-size, semantic, sliding window
- Interactive UI: Streamlit interface with upload, parser selection, query capabilities
- Deployment: Dockerized Airflow pipeline and Streamlit + FastAPI interface
- Demo: https://rag-llm-pipelines.streamlit.app/
A Streamlit web application that leverages LLMs for document summarization and question answering:
- Features: PDF content selection/upload, LLM-based summarization, question answering
- Technologies: Streamlit (frontend), FastAPI (backend), LiteLLM (LLM integrations)
- Supported LLMs: GPT-4o, Gemini Flash, DeepSeek, Claude, Grok
- Architecture: Redis streams for communication between services
- Deployment: Docker Compose, cloud deployment
- Demo: https://llm-interactor-gpt.streamlit.app/
A multi-agent research assistant for crime data analysis:
- Agents: RAG agent (Pinecone), Web Search agent (SerpAPI), Snowflake agent, ImageGenerator agent, Comparison agent
- Architecture: State management with CrimeGPTState, multi-node structure
- Data Processing: Input processing, parallel data collection, analysis integration, report generation
- Technologies: Python, Pinecone, Snowflake, SerpAPI, Claude 3, Graphviz
- Deployment: FastAPI and Streamlit deployments
- Demo: https://crime-analysis-report.streamlit.app/
- AWS: S3, Lambda
- Google Cloud: Cloud Run
- Snowflake: Data warehousing, analytics
- Vector Databases: Pinecone, ChromaDB
- LLMs: Claude 3, GPT-4o, Gemini, DeepSeek, Grok
- RAG Frameworks: Custom implementations with various embedding strategies
- Languages: Python
- API Frameworks: FastAPI
- UI Frameworks: Streamlit
- Orchestration: Apache Airflow, LangGraph
- Containerization: Docker, Docker Compose
- CI/CD: GitHub Actions
- Web Search: SerpAPI
- PDF Processing: Docling, Mistral OCR, various parsers
Each repository has detailed setup instructions in its respective README file. Generally, the process involves:
- Cloning the desired repository
- Setting up environment variables
- Installing dependencies via pip or Poetry
- Running the application (typically involves a FastAPI backend and Streamlit frontend)
For more information about our projects, please feel free to explore the individual repositories and their documentation.
Most repositories are under MIT License unless otherwise specified.