A multimodal video indexing and retrieval system that lets you query any video and returns the most relevant video segments with timestamps, based on semantic meaning—not just keywords.
This system leverages:
- 📷 Visual Understanding via BLIP image captioning
- 🔊 Audio Transcription via OpenAI Whisper
- 🔍 Fast Semantic Search using Sentence Transformers and FAISS
- Automatically extracts frames from videos at fixed intervals
- Generates captions for frames using BLIP
- Transcribes speech from video audio using Whisper
- Embeds both modalities (image + audio) using Sentence Transformers
- Stores embeddings in FAISS for fast nearest neighbor search
- Returns top-k relevant results with video name, modality, and timestamp
- Python
- OpenCV, ffmpeg
- BLIP
- Whisper
- SentenceTransformers
- FAISS (Facebook AI Similarity Search)
-
Clone the repository
git clone https://github.com/joybratasarkar/video-query-rag.git cd video-query-rag -
Add Pre-Downloaded Video Files
-
This project does not download videos automatically.
-
You must manually download or collect videos (e.g., from YouTube, file share, or disk).
-
Then, create a folder called
videos/in the project root:mkdir videos
-
Place all your pre-downloaded
.mp4,.mov,.avi, or.mkvfiles inside thevideos/folder.Example:
video-query-rag/ ├── videos/ │ ├── wildlife_episode1.mp4 │ ├── lecture_on_ai.mov │ └── man_vs_wild.mkv
These videos will be processed for frame extraction and audio transcription during indexing.
-
-
Create and activate a virtual environment python3 -m venv venv source venv/bin/activate
-
Install dependencies pip install -r requirements.txt
🔨 Run Preprocessing to Build the FAISS Index
-python preprocess.py
This will:
-Extract frames every 5 seconds
-Caption each frame using BLIP
-Transcribe audio using Whisper
-Embed all text data using SentenceTransformer
-Build a FAISS index for fast similarity search
🔍 Run the Query Tool
-python query.py