Skip to content

Multimodal video search system using BLIP for visual captions, Whisper for audio transcription, and FAISS for fast semantic search with timestamps.

joybratasarkar/video-query-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎥 Video Query RAG

A multimodal video indexing and retrieval system that lets you query any video and returns the most relevant video segments with timestamps, based on semantic meaning—not just keywords.

This system leverages:

  • 📷 Visual Understanding via BLIP image captioning
  • 🔊 Audio Transcription via OpenAI Whisper
  • 🔍 Fast Semantic Search using Sentence Transformers and FAISS

🚀 Features

  • Automatically extracts frames from videos at fixed intervals
  • Generates captions for frames using BLIP
  • Transcribes speech from video audio using Whisper
  • Embeds both modalities (image + audio) using Sentence Transformers
  • Stores embeddings in FAISS for fast nearest neighbor search
  • Returns top-k relevant results with video name, modality, and timestamp

🛠️ Tech Stack


📦 Setup

  1. Clone the repository

    git clone https://github.com/joybratasarkar/video-query-rag.git
    cd video-query-rag
    
  2. Add Pre-Downloaded Video Files

    • This project does not download videos automatically.

    • You must manually download or collect videos (e.g., from YouTube, file share, or disk).

    • Then, create a folder called videos/ in the project root:

      mkdir videos
    • Place all your pre-downloaded .mp4, .mov, .avi, or .mkv files inside the videos/ folder.

      Example:

      video-query-rag/
      ├── videos/
      │   ├── wildlife_episode1.mp4
      │   ├── lecture_on_ai.mov
      │   └── man_vs_wild.mkv
      

    These videos will be processed for frame extraction and audio transcription during indexing.

  3. Create and activate a virtual environment python3 -m venv venv source venv/bin/activate

  4. Install dependencies pip install -r requirements.txt

🔨 Run Preprocessing to Build the FAISS Index

-python preprocess.py

This will:

-Extract frames every 5 seconds

-Caption each frame using BLIP

-Transcribe audio using Whisper

-Embed all text data using SentenceTransformer

-Build a FAISS index for fast similarity search

🔍 Run the Query Tool

-python query.py

About

Multimodal video search system using BLIP for visual captions, Whisper for audio transcription, and FAISS for fast semantic search with timestamps.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages