Skip to content

End-to-End Multimedia RAG Framework (Retrieval, SFS, QA, and Meta-Aggregation)#18

Open
aravind-3105 wants to merge 3 commits intomainfrom
video_rag
Open

End-to-End Multimedia RAG Framework (Retrieval, SFS, QA, and Meta-Aggregation)#18
aravind-3105 wants to merge 3 commits intomainfrom
video_rag

Conversation

@aravind-3105
Copy link
Member

@aravind-3105 aravind-3105 commented Mar 2, 2026

Summary

This pull request introduces a reference implementation of a multimedia Retrieval-Augmented Generation (RAG) pipeline for long-form video understanding. It also adds structured environment management and dataset preprocessing utilities to support reproducible experimentation.

The implementation integrates multimodal retrieval (ImageBind) with multimodal reasoning (Qwen Omni), enabling segment-level audiovisual retrieval and QA over temporally segmented video corpora.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📝 Documentation update
  • 🔧 Refactoring (no functional changes)
  • ⚡ Performance improvement
  • 🧪 Test improvements
  • 🔒 Security fix

Changes Made

1. Project Documentation

  • Added a comprehensive README.md describing:

    • The multimedia RAG architecture and pipeline stages
    • Supported models (ImageBind, PyTorchVideo, Qwen Omni)
    • Dataset download and preprocessing instructions
    • Environment setup workflow
    • References to relevant benchmarks and datasets

This provides a complete entry point for setup and experimentation.

2. Environment & Dependency Management

  • Added pyproject.toml with two isolated dependency groups:

    • ref5-multimedia-rag-vlm (retrieval + embedding pipeline)
    • ref5-multimedia-rag-vlm-qa (QA + multimodal reasoning)
  • Explicit CUDA and package version specification for reproducibility.

  • Designed for clean environment separation between retrieval and QA stages.

3. Source Code (src/)

  • Added a modular src/ package containing the core retrieval, segmentation, inference, meta-aggregation, and model components (AV-RAG, SFS, Qwen Omni), enabling a clean and extensible implementation of the multimedia RAG pipeline.

4. Notebook (multimedia_rag.ipynb)

  • Added an end-to-end experimental notebook demonstrating dataset preprocessing, multimodal retrieval, SFS frame selection, segment-level QA, and meta-agent aggregation within a reproducible research workflow.

Testing

  • Tests pass locally (uv run pytest tests/)
  • Type checking passes (uv run mypy <src_dir>)
  • Linting passes (uv run ruff check src_dir/)
  • Manual testing performed (describe below)

Manual testing details:

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

  • Code follows the project's style guidelines
  • Self-review of code completed
  • Documentation updated (if applicable)
  • No sensitive information (API keys, credentials) exposed

@aravind-3105 aravind-3105 added the enhancement New feature or request label Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant