[GSoC 2026] Project #39: Automated Video and Audio Redacting/Censorship Pipeline #34561
Replies: 4 comments 1 reply
-
|
Thanks for your input, Siddhant. A couple of inputs:
Apologies on the delay to respond. We will revert much faster going forward. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for the feedback, @bharagha . I appreciate the guidance you're right that I jumped ahead. Let me take a step back and present my understanding of the problem statement first, followed by how that understanding shapes my design thinking.
1.1 Core Problem 1.2 What Constitutes "Undesired" Content
Each category has different detection characteristics:
1.3 Three Tiers of the Application T1 - Manual/Easier Version: T2 - LLM-Based Automated Detection (Main Version): T3 - Visual Understanding (Advanced Version): 1.4 The Role of LLM Reasoning
1.5 Fine-Tuning Consideration
1.6 Input/Output Constraints
1.7 OEP Ecosystem Requirement
Based on the above understanding, here is how I see the system being structured without committing to specific models yet: 2.1 Pipeline Stages The system naturally decomposes into four stages: Stage 1 - Decode & Segment: Stage 2 - Transcribe / Perceive:
Stage 3 - Reason & Flag:
Stage 4 - Redact & Encode:
2.2 Key Design Decisions to Discuss Before moving to implementation specifics, I'd like your input on a few design questions:
I want to make sure we are aligned on the problem and design approach before I dive into specific component choices, model selections, and implementation timelines. Looking forward to your thoughts. Siddhant Shekhar |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @sshekhar563 for the detailed note.
I would prefer directly working with multimodal input data and automation. Human in the loop is for ground truth accuracy.
Recommend to not target it in the current scope as this has its own complexity and effort requirement which I dont think is in the purview of this proposal.
Correct. There are more collaterals in OEP in addition to OpenVINO that you can use.
I would recommend checking out Video Search and Summary sample-application in edge-ai-libraries to see if you get some ideas here.
CLI is fine.
Batch is fine.
Provide the output in a way that human-in-the-loop (HIL) can verify. Output pipeline need not wait for HIL. But it should be verifiable subsequently.
Provide more details here and we will help you a decision :).
Suggest to take multimodal as default requirement (i.e. video) |
Beta Was this translation helpful? Give feedback.
-
|
Hi @bharagha, |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Dear @krish918, @14pankaj, @bharagha,
I am Siddhant Shekhar, a 3rd-year B.Tech (AI&ML) student.I have previously contributed to OpenVINO, and here are some of my PRs:PR #32302 (merged), PR #32913 and PR #33450 (under review)
I am very interested in working on GSoC 2026 Project 39: "Automated Video and Audio Redacting/Censorship Pipeline" and would love to contribute to building a robust multimodal content moderation system using the OpenVINO ecosystem.
My Proposed Approach:
Audio Transcription: Using openai/whisper-base via Optimum Intel to generate word-level timestamps, essential for precise FFmpeg segment trimming.
Visual Detection: Using YOLOv11 for person/face localization, passing spatial metadata downstream to an LLM for contextual violence/abuse judgment rather than direct classification.
Contextual Reasoning: Using Phi-3-mini-4k-instruct-ov to analyze transcript windows and determine whether a segment requires censorship based on defined criteria (violent, abusive, confidential).
Human-in-the-Loop Redaction: Generating a structured JSON redaction map after each pipeline run, displayed in a React frontend where users can review, approve, or reject AI suggestions before FFmpeg executes any permanent changes.
Deployment: FastAPI + Docker containerization, with async task processing.
All models will be converted to OpenVINO IR format using openvino.convert_model() and quantized to INT8 via NNCF.
Questions:
Does this direction align with your vision for the project?
Do you have a preference for a specific VLM for visual reasoning, or is the YOLO + LLM combination a good approach?
Looking forward to your feedback! Thank you for your time and consideration.
Siddhant Shekhar
Beta Was this translation helpful? Give feedback.
All reactions