[GSoC 2026] Project #39: Automated Video and Audio Redacting/Censorship Pipeline #34561

sshekhar563 · 2026-03-08T17:11:54Z

sshekhar563
Mar 8, 2026

I am Siddhant Shekhar, a 3rd-year B.Tech (AI&ML) student.I have previously contributed to OpenVINO, and here are some of my PRs:PR #32302 (merged), PR #32913 and PR #33450 (under review)

I am very interested in working on GSoC 2026 Project 39: "Automated Video and Audio Redacting/Censorship Pipeline" and would love to contribute to building a robust multimodal content moderation system using the OpenVINO ecosystem.

My Proposed Approach:
Audio Transcription: Using openai/whisper-base via Optimum Intel to generate word-level timestamps, essential for precise FFmpeg segment trimming.
Visual Detection: Using YOLOv11 for person/face localization, passing spatial metadata downstream to an LLM for contextual violence/abuse judgment rather than direct classification.
Contextual Reasoning: Using Phi-3-mini-4k-instruct-ov to analyze transcript windows and determine whether a segment requires censorship based on defined criteria (violent, abusive, confidential).
Human-in-the-Loop Redaction: Generating a structured JSON redaction map after each pipeline run, displayed in a React frontend where users can review, approve, or reject AI suggestions before FFmpeg executes any permanent changes.
Deployment: FastAPI + Docker containerization, with async task processing.
All models will be converted to OpenVINO IR format using openvino.convert_model() and quantized to INT8 via NNCF.

Questions:
Does this direction align with your vision for the project?
Do you have a preference for a specific VLM for visual reasoning, or is the YOLO + LLM combination a good approach?
Looking forward to your feedback! Thank you for your time and consideration.

Siddhant Shekhar

bharagha · 2026-03-15T04:53:14Z

bharagha
Mar 15, 2026

Thanks for your input, Siddhant. A couple of inputs:

I recommend first getting into the details of the problem statement itself than directly jumping to the implementation details. I am not sure you have sorted out the problem statement correctly yet. I am looking forward to your reply with a more detailed writeup on your understanding of the problem statement followed by how the understanding translates to a design approach to solve the problem.
Once we are aligned on the problem statement and the design approach, we can step into the implementation details including which components or models to to use, development and deployment details, and the repo where it will be upstreamed.

Apologies on the delay to respond. We will revert much faster going forward.

0 replies

sshekhar563 · 2026-03-15T13:55:30Z

sshekhar563
Mar 15, 2026
Author

Thank you for the feedback, @bharagha . I appreciate the guidance you're right that I jumped ahead. Let me take a step back and present my understanding of the problem statement first, followed by how that understanding shapes my design thinking.

Problem Understanding

1.1 Core Problem
The goal is to build a sample application that takes a video or audio file as input, automatically identifies segments containing undesired content, and produces a clean output with those segments redacted, censored, or edited out.
This is essentially a content moderation pipeline but unlike typical moderation systems that just flag content, this one must go a step further and actively modify the media to remove or mask the problematic portions while keeping the rest intact.

1.2 What Constitutes "Undesired" Content
The problem statement defines three categories:

Violent - content depicting or describing violence
Abusive - hate speech, slurs, profanity, personal attacks
Confidential - sensitive information (e.g., personal data like names, addresses, financial details, proprietary information) that must not be exposed

Each category has different detection characteristics:

Violent & Abusive content can often be detected through keyword/pattern matching in transcripts, but context matters heavily (e.g., a news anchor reporting on violence vs. promoting violence). This is why the problem statement explicitly calls for "LLM-based reasoning" simple keyword filtering would produce too many false positives.
Confidential content is especially tricky because it is context-dependent. What counts as "confidential" varies per use case (a company name might be fine in one context but confidential in another). This requires the system to support 'user-defined criteria' or at least a configurable notion of what is confidential.

1.3 Three Tiers of the Application
The problem statement outlines a progression from simpler to more advanced:

T1 - Manual/Easier Version:
"An easier version would be to manually identify the undesired content from the audio/video transcript."
The pipeline transcribes the media, presents the transcript to a human, and the user manually marks which segments are undesired. The system then redacts those segments. This serves as the baseline-validating the core pipeline (transcription -> marking -> redaction) works correctly.

T2 - LLM-Based Automated Detection (Main Version):
Replace the manual step with LLM-based reasoning. The LLM analyzes the transcript and determines which segments are violent, abusive, or confidential based on defined criteria. The corresponding segments in the video/audio are then censored/redacted. This is the core deliverable.

T3 - Visual Understanding (Advanced Version):
"An even interesting version could be to use visual understanding of violence in case of videos to decide on censoring instead of relying on audio transcripts."
This adds a visual modality directly analyzing video frames to detect violent content, handling cases where violence is visible but has no corresponding audio cue.

1.4 The Role of LLM Reasoning
A key requirement is that the detection is not just a classification task (violent/not-violent). The problem statement specifically says "LLM-based reasoning should be used to identify the undesired part." This means:

The system needs to 'understand context', not just match keywords.
It should be able to explain why a segment is flagged (reasoning chain).
It must handle 'nuanced cases' - sarcasm, quoting, reporting vs. endorsing, etc.

1.5 Fine-Tuning Consideration
The problem statement mentions that "LLM fine-tuning on such undesired content might be required if no such open-source models are available or sufficient." This tells me:

The first attempt should use existing open-source models as-is with good prompt engineering.
Fine-tuning is a fallback if off-the-shelf models don't perform well enough on content moderation tasks.
The system should be designed in a way that the LLM component is modular and swappable, so switching from a base model to a fine-tuned one is straightforward.

1.6 Input/Output Constraints

Input: Video or audio files with 'defined size limits' (the system should handle practical file sizes, not unlimited streaming).
Output: The same video/audio but with undesired portions 'censored, redacted, or edited out' (e.g., audio bleeping/muting, video blurring/blacking out).

1.7 OEP Ecosystem Requirement
The problem statement specifically mentions leveraging Open Edge Platform (OEP) libraries and components DL Streamer, audio transcription services, VLM Serving, etc. This means the solution should not be a standalone Python script but should demonstrate how OEP components integrate together in a real pipeline. The sample application should serve as a reference implementation for the OpenVINO ecosystem.

Design Approach (High-Level)

Based on the above understanding, here is how I see the system being structured without committing to specific models yet:

2.1 Pipeline Stages

The system naturally decomposes into four stages:

Stage 1 - Decode & Segment:
Split the input media into manageable chunks/segments. Extract the audio track separately from the video track if both are present. This stage handles format compatibility and enforces the size limits.

Stage 2 - Transcribe / Perceive:

For the audio path: Convert audio segments to text transcripts with word-level or segment-level timestamps.
For the visual path (advanced): Extract representative frames or keyframes from video segments for visual analysis.

Stage 3 - Reason & Flag:
This is the core intelligence of the pipeline and should support all three tiers:

1: Present the transcript to the user for manual flagging.
2: Feed transcript to an LLM with structured prompts defining criteria for violent, abusive, and confidential content. The LLM returns flagged segments with exact timestamps and reasoning.
3: Feed video frames to a VLM for visual violence detection in addition to transcript analysis.

Stage 4 - Redact & Encode:
Apply the redaction to the original media:

Audio: Mute, bleep, or silence the flagged time ranges
Video: Blur, pixelate, or black out the flagged time ranges
Re-encode and output the clean media file with proper A/V sync

2.2 Key Design Decisions to Discuss

Before moving to implementation specifics, I'd like your input on a few design questions:

Scope of the "sample application": Should this be a CLI tool, a web-based UI, or both? The problem statement mentions "UI or CLI Development Skills" I'm thinking a CLI-first approach with a simple web UI on top for demonstration purposes.
Batch vs. Near-Real-Time: Given "defined size limits" on input, I'm assuming batch processing of uploaded files rather than live-stream processing. Does that align with your vision?
Human-in-the-Loop: Should the application just auto-redact and output the result, or should there be a review step where a user can see what the AI flagged and approve/reject before the final output is generated?
Microservices granularity: The skills mention microservices architecture. Should each pipeline stage be its own independent microservice (fully decoupled), or is a simpler monolithic pipeline with clean internal module boundaries acceptable for the sample application?
Visual understanding priority: Should I treat the visual violence detection as a core deliverable or as a stretch goal/bonus feature after the transcript-based approach is solid?

I want to make sure we are aligned on the problem and design approach before I dive into specific component choices, model selections, and implementation timelines. Looking forward to your thoughts.

Siddhant Shekhar

0 replies

bharagha · 2026-03-19T10:33:32Z

bharagha
Mar 19, 2026

Thanks @sshekhar563 for the detailed note.

1.3 Three Tiers of the Application

I would prefer directly working with multimodal input data and automation. Human in the loop is for ground truth accuracy.

1.5 Fine-Tuning Consideration

Recommend to not target it in the current scope as this has its own complexity and effort requirement which I dont think is in the purview of this proposal.

1.7 OEP Ecosystem Requirement

Correct. There are more collaterals in OEP in addition to OpenVINO that you can use.

Stage 2 - Transcribe / Perceive:

I would recommend checking out Video Search and Summary sample-application in edge-ai-libraries to see if you get some ideas here.

Scope of the "sample application": Should this be a CLI tool, a web-based UI, or both? The problem statement mentions
"UI or CLI Development Skills" I'm thinking a CLI-first approach with a simple web UI on top for demonstration purposes.

CLI is fine.

Batch vs. Near-Real-Time: Given "defined size limits" on input, I'm assuming batch processing of uploaded files
rather than live-stream processing. Does that align with your vision?

Batch is fine.

Human-in-the-Loop: Should the application just auto-redact and output the result, or should there be a review step
where a user can see what the AI flagged and approve/reject before the final output is generated?

Provide the output in a way that human-in-the-loop (HIL) can verify. Output pipeline need not wait for HIL. But it should be verifiable subsequently.

Microservices granularity: The skills mention microservices architecture. Should each pipeline stage be its own
independent microservice (fully decoupled), or is a simpler monolithic pipeline with clean internal module boundaries
acceptable for the sample application?

Provide more details here and we will help you a decision :).

Visual understanding priority: Should I treat the visual violence detection as a core deliverable or as a
stretch goal/bonus feature after the transcript-based approach is solid?

Suggest to take multimodal as default requirement (i.e. video)

Cc: @krish918, @14pankaj

1 reply

sshekhar563 Mar 20, 2026
Author

Thank you for the detailed feedback, @bharagha. This is extremely helpful in shaping my approach.
To confirm my understanding of the key directions:
Multimodal (video + audio) analysis as the default, not a stretch goal
Fine-tuning out of scope rely on existing open-source models
Batch processing of uploaded files
CLI as the primary interface
Pipeline runs fully automated; output designed for post-pipeline human verification
Architecture should align with existing OEP patterns and collaterals
I'll study the Video Search and Summary sample application in edge-ai-libraries as you recommended, and will incorporate those learnings along with the microservices details you requested into my proposal.

Thank you for the responsiveness looking forward to continued alignment.
Siddhant Shekhar

sshekhar563 · 2026-04-02T15:56:34Z

sshekhar563
Apr 2, 2026
Author

Hi @bharagha,
Just wanted to give a quick update that I have officially submitted my final proposal on the GSoC dashboard! I incorporated all the feedback we discussed here, including the multimodal focus, the CLI-first approach, and the architectural alignment with the VSS sample application.
Thank you again to you, @krish918, and @14pankaj for the invaluable guidance over the past few weeks. I really appreciate the time you took to help shape this. Looking forward to the review period!
Best,
Siddhant

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSoC 2026] Project #39: Automated Video and Audio Redacting/Censorship Pipeline #34561

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[GSoC 2026] Project #39: Automated Video and Audio Redacting/Censorship Pipeline #34561

Uh oh!

Uh oh!

sshekhar563 Mar 8, 2026

Replies: 4 comments · 1 reply

Uh oh!

bharagha Mar 15, 2026

Uh oh!

Uh oh!

sshekhar563 Mar 15, 2026 Author

Uh oh!

Uh oh!

bharagha Mar 19, 2026

Uh oh!

sshekhar563 Mar 20, 2026 Author

Uh oh!

sshekhar563 Apr 2, 2026 Author

sshekhar563
Mar 8, 2026

Replies: 4 comments 1 reply

bharagha
Mar 15, 2026

sshekhar563
Mar 15, 2026
Author

bharagha
Mar 19, 2026

sshekhar563 Mar 20, 2026
Author

sshekhar563
Apr 2, 2026
Author