Skip to content

Conversation

@ceberam
Copy link
Member

@ceberam ceberam commented Dec 15, 2025

Refactoring of WebVTT backend parser and ASR pipeline to the latest changes of docling-core.

  • WebVTT backend parser reads a WebVTT file and leverages the new source field in DoclingDocument (type TrackSource).
  • Audio files are parsed with the ASR pipeline, leveraging the source field too and thus the text is separated from the metadata (timings and speaker).

Resolves #2564

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 15, 2025

DCO Check Passed

Thanks @ceberam, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Dec 15, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

…em and ProvenanceTrack

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
…classes

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
@ceberam ceberam force-pushed the dev/webvtt-refactor branch from f0e493d to 58e4da9 Compare January 30, 2026 15:20
@ceberam ceberam marked this pull request as ready for review January 30, 2026 15:26
@dosubot
Copy link

dosubot bot commented Jan 30, 2026

Related Documentation

Checked 7 published document(s) in 0 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@codecov
Copy link

codecov bot commented Jan 30, 2026

Codecov Report

❌ Patch coverage is 98.55072% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/webvtt_backend.py 98.38% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Drop WebVTT formatting features not covered by Docling across formats.
Only 'u', 'b', 'i', and 'v' are supported and without classes.
Align with docling-core v2.62.0

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
@ceberam ceberam force-pushed the dev/webvtt-refactor branch from 58e4da9 to 350594b Compare January 30, 2026 15:34
@ceberam ceberam changed the title refactor: webvtt and provenance tracker (WIP) refactor: webvtt and provenance tracker Jan 30, 2026
@ceberam ceberam changed the title refactor: webvtt and provenance tracker feat: webvtt and source tracker Jan 30, 2026
cau-git
cau-git previously approved these changes Jan 30, 2026
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Copy link
Member

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ceberam ceberam merged commit 0602a7c into main Jan 30, 2026
27 checks passed
@ceberam ceberam deleted the dev/webvtt-refactor branch January 30, 2026 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: Option to turn off timing metadata during ASR

3 participants