feat(docx): add track changes (w:ins/w:del) support#3579
Conversation
|
❌ DCO Check Failed Hi @a-huk, your pull request has failed the Developer Certificate of Origin (DCO) check. This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format. 🛠 Quick Fix: Add a remediation commitRun this command: git commit --allow-empty -s -m "DCO Remediation Commit for Adam Huk <huk.adam.g@gmail.com>
I, Adam Huk <huk.adam.g@gmail.com>, hereby add my Signed-off-by to this commit: 83b2b51eccf18478607f9914558138f14639dcd3
I, Adam Huk <huk.adam.g@gmail.com>, hereby add my Signed-off-by to this commit: acaa92f7caf15ad12dbb82f30babf60293caf483
I, a-huk <huk.adam.g@gmail.com>, hereby add my Signed-off-by to this commit: 4c62fdaa576317765ebabedf589cc0ce435f19c3
I, a-huk <huk.adam.g@gmail.com>, hereby add my Signed-off-by to this commit: 03bc900250e93cccb870ec33eb401d7a91b09301
I, a-huk <huk.adam.g@gmail.com>, hereby add my Signed-off-by to this commit: f32bea3a13f36ed0de3ef6e297fc33b97f4df6ba"
git push🔧 Advanced: Sign off each commit directlyFor the latest commit: git commit --amend --signoff
git push --force-with-leaseFor multiple commits: git rebase --signoff origin/main
git push --force-with-leaseMore info: DCO check report |
Merge Protections🔴 1 of 2 protections blocking · waiting on 👀 reviews
🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
Show 1 satisfied protection🟢 Enforce conventional commitMake sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
ceberam
left a comment
There was a problem hiding this comment.
Thanks @a-huk for your interest in Docling and your contribution!
Before I dive in, should we nail down a few high-level details? I have the impression that we should first agree on the feature we want to support before code can be reviewed, since this is a completely new feature in Docling. By front-loading the design discussion we can save time and effort (you and the reviewers) and ensure that any code change are intentional, scoped, and aligned with the feature goals. Please, see the thread on the original issue and feel free to add your comments.
Thanks again for your work! 🎉
|
The change_type field needs to be added to TextItem in docling-core. I've temporarily patched the local venv's site-packages to add it (for testing), but this requires a separate docling-core PR. |
Word's Track Changes feature (also called Suggestions in newer Word versions) wraps inserted text in <w:ins> and deleted text in <w:del> elements. Previously both were silently dropped, causing content loss. New MsWordBackendOptions.track_changes field controls behaviour: - "accept" (default): include insertions, drop deletions — final document - "reject": drop insertions, include deletions — original document - "raw": include both; insertions get underline formatting, deletions get strikethrough so they are visually distinguishable Exposed via --docx-track-changes CLI flag (default: accept). Fixes: docling-project#3152, docling-project#745
… TextItem In raw mode, tracked insertions/deletions now set change_type='inserted' or change_type='deleted' on TextItem rather than injecting underline/strikethrough formatting, keeping semantic meaning separate from visual presentation. Requires a matching docling-core change to add change_type to TextItem.
The upstream added a recursive child-expander that treated w:ins as a transparent container, causing its runs to bypass the track-changes handler and always appear in the output regardless of mode. Remove "ins" from the transparent-container set so the existing w:ins / w:del logic sees the element and can filter or annotate it.
21fd102 to
f32bea3
Compare
|
@PeterStaar-IBM let me know what you think of it now |
Add support for Word's Track Changes feature (also called Suggestions in
newer Word/Office 365 versions). Previously, inserted text (
w:ins) anddeleted text (
w:del) were silently dropped during DOCX conversion,causing content loss.
New
MsWordBackendOptions.track_changesfield controls the behaviour:"accept"(default): include insertions, drop deletions — the final accepted document"reject": drop insertions, include deletions — the original document"raw": include both; insertions get underline formatting, deletions get strikethroughAlso exposed as
--docx-track-changesCLI flag.Issue resolved by this Pull Request:
Resolves #3152
Resolves #745
Checklist: