Skip to content

Conversation

@emerose
Copy link
Contributor

@emerose emerose commented Jan 24, 2026

Summary

Documents with custom styles like "Heading 0" cause a validation error because SectionHeaderItem requires level >= 1. This PR fixes heading level extraction to use the authoritative OOXML outlineLvl property and adds defense-in-depth clamping.

Changes

  • Add _get_outline_level_from_style() to extract outlineLvl from the style definition. OOXML outlineLvl is 0-indexed (0-8 for levels 1-9), so we convert to 1-indexed heading levels (outlineLvl + 1)
  • In _get_label_and_level: first try to get the level from outlineLvl (the authoritative source), then fall back to parsing from style name
  • In _get_heading_and_level: clamp extracted level to minimum of 1 for defense in depth (handles custom styles like "Heading 0")
  • In _add_heading: additional defense in depth clamping before using the level

Root Cause

The error occurred when processing a Word document with a custom style named "Heading 0":

pydantic_core._pydantic_core.ValidationError: 1 validation error for SectionHeaderItem
level
  Input should be greater than or equal to 1 [type=greater_than_equal, input_value=0, input_type=int]

The style had outlineLvl w:val="0" which in OOXML correctly indicates a top-level heading (equivalent to Heading 1). However, docling was parsing the level from the style name ("Heading 0" → level 0) rather than using the outlineLvl property.

Test plan

  • Added parametrized tests for _get_heading_and_level edge cases (Heading 0, Heading 1, etc.)
  • Added test for _get_outline_level_from_style to verify correct extraction and conversion
  • All existing msword tests pass

🤖 Generated with Claude Code

@github-actions
Copy link
Contributor

github-actions bot commented Jan 24, 2026

DCO Check Failed

Hi @emerose, your pull request has failed the Developer Certificate of Origin (DCO) check.

This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format.


🛠 Quick Fix: Add a remediation commit

Run this command:

git commit --allow-empty -s -m "DCO Remediation Commit for Sam Quigley <sam@quigley.com>

I, Sam Quigley <sam@quigley.com>, hereby add my Signed-off-by to this commit: 59cdad8bd42507ce33c3479137d05c2d345b051b"
git push

🔧 Advanced: Sign off each commit directly

For the latest commit:

git commit --amend --signoff
git push --force-with-lease

For multiple commits:

git rebase --signoff origin/main
git push --force-with-lease

More info: DCO check report

@dosubot
Copy link

dosubot bot commented Jan 24, 2026

Related Documentation

Checked 7 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@mergify
Copy link

mergify bot commented Jan 24, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@codecov
Copy link

codecov bot commented Jan 26, 2026

Codecov Report

❌ Patch coverage is 62.50000% with 12 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/msword_backend.py 62.50% 12 Missing ⚠️

📢 Thoughts on this report? Let us know!

emerose and others added 2 commits January 31, 2026 15:26
Documents with custom styles like "Heading 0" cause a validation error
because SectionHeaderItem requires level >= 1. This fix:

1. Add _get_outline_level_from_style() to extract outlineLvl from the
   style definition. OOXML outlineLvl is 0-indexed (0-8 for levels 1-9),
   so we convert to 1-indexed heading levels (outlineLvl + 1).

2. In _get_label_and_level: first try to get the level from outlineLvl
   (the authoritative source), then fall back to parsing from style name.

3. In _get_heading_and_level: clamp extracted level to minimum of 1 for
   defense in depth (handles custom styles like "Heading 0").

4. In _add_heading: additional defense in depth clamping before using
   the level.

Added tests for _get_heading_and_level edge cases and outlineLvl extraction.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Sam Quigley <quigley@emerose.com>
Improves patch coverage for _get_heading_and_level method.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@emerose emerose force-pushed the fix/heading-level-off-by-one branch from 1739b1b to 59cdad8 Compare January 31, 2026 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant