-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Bug
List items that appear immediately after numbered headings (Heading 2-6) are being completely ignored during DOCX to markdown conversion, leaving sections empty in the output.
This appears to be a regression introduced by PR #2665 (fix for issue #2250). The PR added validation at lines 1150-1154 of msword_backend.py that rejects list items when the parent element is not a ListGroup:
if not isinstance(self.parents[level], ListGroup):
# Ignore the list item if parent is not a ListGroup
logger.warning(f"Parent element of the list item is not a ListGroup. The list item will be ignored.")
return elem_refProblem: When lists immediately follow headings, the parent element is SectionHeaderGroup (not ListGroup), causing all list items to be rejected despite being valid content.
Impact: Entire sections become empty in the output, causing data loss. In our test document, a "Glosario" (Glossary) section with 5 valid definition list items is completely missing from the converted markdown.
Steps to reproduce
- Create a DOCX file with this structure:
## 3 Glosario ← Heading 2
- **Term 1**: Definition 1
- **Term 2**: Definition 2
- **Term 3**: Definition 3
- **Term 4**: Definition 4
- **Term 5**: Definition 5
- Convert with Docling using SimplePipeline:
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("test.docx")
markdown = result.document.export_to_markdown()- Observe warnings in logs:
WARNING - Parent element of the list item is not a ListGroup. The list item will be ignored.
WARNING - Parent element of the list item is not a ListGroup. The list item will be ignored.
WARNING - Parent element of the list item is not a ListGroup. The list item will be ignored.
WARNING - Parent element of the list item is not a ListGroup. The list item will be ignored.
WARNING - Parent element of the list item is not a ListGroup. The list item will be ignored.
- Check output markdown - the entire "Glosario" section is empty:
## 3 Glosario
## 4 Next SectionExpected behavior: All 5 list items should be included in the markdown output under the "Glosario" heading.
Actual behavior: Section is completely empty, 5 list items are lost.
Docling version
docling, version 2.72.0
Also tested with version 2.68.0 - same behavior.
Python version
Python 3.10.12
Additional Context
Affected code location: docling/backend/msword_backend.py, lines 1150-1154
Suggested fix: When encountering a list item whose parent is not a ListGroup, instead of immediately rejecting it:
- Check if parent is a
SectionHeaderGroup(heading) - If yes, create a temporary
ListGroupcontext or process the list independently - Only reject if the parent context is truly invalid (not a heading or list container)
Workaround: We currently extract content directly from word/document.xml to recover the lost list items, but this is fragile and requires maintaining parallel parsing logic.
Test document: Available upon request (contains proprietary content, can provide sanitized version).
