Skip to content

List items after numbered headings are ignored (regression from PR #2665 #2967

@serboor

Description

@serboor

Bug

List items that appear immediately after numbered headings (Heading 2-6) are being completely ignored during DOCX to markdown conversion, leaving sections empty in the output.

This appears to be a regression introduced by PR #2665 (fix for issue #2250). The PR added validation at lines 1150-1154 of msword_backend.py that rejects list items when the parent element is not a ListGroup:

if not isinstance(self.parents[level], ListGroup):
    # Ignore the list item if parent is not a ListGroup
    logger.warning(f"Parent element of the list item is not a ListGroup. The list item will be ignored.")
    return elem_ref

Problem: When lists immediately follow headings, the parent element is SectionHeaderGroup (not ListGroup), causing all list items to be rejected despite being valid content.

Impact: Entire sections become empty in the output, causing data loss. In our test document, a "Glosario" (Glossary) section with 5 valid definition list items is completely missing from the converted markdown.

Steps to reproduce

  1. Create a DOCX file with this structure:
## 3 Glosario               ← Heading 2
   - **Term 1**: Definition 1
   - **Term 2**: Definition 2
   - **Term 3**: Definition 3
   - **Term 4**: Definition 4
   - **Term 5**: Definition 5
  1. Convert with Docling using SimplePipeline:
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("test.docx")
markdown = result.document.export_to_markdown()
  1. Observe warnings in logs:
WARNING - Parent element of the list item is not a ListGroup. The list item will be ignored.
WARNING - Parent element of the list item is not a ListGroup. The list item will be ignored.
WARNING - Parent element of the list item is not a ListGroup. The list item will be ignored.
WARNING - Parent element of the list item is not a ListGroup. The list item will be ignored.
WARNING - Parent element of the list item is not a ListGroup. The list item will be ignored.
  1. Check output markdown - the entire "Glosario" section is empty:
## 3 Glosario

## 4 Next Section

Expected behavior: All 5 list items should be included in the markdown output under the "Glosario" heading.

Actual behavior: Section is completely empty, 5 list items are lost.

Docling version

docling, version 2.72.0

Also tested with version 2.68.0 - same behavior.

Python version

Python 3.10.12

Additional Context

Affected code location: docling/backend/msword_backend.py, lines 1150-1154

Suggested fix: When encountering a list item whose parent is not a ListGroup, instead of immediately rejecting it:

  1. Check if parent is a SectionHeaderGroup (heading)
  2. If yes, create a temporary ListGroup context or process the list independently
  3. Only reject if the parent context is truly invalid (not a heading or list container)

Workaround: We currently extract content directly from word/document.xml to recover the lost list items, but this is fragile and requires maintaining parallel parsing logic.

Test document: Available upon request (contains proprietary content, can provide sanitized version).

Image

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdocxissue related to docx backend

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions