Skip to content

Subdocument merge with Pandoc-generated DOCX causes XML namespace errors when opening the resulting fileΒ #620

@alexgoryushkin

Description

@alexgoryushkin

Describe the bug

When using a subdocument generated by pandoc in a DocxTemplate, the final document becomes invalid due to missing XML namespace prefixes on elements like <a:graphic>. The resulting .docx file fails to open in Word or when loaded with python-docx. The issue occurs because the docxtpl rendering process does not correctly preserve or propagate XML namespace declarations from the subdocument into the main document.

To Reproduce

Here is a minimal standalone example to reproduce the issue:

Ext lib required: pip install pypandoc_binary

from docx import Document
from docxtpl import DocxTemplate
import io
import pypandoc

# Create a main template with a placeholder
main_template = Document()
main_template.add_paragraph("{{p sd }}")
main_template_stream = io.BytesIO()
main_template.save(main_template_stream)

# Convert HTML to DOCX using pandoc
html = '<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAYAAABzenr0AAAALklEQVR42u3OMQEAAAgDoJnP/nk0xh5IwOzlUjQCAgICAgICAgICAgICAgLtwANzwElBO5ixlAAAAABJRU5ErkJggg==">'
pypandoc.convert_text(html, 'docx', format='html', outputfile='1.docx')
Document('1.docx')  # No error

# Load and use the subdoc in the template
template = DocxTemplate(main_template_stream)
template.render(context={"sd": template.new_subdoc('1.docx')})
final_stream = io.BytesIO()
template.save(final_stream)

# The final document will fail to load due to XML namespace errors
Document(final_stream)  # Raises lxml.etree.XMLSyntaxError

Error Message:

lxml.etree.XMLSyntaxError: Namespace prefix a on graphic is not defined, line 2, column 1414

Expected behavior

When rendering a subdocument created by pandoc, the final .docx file should be valid and load without XML namespace errors. The XML elements in the merged document should retain the necessary xmlns declarations, either from the root or explicitly attached to the elements, as expected by Word and the python-docx library.

Screenshots

N/A

Additional context

  • Pandoc behavior: The generated .docx file by pandoc includes XML namespaces in the root element (e.g., xmlns:a="http://..."). However, when this file is used as a subdocument in DocxTemplate, the namespaces are not preserved in the final document.
  • Workaround: Manually adding xmlns attributes to affected elements in word/document.xml (e.g., <a:graphic xmlns:a="http://...">) resolves the error, but this is a fragile "monkey patch" solution.
# unpack docx
docx_unpacked["word/document.xml"] = docx_unpacked["word/document.xml"].replace(b"<a:graphic>", b'<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">')
docx_unpacked["word/document.xml"] = docx_unpacked["word/document.xml"].replace(b"<pic:pic>", b'<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">')
# repack docx
# now it valid
  • Root cause: The DocxTemplate rendering process likely strips or fails to propagate XML namespace declarations from the subdocument during the merge, leading to invalid XML in the final document.

Suggested Fix

Ensure that the DocxTemplate renderer properly handles XML namespaces when merging subdocuments. Specifically, when inserting content from pandoc-generated .docx files, the necessary xmlns declarations should either:

  1. Be preserved in the root element of the final document, or
  2. Be explicitly attached to the relevant XML elements (e.g., <a:graphic>).

This will align the output with standards expected by Word and python-docx, avoiding syntax errors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions