-
Notifications
You must be signed in to change notification settings - Fork 425
Description
Describe the bug
When using a subdocument generated by pandoc in a DocxTemplate, the final document becomes invalid due to missing XML namespace prefixes on elements like <a:graphic>. The resulting .docx file fails to open in Word or when loaded with python-docx. The issue occurs because the docxtpl rendering process does not correctly preserve or propagate XML namespace declarations from the subdocument into the main document.
To Reproduce
Here is a minimal standalone example to reproduce the issue:
Ext lib required: pip install pypandoc_binary
from docx import Document
from docxtpl import DocxTemplate
import io
import pypandoc
# Create a main template with a placeholder
main_template = Document()
main_template.add_paragraph("{{p sd }}")
main_template_stream = io.BytesIO()
main_template.save(main_template_stream)
# Convert HTML to DOCX using pandoc
html = '<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAYAAABzenr0AAAALklEQVR42u3OMQEAAAgDoJnP/nk0xh5IwOzlUjQCAgICAgICAgICAgICAgLtwANzwElBO5ixlAAAAABJRU5ErkJggg==">'
pypandoc.convert_text(html, 'docx', format='html', outputfile='1.docx')
Document('1.docx') # No error
# Load and use the subdoc in the template
template = DocxTemplate(main_template_stream)
template.render(context={"sd": template.new_subdoc('1.docx')})
final_stream = io.BytesIO()
template.save(final_stream)
# The final document will fail to load due to XML namespace errors
Document(final_stream) # Raises lxml.etree.XMLSyntaxErrorError Message:
lxml.etree.XMLSyntaxError: Namespace prefix a on graphic is not defined, line 2, column 1414
Expected behavior
When rendering a subdocument created by pandoc, the final .docx file should be valid and load without XML namespace errors. The XML elements in the merged document should retain the necessary xmlns declarations, either from the root or explicitly attached to the elements, as expected by Word and the python-docx library.
Screenshots
N/A
Additional context
- Pandoc behavior: The generated
.docxfile bypandocincludes XML namespaces in the root element (e.g.,xmlns:a="http://..."). However, when this file is used as a subdocument inDocxTemplate, the namespaces are not preserved in the final document. - Workaround: Manually adding
xmlnsattributes to affected elements inword/document.xml(e.g.,<a:graphic xmlns:a="http://...">) resolves the error, but this is a fragile "monkey patch" solution.
# unpack docx
docx_unpacked["word/document.xml"] = docx_unpacked["word/document.xml"].replace(b"<a:graphic>", b'<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">')
docx_unpacked["word/document.xml"] = docx_unpacked["word/document.xml"].replace(b"<pic:pic>", b'<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">')
# repack docx
# now it valid- Root cause: The
DocxTemplaterendering process likely strips or fails to propagate XML namespace declarations from the subdocument during the merge, leading to invalid XML in the final document.
Suggested Fix
Ensure that the DocxTemplate renderer properly handles XML namespaces when merging subdocuments. Specifically, when inserting content from pandoc-generated .docx files, the necessary xmlns declarations should either:
- Be preserved in the root element of the final document, or
- Be explicitly attached to the relevant XML elements (e.g.,
<a:graphic>).
This will align the output with standards expected by Word and python-docx, avoiding syntax errors.