Description
Description
When converting Markdown to DOCX using Pandoc, East Asian texts (Including Chinese, Japanese, and Korean) do not receive the appropriate XML tags that indicate their language. This leads to typographical issues, especially with punctuation marks like quotes, which do not appear in full-width as expected in East Asian texts. For instance, Simplified Chinese quotes (“ ” ‘ ’) share the same Unicode values as their Western counterparts but need to be displayed as full-width characters to align properly with Chinese text, as the screenshot shows below.
For this issue, MS Word uses specific XML tags to denote East Asian texts, as shown below:
<w:r>
<w:rPr>
<w:rFonts w:hint="eastAsia"/>
<w:lang w:eastAsia="zh-CN"/>
</w:rPr>
<w:t>这是“中文”</w:t>
</w:r>
While for the English texts, it is generally as follows:
<w:r>
<w:t>This is an English sentence</w:t>
</w:r>
Current Workaround
To address the specific issue regarding quotation marks, I have written a Lua filter that converts straight Chinese quotes to Pandoc's Quoted
elements (DoubleQuote and SingleQuote), and then applies the necessary XML tags to these Quoted
elements in the DOCX output.
Here's the Lua filter:
-- Lua Filter to Apply XML Tags to Chinese Quotes in DOCX Output
-- Check if the text contains Chinese characters
function is_chinese(text)
return text:find("[\228-\233][\128-\191][\128-\191]")
end
-- Parse quotes in the text, handling nested quotes
function parse_quotes(text)
local elements = {}
local pos = 1
while pos <= #text do
local double_start, double_end, double_quoted = text:find("「(.-)」", pos)
local single_start, single_end, single_quoted = text:find("『(.-)』", pos)
if double_start and (not single_start or double_start < single_start) then
if double_start > pos then
table.insert(elements, pandoc.Str(text:sub(pos, double_start - 1)))
end
table.insert(elements, pandoc.Quoted(pandoc.DoubleQuote, parse_quotes(double_quoted)))
pos = double_end + 1
elseif single_start then
if single_start > pos then
table.insert(elements, pandoc.Str(text:sub(pos, single_start - 1)))
end
table.insert(elements, pandoc.Quoted(pandoc.SingleQuote, parse_quotes(single_quoted)))
pos = single_end + 1
else
table.insert(elements, pandoc.Str(text:sub(pos)))
break
end
end
return elements
end
-- Apply custom XML tags to quotes, including nested quotes
function apply_custom_tags(element)
if element.t == "Quoted" then
local has_chinese = false
for _, inner_element in ipairs(element.content) do
if inner_element.t == "Str" and is_chinese(inner_element.text) then
has_chinese = true
break
end
end
if has_chinese then
local quote_type = element.quotetype == pandoc.DoubleQuote and "“" or "‘"
local closing_quote = element.quotetype == pandoc.DoubleQuote and "”" or "’"
local result = pandoc.List({
pandoc.RawInline("openxml",
string.format(
'<w:r><w:rPr><w:rFonts w:hint="eastAsia"/><w:lang w:eastAsia="zh-CN"/></w:rPr><w:t>%s</w:t></w:r>',
quote_type))
})
for _, inner_element in ipairs(element.content) do
local nested_elements = apply_custom_tags(inner_element)
for _, nested_element in ipairs(nested_elements) do
result:insert(nested_element)
end
end
result:insert(pandoc.RawInline("openxml",
string.format(
'<w:r><w:rPr><w:rFonts w:hint="eastAsia"/><w:lang w:eastAsia="zh-CN"/></w:rPr><w:t>%s</w:t></w:r>',
closing_quote)))
return result
end
end
return pandoc.List({ element })
end
-- Process each string to convert quotes and apply custom XML tags
function Str(str)
local parsed_elements = parse_quotes(str.text)
local new_elements = pandoc.List({})
for _, parsed_element in ipairs(parsed_elements) do
new_elements:insert(parsed_element)
end
local result = pandoc.List({})
for _, element in ipairs(new_elements) do
local processed_elements = apply_custom_tags(element)
for _, processed_element in ipairs(processed_elements) do
result:insert(processed_element)
end
end
return result
end
Proposed Solution
I propose that Pandoc automatically add the appropriate XML tags for East Asian languages when converting documents to DOCX format, regardless of the lang
option. This could be based on detecting the presence of East Asian characters in the text. Additionally, support for bidirectional (Bidi) languages could be included to ensure proper formatting.
Benefits
- Ensures correct typographical display of East Asian texts in DOCX documents.
- Improves the user experience for documents containing a mix of Western and East Asian texts.
- Removes the need for custom Lua filters for basic functionality.
Thank you for considering this feature request. I believe it will significantly enhance Pandoc's functionality and usability for users dealing with multilingual documents.
Related: #7022