Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCX table caption to Markdown has extra "Table" word #9002

Closed
rgaiacs opened this issue Aug 14, 2023 · 1 comment
Closed

DOCX table caption to Markdown has extra "Table" word #9002

rgaiacs opened this issue Aug 14, 2023 · 1 comment
Labels

Comments

@rgaiacs
Copy link
Contributor

rgaiacs commented Aug 14, 2023

Given the minimal DOCX working example that has

Screenshot of minimal DOCX working example

pandoc --from docx --to markdown caption.docx produces

-----------------------------------------------------------------------
Foo                                 Bar
----------------------------------- -----------------------------------
1                                   2

-----------------------------------------------------------------------

: Table Caption here.

Note the Table after : that is different with table_captions:

A caption is a paragraph beginning with the string Table: (or table: or just :), which will be stripped off.

The expected output is

-----------------------------------------------------------------------
Foo                                 Bar
----------------------------------- -----------------------------------
1                                   2

-----------------------------------------------------------------------

: Caption here.

When inspecting the XML generated by Pandoc and Microsoft Office Word, they are very different.

Pandoc XML:

<w:p>
    <w:pPr>
    <w:pStyle w:val="TableCaption"/>
    </w:pPr>
    <w:r>
    <w:t xml:space="preserve">Caption here.</w:t>
    </w:r>
</w:p>

Microsoft Office Word XML:

<w:p w14:paraId="5AD17E4F" w14:textId="3DEE5416" w:rsidR="00C10DAE" w:rsidRDefault="00C10DAE" w:rsidP="00C10DAE">
    <w:pPr>
      <w:pStyle w:val="Caption"/>
      <w:keepNext/>
    </w:pPr>
    <w:r>
      <w:t xml:space="preserve">Table </w:t>
    </w:r>
    <w:fldSimple w:instr=" SEQ Table \* ARABIC ">
      <w:r>
        <w:rPr>
          <w:noProof/>
        </w:rPr>
        <w:t>1</w:t>
      </w:r>
    </w:fldSimple>
    <w:r>
      <w:t xml:space="preserve"> </w:t>
    </w:r>
    <w:r w:rsidR="00D62831">
      <w:t>C</w:t>
    </w:r>
    <w:r>
      <w:t>aption here.</w:t>
    </w:r>
</w:p>

Environment:

pandoc 3.1.5
Features: +server +lua
Scripting engine: Lua 5.4
@rgaiacs rgaiacs added the bug label Aug 14, 2023
@jgm
Copy link
Owner

jgm commented Aug 14, 2023

We should presumably strip off everything before <w:fldSimple w:instr=" SEQ Table ... in the caption.

@jgm jgm closed this as completed in 068fce4 Aug 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants