Skip to content

Converting captioned figures to DOCX and back produces raw HTML, or doubled captions #10755

Closed
@rrthomas

Description

@rrthomas

Using pandoc 3.6.4.

With the following Markdown input as foo.md:

![An image.](media/image.jpg)

I convert to DOCX:

pandoc --to=docx foo.md --output=foo.docx

This works fine. Converting foo.md to HTML and PDF also produce good results.

I then convert the DOCX back to Markdown:

pandoc --to=markdown --output=bar.md --extract-media=. foo.docx

This produces the following Markdown:

<figure>
<img src="./media/rId20.jpg" style="width:5.83333in;height:3.91413in"
alt="An image." />
<figcaption aria-hidden="true"><p>An image.</p></figcaption>
</figure>

This looks OK in principle, but while converting it to HTML produces a good result, converting it to PDF omits the image.

I also tried:

pandoc --to=markdown-raw_html --output=bar.md --extract-media=. foo.docx

This produces:

:::: figure
![An image.](./media/rId20.jpg){width="5.833333333333333in"
height="3.9141338582677165in"}

::: caption
An image.
:::
::::

Here, the PDF output is fine, but the HTML has two copies of the figure caption.

Ideally, it would be possible to produce the same Markdown from the DOCX as the original, but I'd be quite happy with an equivalent that worked as well. I'm hoping to be able to do round-trip conversion, so that I and others can do (disciplined!) edits to the DOCX and then re-convert to Markdown.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions