Skip to content

Detect KaTeX math in HTML input and extract only LaTeX source #9971

Closed
@napaalm

Description

@napaalm

Right now pandoc generates a lot of garbage code when converting an HTML page with KaTeX math to a LaTeX file.

Here's an example converting the page https://katex.org/ to a LaTeX document:

chromium-browser --headless=new --disable-gpu --dump-dom https://katex.org/ > katex.html
pandoc -s katex.html -o katex.tex

In katex.html there's the following formula:

<div class="example tex" data-expr="\displaystyle \frac{1}{\Bigl(\sqrt{\phi \sqrt{5}}-\phi\Bigr) e^{\frac25 \pi}} = 1+\frac{e^{-2\pi}} {1+\frac{e^{-4\pi}} {1+\frac{e^{-6\pi}} {1+\frac{e^{-8\pi}} {1+\cdots} } } }"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><!-- [...] --><annotation encoding="application/x-tex">\displaystyle \frac{1}{\Bigl(\sqrt{\phi \sqrt{5}}-\phi\Bigr) e^{\frac25 \pi}} = 1+\frac{e^{-2\pi}} {1+\frac{e^{-4\pi}} {1+\frac{e^{-6\pi}} {1+\frac{e^{-8\pi}} {1+\cdots} } } }</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><!-- [...] -->

And in katex.tex pandoc renders it as {{\(\frac{1}{\left( \sqrt{\phi\sqrt{5}} - \phi \right)e^{\frac{2}{5}\pi}} = 1 + \frac{e^{- 2\pi}}{1 + \frac{e^{- 4\pi}}{1 + \frac{e^{- 6\pi}}{1 + \frac{e^{- 8\pi}}{1 + \cdots}}}}\)}{{{}{{}{{{{{{}{{{(}}{{{{{{}{{ϕ}{{{{{{}{{5}}}{{}{}}}{\hspace{0pt}}}{{{}}}}}}}{{}{}}}{\hspace{0pt}}}{{{}}}}}{}{−}{}{ϕ}{{)}}{{e}{{{{{{}{{{{}{{{{{{}{{{5}}}}{{}{}}{{}{{{2}}}}}{\hspace{0pt}}}{{{}}}}}{}}{π}}}}}}}}}}}{{}{}}{{}{{1}}}}{\hspace{0pt}}}{{{}}}}}{}}{}{=}{}}{{}{1}{}{+}{}}{{}{{}{{{{{{}{{1}{}{+}{}{{}{{{{{{}{{{1}{+}{{}{{{{{{}{{{1}{+}{{}{{{{{{}{{1}{+}{⋯}}}{{}{}}{{}{{{e}{{{{{{}{{−}{8}{π}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}}{{}{}}{{}{{{{e}{{{{{{}{{−}{6}{π}}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}}{{}{}}{{}{{{{e}{{{{{{}{{{−}{4}{π}}}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}{{}{}}{{}{{{e}{{{{{{}{{{−}{2}{π}}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}}
The same applies for all other KaTeX formulas in the document.

Pandoc should be able to extract only the original LaTeX source and ignore all the other HTML tags, as per answer KaTeX/KaTeX#3729 (comment) below reported:

Any KaTeX output contains (1) MathML and (2) the original LaTeX, so you can get both.
[...]
The MathML is contained in the span with class "katex-mathml" and the original LaTeX is in the <annotation> node, with encoding "application/x-tex".

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementstatus:resolved?Feedback requested: please either close the issue or describe why the solution is insufficient.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions