Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect KaTeX math in HTML input and extract only LaTeX source #9971

Closed
napaalm opened this issue Jul 11, 2024 · 10 comments
Closed

Detect KaTeX math in HTML input and extract only LaTeX source #9971

napaalm opened this issue Jul 11, 2024 · 10 comments
Labels
enhancement status:resolved? Feedback requested: please either close the issue or describe why the solution is insufficient.

Comments

@napaalm
Copy link

napaalm commented Jul 11, 2024

Right now pandoc generates a lot of garbage code when converting an HTML page with KaTeX math to a LaTeX file.

Here's an example converting the page https://katex.org/ to a LaTeX document:

chromium-browser --headless=new --disable-gpu --dump-dom https://katex.org/ > katex.html
pandoc -s katex.html -o katex.tex

In katex.html there's the following formula:

<div class="example tex" data-expr="\displaystyle \frac{1}{\Bigl(\sqrt{\phi \sqrt{5}}-\phi\Bigr) e^{\frac25 \pi}} = 1+\frac{e^{-2\pi}} {1+\frac{e^{-4\pi}} {1+\frac{e^{-6\pi}} {1+\frac{e^{-8\pi}} {1+\cdots} } } }"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><!-- [...] --><annotation encoding="application/x-tex">\displaystyle \frac{1}{\Bigl(\sqrt{\phi \sqrt{5}}-\phi\Bigr) e^{\frac25 \pi}} = 1+\frac{e^{-2\pi}} {1+\frac{e^{-4\pi}} {1+\frac{e^{-6\pi}} {1+\frac{e^{-8\pi}} {1+\cdots} } } }</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><!-- [...] -->

And in katex.tex pandoc renders it as {{\(\frac{1}{\left( \sqrt{\phi\sqrt{5}} - \phi \right)e^{\frac{2}{5}\pi}} = 1 + \frac{e^{- 2\pi}}{1 + \frac{e^{- 4\pi}}{1 + \frac{e^{- 6\pi}}{1 + \frac{e^{- 8\pi}}{1 + \cdots}}}}\)}{{{}{{}{{{{{{}{{{(}}{{{{{{}{{ϕ}{{{{{{}{{5}}}{{}{}}}{\hspace{0pt}}}{{{}}}}}}}{{}{}}}{\hspace{0pt}}}{{{}}}}}{}{−}{}{ϕ}{{)}}{{e}{{{{{{}{{{{}{{{{{{}{{{5}}}}{{}{}}{{}{{{2}}}}}{\hspace{0pt}}}{{{}}}}}{}}{π}}}}}}}}}}}{{}{}}{{}{{1}}}}{\hspace{0pt}}}{{{}}}}}{}}{}{=}{}}{{}{1}{}{+}{}}{{}{{}{{{{{{}{{1}{}{+}{}{{}{{{{{{}{{{1}{+}{{}{{{{{{}{{{1}{+}{{}{{{{{{}{{1}{+}{⋯}}}{{}{}}{{}{{{e}{{{{{{}{{−}{8}{π}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}}{{}{}}{{}{{{{e}{{{{{{}{{−}{6}{π}}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}}{{}{}}{{}{{{{e}{{{{{{}{{{−}{4}{π}}}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}{{}{}}{{}{{{e}{{{{{{}{{{−}{2}{π}}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}}
The same applies for all other KaTeX formulas in the document.

Pandoc should be able to extract only the original LaTeX source and ignore all the other HTML tags, as per answer KaTeX/KaTeX#3729 (comment) below reported:

Any KaTeX output contains (1) MathML and (2) the original LaTeX, so you can get both.
[...]
The MathML is contained in the span with class "katex-mathml" and the original LaTeX is in the <annotation> node, with encoding "application/x-tex".

@jgm
Copy link
Owner

jgm commented Aug 30, 2024

Here's what I'm seeing:

% pandoc -f html -t latex
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><msubsup><mo>∫</mo><mrow><mo>−</mo><mi mathvariant="normal">∞</mi></mrow><mi mathvariant="normal">∞</mi></msubsup><mover accent="true"><mi>f</mi><mo>^</mo></mover><mo stretchy="false">(</mo><mi>ξ</mi><mo stretchy="false">)</mo> <msup><mi>e</mi><mrow><mn>2</mn><mi>π</mi><mi>i</mi><mi>ξ</mi><mi>x</mi></mrow></msup> <mi>d</mi><mi>ξ</mi></mrow><annotation encoding="application/x-tex">% \f is defined as #1f(#2) using the macro
\f\relax{x} = \int_{-\infty}^\infty
    \f\hat\xi\,e^{2 \pi i \xi x}
    \,d\xi</annotation></semantics></math>
^D
\[f(x) = \int_{- \infty}^{\infty}\hat{f}(\xi)e^{2\pi i\xi x}d\xi\]

This doesn't look garbled like the output you pasted above...

@jgm
Copy link
Owner

jgm commented Aug 30, 2024

This is with pandoc 3.3.

@napaalm
Copy link
Author

napaalm commented Aug 30, 2024

@jgm how did you generate the HTML code in your example? Maybe you have to input the complete page in order to trigger the issue.

By the way, I'm seeing the same garbled output on pandoc 3.3. Here are my test commands:

~ $ chromium-browser --headless=new --disable-gpu --dump-dom https://katex.org/ > katex.html
~ $ pandoc-3.3/bin/pandoc -v
pandoc 3.3
Features: +server +lua                                      Scripting engine: Lua 5.4
User data directory: /data/data/com.termux/files/home/.local/share/pandoc
Copyright (C) 2006-2024 John MacFarlane. Web: https://pandoc.org                                                        This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
~ $ pandoc-3.3/bin/pandoc -s katex.html -o katex.tex

And finally here are my input and output files so you can confirm the problem: katex.zip.

@jgm
Copy link
Owner

jgm commented Aug 30, 2024

Yes, I do get the garbled output with the broader context

% pandoc -f html -t latex
<div id="demo-output"><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><msubsup><mo>∫</mo><mrow><mo>−</mo><mi mathvariant="normal">∞</mi></mrow><mi mathvariant="normal">∞</mi></msubsup><mover accent="true"><mi>f</mi><mo>^</mo></mover><mo stretchy="false">(</mo><mi>ξ</mi><mo stretchy="false">)</mo> <msup><mi>e</mi><mrow><mn>2</mn><mi>π</mi><mi>i</mi><mi>ξ</mi><mi>x</mi></mrow></msup> <mi>d</mi><mi>ξ</mi></mrow><annotation encoding="application/x-tex">% \f is defined as #1f(#2) using the macro
\f\relax{x} = \int_{-\infty}^\infty
    \f\hat\xi\,e^{2 \pi i \xi x}
    \,d\xi</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mord mathnormal" style="margin-right: 0.1076em;">f</span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mspace" style="margin-right: 0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right: 0.2778em;"></span></span><span class="base"><span class="strut" style="height: 2.3846em; vertical-align: -0.9703em;"></span><span class="mop"><span class="mop op-symbol large-op" style="margin-right: 0.4445em; position: relative; top: -0.0011em;">∫</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 1.4143em;"><span class="" style="top: -1.7881em; margin-left: -0.4445em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">−</span><span class="mord mtight">∞</span></span></span></span><span class="" style="top: -3.8129em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">∞</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: 0.9703em;"><span class=""></span></span></span></span></span></span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord accent"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.9579em;"><span class="" style="top: -3em;"><span class="pstrut" style="height: 3em;"></span><span class="mord mathnormal" style="margin-right: 0.1076em;">f</span></span><span class="" style="top: -3.2634em;"><span class="pstrut" style="height: 3em;"></span><span class="accent-body" style="left: -0.0833em;"><span class="mord">^</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: 0.1944em;"><span class=""></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right: 0.046em;">ξ</span><span class="mclose">)</span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord"><span class="mord mathnormal">e</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height: 0.8991em;"><span class="" style="top: -3.113em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span><span class="mord mathnormal mtight">πi</span><span class="mord mathnormal mtight" style="margin-right: 0.046em;">ξ</span><span class="mord mathnormal mtight">x</span></span></span></span></span></span></span></span></span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal" style="margin-right: 0.046em;">ξ</span></span></span></span></span></div>
^D
<div id="demo-output"><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><msubsup><mo>∫</mo><mrow><mo>−</mo><mi mathvariant="normal">∞</mi></mrow><mi mathvariant="normal">∞</mi></msubsup><mover accent="true"><mi>f</mi><mo>^</mo></mover><mo stretchy="false">(</mo><mi>ξ</mi><mo stretchy="false">)</mo> <msup><mi>e</mi><mrow><mn>2</mn><mi>π</mi><mi>i</mi><mi>ξ</mi><mi>x</mi></mrow></msup> <mi>d</mi><mi>ξ</mi></mrow><annotation encoding="application/x-tex">% \f is defined as #1f(#2) using the macro
\f\relax{x} = \int_{-\infty}^\infty
    \f\hat\xi\,e^{2 \pi i \xi x}
    \,d\xi</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mord mathnormal" style="margin-right: 0.1076em;">f</span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mspace" style="margin-right: 0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right: 0.2778em;"></span></span><span class="base"><span class="strut" style="height: 2.3846em; vertical-align: -0.9703em;"></span><span class="mop"><span class="mop op-symbol large-op" style="margin-right: 0.4445em; position: relative; top: -0.0011em;">∫</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 1.4143em;"><span class="" style="top: -1.7881em; margin-left: -0.4445em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">−</span><span class="mord mtight">∞</span></span></span></span><span class="" style="top: -3.8129em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">∞</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: 0.9703em;"><span class=""></span></span></span></span></span></span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord accent"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.9579em;"><span class="" style="top: -3em;"><span class="pstrut" style="height: 3em;"></span><span class="mord mathnormal" style="margin-right: 0.1076em;">f</span></span><span class="" style="top: -3.2634em;"><span class="pstrut" style="height: 3em;"></span><span class="accent-body" style="left: -0.0833em;"><span class="mord">^</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height: 0.1944em;"><span class=""></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right: 0.046em;">ξ</span><span class="mclose">)</span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord"><span class="mord mathnormal">e</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height: 0.8991em;"><span class="" style="top: -3.113em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span><span class="mord mathnormal mtight">πi</span><span class="mord mathnormal mtight" style="margin-right: 0.046em;">ξ</span><span class="mord mathnormal mtight">x</span></span></span></span></span></span></span></span></span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal" style="margin-right: 0.046em;">ξ</span></span></span></span></span></div>
^D
\phantomsection\label{demo-output}
{{{\[f(x) = \int_{- \infty}^{\infty}\hat{f}(\xi)e^{2\pi i\xi x}d\xi\]}{{{}{f}{(}{x}{)}{}{=}{}}{{}{{∫}{{{{{{}{{{−}{∞}}}}{{}{{∞}}}}{\hspace{0pt}}}{{{}}}}}}{}{{{{{{}{f}}{{}{{\^{}}}}}{\hspace{0pt}}}{{{}}}}}{(}{ξ}{)}{}{{e}{{{{{{}{{{2}{πi}{ξ}{x}}}}}}}}}{}{d}{ξ}}}}}

The garbled stuff at the end is not actually from the <math> element; it is pandoc's attempt to convert the part that comes after it, namely

<span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mord mathnormal" style="margin-right: 0.1076em;">f</span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mspace" style="margin-right: 0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right: 0.2778em;"></span></span><span class="base"><span class="strut" style="height: 2.3846em; vertical-align: -0.9703em;"></span><span class="mop"><span class="mop op-symbol large-op" style="margin-right: 0.4445em; position: relative; top: -0.0011em;"></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 1.4143em;"><span class="" style="top: -1.7881em; margin-left: -0.4445em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"></span><span class="mord mtight"></span></span></span></span><span class="" style="top: -3.8129em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.9703em;"><span class=""></span></span></span></span></span></span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord accent"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.9579em;"><span class="" style="top: -3em;"><span class="pstrut" style="height: 3em;"></span><span class="mord mathnormal" style="margin-right: 0.1076em;">f</span></span><span class="" style="top: -3.2634em;"><span class="pstrut" style="height: 3em;"></span><span class="accent-body" style="left: -0.0833em;"><span class="mord">^</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.1944em;"><span class=""></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right: 0.046em;">ξ</span><span class="mclose">)</span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord"><span class="mord mathnormal">e</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height: 0.8991em;"><span class="" style="top: -3.113em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span><span class="mord mathnormal mtight">πi</span><span class="mord mathnormal mtight" style="margin-right: 0.046em;">ξ</span><span class="mord mathnormal mtight">x</span></span></span></span></span></span></span></span></span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal" style="margin-right: 0.046em;">ξ</span></span></span>

I'm not sure what the point of this is on the katex website -- it's marked "hidden" -- but pandoc is doing a decent job of translating it.

@jgm
Copy link
Owner

jgm commented Aug 30, 2024

Note that you won't have anything like this in a normal page with mathml.

@jgm jgm added the status:resolved? Feedback requested: please either close the issue or describe why the solution is insufficient. label Aug 30, 2024
@jgm
Copy link
Owner

jgm commented Aug 30, 2024

One possible improvement would be to have the HTML reader ignore anything with aria-hidden="true". I'm not sure if that's a good idea, though.

@napaalm
Copy link
Author

napaalm commented Aug 31, 2024

I'm not sure what the point of this is on the katex website -- it's marked "hidden" -- but pandoc is doing a decent job of translating it.

@jgm It's not just on katex.org: every website that uses KaTeX math rendering is full of these span class="katex-html" tags, as they are automatically generated by the KaTeX library. See also KaTeX/KaTeX#3729 (comment).

One possible improvement would be to have the HTML reader ignore anything with aria-hidden="true".

You could further filter for class="katex-html" so it applies only for this particular situation.

@jgm
Copy link
Owner

jgm commented Aug 31, 2024

For now I'll do the minimal thing and ignore spans with class katex-html.

@jgm jgm closed this as completed in 49e82f9 Aug 31, 2024
@napaalm
Copy link
Author

napaalm commented Sep 1, 2024

@jgm One last question: does pandoc currently extract the original TeX source from the <annotation> tag, or does it translate the mathml representation? If the latter, does such a translation produce the exact same TeX source code, or could it be inaccurate?

@jgm
Copy link
Owner

jgm commented Sep 1, 2024

It parses the mathml. It is possible that it will produce different TeX. I could check the annotation first and use that if present; that might be a small improvement.

jgm added a commit that referenced this issue Sep 1, 2024
KaTeX (and probably other tools that produce MathML from TeX)
includes an annotation tag with the original TeX; we extract this
if present instead of converting the MathML.

See #9971.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement status:resolved? Feedback requested: please either close the issue or describe why the solution is insufficient.
Projects
None yet
Development

No branches or pull requests

2 participants