-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect KaTeX math in HTML input and extract only LaTeX source #9971
Comments
Here's what I'm seeing:
This doesn't look garbled like the output you pasted above... |
This is with pandoc 3.3. |
@jgm how did you generate the HTML code in your example? Maybe you have to input the complete page in order to trigger the issue. By the way, I'm seeing the same garbled output on pandoc 3.3. Here are my test commands: ~ $ chromium-browser --headless=new --disable-gpu --dump-dom https://katex.org/ > katex.html
~ $ pandoc-3.3/bin/pandoc -v
pandoc 3.3
Features: +server +lua Scripting engine: Lua 5.4
User data directory: /data/data/com.termux/files/home/.local/share/pandoc
Copyright (C) 2006-2024 John MacFarlane. Web: https://pandoc.org This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
~ $ pandoc-3.3/bin/pandoc -s katex.html -o katex.tex And finally here are my input and output files so you can confirm the problem: katex.zip. |
Yes, I do get the garbled output with the broader context
The garbled stuff at the end is not actually from the <span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mord mathnormal" style="margin-right: 0.1076em;">f</span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mspace" style="margin-right: 0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right: 0.2778em;"></span></span><span class="base"><span class="strut" style="height: 2.3846em; vertical-align: -0.9703em;"></span><span class="mop"><span class="mop op-symbol large-op" style="margin-right: 0.4445em; position: relative; top: -0.0011em;">∫</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 1.4143em;"><span class="" style="top: -1.7881em; margin-left: -0.4445em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">−</span><span class="mord mtight">∞</span></span></span></span><span class="" style="top: -3.8129em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">∞</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.9703em;"><span class=""></span></span></span></span></span></span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord accent"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.9579em;"><span class="" style="top: -3em;"><span class="pstrut" style="height: 3em;"></span><span class="mord mathnormal" style="margin-right: 0.1076em;">f</span></span><span class="" style="top: -3.2634em;"><span class="pstrut" style="height: 3em;"></span><span class="accent-body" style="left: -0.0833em;"><span class="mord">^</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.1944em;"><span class=""></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right: 0.046em;">ξ</span><span class="mclose">)</span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord"><span class="mord mathnormal">e</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height: 0.8991em;"><span class="" style="top: -3.113em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span><span class="mord mathnormal mtight">πi</span><span class="mord mathnormal mtight" style="margin-right: 0.046em;">ξ</span><span class="mord mathnormal mtight">x</span></span></span></span></span></span></span></span></span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord mathnormal">d</span><span class="mord mathnormal" style="margin-right: 0.046em;">ξ</span></span></span> I'm not sure what the point of this is on the katex website -- it's marked "hidden" -- but pandoc is doing a decent job of translating it. |
Note that you won't have anything like this in a normal page with mathml. |
One possible improvement would be to have the HTML reader ignore anything with |
@jgm It's not just on katex.org: every website that uses KaTeX math rendering is full of these
You could further filter for |
For now I'll do the minimal thing and ignore spans with class |
@jgm One last question: does pandoc currently extract the original TeX source from the |
It parses the mathml. It is possible that it will produce different TeX. I could check the annotation first and use that if present; that might be a small improvement. |
KaTeX (and probably other tools that produce MathML from TeX) includes an annotation tag with the original TeX; we extract this if present instead of converting the MathML. See #9971.
Right now pandoc generates a lot of garbage code when converting an HTML page with KaTeX math to a LaTeX file.
Here's an example converting the page https://katex.org/ to a LaTeX document:
chromium-browser --headless=new --disable-gpu --dump-dom https://katex.org/ > katex.html pandoc -s katex.html -o katex.tex
In
katex.html
there's the following formula:And in
katex.tex
pandoc renders it as{{\(\frac{1}{\left( \sqrt{\phi\sqrt{5}} - \phi \right)e^{\frac{2}{5}\pi}} = 1 + \frac{e^{- 2\pi}}{1 + \frac{e^{- 4\pi}}{1 + \frac{e^{- 6\pi}}{1 + \frac{e^{- 8\pi}}{1 + \cdots}}}}\)}{{{}{{}{{{{{{}{{{(}}{{{{{{}{{ϕ}{{{{{{}{{5}}}{{}{}}}{\hspace{0pt}}}{{{}}}}}}}{{}{}}}{\hspace{0pt}}}{{{}}}}}{}{−}{}{ϕ}{{)}}{{e}{{{{{{}{{{{}{{{{{{}{{{5}}}}{{}{}}{{}{{{2}}}}}{\hspace{0pt}}}{{{}}}}}{}}{π}}}}}}}}}}}{{}{}}{{}{{1}}}}{\hspace{0pt}}}{{{}}}}}{}}{}{=}{}}{{}{1}{}{+}{}}{{}{{}{{{{{{}{{1}{}{+}{}{{}{{{{{{}{{{1}{+}{{}{{{{{{}{{{1}{+}{{}{{{{{{}{{1}{+}{⋯}}}{{}{}}{{}{{{e}{{{{{{}{{−}{8}{π}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}}{{}{}}{{}{{{{e}{{{{{{}{{−}{6}{π}}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}}{{}{}}{{}{{{{e}{{{{{{}{{{−}{4}{π}}}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}{{}{}}{{}{{{e}{{{{{{}{{{−}{2}{π}}}}}}}}}}}}{\hspace{0pt}}}{{{}}}}}{}}}}}
The same applies for all other KaTeX formulas in the document.
Pandoc should be able to extract only the original LaTeX source and ignore all the other HTML tags, as per answer KaTeX/KaTeX#3729 (comment) below reported:
The text was updated successfully, but these errors were encountered: