Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CJK chars encoding error with HTML output #2301

Closed
lewisleedev opened this issue Dec 13, 2024 · 23 comments
Closed

CJK chars encoding error with HTML output #2301

lewisleedev opened this issue Dec 13, 2024 · 23 comments
Labels
html The html output format. i18n Internationalisation/localisation-related.

Comments

@lewisleedev
Copy link

HTML has some problems with CJK characters, and the simplest solution is to add in the tag. hledger by default does not include this encoding tag, so CJK characters (Korean, in my case) appear broken when rendered without adding the snippet.

Problem is that this snippet must be placed within the tag and the proper character encoding declaration should be set by the HTML document itself.

I'm still creating an issue for this problem since it affects the proper display of CJK characters in the generated HTML output but I also think that adding HTML meta tags in hledger's output might not be the right approach as a snippet of a document shouldn't really change the encoding for the whole document.

@simonmichael simonmichael added A-BUG Something wrong, confusing or sub-standard in the software, docs, or user experience. i18n Internationalisation/localisation-related. html The html output format. labels Dec 13, 2024
@simonmichael
Copy link
Owner

Thanks for reporting, I'll wait for a small example of the problem and workaround that you mentioned, since you understand it better.

@lewisleedev
Copy link
Author

Sorry for the belated reply.

Example journal file follows:

2024-12-16 미드나잇롤러코스터클럽
    expenses:cafe                             4,500. KRW
    liabilities:samsung card                 -4,500. KRW

2024-12-16 里門蔘鷄湯
    expenses:restaurant                       9,000. KRW
    liabilities:samsung card                 -9,000. KRW

I also tried with the hanja(kanji, Chinese character) input and the same thing happened.

HTML output:
Screenshot_20241216_205124

Rendered output, unmodified:
Screenshot_20241216_205053

Adding <meta charset="utf8" /> to the output HTML file solves the issue.

Code added:
Screenshot_20241216_205131

Rendered output w/ properly rendered CJK char:
Screenshot_20241216_205148

@lewisleedev
Copy link
Author

Also figured out just now: apparently enabling -t(tree) with HTML output shows this weird indentation character:

Screenshot_20241216_213122-1
Sorry for the censored image, this was my real report with real data

This also was solved by adding the same meta tag.

It was created with the following command on Fedora linux(English locale, if that matters):

hledger is -MATt -O html > index.html
hledger is -MATt -O html -o index.html

This did not solve the issue.

Tried bunch of browsers but to no avail.

@simonmichael
Copy link
Owner

On my system (macos 15.1), all of the above display properly in safari, brave and firefox.

I probably have a system locale that supports UTF-8 decoding.

Can you tell us more about your OS and system locale/language setting ?

@simonmichael
Copy link
Owner

Ah, you said: Fedora GNU/Linux. And what does echo $LANG and locale -a look like in the terminal where you run hledger ?

@lewisleedev
Copy link
Author

lewisleedev commented Dec 20, 2024

echo $LANG result:

en_US.UTF-8

locale -a result:

<SNIP>
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8
en_ZA
<SNIP>

@lewisleedev
Copy link
Author

To be browser specific, I used Zen browser v1.0.1-a17 (Firefox 132.0) and Brave browser v1.73.97 (Chromium 131.0.6778.108), both on the same Fedora machine.

Also tested this with a Windows 10 machine(set to use both English and Korean, display language is English), same with Zen browser and Brave browser and the exact same encoding error happens.

@simonmichael
Copy link
Owner

You seem to have everything set up correctly for command line use at least. All I can think of is that when you start web browsers from the GUI they are not seeing the same system locale. In that case starting the browser from your terminal might make a difference. But the Windows test makes this seem unlikely.

So I can't reproduce at the moment. Ideas, anyone ??

@simonmichael
Copy link
Owner

@thielema, have you noticed this in any of your HTML reports ?

@simonmichael
Copy link
Owner

@lewisleedev, when I trimmed your comment just now I noticed that the $LANG value and the installed locale have a different spelling. I remember that causing problems in my past testing. (Related: https://hledger.org/dev/hledger.html#troubleshooting)

@simonmichael
Copy link
Owner

(en_US.UTF-8 vs en_US.utf8)

@lewisleedev
Copy link
Author

@lewisleedev, when I trimmed your comment just now I noticed that the $LANG value and the installed locale have a different spelling. I remember that causing problems in my past testing. (Related: https://hledger.org/dev/hledger.html#troubleshooting)

That seem to (also) affect terminal output? I have no problem with terminal output(using Wezterm, if it matters). Besides, considering this also happens on my Windows machine, I don't think locale setup is the issue here, at least $LANG part.

Also tested with an Android machine, en-US, Firefox 133.0.3. Same thing happens. w3m works fine so system wide encoding may not be the issue...

It's strange really, perhaps it's only MacOS that's working properly?

@Aankhen
Copy link

Aankhen commented Dec 21, 2024

It works correctly for me in Firefox and Chrome on Windows, too. Nevertheless, I agree with @lewisleedev that hledger should add meta charset: without that, a browser opening the document from the filesystem (where there are no HTTP headers to glean the encoding from) has to guess what the encoding is. The fact that a lot of our browsers arrive at UTF-8 as the answer is fortunate, but it isn’t something hledger should rely on.

@simonmichael
Copy link
Owner

simonmichael commented Dec 22, 2024 via email

@lewisleedev
Copy link
Author

lewisleedev commented Dec 23, 2024

where there are no HTTP headers to glean the encoding from

I actually figured it out thanks to that note. All this time I was using python -m http.server to essentially host my .html file when you were talking about opening it directly with the browser. I think I was giving it the wrong header.

@simonmichael
Copy link
Owner

@lewisleedev great! Does that mean we don't need to do anything in hledger ?

@Aankhen
Copy link

Aankhen commented Dec 28, 2024

I would still suggest adding meta charset, because it is required to remove any ambiguity when the file is being opened directly from disk. It would be fair to say this isn’t a concern since most environments use UTF-8 anyway, but adding it wouldn’t hurt hledger and would deal with the edge cases.

simonmichael added a commit that referenced this issue Dec 28, 2024
For general correctness of reports' HTML output.
@lewisleedev
Copy link
Author

I would have to personally disagree with @Aankhen and say that meta tag in a snippet isn't the best idea. It's a snippet. It cqn either go in a file unaltered or can be inside another HTML page. Either way, user should be able to make their own decisions regarding the encoding without having to remove the meta tag.

I think browsers having uft8 as their default is definitely something that we can rely on. Besides, if for some reason user needs different encoding than utf8, this will be an issue.

@simonmichael
Copy link
Owner

simonmichael commented Dec 28, 2024

Well now I'm glad I had all these local IT hassles because I was just about to push the charset UTF-8 meta tag in printHtml.

I see the point that our reports' HTML output is a HTML fragment, not a full HTML document. And "if it ain't broke don't fix it".

@thielema, I'm guessing you'll agree with @lewisleedev here ?

(Unrelated: I didn't find obvious users of Hledger.Write.Html.Blaze.printHtml, should it be removed ?)

@simonmichael
Copy link
Owner

simonmichael commented Dec 28, 2024

Example of our HTML output with the meta tag added, just to make this concrete:

$ cat a.html
<meta charset="UTF-8"><link rel="stylesheet" href="hledger.css"><style>
table {border-collapse:collapse}
th, td {padding-left:1em}
th.account, td.account {padding-left:0;}
</style><table><tr><th style="border-bottom:double black">account</th><th style="border-bottom:double black">balance</th></tr><tr><td>ß</td><td align="right" class="amount">10 ß</td></tr><tr><td>проверка</td><td align="right" class="amount">10 проверка</td></tr><tr><td style="border-top:double black"><b>Total:</b></td><td style="border-top:double black" align="right" class="amount coltotal"><b>10 ß, 10 проверка</b></td></tr></table>

@thielema
Copy link
Contributor

thielema commented Dec 28, 2024 via email

@simonmichael simonmichael removed the A-BUG Something wrong, confusing or sub-standard in the software, docs, or user experience. label Dec 29, 2024
@Aankhen
Copy link

Aankhen commented Dec 29, 2024

Either way, user should be able to make their own decisions regarding the encoding without having to remove the meta tag.

I understand what you’re saying, but hledger specifies that journal files are in UTF-8 and can only ever produce UTF-8 (modulo bugs or errors). Putting the HTML output in a non–UTF-8 document verbatim doesn’t make sense, which is why I’d say meta charset is both correct and advisable.

@thielema
Copy link
Contributor

thielema commented Dec 29, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
html The html output format. i18n Internationalisation/localisation-related.
Projects
None yet
Development

No branches or pull requests

4 participants