-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with Displaying Chinese Characters #7
Comments
I don't think your file is encoded in GB2312. I've tried opening your file with that encoding in a text editor and got this:
Not only is this complete gibberish, this is not even valid SGF (in the last line, it's missing a |
Couldn't find a valid encoding for this, it may have been corrupted somehow before you got it. |
Folks, I am grateful for your prompt reply/help. I was a bit hasty in the previous post. As you will see, the Chinese characters in the first file are scrambled. The second file does display Chinese characters correctly in Sabaki. This is good news. The third file was produced by the following process. First, I used BBEdit to create a new, empty So, the question is what might be a "painless" solution? For example, can Sabaki be made to recognize and properly display GB18030 characters? This would be highly desirable because Your comments and help are again greatly appreciated. Best, |
This probably stems from the fact that we only consider the first 100 bytes for character encoding detection which in this case does not contain enough Chinese characters. When applying @fohristiwhirl I believe you introduced the buffer limit. Can you explain your rationale behind it? |
I forget. I think the point might have been that SGF naturally contains a bunch of UTF-8 looking stuff like B[cc]; W[dd] etc etc etc, but the start of the file is more likely to contain names and such. I seem to recall this was more of an issue for other file formats. e.g. NGF. If possible, maybe detect charset using some aggregated comments, metadata etc, e.g. tags C, PW, PB, that sort of thing, joined together into a single string? |
@yishn I have checked a few other files, and your assessment seems valid. |
Hello, Just downloaded and installed the new version, and this problem has not been resolved. I have attached two files, one is original, which won't display properly for either 4.4.3 or 5.0, Thanks, |
Weird, the file with the added GB2312 declaration loads fine for me. |
@yishn I have tested several other files with added CA[GB2312], and they all do not Also, how do I test your new commit with an increased buffer size? Do I need to compile |
@yishn I have compiled Sabaki myself, and the issue persists. git clone https://github.com/SabakiHQ/Sabaki The compilation seemed to have worked fine. The executable and a screen dump are here: https://www.dropbox.com/s/uqtyg9p0uwmos4i/Sabaki%20Compile.zip?dl=0 Thanks for your help, |
Hi there,
Shun-Chen Niu (scniu@sbcglobal.net) invited you to view the file " Sabaki Compile.zip " on Dropbox.
View file[1]
Enjoy!
The Dropbox team
Shun-Chen and others will be able to see when you view this file. Other files shared with you through Dropbox may also show this info. Learn more[2] in our help center.
[1]: https://www.dropbox.com/l/scl/AAB2c78RKHqDY_CUsWOj4Fk8aUUJgM0g_QA
[2]: https://www.dropbox.com/l/AADIYPvImROQv58Cm-ifer-7tcC-wy5Gr1w
|
After investigation, it seems we're accidentally excluding the decoding library from our bundle. This should be fixed on Sabaki master now. Can you pull, rebuild, and see if the problem is now fixed? |
Just compiled again, and it now works fine with and WITHOUT the GB2312
declaration! Great detective work and many thanks.
… On Mar 12, 2020, at 7:18 PM, Yichuan Shen ***@***.***> wrote:
After investigation, it seems we're accidentally excluding the decoding library from our bundle. This should be fixed on Sabaki master now. Can you pull, rebuild, and see if the problem is now fixed?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@yishn Sorry to bother you again, but the new version still seems to have issues. So, there appears to be a discrepancy between the two Sabaki versions. The attached file Best, |
Hmm... it seems like detecting encoding on spliced test buffers didn't really work. Now we're just falling back to detecting encoding on the first 1000 bytes of the buffer. |
Just compiled after the new commit. The file Original.sgf still does not display properly.
… On Mar 15, 2020, at 5:30 AM, Yichuan Shen ***@***.***> wrote:
Hmm... it seems like detecting encoding on spliced test buffers didn't really work. Now we're just falling back to detecting encoding on the first 1000 bytes of the buffer.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
In trying to figure out what might have gone wrong, I have inspected lots of files, using the Shun |
Let me add that both files in Samples.zip load properly in 43.3. |
In v0.43.3, we're guessing encoding based on the first 100 bytes of the files. After extending the encoding guessing to the first 1000 bytes of the file, it doesn't guess GB2312 anymore because, as @fohristiwhirl pointed out, "SGF naturally contains a bunch of UTF-8 looking stuff like B[cc]; W[dd]". If we restrict ourselves to the first 100 bytes again, your original file would have issues, because in there, the first 100 bytes doesn't contain any Chinese. For short term, we can probably just pick something between 100 bytes and 1000 bytes and guess encoding based on that. For long term, we should let the user pick their own encoding. |
Does it make sense to let the detection scheme focus on the C[], PB[], and PW[] fields? These are areas where different encoding might make a difference (especially C[]). |
Yes, that was what we were doing before, using spliced test buffers. But that doesn't work as evidenced by your previous samples. The detected encodings on the spliced test buffers were completely wrong. |
This comment has been minimized.
This comment has been minimized.
@yishn I have tested some more files, and I am attaching four of them. These are all original BTW, I compiled the latest version, but noticed that the new option on user encoding selection Best, |
This has nothing to do with encoding, so please open a new issue on Sabaki's repository about the hanging. FYI the new option on user encoding selection is not implemented, it's an open issue, please subscribe to it for updates. |
Thanks, just posted there. |
Hello,
I am using an iMac, and Sabaki seems to have difficulty displaying Chinese characters properly.
This often occurs with sgf files that I downloaded from the internet. Following comments from
another thread, I have tried to add CA[GB2312] to the file, but it did not work.
A sample file is given below. Can someone enlighten me with a solution to this?
Many thanks in advance,
Shun
The actual file is attached below, with added .txt file extension:
__Vs__9.sgf.txt
The text was updated successfully, but these errors were encountered: