Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EncodingDetect #1

Closed
wizos opened this issue Jul 6, 2018 · 5 comments
Closed

EncodingDetect #1

wizos opened this issue Jul 6, 2018 · 5 comments
Labels

Comments

@wizos
Copy link

wizos commented Jul 6, 2018

This is great! But can you improve the automatic identification webpage's charset? If the encode is GB2312 or GBK, it will cause error.

@dankito
Copy link
Owner

dankito commented Jul 11, 2018

Do you have any suggestions how to detect the correct encoding? Pull requests are welcome.

As a proposal I could check the HTML header and use that one.
But give me some time, I moved recently and in my new apartment there's still a lot to do.

@dankito
Copy link
Owner

dankito commented Jul 17, 2018

Could you provide me some test data with source html, actual output and expected output?

I checked some sites like http://www.sina.com.cn and http://www.huanqiu.com, and both tell their charset is utf-8.

For example for http://news.sina.com.cn/gov/xlxw/2018-07-17/doc-ihfkffam3728018.shtml Readability4J generates this output: https://dankito.net/test/sina-output.html.

May you only have to wrap the output in

<html>
 <head>
  <meta charset="GBK" /> 
 </head>
 <body>
 <!-- output here -->
 </body>
</html>

Does it then work as you expect?

@wizos
Copy link
Author

wizos commented Jul 18, 2018

I subscribe a website is: http://www.shgjj.com/html/zyxw/index.html.
Its output charset is .

@dankito
Copy link
Owner

dankito commented Jul 22, 2018

Sorry for letting you wait so long!

I just tried it with this url
http://www.shgjj.com/html/zyxw/101770.html
and it produced that output
https://dankito.net/test/shgjj-output.html.

I just wrapped the Readability4J output in

<html>
 <head>
  <meta charset="utf-8" /> 
 </head>
 <body>
 <!-- output here -->
 </body>
</html>

(not charset="GBK" as suggested in my last post) so that a browser shows the characters correctly.

As I don't understand Chinese that well, what would you say, is the output OK?

@wizos
Copy link
Author

wizos commented Jul 23, 2018

Thank you, this output is normal!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants