-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support CJK string annotation; print readably CJK string in scrapely.tool's output #45
base: master
Are you sure you want to change the base?
Conversation
A doctest is reasonable. Actually I had tried adding a doctest on this but failed:
It's a copy of python shell output, can be used as document. But if your run it as doctest, you will get this strange result:
|
In Python 2.x doctests just can't handle non-ascii text. There are some bugs about that in Python bug tracker, but as I recall they are all closed because the issue is fixed for Python 3.x. In 2.x it won't work. |
return unichr(int(str(repr_char.group())[2:], base=16)) | ||
|
||
repr_string = repr(obj) | ||
return REPR_UNICODE_CHAR.sub(replace_unicode_char, repr_string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it different from repr_bytesting.decode('unicode-escape')
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is repr_bytesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is repr_string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike, it's different. The decode('unicode-escape')
restore whole string; readable_repr
restore CJK characters(all four bytes characters actually) only, not include '\n'
, '\t'
, '\\'
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @xyb,
Thanks for the fix and an explanation. This approach makes sense; it is basically undoing what Python 2.x repr is doing for unicode strings (if we don't want to print new lines, etc.)
A couple of notes:
- Your regex doesn't catch all symbols that can be safely decoded, e.g. ² (
\xb2
) or £ (\xa3
) could be nice to see in the output; - 'readable_repr' name is a bit confusing because in Python 2.x repr must be a bytestring, and readable_repr returns unicode. What do you think about calling it e.g. 'unicode_repr'?
The best fix for this issue would be to port scrapely to Python 3 - it doesn't escape non-ascii letters and symbols in repr of unicode strings, but w3lib must be ported before that :)
Maybe just add a unittest if doctests don't handle non-ascii text in Python 2.x? |
@pablohoffman, @kmike, Sorry for the delay replying, I have added unittests for the |
func = best_match(criteria.text) if criteria.text else lambda x, y: False | ||
text = criteria.text | ||
if text and isinstance(text, str): | ||
text = text.decode(tm.get_template().encoding or 'utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is a wrong place to decode criteria.text, and the encoding it is decoded from is incorrect - it should be decoded from IblTool.stdin.encoding, and so it makes sense to decode it in IblTool itself. See #46.
Any updates? |
@akkatracker if you use latest scrapely master in Python 3 it should print all characters correctly. Fixing it for Python 2.x could be ugly. Unicode input issues are fixed by #46, both for Python 2.x and 3.x. The issue from the PR description should be fixed in scrapely master if you use Python 3.x. This PR provides some nice unit tests, fixes similar to #56 and an attempt to fix unicode output for Python 2.x (not finished), that's why it is not closed. |
scrapely.tool will crash when using CJK string as annotation in scrapely.tool:
I fixed it, and add improved the usability of scrapely.tool's output that including CJK unicode characters: