Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support CJK string annotation; print readably CJK string in scrapely.tool's output #45

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

xyb
Copy link

@xyb xyb commented Sep 10, 2013

scrapely.tool will crash when using CJK string as annotation in scrapely.tool:

$ python -m scrapely.tool blog.json
scrapely> ta http://blog.douban.com/douban/2013/07/04/2630/
[0] http://blog.douban.com/douban/2013/07/04/2630/
scrapely> t 0 算法工程师如何改进豆瓣电影 TOP250
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Users/xyb/scrapely.xyb/scrapely/tool.py", line 189, in <module>
    main()
  File "/Users/xyb/scrapely.xyb/scrapely/tool.py", line 186, in main
    t.cmdloop()
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/cmd.py", line 142, in cmdloop
    stop = self.onecmd(line)
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/cmd.py", line 221, in onecmd
    return func(arg)
  File "/Users/xyb/scrapely.xyb/scrapely/tool.py", line 48, in do_t
    selection = apply_criteria(criteria, tm)
  File "/Users/xyb/scrapely.xyb/scrapely/tool.py", line 147, in apply_criteria
    sel = tm.select(func)
  File "scrapely/template.py", line 48, in select
    score = score_func(fragment, htmlpage)
  File "scrapely/template.py", line 95, in func
    if text in fdata:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal not in range(128)

I fixed it, and add improved the usability of scrapely.tool's output that including CJK unicode characters:

$ python -m scrapely.tool blog.json
scrapely> t 0 算法工程师如何改进豆瓣电影 TOP250
[0] u'<h1>算法工程师如何改进豆瓣电影 TOP250</h1>'
[1] u'<title>豆瓣blog  &raquo; Blog Archive   &raquo; 算法工程师如何改进豆瓣电影 TOP250</title>'
[2] u'<link rel="alternate" type="application/rss+xml" title="豆瓣blog &raquo; 算法工程师如何改进豆瓣电影 TOP250 评论 Feed" href="http://blog.douban.com/douban/2013/07/04/2630/feed/" />'
scrapely> 

@xyb
Copy link
Author

xyb commented Sep 13, 2013

A doctest is reasonable. Actually I had tried adding a doctest on this but failed:

    >>> u = u'cjk 中日韩 \\u535a'
    >>> u
    u'cjk \u4e2d\u65e5\u97e9 \\u535a'
    >>> repr(u)
    "u'cjk \\u4e2d\\u65e5\\u97e9 \\\\u535a'"
    >>> print repr(u)
    u'cjk \u4e2d\u65e5\u97e9 \\u535a'
    >>> readable_repr(u)
    u"u'cjk \u4e2d\u65e5\u97e9 \\\\u535a'"
    >>> print readable_repr(u)
    u'cjk 中日韩 \\u535a'

It's a copy of python shell output, can be used as document. But if your run it as doctest, you will get this strange result:

**********************************************************************
File "readable_repr.py", line 12, in __main__.readable_repr
Failed example:
    u
Expected:
    u'cjk \u4e2d\u65e5\u97e9 \u535a'
Got:
    u'cjk \xe4\xb8\xad\xe6\x97\xa5\xe9\x9f\xa9 \u535a'
**********************************************************************
File "readable_repr.py", line 14, in __main__.readable_repr
Failed example:
    repr(u)
Expected:
    "u'cjk \u4e2d\u65e5\u97e9 \\u535a'"
Got:
    "u'cjk \\xe4\\xb8\\xad\\xe6\\x97\\xa5\\xe9\\x9f\\xa9 \\u535a'"
**********************************************************************
File "readable_repr.py", line 16, in __main__.readable_repr
Failed example:
    print repr(u)
Expected:
    u'cjk \u4e2d\u65e5\u97e9 \u535a'
Got:
    u'cjk \xe4\xb8\xad\xe6\x97\xa5\xe9\x9f\xa9 \u535a'
**********************************************************************
File "readable_repr.py", line 18, in __main__.readable_repr
Failed example:
    readable_repr(u)
Expected:
    u"u'cjk \u4e2d\u65e5\u97e9 \\u535a'"
Got:
    u"u'cjk \\xe4\\xb8\\xad\\xe6\\x97\\xa5\\xe9\\x9f\\xa9 \u535a'"
/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/doctest.py:1531: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if got == want:
/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/doctest.py:1551: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if got == want:
**********************************************************************
File "readable_repr.py", line 20, in __main__.readable_repr
Failed example:
    print readable_repr(u)
Expected:
    u'cjk 中日韩 \u535a'
Got:
    u'cjk \xe4\xb8\xad\xe6\x97\xa5\xe9\x9f\xa9 博'
**********************************************************************
1 items had failures:
   5 of   6 in __main__.readable_repr
***Test Failed*** 5 failures.

@kmike
Copy link
Member

kmike commented Sep 13, 2013

In Python 2.x doctests just can't handle non-ascii text. There are some bugs about that in Python bug tracker, but as I recall they are all closed because the issue is fixed for Python 3.x. In 2.x it won't work.

return unichr(int(str(repr_char.group())[2:], base=16))

repr_string = repr(obj)
return REPR_UNICODE_CHAR.sub(replace_unicode_char, repr_string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it different from repr_bytesting.decode('unicode-escape') ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is repr_bytesting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is repr_string

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike, it's different. The decode('unicode-escape') restore whole string; readable_repr restore CJK characters(all four bytes characters actually) only, not include '\n', '\t', '\\', etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @xyb,

Thanks for the fix and an explanation. This approach makes sense; it is basically undoing what Python 2.x repr is doing for unicode strings (if we don't want to print new lines, etc.)

A couple of notes:

  1. Your regex doesn't catch all symbols that can be safely decoded, e.g. ² (\xb2) or £ (\xa3) could be nice to see in the output;
  2. 'readable_repr' name is a bit confusing because in Python 2.x repr must be a bytestring, and readable_repr returns unicode. What do you think about calling it e.g. 'unicode_repr'?

The best fix for this issue would be to port scrapely to Python 3 - it doesn't escape non-ascii letters and symbols in repr of unicode strings, but w3lib must be ported before that :)

@pablohoffman
Copy link
Member

Maybe just add a unittest if doctests don't handle non-ascii text in Python 2.x?

@xyb
Copy link
Author

xyb commented Sep 29, 2013

@pablohoffman, @kmike, Sorry for the delay replying, I have added unittests for the readable_repr function and best_match text encoding correction(moved to scrapely.tool already).

func = best_match(criteria.text) if criteria.text else lambda x, y: False
text = criteria.text
if text and isinstance(text, str):
text = text.decode(tm.get_template().encoding or 'utf-8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is a wrong place to decode criteria.text, and the encoding it is decoded from is incorrect - it should be decoded from IblTool.stdin.encoding, and so it makes sense to decode it in IblTool itself. See #46.

@mattdbr
Copy link

mattdbr commented Mar 8, 2015

Any updates?

@kmike
Copy link
Member

kmike commented Mar 8, 2015

@akkatracker if you use latest scrapely master in Python 3 it should print all characters correctly. Fixing it for Python 2.x could be ugly.

Unicode input issues are fixed by #46, both for Python 2.x and 3.x.

The issue from the PR description should be fixed in scrapely master if you use Python 3.x. This PR provides some nice unit tests, fixes similar to #56 and an attempt to fix unicode output for Python 2.x (not finished), that's why it is not closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants