Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] scrapely.tool: add support for non-ascii <text> and <data> arguments #46

Merged
merged 2 commits into from
Oct 10, 2013

Conversation

kmike
Copy link
Member

@kmike kmike commented Oct 2, 2013

I think that there are two separate issues in #45:

  1. non-ascii input in scrapely.tool leads to UnicodeDecodeError;
  2. non-ascii data is not readable when printed.

This PR addresses (1).

<text> and <data> arguments are parsed by parse_criteria function
(it uses shlex and optparse for parsing). Data that is passed to parse_criteria
function is extracted from "line" argument of do_<…> methods.
This "line" argument is read from self.stdin by cmd.Cmd and
passed to do_ methods. In Python 2.x sys.stdin (which is
default for cmd.Cmd.stdin) is binary, so "line" is a bytestring;
its encoding is self.stdin.encoding. That's why <text> and <data>
argument values was previously bytestrings; when passed to
other scrapely functions they eventually got implicitly decoded
using sys.getdefaultencoding() - this usually leads to
UnicodeDecodeError if input text is non-ascii.

The fix is to decode these arguments using self.stdin.encoding
before passing them to scrapely. This is done after shlex call
because shlex doesn't support unicode. Non-ascii "field" arguments
are still unsupported.
pablohoffman added a commit that referenced this pull request Oct 10, 2013
[MRG] scrapely.tool: add support for non-ascii <text> and <data> arguments
@pablohoffman pablohoffman merged commit 576d3db into scrapy:master Oct 10, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants