Skip to content

Add encoding functions #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 14, 2012
Merged

Add encoding functions #1

merged 3 commits into from
Feb 14, 2012

Conversation

shaneaevans
Copy link
Member

These are loosely based on the encoding in scrapy.

Main differences:

  • tweaks to regular expressions for encoding detection in HTML. One regexp handles html and xml
  • handle byte order marks
  • better handling of character encoding overrides, with an updated list
  • does not fall back to BeautifulSoup, instead the auto-detect is customizeable and disabled by default

This is based on the encoding detection in scrapy
returns a tuple of (encoding used, unicode)
"""
enc = http_content_type_encoding(content_type_header)
bom_enc, rest_of_data = read_bom(html_body_str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are python byte slices cheap?

Just to point about rest_of_data been copied and later dropped when bom_enc and enc doens't match.
It's a rare case I know, but it waste memory for big responses or for sites sending BOM and different transport encoding for all its pages.

I think this isn't a merge blocker but worth pointing it, once in the wild we can optimize if affects any real case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be pretty rare that we have BOMs that differ from the encoding in the headers, and it's not that expensive. But still - why take the risk? I'll change it.

The content type header parameter to html_to_unicode has been documented
more clearly.
shaneaevans added a commit that referenced this pull request Feb 14, 2012
Add encoding functions for converting html to unicode
@shaneaevans shaneaevans merged commit 9f39f99 into scrapy:master Feb 14, 2012
wRAR pushed a commit that referenced this pull request Aug 24, 2021
Improve ParseDataURIResult documentation
kmike pushed a commit that referenced this pull request Jun 16, 2022
For issue #162 Add different regex pattern to search for meta tags
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants