Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[From Scrapy]: Incorrect XPath processing #160

Closed
chemiron opened this issue Oct 18, 2019 · 3 comments
Closed

[From Scrapy]: Incorrect XPath processing #160

chemiron opened this issue Oct 18, 2019 · 3 comments

Comments

@chemiron
Copy link

Xpath //h1 can't extract data correctly from https://www.imdb.com/title/tt6757474/

@BurnzZ
Copy link
Member

BurnzZ commented Oct 19, 2019

The website might change, so here's a quick reproducible example from the current state of the page:

>>> import parsel
>>> html = '<h1 class="">LA Galaxy <@ San Jose Earthquakes&nbsp;            </h1>'
>>> parsel.Selector(text=html).xpath('//h1').get()  
'<h1 class="">LA Galaxy </h1>'

Though IMHO, this was an invalid HTML to begin with, should the webpage have encoded the < char as &lt;, it should work correctly:

import parsel
html = '<h1 class="">LA Galaxy &lt;@ San Jose Earthquakes&nbsp;            </h1>'
parsel.Selector(text=html).xpath('//h1').get()  
'<h1 class="">LA Galaxy &lt;@ San Jose Earthquakes\xa0            </h1>'

@Gallaecio
Copy link
Member

@chemiron May we close this issue in favor of #126?

@chemiron
Copy link
Author

closed in favor of #126

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants