[From Scrapy]: Incorrect XPath processing #160

chemiron · 2019-10-18T15:07:44Z

Xpath //h1 can't extract data correctly from https://www.imdb.com/title/tt6757474/

The text was updated successfully, but these errors were encountered:

BurnzZ · 2019-10-19T13:27:15Z

The website might change, so here's a quick reproducible example from the current state of the page:

>>> import parsel
>>> html = '<h1 class="">LA Galaxy <@ San Jose Earthquakes&nbsp;            </h1>'
>>> parsel.Selector(text=html).xpath('//h1').get()  
'<h1 class="">LA Galaxy </h1>'

Though IMHO, this was an invalid HTML to begin with, should the webpage have encoded the < char as <, it should work correctly:

import parsel
html = '<h1 class="">LA Galaxy &lt;@ San Jose Earthquakes&nbsp;            </h1>'
parsel.Selector(text=html).xpath('//h1').get()  
'<h1 class="">LA Galaxy &lt;@ San Jose Earthquakes\xa0            </h1>'

Gallaecio · 2019-10-21T15:32:26Z

@chemiron May we close this issue in favor of #126?

chemiron · 2019-10-25T08:43:51Z

closed in favor of #126

Gallaecio added the enhancement label Oct 21, 2019

chemiron closed this as completed Oct 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[From Scrapy]: Incorrect XPath processing #160

[From Scrapy]: Incorrect XPath processing #160

chemiron commented Oct 18, 2019

BurnzZ commented Oct 19, 2019 •

edited

Loading

Gallaecio commented Oct 21, 2019

chemiron commented Oct 25, 2019

[From Scrapy]: Incorrect XPath processing #160

[From Scrapy]: Incorrect XPath processing #160

Comments

chemiron commented Oct 18, 2019

BurnzZ commented Oct 19, 2019 • edited Loading

Gallaecio commented Oct 21, 2019

chemiron commented Oct 25, 2019

BurnzZ commented Oct 19, 2019 •

edited

Loading