SelectorList.drop() removing elements doesn't work as expected #297

dream2333 · 2024-06-08T17:25:18Z

def parse_detail(self, response: HtmlResponse, item: DetailDataItem):
    selectors = response.jmespath("news.body")
    selectors.xpath(".//script|.//style").drop()
    item.content = selectors.xpath("string(.)").get().strip()
    yield item

I'm trying to remove the 'style' tag from the element using selector.xpath(".//script|.//style").drop(). However, even after executing this line of code, the 'style' element still exists in the DOM.

Here's url:
https://newsinfo.eastmoney.com/kuaixun/v2/api/content/getnews?newsid=202406083099747443&newstype=1

The text was updated successfully, but these errors were encountered:

dream2333 · 2024-06-08T17:27:48Z

Could someone help me understand why this is happening?

dream2333 · 2024-06-09T08:42:22Z

I've figured out why this is happening. If you perform a drop operation on a Selector that's been created from JSON in Scrapy, it cannot correctly handle the DOM. However, if you extract the HTML text from the JSON and reconstruct the Selector, this issue does not occur. This seems to be a bug in Parsel's Selector implementation.

content = response.jmespath("news.body").get()
selector = Selector(text=content, type="html")
selector.xpath(".//script|.//style").drop()
item.content = selector.xpath("string(.)").get().strip()

dream2333 · 2024-06-09T13:56:57Z

When using the .xpath method to create nodes from a text type selector, it appears that these nodes are actually copies generated from the text, rather than being generated based on the original root node. As a result, when executing the .drop method, it doesn't affect the content of the original HTML tree. This happens mostly when using jmespath and xpath in combination

This process is quite subtle. To make the .drop operation effective, we need to call .xpath(".") to generate a new HtmlSelector. Only then does the .drop operation work as expected on it. This behavior is not intuitive and could potentially lead to confusion or unexpected results. I believe it would be beneficial to either adjust this behavior or clarify it in the documentation to prevent future confusion.

selector = json_selector.jmespath("news.body").xpath(".")
selectors.xpath(".//script|.//style").drop()
item.content = selectors.xpath("string(.)").get().strip()

dream2333 · 2024-06-09T19:06:55Z

Refs #298

dream2333 linked a pull request Jun 9, 2024 that will close this issue

Fix the issue where HTML elements cannot be dropped from the text selector returned by Selector.jmespath() #298

Open

Gallaecio linked a pull request Jun 14, 2024 that will close this issue

Support forcing a selector type into a subselector #299

Open

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SelectorList.drop() removing elements doesn't work as expected #297

SelectorList.drop() removing elements doesn't work as expected #297

dream2333 commented Jun 8, 2024 •

edited

Loading

dream2333 commented Jun 8, 2024

dream2333 commented Jun 9, 2024

dream2333 commented Jun 9, 2024 •

edited

Loading

dream2333 commented Jun 9, 2024

This comment was marked as off-topic.

SelectorList.drop() removing elements doesn't work as expected #297

SelectorList.drop() removing elements doesn't work as expected #297

Comments

dream2333 commented Jun 8, 2024 • edited Loading

dream2333 commented Jun 8, 2024

dream2333 commented Jun 9, 2024

dream2333 commented Jun 9, 2024 • edited Loading

dream2333 commented Jun 9, 2024

This comment was marked as off-topic.

dream2333 commented Jun 8, 2024 •

edited

Loading

dream2333 commented Jun 9, 2024 •

edited

Loading