Closed
Description
Hello world!
I am building a common Spider that crawls sites and the contained request.
I use scrapy-playwright to load websites first and get the requests that are sent.
I noticed that when I parse urls that have no content on body the execution freezes and playwright's browser shows empty tab.
To be clear reproduction of the problem is when parse a url that has the following condition as true:
response_body_text = await response.text()
response_body_text == ''
For the urls that this condition is false spider works perfectly!
For the reproduction, I have a quite common configuration with:
CrawlerProcess({
...
# Playwright settings
'DOWNLOAD_HANDLERS': {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
'TWISTED_REACTOR': "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
'PLAYWRIGHT_BROWSER_TYPE': 'chromium',
'PLAYWRIGHT_MAX_PAGES_PER_CONTEXT': 10,
'PLAYWRIGHT_LAUNCH_OPTIONS': {
'headless': True,
}
})
and on each scrapy.Request() I pass the following meta:
{
"playwright": True
}
Has anybody else come up with this issue?
Thank you all!
Activity