Description
Hello,
While trying to upgrade my "normal" Scrapy CrawlSpider to use Playwright through this plug-in, I came across the handling of binary files. For the classic Scrapy, every file was processed the same way: the Scrapy response body was the bytes of the requested file β HTML page, PDF, Office document etc.
I understand that with Playwright, as it is literally a real browser, every "unreadable" document by the browser should trigger the download
event β except PDFs which are read with the built-in viewer, but this is a known issue.
My concerns are more about how to handle this: with the plug-in, even if I try to catch a Playwright exception or play with the event handler download
, the plug-in still raises an error and is unable to trigger the callback of the spider, like parse_page
. Therefore, these "binary" files are never processed.
Browser chromium launched
Browser context started: 'custom' (persistent=False)
New page created, page count is 1 (1 for all contexts)
Request: <GET https://dornsife.usc.edu/assets/sites/298/docs/ir211wk12samplexls> (resource type: document, referrer: https://dornsife.usc.edu/)
***Caught exception: <class 'playwright_impl_api_typesError'>***
Response: <200 https://dornsife.usc.edu/assets/sites/298/docs/ir211wk12samplexls> (referrer: None)
***Document triggered a download***
<twistedpythonfailureFailure playwright_impl_api_typesError: net::ERR_ABORTED at https://dornsife.usc.edu/assets/sites/298/docs/ir211wk12samplexls
=========================== logs ===========================
navigating to "https://dornsife.usc.edu/assets/sites/298/docs/ir211wk12samplexls", waiting until "load"
============================================================>
Closing spider (finished)
Spider closed (finished)
Closing download handler
Closing download handler
Browser context closed: 'custom' (persistent=False)
Through a dedicated middleware for Playwright, here is what I tried:
Add a download event handler
[...]
async def handle_download(self, download):
logger.info('Document triggered a download')
await download.save_as(download.suggested_filename)
self.binary_files[download.url] = download.suggested_filename
async def process_request(
self,
request,
spider,
) -> None:
request.meta['playwright'] = True
request.meta['playwright_include_page'] = True
request.meta['playwright_page_event_handlers'] = {
'download': self.handle_download,
}
return None
async def process_exception(
self,
request,
spider,
exception,
):
logger.info('Caught exception: %s', exception.__class__)
if request.url in self.binary_files:
logger.info('File found locally for this URL')
My idea was to:
- Download the file locally with the
download
Playwright event handler - Store in a middleware attribute the URL as key and the filename as value to use it in the
process_exception
- Return a custom Scrapy
Response
object to get the file processed as normal inprocess_exception
But as we can see in the log above, the download
event is triggered after the process_exception
method, therefore, this is not possible.
I don't think I can add a custom Scrapy response in the download
event handler too?
Detect and redirect binary files to a classic Scrapy
I figured I could:
- Detect the extension of a binary file in
process_request
and redirect it to a non-Playwright version of my spider ; - Detect the mime type of a binary file in
process_response
β sometimes, files don't have extension β, and do the same.
Like so:
def _is_request_for_binary_file(self, url: str) -> bool:
lowercase_path = urlparse(url).path.lower()
return any(lowercase_path.endswith(ext) for ext in self.binary_extensions)
async def process_request(
self,
request,
spider,
) -> None:
if self.is_request_for_binary_file(request.url):
logger.warn('[Playwright] Binary file requested')
logger.warn('[Playwright] Transferring to normal Scrapy for crawling')
# send crawling task to another scrapy instance here
raise IgnoreRequest()
async def process_response(
self,
request,
response,
spider,
):
mimetype, *_ = response.headers['content-type'].decode().split(';')
if mimetype in self.binary_mimetypes:
logger.warn('Binary file detected')
logger.warn('Sending file to normal Scrapy for crawling')
# send crawling task to another scrapy instance here
raise IgnoreRequest()
return response
But I'm pretty sure there must be a better way. Any idea where we could avoid these exceptions and transform binary files as valid Scrapy responses, which will trigger the Request callback?
I could not reproduce it but in my early tests, the spider's traceback when trying to process a binary file was targeting the download_request
method in the handler.py
file of the plug-in.
Thank you very much in advance for any tips!
Activity