Skip to content

Handling of binary filesΒ #184

Closed
Closed
@kinoute

Description

Hello,

While trying to upgrade my "normal" Scrapy CrawlSpider to use Playwright through this plug-in, I came across the handling of binary files. For the classic Scrapy, every file was processed the same way: the Scrapy response body was the bytes of the requested file – HTML page, PDF, Office document etc.

I understand that with Playwright, as it is literally a real browser, every "unreadable" document by the browser should trigger the download event – except PDFs which are read with the built-in viewer, but this is a known issue.

My concerns are more about how to handle this: with the plug-in, even if I try to catch a Playwright exception or play with the event handler download, the plug-in still raises an error and is unable to trigger the callback of the spider, like parse_page. Therefore, these "binary" files are never processed.

 Browser chromium launched
 Browser context started: 'custom' (persistent=False)
 New page created, page count is 1 (1 for all contexts)
 Request: <GET https://dornsife.usc.edu/assets/sites/298/docs/ir211wk12samplexls> (resource type: document, referrer: https://dornsife.usc.edu/)
 ***Caught exception: <class 'playwright_impl_api_typesError'>***
 Response: <200 https://dornsife.usc.edu/assets/sites/298/docs/ir211wk12samplexls> (referrer: None)
 ***Document triggered a download***
 <twistedpythonfailureFailure playwright_impl_api_typesError: net::ERR_ABORTED at https://dornsife.usc.edu/assets/sites/298/docs/ir211wk12samplexls
=========================== logs ===========================
navigating to "https://dornsife.usc.edu/assets/sites/298/docs/ir211wk12samplexls", waiting until "load"
============================================================>
 Closing spider (finished)
 Spider closed (finished)
 Closing download handler
 Closing download handler
Browser context closed: 'custom' (persistent=False)

Through a dedicated middleware for Playwright, here is what I tried:

Add a download event handler

   [...]
    async def handle_download(self, download):
        logger.info('Document triggered a download')
        await download.save_as(download.suggested_filename)
        self.binary_files[download.url] = download.suggested_filename

    async def process_request(
        self,
        request,
        spider,
    ) -> None:
        request.meta['playwright'] = True
        request.meta['playwright_include_page'] = True

        request.meta['playwright_page_event_handlers'] = {
            'download': self.handle_download,
        }

        return None

    async def process_exception(
        self,
        request,
        spider,
        exception,
    ):
        logger.info('Caught exception: %s', exception.__class__)
        if request.url in self.binary_files:
            logger.info('File found locally for this URL')

My idea was to:

  1. Download the file locally with the download Playwright event handler
  2. Store in a middleware attribute the URL as key and the filename as value to use it in the process_exception
  3. Return a custom Scrapy Response object to get the file processed as normal in process_exception

But as we can see in the log above, the download event is triggered after the process_exception method, therefore, this is not possible.

I don't think I can add a custom Scrapy response in the download event handler too?

Detect and redirect binary files to a classic Scrapy

I figured I could:

  • Detect the extension of a binary file in process_request and redirect it to a non-Playwright version of my spider ;
  • Detect the mime type of a binary file in process_response – sometimes, files don't have extension –, and do the same.

Like so:

    def _is_request_for_binary_file(self, url: str) -> bool:
        lowercase_path = urlparse(url).path.lower()
        return any(lowercase_path.endswith(ext) for ext in self.binary_extensions)

    async def process_request(
        self,
        request,
        spider,
    ) -> None:
        if self.is_request_for_binary_file(request.url):
             logger.warn('[Playwright] Binary file requested')
             logger.warn('[Playwright] Transferring to normal Scrapy for crawling')
            # send crawling task to another scrapy instance here
             raise IgnoreRequest()

    async def process_response(
        self,
        request,
        response,
        spider,
    ):
        mimetype, *_ = response.headers['content-type'].decode().split(';')

        if mimetype in self.binary_mimetypes:
            logger.warn('Binary file detected')
            logger.warn('Sending file to normal Scrapy for crawling')
            # send crawling task to another scrapy instance here
            raise IgnoreRequest()

        return response

But I'm pretty sure there must be a better way. Any idea where we could avoid these exceptions and transform binary files as valid Scrapy responses, which will trigger the Request callback?

I could not reproduce it but in my early tests, the spider's traceback when trying to process a binary file was targeting the download_request method in the handler.py file of the plug-in.

Thank you very much in advance for any tips!

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions