Handling of binary files

Hello,

While trying to upgrade my "normal" Scrapy CrawlSpider to use Playwright through this plug-in, I came across the handling of binary files. For the classic Scrapy, every file was processed the same way: the Scrapy response body was the bytes of the requested file – HTML page, PDF, Office document etc.

I understand that with Playwright, as it is literally a real browser, every "unreadable" document by the browser should trigger the `download` event – except PDFs which are read with the built-in viewer, but this is a [known issue](https://github.com/microsoft/playwright/issues/7822).

My concerns are more about how to handle this: with the plug-in, even if I try to catch a Playwright exception or play with the event handler `download`, the plug-in still raises an error and is unable to trigger the callback of the spider, like `parse_page`. Therefore, these "binary" files are never processed.

```
 Browser chromium launched
 Browser context started: 'custom' (persistent=False)
 New page created, page count is 1 (1 for all contexts)
 Request: <GET https://dornsife.usc.edu/assets/sites/298/docs/ir211wk12samplexls> (resource type: document, referrer: https://dornsife.usc.edu/)
 ***Caught exception: <class 'playwright_impl_api_typesError'>***
 Response: <200 https://dornsife.usc.edu/assets/sites/298/docs/ir211wk12samplexls> (referrer: None)
 ***Document triggered a download***
 <twistedpythonfailureFailure playwright_impl_api_typesError: net::ERR_ABORTED at https://dornsife.usc.edu/assets/sites/298/docs/ir211wk12samplexls
=========================== logs ===========================
navigating to "https://dornsife.usc.edu/assets/sites/298/docs/ir211wk12samplexls", waiting until "load"
============================================================>
 Closing spider (finished)
 Spider closed (finished)
 Closing download handler
 Closing download handler
Browser context closed: 'custom' (persistent=False)
```

Through a dedicated middleware for Playwright, here is what I tried:

**Add a download event handler**

```python3
   [...]
    async def handle_download(self, download):
        logger.info('Document triggered a download')
        await download.save_as(download.suggested_filename)
        self.binary_files[download.url] = download.suggested_filename

    async def process_request(
        self,
        request,
        spider,
    ) -> None:
        request.meta['playwright'] = True
        request.meta['playwright_include_page'] = True

        request.meta['playwright_page_event_handlers'] = {
            'download': self.handle_download,
        }

        return None

    async def process_exception(
        self,
        request,
        spider,
        exception,
    ):
        logger.info('Caught exception: %s', exception.__class__)
        if request.url in self.binary_files:
            logger.info('File found locally for this URL')


```

My idea was to:

1. Download the file locally with the `download` Playwright event handler
2. Store in a middleware attribute the URL as key and the filename as value to use it in the `process_exception` 
3. Return a custom Scrapy `Response` object to get the file processed as normal in `process_exception`

But as we can see in the log above, the `download` event is triggered _after_ the `process_exception` method, therefore, this is not possible.

I don't think I can add a custom Scrapy response in the `download` event handler too?

**Detect and redirect binary files to a classic Scrapy**

I figured I could:

* Detect the extension of a binary file in `process_request` and redirect it to a non-Playwright version of my spider ;
* Detect the mime type of a binary file in `process_response` – sometimes, files don't have extension –, and do the same.

Like so:

```python3

    def _is_request_for_binary_file(self, url: str) -> bool:
        lowercase_path = urlparse(url).path.lower()
        return any(lowercase_path.endswith(ext) for ext in self.binary_extensions)

    async def process_request(
        self,
        request,
        spider,
    ) -> None:
        if self.is_request_for_binary_file(request.url):
             logger.warn('[Playwright] Binary file requested')
             logger.warn('[Playwright] Transferring to normal Scrapy for crawling')
            # send crawling task to another scrapy instance here
             raise IgnoreRequest()

    async def process_response(
        self,
        request,
        response,
        spider,
    ):
        mimetype, *_ = response.headers['content-type'].decode().split(';')

        if mimetype in self.binary_mimetypes:
            logger.warn('Binary file detected')
            logger.warn('Sending file to normal Scrapy for crawling')
            # send crawling task to another scrapy instance here
            raise IgnoreRequest()

        return response
```

But I'm pretty sure there must be a better way. Any idea where we could avoid these exceptions and transform binary files as valid Scrapy responses, which will trigger the Request callback?

I could not reproduce it but in my early tests, the spider's traceback when trying to process a binary file was targeting the `download_request` method in the `handler.py` file of the plug-in.

Thank you very much in advance for any tips!











Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of binary files #184

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development