Skip to content

Commit

Permalink
Windows support (scrapy-plugins#276)
Browse files Browse the repository at this point in the history
* Supporting for Windows

* Supporting for Windows

* Supporting for Windows

* Upload coverage report

* Upload coverage report

* up

* No need for error details

* Revert "No need for error details"

This reverts commit a6b9f6e.

* Restore original __all__

* Use platform.system(), remove Python version check

* Make black happy

* Black & typing adjustments

* _WindowsAdapter class

* Remove test markers

* Decorator to adapt tests for Windows

* Move _WindowsAdapter to _utils module

* Adapt all tests for Windows

* Update readme about Windows

* Placeholder changelog entry for upcoming release

* Rename coverage report CI step

* Add pull request id to changelog

* CI: add CODECOV_TOKEN to env (Windows)

* Run twisted test on Windows too

* Readme adjustments

* Remove unused check for Deferred

* asyncio reactor is not a requirement on Windows

---------

Co-authored-by: sanzenwin <sanzenwin@gmail.com>
  • Loading branch information
elacuesta and sanzenwin authored Jun 24, 2024
1 parent ff06d5c commit c12e56b
Show file tree
Hide file tree
Showing 15 changed files with 210 additions and 44 deletions.
11 changes: 11 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ jobs:
include:
- os: macos-latest
python-version: "3.12"
- os: windows-latest
python-version: "3.12"

steps:
- uses: actions/checkout@v4
Expand Down Expand Up @@ -48,3 +50,12 @@ jobs:
curl -Os https://uploader.codecov.io/latest/macos/codecov
chmod +x codecov
./codecov
- name: Upload coverage report (Windows)
if: runner.os == 'Windows'
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
run: |
$ProgressPreference = 'SilentlyContinue'
Invoke-WebRequest -Uri https://uploader.codecov.io/latest/windows/codecov.exe -Outfile codecov.exe
.\codecov.exe
49 changes: 31 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,10 +56,13 @@ See the [changelog](docs/changelog.md) document.

## Activation

### Download handler

Replace the default `http` and/or `https` Download Handlers through
[`DOWNLOAD_HANDLERS`](https://docs.scrapy.org/en/latest/topics/settings.html):

```python
# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
Expand All @@ -70,12 +73,19 @@ Note that the `ScrapyPlaywrightDownloadHandler` class inherits from the default
`http/https` handler. Unless explicitly marked (see [Basic usage](#basic-usage)),
requests will be processed by the regular Scrapy download handler.

Also, be sure to [install the `asyncio`-based Twisted reactor](https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor):

### Twisted reactor

When running on GNU/Linux or macOS you'll need to
[install the `asyncio`-based Twisted reactor](https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor):

```python
# settings.py
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
```

This is not a requirement on Windows (see [Windows support](#windows-support))


## Basic usage

Expand Down Expand Up @@ -112,6 +122,20 @@ does not match the running Browser. If you prefer the `User-Agent` sent by
default by the specific browser you're using, set the Scrapy user agent to `None`.


## Windows support

Windows support is possible by running Playwright in a `ProactorEventLoop` in a separate thread.
This is necessary because it's not possible to run Playwright in the same
asyncio event loop as the Scrapy crawler:
* Playwright runs the driver in a subprocess. Source:
[Playwright repository](https://github.com/microsoft/playwright-python/blob/v1.44.0/playwright/_impl/_transport.py#L120-L130).
* "On Windows, the default event loop `ProactorEventLoop` supports subprocesses,
whereas `SelectorEventLoop` does not". Source:
[Python docs](https://docs.python.org/3/library/asyncio-platforms.html#asyncio-windows-subprocess).
* Twisted's `asyncio` reactor requires the `SelectorEventLoop`. Source:
[Twisted repository](https://github.com/twisted/twisted/blob/twisted-24.3.0/src/twisted/internet/asyncioreactor.py#L31)


## Supported [settings](https://docs.scrapy.org/en/latest/topics/settings.html)

### `PLAYWRIGHT_BROWSER_TYPE`
Expand Down Expand Up @@ -851,6 +875,12 @@ Refer to the
[upstream docs](https://docs.scrapy.org/en/latest/topics/extensions.html#module-scrapy.extensions.memusage)
for more information about supported settings.

### Windows support

Just like the [upstream Scrapy extension](https://docs.scrapy.org/en/latest/topics/extensions.html#module-scrapy.extensions.memusage), this custom memory extension does not work
on Windows. This is because the stdlib [`resource`](https://docs.python.org/3/library/resource.html)
module is not available.


## Examples

Expand Down Expand Up @@ -912,23 +942,6 @@ See the [examples](examples) directory for more.

## Known issues

### Lack of native support for Windows

This package does not work natively on Windows. This is because:

* Playwright runs the driver in a subprocess. Source:
[Playwright repository](https://github.com/microsoft/playwright-python/blob/v1.28.0/playwright/_impl/_transport.py#L120-L129).
* "On Windows, the default event loop `ProactorEventLoop` supports subprocesses,
whereas `SelectorEventLoop` does not". Source:
[Python docs](https://docs.python.org/3/library/asyncio-platforms.html#asyncio-windows-subprocess).
* Twisted's `asyncio` reactor requires the `SelectorEventLoop`. Source:
[Twisted repository](https://github.com/twisted/twisted/blob/twisted-22.4.0/src/twisted/internet/asyncioreactor.py#L31).

Some users have reported having success
[running under WSL](https://github.com/scrapy-plugins/scrapy-playwright/issues/7#issuecomment-817394494).
See also [#78](https://github.com/scrapy-plugins/scrapy-playwright/issues/78)
for information about working in headful mode under WSL.

### No per-request proxy support
Specifying a proxy via the `proxy` Request meta key is not supported.
Refer to the [Proxy support](#proxy-support) section for more information.
Expand Down
5 changes: 5 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# scrapy-playwright changelog

### [v0.0.36](https://github.com/scrapy-plugins/scrapy-playwright/releases/tag/v0.0.36) (2024-MM-DD)

* Windows support (#276)


### [v0.0.35](https://github.com/scrapy-plugins/scrapy-playwright/releases/tag/v0.0.35) (2024-06-01)

* Update exception message check
Expand Down
44 changes: 42 additions & 2 deletions scrapy_playwright/_utils.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
import asyncio
import concurrent
import logging
import platform
import threading
from typing import Awaitable, Iterator, Optional, Tuple, Union

import scrapy
from playwright.async_api import Error, Page, Request, Response
from scrapy import Spider
from scrapy.http.headers import Headers
from scrapy.utils.python import to_unicode
from twisted.internet.defer import Deferred
from w3lib.encoding import html_body_declared_encoding, http_content_type_encoding


Expand Down Expand Up @@ -53,7 +58,7 @@ def _is_safe_close_error(error: Error) -> bool:

async def _get_page_content(
page: Page,
spider: Spider,
spider: scrapy.Spider,
context_name: str,
scrapy_request_url: str,
scrapy_request_method: str,
Expand Down Expand Up @@ -89,3 +94,38 @@ async def _get_header_value(
return await resource.header_value(header_name)
except Exception:
return None


if platform.system() == "Windows":

class _WindowsAdapter:
"""Utility class to redirect coroutines to an asyncio event loop running
in a different thread. This allows to use a ProactorEventLoop, which is
supported by Playwright on Windows.
"""

loop = None
thread = None

@classmethod
def get_event_loop(cls) -> asyncio.AbstractEventLoop:
if cls.thread is None:
if cls.loop is None:
policy = asyncio.WindowsProactorEventLoopPolicy() # type: ignore
cls.loop = policy.new_event_loop()
asyncio.set_event_loop(cls.loop)
if not cls.loop.is_running():
cls.thread = threading.Thread(target=cls.loop.run_forever, daemon=True)
cls.thread.start()
logger.info("Started loop on separate thread: %s", cls.loop)
return cls.loop

@classmethod
async def get_result(cls, coro) -> concurrent.futures.Future:
return asyncio.run_coroutine_threadsafe(coro=coro, loop=cls.get_event_loop()).result()

def _deferred_from_coro(coro) -> Deferred:
return scrapy.utils.defer.deferred_from_coro(_WindowsAdapter.get_result(coro))

else:
_deferred_from_coro = scrapy.utils.defer.deferred_from_coro
12 changes: 7 additions & 5 deletions scrapy_playwright/handler.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import asyncio
import logging
import platform
from contextlib import suppress
from dataclasses import dataclass
from ipaddress import ip_address
Expand All @@ -25,7 +26,6 @@
from scrapy.http.headers import Headers
from scrapy.responsetypes import responsetypes
from scrapy.settings import Settings
from scrapy.utils.defer import deferred_from_coro
from scrapy.utils.misc import load_object
from scrapy.utils.reactor import verify_installed_reactor
from twisted.internet.defer import Deferred, inlineCallbacks
Expand All @@ -38,6 +38,7 @@
_get_page_content,
_is_safe_close_error,
_maybe_await,
_deferred_from_coro,
)


Expand Down Expand Up @@ -101,7 +102,8 @@ class ScrapyPlaywrightDownloadHandler(HTTPDownloadHandler):

def __init__(self, crawler: Crawler) -> None:
super().__init__(settings=crawler.settings, crawler=crawler)
verify_installed_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")
if platform.system() != "Windows":
verify_installed_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")
crawler.signals.connect(self._engine_started, signals.engine_started)
self.stats = crawler.stats

Expand Down Expand Up @@ -134,7 +136,7 @@ def from_crawler(cls: Type[PlaywrightHandler], crawler: Crawler) -> PlaywrightHa

def _engine_started(self) -> Deferred:
"""Launch the browser. Use the engine_started signal as it supports returning deferreds."""
return deferred_from_coro(self._launch())
return _deferred_from_coro(self._launch())

async def _launch(self) -> None:
"""Launch Playwright manager and configured startup context(s)."""
Expand Down Expand Up @@ -290,7 +292,7 @@ def _set_max_concurrent_context_count(self):
def close(self) -> Deferred:
logger.info("Closing download handler")
yield super().close()
yield deferred_from_coro(self._close())
yield _deferred_from_coro(self._close())

async def _close(self) -> None:
await asyncio.gather(*[ctx.context.close() for ctx in self.context_wrappers.values()])
Expand All @@ -305,7 +307,7 @@ async def _close(self) -> None:

def download_request(self, request: Request, spider: Spider) -> Deferred:
if request.meta.get("playwright"):
return deferred_from_coro(self._download_request(request, spider))
return _deferred_from_coro(self._download_request(request, spider))
return super().download_request(request, spider)

async def _download_request(self, request: Request, spider: Spider) -> Response:
Expand Down
28 changes: 28 additions & 0 deletions tests/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,38 @@
import inspect
import logging
import platform
from contextlib import asynccontextmanager
from functools import wraps

from scrapy import Request
from scrapy.http.response.html import HtmlResponse
from scrapy.utils.test import get_crawler


logger = logging.getLogger("scrapy-playwright-tests")


if platform.system() == "Windows":
from scrapy_playwright._utils import _WindowsAdapter

def allow_windows(test_method):
"""Wrap tests with the _WindowsAdapter class on Windows."""
if not inspect.iscoroutinefunction(test_method):
raise RuntimeError(f"{test_method} must be an async def method")

@wraps(test_method)
async def wrapped(self, *args, **kwargs):
logger.debug("Calling _WindowsAdapter.get_result for %r", self)
await _WindowsAdapter.get_result(test_method(self, *args, **kwargs))

return wrapped

else:

def allow_windows(test_method):
return test_method


@asynccontextmanager
async def make_handler(settings_dict: dict):
"""Convenience function to obtain an initialized handler and close it gracefully"""
Expand Down
17 changes: 17 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,20 @@
import platform

import pytest


@pytest.hookimpl(tryfirst=True)
def pytest_configure(config):
# https://twistedmatrix.com/trac/ticket/9766
# https://github.com/pytest-dev/pytest-twisted/issues/80

if config.getoption("reactor", "default") == "asyncio" and platform.system() == "Windows":
import asyncio

selector_policy = asyncio.WindowsSelectorEventLoopPolicy()
asyncio.set_event_loop_policy(selector_policy)


def pytest_sessionstart(session): # pylint: disable=unused-argument
"""
Called after the Session object has been created and before performing
Expand Down
9 changes: 8 additions & 1 deletion tests/tests_asyncio/test_browser_contexts.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,12 @@
from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod

from tests import make_handler
from tests import allow_windows, make_handler
from tests.mockserver import StaticMockServer


class MixinTestCaseMultipleContexts:
@allow_windows
async def test_context_kwargs(self):
settings_dict = {
"PLAYWRIGHT_BROWSER_TYPE": self.browser_type,
Expand All @@ -37,6 +38,7 @@ async def test_context_kwargs(self):
with pytest.raises(PlaywrightTimeoutError):
await handler._download_request(req, Spider("foo"))

@allow_windows
async def test_contexts_max_pages(self):
settings = {
"PLAYWRIGHT_BROWSER_TYPE": self.browser_type,
Expand Down Expand Up @@ -71,6 +73,7 @@ async def test_contexts_max_pages(self):

assert handler.stats.get_value("playwright/page_count/max_concurrent") == 4

@allow_windows
async def test_max_contexts(self):
def cb_close_context(task):
response = task.result()
Expand Down Expand Up @@ -105,6 +108,7 @@ def cb_close_context(task):

assert handler.stats.get_value("playwright/context_count/max_concurrent") == 4

@allow_windows
async def test_contexts_startup(self):
settings = {
"PLAYWRIGHT_BROWSER_TYPE": self.browser_type,
Expand Down Expand Up @@ -143,6 +147,7 @@ async def test_contexts_startup(self):
assert cookie["value"] == "bar"
assert cookie["domain"] == "example.org"

@allow_windows
async def test_persistent_context(self):
temp_dir = f"{tempfile.gettempdir()}/{uuid4()}"
settings = {
Expand All @@ -161,6 +166,7 @@ async def test_persistent_context(self):
assert handler.context_wrappers["persistent"].persistent
assert not hasattr(handler, "browser")

@allow_windows
async def test_mixed_persistent_contexts(self):
temp_dir = f"{tempfile.gettempdir()}/{uuid4()}"
settings = {
Expand All @@ -183,6 +189,7 @@ async def test_mixed_persistent_contexts(self):
assert not handler.context_wrappers["non-persistent"].persistent
assert isinstance(handler.browser, Browser)

@allow_windows
async def test_contexts_dynamic(self):
async with make_handler({"PLAYWRIGHT_BROWSER_TYPE": self.browser_type}) as handler:
assert len(handler.context_wrappers) == 0
Expand Down
5 changes: 5 additions & 0 deletions tests/tests_asyncio/test_extensions.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import platform
from asyncio.subprocess import Process as AsyncioProcess
from unittest import IsolatedAsyncioTestCase
from unittest.mock import MagicMock, patch
Expand Down Expand Up @@ -34,6 +35,10 @@ class MockMemoryInfo:
rss = 999


@pytest.mark.skipif(
platform.system() == "Windows",
reason="resource stdlib module is not available on Windows",
)
@patch("scrapy.extensions.memusage.MailSender")
class TestMemoryUsageExtension(IsolatedAsyncioTestCase):
async def test_process_availability(self, _MailSender):
Expand Down
Loading

0 comments on commit c12e56b

Please sign in to comment.