Skip to content

Commit 669e3b4

Browse files
committed
to 004
1 parent 89d4f71 commit 669e3b4

File tree

7 files changed

+200
-14
lines changed

7 files changed

+200
-14
lines changed

CHANGELOG.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Gerapy Auto Extractor Changelog
2+
3+
## 0.0.4 (2020-07-15)
4+
5+
### Bug Fixes
6+
7+
* Fix Bug of un-closing Pyppeteer when loaded failed
8+
9+
### Features
10+
11+
* Add support for `GERAPY_IGNORE_RESOURCE_TYPES`
12+
* Add support for retrying

README.md

Lines changed: 92 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,4 +55,95 @@ GERAPY_PYPPETEER_DISABLE_GPU = True
5555

5656
## Example
5757

58-
For more detail, please see [example](./example).
58+
For more detail, please see [example](./example).
59+
60+
Also you can directly run with Docker:
61+
62+
```
63+
docker run germey/gerapy-pyppeteer-example
64+
```
65+
66+
Outputs:
67+
68+
```shell script
69+
2020-07-13 01:49:13 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example)
70+
2020-07-13 01:49:13 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, May 6 2020, 04:59:01) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Darwin-19.4.0-x86_64-i386-64bit
71+
2020-07-13 01:49:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
72+
2020-07-13 01:49:13 [scrapy.crawler] INFO: Overridden settings:
73+
{'BOT_NAME': 'example',
74+
'CONCURRENT_REQUESTS': 3,
75+
'NEWSPIDER_MODULE': 'example.spiders',
76+
'RETRY_HTTP_CODES': [403, 500, 502, 503, 504],
77+
'SPIDER_MODULES': ['example.spiders']}
78+
2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet Password: 83c276fb41754bd0
79+
2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled extensions:
80+
['scrapy.extensions.corestats.CoreStats',
81+
'scrapy.extensions.telnet.TelnetConsole',
82+
'scrapy.extensions.memusage.MemoryUsage',
83+
'scrapy.extensions.logstats.LogStats']
84+
2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
85+
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
86+
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
87+
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
88+
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
89+
'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware',
90+
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
91+
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
92+
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
93+
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
94+
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
95+
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
96+
'scrapy.downloadermiddlewares.stats.DownloaderStats']
97+
2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled spider middlewares:
98+
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
99+
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
100+
'scrapy.spidermiddlewares.referer.RefererMiddleware',
101+
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
102+
'scrapy.spidermiddlewares.depth.DepthMiddleware']
103+
2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled item pipelines:
104+
[]
105+
2020-07-13 01:49:13 [scrapy.core.engine] INFO: Spider opened
106+
2020-07-13 01:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
107+
2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
108+
2020-07-13 01:49:13 [example.spiders.book] INFO: crawling https://dynamic5.scrape.center/page/1
109+
2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/page/1>
110+
2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
111+
2020-07-13 01:49:14 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/1
112+
2020-07-13 01:49:19 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished
113+
2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: wait for .item .name finished
114+
2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: close pyppeteer
115+
2020-07-13 01:49:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/page/1> (referer: None)
116+
2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26898909>
117+
2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26861389>
118+
2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26855315>
119+
2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
120+
2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
121+
2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
122+
2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26855315
123+
2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26861389
124+
2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26898909
125+
2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished
126+
2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: wait for .item .name finished
127+
2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: close pyppeteer
128+
2020-07-13 01:49:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/detail/26861389> (referer: https://dynamic5.scrape.center/page/1)
129+
2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/page/2>
130+
2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
131+
2020-07-13 01:49:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26861389>
132+
{'name': '壁穴ヘブンホール',
133+
'score': '5.6',
134+
'tags': ['BL漫画', '小基漫', 'BL', '『又腐又基』', 'BLコミック']}
135+
2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished
136+
2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/2
137+
2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: wait for .item .name finished
138+
2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: close pyppeteer
139+
2020-07-13 01:49:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/detail/26855315> (referer: https://dynamic5.scrape.center/page/1)
140+
2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/27047626>
141+
2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
142+
2020-07-13 01:49:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26855315>
143+
{'name': '冒险小虎队', 'score': '9.4', 'tags': ['冒险小虎队', '童年', '冒险', '推理', '小时候读的']}
144+
2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished
145+
2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/27047626
146+
2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: wait for .item .name finished
147+
2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: close pyppeteer
148+
...
149+
```

gerapy_pyppeteer/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
VERSION = (0, 0, '3')
1+
VERSION = (0, 0, '4')
22

33
version = __version__ = '.'.join(map(str, VERSION))

gerapy_pyppeteer/downloadermiddlewares.py

Lines changed: 85 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,11 @@
11
import sys
22
import asyncio
3+
4+
from pyppeteer.errors import PageError, TimeoutError
35
from scrapy.http import HtmlResponse
46
import twisted.internet
7+
from scrapy.utils.python import global_object_name
8+
from scrapy.utils.response import response_status_message
59
from twisted.internet.asyncioreactor import AsyncioSelectorReactor
610
from twisted.internet.defer import Deferred
711
from gerapy_pyppeteer.request import PyppeteerRequest
@@ -32,6 +36,45 @@ class PyppeteerMiddleware(object):
3236
Downloader middleware handling the requests with Puppeteer
3337
"""
3438

39+
def _retry(self, request, reason, spider):
40+
"""
41+
get retry request
42+
:param request:
43+
:param reason:
44+
:param spider:
45+
:return:
46+
"""
47+
if not self.retry_enabled:
48+
return
49+
50+
retries = request.meta.get('retry_times', 0) + 1
51+
retry_times = self.max_retry_times
52+
53+
if 'max_retry_times' in request.meta:
54+
retry_times = request.meta['max_retry_times']
55+
56+
stats = spider.crawler.stats
57+
if retries <= retry_times:
58+
logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s",
59+
{'request': request, 'retries': retries, 'reason': reason},
60+
extra={'spider': spider})
61+
retryreq = request.copy()
62+
retryreq.meta['retry_times'] = retries
63+
retryreq.dont_filter = True
64+
retryreq.priority = request.priority + self.priority_adjust
65+
66+
if isinstance(reason, Exception):
67+
reason = global_object_name(reason.__class__)
68+
69+
stats.inc_value('retry/count')
70+
stats.inc_value('retry/reason_count/%s' % reason)
71+
return retryreq
72+
else:
73+
stats.inc_value('retry/max_reached')
74+
logger.error("Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
75+
{'request': request, 'retries': retries, 'reason': reason},
76+
extra={'spider': spider})
77+
3578
@classmethod
3679
def from_crawler(cls, crawler):
3780
"""
@@ -61,6 +104,13 @@ def from_crawler(cls, crawler):
61104
cls.disable_gpu = settings.get('GERAPY_PYPPETEER_DISABLE_GPU', GERAPY_PYPPETEER_DISABLE_GPU)
62105
cls.download_timeout = settings.get('GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT',
63106
settings.get('DOWNLOAD_TIMEOUT', GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT))
107+
cls.ignore_resource_types = settings.get('GERAPY_IGNORE_RESOURCE_TYPES', GERAPY_IGNORE_RESOURCE_TYPES)
108+
109+
cls.retry_enabled = settings.getbool('RETRY_ENABLED')
110+
cls.max_retry_times = settings.getint('RETRY_TIMES')
111+
cls.retry_http_codes = set(int(x) for x in settings.getlist('RETRY_HTTP_CODES'))
112+
cls.priority_adjust = settings.getint('RETRY_PRIORITY_ADJUST')
113+
64114
return cls()
65115

66116
async def _process_request(self, request: PyppeteerRequest, spider):
@@ -111,32 +161,55 @@ async def _process_request(self, request: PyppeteerRequest, spider):
111161
await page.setRequestInterception(True)
112162

113163
@page.on('request')
114-
async def _handle_headers(pu_request):
164+
async def _handle_interception(pu_request):
165+
# handle headers
115166
overrides = {
116167
'headers': {
117168
k.decode(): ','.join(map(lambda v: v.decode(), v))
118169
for k, v in request.headers.items()
119170
}
120171
}
121-
await pu_request.continue_(overrides=overrides)
172+
# handle resource types
173+
_ignore_resource_types = self.ignore_resource_types
174+
if request.ignore_resource_types is not None:
175+
_ignore_resource_types = request.ignore_resource_types
176+
if pu_request.resourceType in _ignore_resource_types:
177+
await pu_request.abort()
178+
else:
179+
await pu_request.continue_(overrides)
122180

123181
timeout = self.download_timeout
124182
if request.timeout is not None:
125183
timeout = request.timeout
126184

127185
logger.debug('crawling %s', request.url)
128-
response = await page.goto(
129-
request.url,
130-
options={
186+
187+
response = None
188+
try:
189+
options = {
131190
'timeout': 1000 * timeout,
132191
'waitUntil': request.wait_until
133192
}
134-
)
193+
logger.debug('request %s with options %s', request.url, options)
194+
response = await page.goto(
195+
request.url,
196+
options=options
197+
)
198+
except (PageError, TimeoutError):
199+
logger.error('error rendering url %s using pyppeteer', request.url)
200+
await page.close()
201+
await browser.close()
202+
return self._retry(request, 504, spider)
135203

136204
if request.wait_for:
137-
logger.debug('waiting for %s finished', request.wait_for)
138-
await page.waitFor(request.wait_for)
139-
logger.debug('wait for %s finished', request.wait_for)
205+
try:
206+
logger.debug('waiting for %s finished', request.wait_for)
207+
await page.waitFor(request.wait_for)
208+
except TimeoutError:
209+
logger.error('error waiting for %s of %s', request.wait_for, request.url)
210+
await page.close()
211+
await browser.close()
212+
return self._retry(request, 504, spider)
140213

141214
# evaluate script
142215
if request.script:
@@ -156,6 +229,9 @@ async def _handle_headers(pu_request):
156229
await page.close()
157230
await browser.close()
158231

232+
if not response:
233+
logger.error('get null response by pyppeteer of url %s', request.url)
234+
159235
# Necessary to bypass the compression middleware (?)
160236
response.headers.pop('content-encoding', None)
161237
response.headers.pop('Content-Encoding', None)

gerapy_pyppeteer/request.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ class PyppeteerRequest(Request):
66
Scrapy ``Request`` subclass providing additional arguments
77
"""
88

9-
def __init__(self, url, callback=None, wait_until=None, wait_for=None, script=None, sleep=None, timeout=10,
10-
proxy=None, *args,
9+
def __init__(self, url, callback=None, wait_until=None, wait_for=None, script=None, sleep=None, timeout=None,
10+
proxy=None, ignore_resource_types=None, *args,
1111
**kwargs):
1212
"""
1313
:param url: request url
@@ -26,5 +26,6 @@ def __init__(self, url, callback=None, wait_until=None, wait_for=None, script=No
2626
self.sleep = sleep
2727
self.proxy = proxy
2828
self.timeout = timeout
29+
self.ignore_resource_types = ignore_resource_types
2930

3031
super().__init__(url, callback, *args, **kwargs)

gerapy_pyppeteer/settings.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,9 @@
2121
GERAPY_PYPPETEER_NO_SANDBOX = True
2222
GERAPY_PYPPETEER_DISABLE_SETUID_SANDBOX = True
2323
GERAPY_PYPPETEER_DISABLE_GPU = True
24+
25+
# ignore resource types, ResourceType will be one of the following: ``document``,
26+
# ``stylesheet``, ``image``, ``media``, ``font``, ``script``,
27+
# ``texttrack``, ``xhr``, ``fetch``, ``eventsource``, ``websocket``,
28+
# ``manifest``, ``other``.
29+
GERAPY_IGNORE_RESOURCE_TYPES = []

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
scrapy>=2.0.0
2-
pyppeteer
2+
pyppeteer>=0.2.2

0 commit comments

Comments
 (0)