Skip to content

Commit 0c8c58b

Browse files
committed
scrapy role, overview update
1 parent 246575a commit 0c8c58b

File tree

2 files changed

+68
-37
lines changed

2 files changed

+68
-37
lines changed

docs/source/topics/overview.rst

Lines changed: 3 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ Here are few cases, external crawl frontier can be suitable for:
2828

2929
* URL ordering/queueing isolation from the spider (e.g. distributed cluster of spiders, need of remote management of
3030
ordering/queueing),
31-
* URL (meta)data storage is needed (e.g. to demonstrate it's contents somewhere),
31+
* URL (meta)data storage is needed (e.g. to be able to pause and resume the crawl),
3232
* advanced URL ordering logic is needed, when it's hard to maintain code within spider/fetcher.
3333

3434

@@ -48,31 +48,8 @@ If website is big, and it's expensive to crawl the whole website, Frontera can b
4848
the most important documents.
4949

5050

51-
Distributed load, few websites
52-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
53-
54-
If website needs to be crawled faster than single spider one could use distributed spiders mode. In this mode Frontera
55-
is distributing spider processes and using one instance of backend worker. Requests are distributed using
56-
:term:`message bus` of your choice and distribution logic can be adjusted using custom partitioning. By default requests
57-
are distributed to spiders randomly, and desired request rate can be set in spiders.
58-
59-
Consider also using proxy services, such as `Crawlera`_.
60-
61-
62-
Revisiting
63-
^^^^^^^^^^
64-
65-
There is a set of websites and one need to re-crawl them on timely (or other) manner. Frontera provides simple
66-
revisiting backend, scheduling already visited documents for next visit using time interval set by option. This
67-
backend is using general relational database for persistence and can be used in single process or distributed
68-
spiders modes.
69-
70-
Watchdog use case - when one needs to be notified about document changes, also could be addressed with such a backend
71-
and minor customization.
72-
73-
74-
Broad crawling
75-
^^^^^^^^^^^^^^
51+
Broad crawling of many websites
52+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
7653

7754
This use case requires full distribution: spiders and backend. In addition to spiders process one should be running
7855
:term:`strategy worker` (s) and :term:`db worker` (s), depending on chosen partitioning scheme.

docs/source/topics/scrapy-integration.rst

Lines changed: 65 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,20 @@ To use Frontera with Scrapy, you will need to add `Scrapy middlewares`_ and rede
66
custom Frontera scheduler. Both can be done by modifying `Scrapy settings`_.
77

88

9+
The purpose
10+
===========
11+
12+
Scrapy is expected to be used as a fetching, HTML parsing and links extracting component. Your spider code have
13+
to produce responses and requests from extracted links. That's all. Frontera's business is to keep the links, queue
14+
and schedule links when needed.
15+
16+
Please make sure all the middlewares affecting the crawling, like DepthMiddleware, OffsiteMiddleware or
17+
RobotsTxtMiddleware are disabled.
18+
19+
All other use cases when Scrapy is busy items generation, scraping from HTML, scheduling links directly trying to bypass
20+
Frontera, are doomed to cause countless hours of maintenance. Please don't use Frontera integrated with Scrapy that way.
21+
22+
923
Activating the frontier
1024
=======================
1125

@@ -98,25 +112,65 @@ Writing Scrapy spider
98112

99113
Spider logic
100114
------------
101-
Creation of basic Scrapy spider is described at `Quick start single process`_ page.
102115

103-
It's also a good practice to prevent spider from closing because of insufficiency of queued requests transport:::
116+
Creation of new Scrapy project is described at `Quick start single process`_ page. Again, your spider code have
117+
to produce responses and requests from extracted links. Also, make sure exceptions caused by request processing are
118+
not intercepted by any of the middlewares. Otherwise errors delivery to :term:`crawling strategy` will be broken.
119+
120+
Here is an example code to start::
104121

105-
@classmethod
106-
def from_crawler(cls, crawler, *args, **kwargs):
107-
spider = cls(*args, **kwargs)
108-
spider._set_crawler(crawler)
109-
spider.crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
110-
return spider
122+
from scrapy import Spider
123+
from scrapy.linkextractors import LinkExtractor
124+
from scrapy.http import Request
125+
from scrapy.http.response.html import HtmlResponse
126+
127+
class CommonPageSpider(Spider):
128+
129+
name = "commonpage"
130+
131+
def __init__(self, *args, **kwargs):
132+
super(CommonPageSpider, self).__init__(*args, **kwargs)
133+
self.le = LinkExtractor()
134+
135+
def parse(self, response):
136+
if not isinstance(response, HtmlResponse):
137+
return
138+
for link in self.le.extract_links(response):
139+
r = Request(url=link.url)
140+
r.meta.update(link_text=link.text)
141+
yield r
111142

112-
def spider_idle(self):
113-
self.log("Spider idle signal caught.")
114-
raise DontCloseSpider
115143

116144

117145
Configuration guidelines
118146
------------------------
119147

148+
Please specify a correct user agent string to disclose yourself to webmasters::
149+
150+
USER_AGENT = 'Some-Bot (+http://url/to-the-page-describing-the-purpose-of-crawling)'
151+
152+
153+
When using Frontera robots.txt obeying have to be implemented in :term:`crawling strategy`::
154+
155+
ROBOTSTXT_OBEY = False
156+
157+
Disable some of the spider and downloader middlewares which may affect the crawling::
158+
159+
SPIDER_MIDDLEWARES.update({
160+
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
161+
'scrapy.spidermiddlewares.referer.RefererMiddleware': None,
162+
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': None,
163+
'scrapy.spidermiddlewares.depth.DepthMiddleware': None,
164+
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': None
165+
})
166+
167+
DOWNLOADER_MIDDLEWARES.update({
168+
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': None,
169+
})
170+
171+
del DOWNLOADER_MIDDLEWARES_BASE['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware']
172+
173+
120174
There several tunings you can make for efficient broad crawling.
121175

122176
Various settings suitable for broad crawling::

0 commit comments

Comments
 (0)