@@ -6,6 +6,20 @@ To use Frontera with Scrapy, you will need to add `Scrapy middlewares`_ and rede
66custom Frontera scheduler. Both can be done by modifying `Scrapy settings `_.
77
88
9+ The purpose
10+ ===========
11+
12+ Scrapy is expected to be used as a fetching, HTML parsing and links extracting component. Your spider code have
13+ to produce responses and requests from extracted links. That's all. Frontera's business is to keep the links, queue
14+ and schedule links when needed.
15+
16+ Please make sure all the middlewares affecting the crawling, like DepthMiddleware, OffsiteMiddleware or
17+ RobotsTxtMiddleware are disabled.
18+
19+ All other use cases when Scrapy is busy items generation, scraping from HTML, scheduling links directly trying to bypass
20+ Frontera, are doomed to cause countless hours of maintenance. Please don't use Frontera integrated with Scrapy that way.
21+
22+
923Activating the frontier
1024=======================
1125
@@ -98,25 +112,65 @@ Writing Scrapy spider
98112
99113Spider logic
100114------------
101- Creation of basic Scrapy spider is described at `Quick start single process `_ page.
102115
103- It's also a good practice to prevent spider from closing because of insufficiency of queued requests transport:::
116+ Creation of new Scrapy project is described at `Quick start single process `_ page. Again, your spider code have
117+ to produce responses and requests from extracted links. Also, make sure exceptions caused by request processing are
118+ not intercepted by any of the middlewares. Otherwise errors delivery to :term: `crawling strategy ` will be broken.
119+
120+ Here is an example code to start::
104121
105- @classmethod
106- def from_crawler(cls, crawler, *args, **kwargs):
107- spider = cls(*args, **kwargs)
108- spider._set_crawler(crawler)
109- spider.crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
110- return spider
122+ from scrapy import Spider
123+ from scrapy.linkextractors import LinkExtractor
124+ from scrapy.http import Request
125+ from scrapy.http.response.html import HtmlResponse
126+
127+ class CommonPageSpider(Spider):
128+
129+ name = "commonpage"
130+
131+ def __init__(self, *args, **kwargs):
132+ super(CommonPageSpider, self).__init__(*args, **kwargs)
133+ self.le = LinkExtractor()
134+
135+ def parse(self, response):
136+ if not isinstance(response, HtmlResponse):
137+ return
138+ for link in self.le.extract_links(response):
139+ r = Request(url=link.url)
140+ r.meta.update(link_text=link.text)
141+ yield r
111142
112- def spider_idle(self):
113- self.log("Spider idle signal caught.")
114- raise DontCloseSpider
115143
116144
117145Configuration guidelines
118146------------------------
119147
148+ Please specify a correct user agent string to disclose yourself to webmasters::
149+
150+ USER_AGENT = 'Some-Bot (+http://url/to-the-page-describing-the-purpose-of-crawling)'
151+
152+
153+ When using Frontera robots.txt obeying have to be implemented in :term: `crawling strategy `::
154+
155+ ROBOTSTXT_OBEY = False
156+
157+ Disable some of the spider and downloader middlewares which may affect the crawling::
158+
159+ SPIDER_MIDDLEWARES.update({
160+ 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
161+ 'scrapy.spidermiddlewares.referer.RefererMiddleware': None,
162+ 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': None,
163+ 'scrapy.spidermiddlewares.depth.DepthMiddleware': None,
164+ 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': None
165+ })
166+
167+ DOWNLOADER_MIDDLEWARES.update({
168+ 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': None,
169+ })
170+
171+ del DOWNLOADER_MIDDLEWARES_BASE['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware']
172+
173+
120174There several tunings you can make for efficient broad crawling.
121175
122176Various settings suitable for broad crawling::
0 commit comments