scrapy role, overview update

sibiryakov · sibiryakov · commit 0c8c58b6c546 · 2018-11-02T17:00:43.000+01:00
diff --git a/docs/source/topics/overview.rst b/docs/source/topics/overview.rst
@@ -28,7 +28,7 @@ Here are few cases, external crawl frontier can be suitable for:
 
 * URL ordering/queueing isolation from the spider (e.g. distributed cluster of spiders, need of remote management of
   ordering/queueing),
-* URL (meta)data storage is needed (e.g. to demonstrate it's contents somewhere),
+* URL (meta)data storage is needed (e.g. to be able to pause and resume the crawl),
 * advanced URL ordering logic is needed, when it's hard to maintain code within spider/fetcher.
 
 
@@ -48,31 +48,8 @@ If website is big, and it's expensive to crawl the whole website, Frontera can b
 the most important documents.
 
 
-Distributed load, few websites
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-If website needs to be crawled faster than single spider one could use distributed spiders mode. In this mode Frontera
-is distributing spider processes and using one instance of backend worker. Requests are distributed using
-:term:`message bus` of your choice and distribution logic can be adjusted using custom partitioning. By default requests
-are distributed to spiders randomly, and desired request rate can be set in spiders.
-
-Consider also using proxy services, such as `Crawlera`_.
-
-
-Revisiting
-^^^^^^^^^^
-
-There is a set of websites and one need to re-crawl them on timely (or other) manner. Frontera provides simple
-revisiting backend, scheduling already visited documents for next visit using time interval set by option. This
-backend is using general relational database for persistence and can be used in single process or distributed
-spiders modes.
-
-Watchdog use case - when one needs to be notified about document changes, also could be addressed with such a backend
-and minor customization.
-
-
-Broad crawling
-^^^^^^^^^^^^^^
+Broad crawling of many websites
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 This use case requires full distribution: spiders and backend. In addition to spiders process one should be running
 :term:`strategy worker` (s) and :term:`db worker` (s), depending on chosen partitioning scheme.
diff --git a/docs/source/topics/scrapy-integration.rst b/docs/source/topics/scrapy-integration.rst
@@ -6,6 +6,20 @@ To use Frontera with Scrapy, you will need to add `Scrapy middlewares`_ and rede
 custom Frontera scheduler. Both can be done by modifying `Scrapy settings`_.
 
 
+The purpose
+===========
+
+Scrapy is expected to be used as a fetching, HTML parsing and links extracting component. Your spider code have
+ to produce responses and requests from extracted links. That's all. Frontera's business is to keep the links, queue
+and schedule links when needed.
+
+Please make sure all the middlewares affecting the crawling, like DepthMiddleware, OffsiteMiddleware or
+RobotsTxtMiddleware are disabled.
+
+All other use cases when Scrapy is busy items generation, scraping from HTML, scheduling links directly trying to bypass
+Frontera, are doomed to cause countless hours of maintenance. Please don't use Frontera integrated with Scrapy that way.
+
+
 Activating the frontier
 =======================
 
@@ -98,25 +112,65 @@ Writing Scrapy spider
 
 Spider logic
 ------------
-Creation of basic Scrapy spider is described at `Quick start single process`_ page.
 
-It's also a good practice to prevent spider from closing because of insufficiency of queued requests transport:::
+Creation of new Scrapy project is described at `Quick start single process`_ page. Again, your spider code have
+ to produce responses and requests from extracted links. Also, make sure exceptions caused by request processing are
+not intercepted by any of the middlewares. Otherwise errors delivery to :term:`crawling strategy` will be broken.
+
+Here is an example code to start::
 
-    @classmethod
-    def from_crawler(cls, crawler, *args, **kwargs):
-        spider = cls(*args, **kwargs)
-        spider._set_crawler(crawler)
-        spider.crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
-        return spider
+    from scrapy import Spider
+    from scrapy.linkextractors import LinkExtractor
+    from scrapy.http import Request
+    from scrapy.http.response.html import HtmlResponse
+
+    class CommonPageSpider(Spider):
+
+        name = "commonpage"
+
+        def __init__(self, *args, **kwargs):
+            super(CommonPageSpider, self).__init__(*args, **kwargs)
+            self.le = LinkExtractor()
+
+        def parse(self, response):
+            if not isinstance(response, HtmlResponse):
+                return
+            for link in self.le.extract_links(response):
+                r = Request(url=link.url)
+                r.meta.update(link_text=link.text)
+                yield r
 
-    def spider_idle(self):
-        self.log("Spider idle signal caught.")
-        raise DontCloseSpider
 
 
 Configuration guidelines
 ------------------------
 
+Please specify a correct user agent string to disclose yourself to webmasters::
+
+    USER_AGENT = 'Some-Bot (+http://url/to-the-page-describing-the-purpose-of-crawling)'
+
+
+When using Frontera robots.txt obeying have to be implemented in :term:`crawling strategy`::
+
+    ROBOTSTXT_OBEY = False
+
+Disable some of the spider and downloader middlewares which may affect the crawling::
+
+    SPIDER_MIDDLEWARES.update({
+        'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
+        'scrapy.spidermiddlewares.referer.RefererMiddleware': None,
+        'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': None,
+        'scrapy.spidermiddlewares.depth.DepthMiddleware': None,
+        'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': None
+    })
+
+    DOWNLOADER_MIDDLEWARES.update({
+        'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': None,
+    })
+
+    del DOWNLOADER_MIDDLEWARES_BASE['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware']
+
+
 There several tunings you can make for efficient broad crawling.
 
 Various settings suitable for broad crawling::