From bc0f481a7355713978ee206d36a9356ab4be9d61 Mon Sep 17 00:00:00 2001 From: Mikhail Korobov Date: Sun, 21 Sep 2014 07:12:01 +0600 Subject: [PATCH] DOC bring back notes about multiple spiders per process because it is now documented how to do that --- docs/topics/leaks.rst | 30 ++++++++++++++++++++++++------ docs/topics/practices.rst | 2 ++ 2 files changed, 26 insertions(+), 6 deletions(-) diff --git a/docs/topics/leaks.rst b/docs/topics/leaks.rst index c838b3c3031..95bb882e93d 100644 --- a/docs/topics/leaks.rst +++ b/docs/topics/leaks.rst @@ -32,13 +32,16 @@ and that effectively bounds the lifetime of those referenced objects to the lifetime of the Request. This is, by far, the most common cause of memory leaks in Scrapy projects, and a quite difficult one to debug for newcomers. +In big projects, the spiders are typically written by different people and some +of those spiders could be "leaking" and thus affecting the rest of the other +(well-written) spiders when they get to run concurrently, which, in turn, +affects the whole crawling process. + The leak could also come from a custom middleware, pipeline or extension that you have written, if you are not releasing the (previously allocated) resources -properly. - -It's hard to avoid the reasons that cause these leaks -without restricting the power of the framework, so we have decided not to -restrict the functionally but provide useful tools for debugging these leaks. +properly. For example, allocating resources on :signal:`spider_opened` +but not releasing them on :signal:`spider_closed` may cause problems if +you're running :ref:`multiple spiders per process `. .. _topics-leaks-trackrefs: @@ -64,7 +67,10 @@ alias to the :func:`~scrapy.utils.trackref.print_live_refs` function:: FormRequest 878 oldest: 7s ago As you can see, that report also shows the "age" of the oldest object in each -class. +class. If you're running multiple spiders per process chances are you can +figure out which spider is leaking by looking at the oldest request or response. +You can get the oldest object of each class using the +:func:`~scrapy.utils.trackref.get_oldest` function (from the telnet console). Which objects are tracked? -------------------------- @@ -130,6 +136,18 @@ can use the :func:`scrapy.utils.trackref.iter_all` function:: 'http://www.somenastyspider.com/product.php?pid=584', ... +Too many spiders? +----------------- + +If your project has too many spiders executed in parallel, +the output of :func:`prefs()` can be difficult to read. +For this reason, that function has a ``ignore`` argument which can be used to +ignore a particular class (and all its subclases). For +example, this won't show any live references to spiders:: + + >>> from scrapy.spider import Spider + >>> prefs(ignore=Spider) + .. module:: scrapy.utils.trackref :synopsis: Track references of live objects diff --git a/docs/topics/practices.rst b/docs/topics/practices.rst index b188ee56259..e9c7a94bfaf 100644 --- a/docs/topics/practices.rst +++ b/docs/topics/practices.rst @@ -69,6 +69,8 @@ the spider class as first argument in the :meth:`CrawlerRunner.crawl .. seealso:: `Twisted Reactor Overview`_. +.. _run-multiple-spiders: + Running multiple spiders in the same process ============================================