Skip to content

Commit

Permalink
DOC bring back notes about multiple spiders per process because it is…
Browse files Browse the repository at this point in the history
… now documented how to do that
  • Loading branch information
kmike committed Sep 21, 2014
1 parent a122fdb commit bc0f481
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 6 deletions.
30 changes: 24 additions & 6 deletions docs/topics/leaks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,16 @@ and that effectively bounds the lifetime of those referenced objects to the
lifetime of the Request. This is, by far, the most common cause of memory leaks
in Scrapy projects, and a quite difficult one to debug for newcomers.

In big projects, the spiders are typically written by different people and some
of those spiders could be "leaking" and thus affecting the rest of the other
(well-written) spiders when they get to run concurrently, which, in turn,
affects the whole crawling process.

The leak could also come from a custom middleware, pipeline or extension that
you have written, if you are not releasing the (previously allocated) resources
properly.

It's hard to avoid the reasons that cause these leaks
without restricting the power of the framework, so we have decided not to
restrict the functionally but provide useful tools for debugging these leaks.
properly. For example, allocating resources on :signal:`spider_opened`
but not releasing them on :signal:`spider_closed` may cause problems if
you're running :ref:`multiple spiders per process <run-multiple-spiders>`.

.. _topics-leaks-trackrefs:

Expand All @@ -64,7 +67,10 @@ alias to the :func:`~scrapy.utils.trackref.print_live_refs` function::
FormRequest 878 oldest: 7s ago

As you can see, that report also shows the "age" of the oldest object in each
class.
class. If you're running multiple spiders per process chances are you can
figure out which spider is leaking by looking at the oldest request or response.
You can get the oldest object of each class using the
:func:`~scrapy.utils.trackref.get_oldest` function (from the telnet console).

Which objects are tracked?
--------------------------
Expand Down Expand Up @@ -130,6 +136,18 @@ can use the :func:`scrapy.utils.trackref.iter_all` function::
'http://www.somenastyspider.com/product.php?pid=584',
...

Too many spiders?
-----------------

If your project has too many spiders executed in parallel,
the output of :func:`prefs()` can be difficult to read.
For this reason, that function has a ``ignore`` argument which can be used to
ignore a particular class (and all its subclases). For
example, this won't show any live references to spiders::

>>> from scrapy.spider import Spider
>>> prefs(ignore=Spider)

.. module:: scrapy.utils.trackref
:synopsis: Track references of live objects

Expand Down
2 changes: 2 additions & 0 deletions docs/topics/practices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,8 @@ the spider class as first argument in the :meth:`CrawlerRunner.crawl

.. seealso:: `Twisted Reactor Overview`_.

.. _run-multiple-spiders:

Running multiple spiders in the same process
============================================

Expand Down

0 comments on commit bc0f481

Please sign in to comment.