A simple python gevent concurrency spider
- The concurrency foundation on gevent
- The spider strategy highly configurable:
- Max depth
- Sum totals of urls
- Max concurrency of http request,avoid dos
- Request headers and cookies
- Same host strategy
- Same domain strategy
- Max running time
- python 2.7
* gevent 1.0dev- gevent 1.0 final
- requests 1.0.3
- pyquery 1.2.4
python spider.py -v
import logging
from spider import Spider
logging.basicConfig(
level=logging.DEBUG ,
format='%(asctime)s %(levelname)s %(message)s')
spider = Spider()
spider.setRootUrl("http://www.sina.com.cn")
spider.run()
- Support Distributed , update
gevent.Queue
->redis.Queue
- Storage backend highly configurable
- Support Ajax url (webkit etc..)
Copyright © 2013 by kenshin
Under MIT license : rem.mit-license.org