Skip to content

kenshinx/second-spider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

second-spider

A simple python gevent concurrency spider

Features

  1. The concurrency foundation on gevent
  2. The spider strategy highly configurable:
  • Max depth
  • Sum totals of urls
  • Max concurrency of http request,avoid dos
  • Request headers and cookies
  • Same host strategy
  • Same domain strategy
  • Max running time

Dependencies

Test

python spider.py -v

Example

import logging
from spider  import Spider

logging.basicConfig(
        level=logging.DEBUG ,
        format='%(asctime)s %(levelname)s %(message)s')

spider = Spider()
spider.setRootUrl("http://www.sina.com.cn")
spider.run()

TODO

  • Support Distributed , update gevent.Queue -> redis.Queue
  • Storage backend highly configurable
  • Support Ajax url (webkit etc..)

LICENSE

Copyright © 2013 by kenshin

Under MIT license : rem.mit-license.org

About

one more spider based on gevent requests pyquery

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages