BlogForever crawler

Install for python 2.6:

pip install scrapy==0.18.4
pip install lxml httplib2 feedparser selenium python-Levenshtein
install http://phantomjs.org/download.html to /opt/phantomjs/bin/phantomjs

Run:

scrapy crawl newcrawl -a startat=http://www.quantumdiaries.org/
scrapy crawl updatecrawl -a startat=http://www.quantumdiaries.org/ -a since=1388593000

Test:

pip install pytest pytest-incremental
py.test

Source tree docstrings:

bibcrawl
├── model
│   ├── commentitem.py: Blog comment Item
│   ├── objectitem.py: Super class of comment and post item
│   └── postitem.py: Blog post Item
├── pipelines
│   ├── backendpropagate.py: Saves the item in the back-end
│   ├── downloadfeeds.py: Downloads comments web feed
│   ├── downloadimages.py: Download images
│   ├── extractcomments.py: Extracts all comments from html using the comment feed
│   ├── files.py: Files pipeline back-ported to python 2.6
│   ├── processhtml.py: Process html to extract article, title and author
│   └── renderjavascript.py: Renders the original page with PhantomJS and takes a screenshot
├── spiders
│   ├── newcrawl.py: Entirely crawls a new blog
│   ├── rsscrawl.py: Super class of new and update crawl
│   └── updatecrawl.py: Partialy crawls a blog for new content of the web feed
├── utils
│   ├── contentextractor.py: Extracts the content of blog posts using a RSS feed
│   ├── ohpython.py: Essential functions that should have been part of python core
│   ├── parsing.py: Parsing functions
│   ├── priorityheuristic.py: Priority heuristic for page download, favors page with links to posts
│   ├── stringsimilarity.py: Dice's coefficient similarity function
│   └── webdriverpool.py: Pool of PhantomJS processes to parallelize page rendering
├── blogmonitor.py: Queries the database and starts new and update crawls when needed
└── settings.py: Scrapy settings

TODO:

Add to the DB, per blog

link to web-feed
latest etag of this feed
date of last crawl (unix format)

Blog monitor algo:

if isFresh, start an updatecrawl with last crawl date
otherwise we are fine for this blog.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

BlogForever crawler

Source tree docstrings:

TODO:

Files

README.md

Latest commit

History

README.md

File metadata and controls

BlogForever crawler

Source tree docstrings:

TODO: