Skip to content

An open source example of the Count Love crawler.

License

Notifications You must be signed in to change notification settings

count-love/crawler

Repository files navigation

Count Love Crawler

Installation

To isolate the crawler and its dependencies, it is recommended that you install in a Python virtual environment.

Tested with Python 3.9 (but should be compatible with a range of versions).

Install dependencies

To install dependencies, run:

pip install -r requirements.txt

Setup SQLite database

The SQLite3 database stores the source list, crawler queue, and content extracted from pages. To create a database run:

sqlite3 data.db < schema.sql

Running crawl

To run the crawl, run:

python crawler.py

While the crawl is running, details and diagnostic information is logged to "crawl.log". Because the Sources table is initially empty, running python crawler.py has no effect until a source is added. Here's an example of how to add a source by directly interacting with the database table:

sqlite3 data.db
INSERT INTO Sources VALUES (NULL, 'https://nytimes.com', 'New York, NY', 1, datetime('now'), NULL);

Rerunning python crawler.py will now print a list of potential articles with protest keywords to the console.

Releases

No releases published

Packages

No packages published

Languages