Kenya Law gazette scraper built on Scrapy
- Clone repo and cd into it
- Make virtual environment
- pip install -r requirements.txt
- Set ENV variables
SCRAPY_AWS_ACCESS_KEY_ID- Get this from AWSSCARPY_AWS_SECRET_ACCESS_KEY- Get this from AWSSCRAPY_FEED_URI=s3://name-of-bucket-here/gazettes/data.jsonlines- Where you want thejsonlinesoutput for crawls to be saved. This can also be a local locationSCRAPY_FILES_STORE=s3://name-of-bucket-here/gazettes- Where you want scraped gazettes to be stored. This can also be a local location
Deploying to Scraping Hub
It is recommended that you deploy your crawler to scrapinghub for easy management. Follow these steps to do this:
- Sign up for free scraping hub account here
- Install shub locally using
pip install shub. Further instructions here shub loginshub deploy
Note that on scraping hub, environment variables don't need the SCRAPY_ prefix
brew install berkeley-dbexport YES_I_HAVE_THE_RIGHT_TO_USE_THIS_BERKELEY_DB_VERSION=1BERKELEYDB_DIR=$(brew --cellar)/berkeley-db/6.2.23 pip install bsddb3. Replace6.2.23with the version of berkeley-db that you installedpip install scrapy-deltafetch