A collection of Scrapy spiders that transform government media releases into RSS feeds. The purpose of creating this is to increase the availability of these media releases to members of the public, making it easier to keep up to date with state governments.
The feeds are available through the website and gov-rss/rss.
$ pip install -r requirements.txt$ conda create --file=environment.yaml
$ conda activate gov-scrape$ docker pull callumskeet/gov-scrape
# or
$ docker build -f splash.Dockerfile -t gov-scrape .$ scrapy crawl <spider-name> # one spider
$ ./crawl.sh # all spiders$ docker-compose up -d # runs crawl.sh then exits$ docker run \
--name gov-scrape \
--rm \
-v $FEED_DIR:/gov-scrape/feeds \ # stores rss files
-v $LOG_DIR:/gov-scrape/logs \ # log files from scrapy
-v $CACHE_DIR:/gov-scrape/.scrapy/httpcache # cache content from crawled pages
-it gov-scrape # crawls with all spidersThe regular shell commands also work with Docker, e.g. scrapy crawl vic-prem can be passed to the container.
* In the process of getting permission from Facebook to scrape the SA Labor page
A few sources already had RSS feeds available, a list of these is available below:
Copyright (c) 2021 Callum Skeet under the MIT License