gov-scrape

A collection of Scrapy spiders that transform government media releases into RSS feeds. The purpose of creating this is to increase the availability of these media releases to members of the public, making it easier to keep up to date with state governments.

The feeds are available through the website and gov-rss/rss.

Setup

Pip

$ pip install -r requirements.txt

Conda

$ conda create --file=environment.yaml
$ conda activate gov-scrape

Docker

$ docker pull callumskeet/gov-scrape
# or
$ docker build -f splash.Dockerfile -t gov-scrape .

Run

Shell

$ scrapy crawl <spider-name>  # one spider
$ ./crawl.sh                  # all spiders

Docker-Compose

$ docker-compose up -d        # runs crawl.sh then exits

Docker

$ docker run \
    --name gov-scrape \
    --rm \
    -v $FEED_DIR:/gov-scrape/feeds \                # stores rss files
    -v $LOG_DIR:/gov-scrape/logs \                  # log files from scrapy
    -v $CACHE_DIR:/gov-scrape/.scrapy/httpcache     # cache content from crawled pages
    -it gov-scrape                                  # crawls with all spiders

The regular shell commands also work with Docker, e.g. scrapy crawl vic-prem can be passed to the container.

Available spiders

Spider Name	Source
act_shadow	canberraliberals.org.au
nsw_gov	nsw.gov.au
nsw_prem	nsw.gov.au
nt_shadow	countryliberal.org
sa_prem	premier.sa.gov.au
tas_prem	premier.tas.gov.au
qld_gov	statements.qld.gov.au
qld_shadow	lnp.org.au
sa_shadow*	facebook.com/SouthAustralianLabor/
tas_shadow	taslabor.com
vic_prem	premier.vic.gov.au
vic_shadow	vic.liberal.org.au
wa_gov	mediastatements.wa.gov.au
wa_shadow	waliberal.org.au

* In the process of getting permission from Facebook to scrape the SA Labor page

A few sources already had RSS feeds available, a list of these is available below:

RSS Feed	Source
Jodie McKay Media Releases (NSW Shadow Premier)	https://www.jodimckay.com.au/media_releases
NT Government Newsroom	https://newsroom.nt.gov.au/
NT Government Departmental and Agency Media Releases	https://mediareleases.nt.gov.au/
ACT Government RSS Feed Collection	https://www.cmtedd.act.gov.au/open_government/inform/act_government_media_releases

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
gov_scrape		gov_scrape
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
crawl.sh		crawl.sh
docker-compose.yaml		docker-compose.yaml
environment.yaml		environment.yaml
requirements.txt		requirements.txt
run.sh		run.sh
scrapy.cfg		scrapy.cfg
slim.Dockerfile		slim.Dockerfile
splash.Dockerfile		splash.Dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

gov-scrape

Setup

Pip

Conda

Docker

Run

Shell

Docker-Compose

Docker

Available spiders

Projects used

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

gov-rss/scrape

Folders and files

Latest commit

History

Repository files navigation

gov-scrape

Setup

Pip

Conda

Docker

Run

Shell

Docker-Compose

Docker

Available spiders

Projects used

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages