Demonstration of distributed web scraper for https://www.techinasia.com/jobs and its analytics data pipeline.
This guide assumes you have working installation of:
- Docker Compose
- Postgres
- Redis
- Create 2 databases in your Postgres instance named
scraper
andetl
- Clone this repo by run
git clone https://github.com/pythonjokeun/web-scraper-data-pipeline
- Enter the cloned repo.
- Open up
docker-compose.yml
in your favorite editor. - Configure all the environment variables values accordingly.
- Run
docker-compose up --scale scraper-consumer=2
Notes:
- If your working installation of Postgres or Redis running at your local machine, you can use
host.docker.internal
as the[host]
value. - The scraper and ETL data pipeline execution schedules are defined in the
CRON_SCHEDULE
variables usingcron
expression. - The
--scale scraper-consumer=2
argument is used to define number ofscraper-consumer
instance.