Skip to content
/ etl Public

Data pipelines for rescuing my data from various web services

Notifications You must be signed in to change notification settings

MatMoore/etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETL

This repository contains data pipelines for rescuing my data from various web services.

These are implemented in Python and Airflow.

Goals:

  • I have a backup of my data in case the service goes away or I stop using it
  • I can easily query my data without writing code
  • I can join and aggregate data from different sources
  • I can filter data by the date it was generated

Local development

Python setup

Install Python 3.8.

python -m venv env
. env/bin/activate
pip install -r requirements.txt --use-deprecated legacy-resolver

Airflow setup

(Optional) configure airflow. By default, ~/airflow/airflow.cfg will be generated when you install airflow.

By default this will use the SequentialExecutor with SQLite, which is not recommended for production usage. For a more production-like setup using Postgres, change the following settings:

sql_alchemy_conn = postgres://etl:etl@localhost/airflow
executor = LocalExecutor

By default dags_folder will be set to $AIRFLOW_HOME/dags. To run these dags, configure it to point to the src/dags directory (either in the airflow config or via environment variables).

Run airflow initdb

Database setup

Create a postgres database for all the data to go into. E.g. on ubuntu:

sudo -u postgres createdb <dbname>

Create a database user as well:

sudo -u postgres psql
> create user <dbusername> with password '<dbpassword>';
> grant all privileges on database dbname to <dbusername>;

Generate API keys

Generate an Airtable API key on the account page.

Configure the local environment

Create a .env file in the root of the project. Set the following environment variables:

AIRTABLE_API_KEY=####
AIRFLOW_CONN_POSTGRES_MOVIES=postgresql://<dbusername>:<dbpassword>@localhost/<dbname>

Debugging commands

# Run a task in isolation
airflow test syncing_movie_and_tv_data create_my_ratings 2015-06-01

# Backfill
airflow backfill syncing_movie_and_tv_data -s 2020-12-06

Automated tests

There is a CI build defined in .github/workflows/ci.yaml.

You can run this build locally using act.

# Workaround for ubuntu-latest not working
cd .github/workflows/ubuntu2004
docker build .
docker image tag sha256:<image hash> github-ubuntu-20.04

act -P ubuntu-latest=github-ubuntu-20.04

Pipelines

TV and Movies

This builds a database of TV shows and movies I've watched.

How this should work:

  • Identify shows and movies by IMDB ID and/or https://www.themoviedb.org/
  • Get my ratings from Airtable and/or Letterboxd
  • Store images in S3
  • Copy movie metadata to Airtable

Licence

All code is licenced under MIT.

About

Data pipelines for rescuing my data from various web services

Resources

Stars

Watchers

Forks