This repository contains data pipelines for rescuing my data from various web services.
These are implemented in Python and Airflow.
Goals:
- I have a backup of my data in case the service goes away or I stop using it
- I can easily query my data without writing code
- I can join and aggregate data from different sources
- I can filter data by the date it was generated
Install Python 3.8.
python -m venv env
. env/bin/activate
pip install -r requirements.txt --use-deprecated legacy-resolver
(Optional) configure airflow. By default, ~/airflow/airflow.cfg
will be generated when you install airflow.
By default this will use the SequentialExecutor with SQLite, which is not recommended for production usage. For a more production-like setup using Postgres, change the following settings:
sql_alchemy_conn = postgres://etl:etl@localhost/airflow
executor = LocalExecutor
By default dags_folder
will be set to $AIRFLOW_HOME/dags
. To run these dags, configure it to point to the src/dags
directory (either in the airflow config or via environment variables).
Run airflow initdb
Create a postgres database for all the data to go into. E.g. on ubuntu:
sudo -u postgres createdb <dbname>
Create a database user as well:
sudo -u postgres psql
> create user <dbusername> with password '<dbpassword>';
> grant all privileges on database dbname to <dbusername>;
Generate an Airtable API key on the account page.
Create a .env
file in the root of the project. Set the following environment variables:
AIRTABLE_API_KEY=####
AIRFLOW_CONN_POSTGRES_MOVIES=postgresql://<dbusername>:<dbpassword>@localhost/<dbname>
# Run a task in isolation
airflow test syncing_movie_and_tv_data create_my_ratings 2015-06-01
# Backfill
airflow backfill syncing_movie_and_tv_data -s 2020-12-06
There is a CI build defined in .github/workflows/ci.yaml
.
You can run this build locally using act.
# Workaround for ubuntu-latest not working
cd .github/workflows/ubuntu2004
docker build .
docker image tag sha256:<image hash> github-ubuntu-20.04
act -P ubuntu-latest=github-ubuntu-20.04
This builds a database of TV shows and movies I've watched.
How this should work:
- Identify shows and movies by IMDB ID and/or https://www.themoviedb.org/
- Get my ratings from Airtable and/or Letterboxd
- Store images in S3
- Copy movie metadata to Airtable
All code is licenced under MIT.