A Dagster pipeline that pulls survey responses from Google Drive, processes them, and publishes cleaned outputs to both Google Drive and Amazon S3. The project powers downstream reporting for the Marshall Project surveys of incarcerated people.
- Python 3.11
- Git and Git LFS (
brew install git-lfson macOS) - Access to a Google Cloud project with Drive API enabled
After cloning the repository, pull large files managed by Git LFS:
git lfs pullpython -m venv .venv
source .venv/bin/activate # .\.venv\Scripts\activate on Windows
pip install --upgrade pip
pip install -e .Install the optional developer tools if you plan to use the Dagster webserver or run tests:
pip install -e ".[dev]"Copy the example environment file and fill in the Google Drive IDs that match your project:
cp .env.example .env| Variable | Description |
|---|---|
GDRIVE_DEFAULT_FOLDER_ID |
Destination folder for most processed outputs. |
GDRIVE_STATE_FOLDER_ID |
Destination folder for per-state exports. |
GDRIVE_PRISON_RESPONSES_FILE_ID |
Source CSV with the base prison survey responses. |
GDRIVE_JAIL_RESPONSES_FILE_ID |
Source CSV with jail responses. |
GDRIVE_KAMALA_RESPONSES_FILE_ID |
Source CSV with the Harris-era supplement responses. |
GDRIVE_QUESTIONS_DICT_FILE_ID |
Lookup table mapping question IDs to labels. |
GDRIVE_QUESTION_IDS_DICT_FILE_ID |
Lookup table of question identifiers. |
S3_BUCKET_NAME |
S3 bucket used for processed exports (set to a placeholder if you are not using the S3 assets). |
The pipeline loads .env automatically on import using python-dotenv. Without these values Dagster will raise a clear error message explaining what is missing.
Place a Google OAuth client secret JSON (named client_secrets.json) in the project root. On the first run Dagster prompts for authorization and writes a refresh token to google_api_creds.txt. Re-run authentication whenever the credentials expire or you rotate the client secret.
The S3 assets rely on the dagster_aws resource. Provide AWS credentials via the usual environment variables or AWS config/credential files, and set S3_BUCKET_NAME in .env. Use any bucket you control. If you are only interested in the Google Drive outputs, keep the value as a harmless placeholder and skip the S3 assets.
Materialize the assets through Dagster. For example, execute the job that downloads the raw files:
dagster job execute -f survey_processing/jobs.py -j download_files_jobTo explore and run assets interactively:
dagster devThis command launches the Dagster UI pointing at the definitions in survey_processing/definitions.py.
Processed artifacts are written under data/ locally and uploaded to the configured Google Drive folders and S3 bucket.
The pipeline standardizes a couple of critical fields:
vintage: captures the survey wave. Current values are2024-001(pre-Biden withdrawal) and2024-002(post-Harris nomination supplement).source: identifies the original source file (prison_responses,jail_responses, orkamala_responses).
Many downstream assets create derived aggregations, per-state exports, and redacted public releases. Review survey_processing/assets.py for the full list.
Jupyter or JupyterLab are required to work with the notebooks in the notebooks/ directory. Make sure you install the project in editable mode inside the same virtual environment so notebooks can import the package.
The contents of data/ are ignored by default and tracked with Git LFS. Add new files with git add -f path/to/file if they should be versioned.
Issues and pull requests are welcome. Please open an issue if you spot data anomalies, questions about configuration, or ideas for additional assets.