Skip to content

themarshallproject/incarcerated-survey-processing-reference

Repository files navigation

Incarcerated Survey Processing

A Dagster pipeline that pulls survey responses from Google Drive, processes them, and publishes cleaned outputs to both Google Drive and Amazon S3. The project powers downstream reporting for the Marshall Project surveys of incarcerated people.

Getting Started

1. Install prerequisites

  • Python 3.11
  • Git and Git LFS (brew install git-lfs on macOS)
  • Access to a Google Cloud project with Drive API enabled

After cloning the repository, pull large files managed by Git LFS:

git lfs pull

2. Create a virtual environment

python -m venv .venv
source .venv/bin/activate  # .\.venv\Scripts\activate on Windows
pip install --upgrade pip
pip install -e .

Install the optional developer tools if you plan to use the Dagster webserver or run tests:

pip install -e ".[dev]"

3. Configure environment variables

Copy the example environment file and fill in the Google Drive IDs that match your project:

cp .env.example .env
Variable Description
GDRIVE_DEFAULT_FOLDER_ID Destination folder for most processed outputs.
GDRIVE_STATE_FOLDER_ID Destination folder for per-state exports.
GDRIVE_PRISON_RESPONSES_FILE_ID Source CSV with the base prison survey responses.
GDRIVE_JAIL_RESPONSES_FILE_ID Source CSV with jail responses.
GDRIVE_KAMALA_RESPONSES_FILE_ID Source CSV with the Harris-era supplement responses.
GDRIVE_QUESTIONS_DICT_FILE_ID Lookup table mapping question IDs to labels.
GDRIVE_QUESTION_IDS_DICT_FILE_ID Lookup table of question identifiers.
S3_BUCKET_NAME S3 bucket used for processed exports (set to a placeholder if you are not using the S3 assets).

The pipeline loads .env automatically on import using python-dotenv. Without these values Dagster will raise a clear error message explaining what is missing.

4. Authenticate with Google Drive

Place a Google OAuth client secret JSON (named client_secrets.json) in the project root. On the first run Dagster prompts for authorization and writes a refresh token to google_api_creds.txt. Re-run authentication whenever the credentials expire or you rotate the client secret.

5. (Optional) Configure AWS

The S3 assets rely on the dagster_aws resource. Provide AWS credentials via the usual environment variables or AWS config/credential files, and set S3_BUCKET_NAME in .env. Use any bucket you control. If you are only interested in the Google Drive outputs, keep the value as a harmless placeholder and skip the S3 assets.

Running the pipeline

Materialize the assets through Dagster. For example, execute the job that downloads the raw files:

dagster job execute -f survey_processing/jobs.py -j download_files_job

To explore and run assets interactively:

dagster dev

This command launches the Dagster UI pointing at the definitions in survey_processing/definitions.py.

Processed artifacts are written under data/ locally and uploaded to the configured Google Drive folders and S3 bucket.

Data notes

The pipeline standardizes a couple of critical fields:

  • vintage: captures the survey wave. Current values are 2024-001 (pre-Biden withdrawal) and 2024-002 (post-Harris nomination supplement).
  • source: identifies the original source file (prison_responses, jail_responses, or kamala_responses).

Many downstream assets create derived aggregations, per-state exports, and redacted public releases. Review survey_processing/assets.py for the full list.

Working with notebooks

Jupyter or JupyterLab are required to work with the notebooks in the notebooks/ directory. Make sure you install the project in editable mode inside the same virtual environment so notebooks can import the package.

Large file storage

The contents of data/ are ignored by default and tracked with Git LFS. Add new files with git add -f path/to/file if they should be versioned.

Contributing

Issues and pull requests are welcome. Please open an issue if you spot data anomalies, questions about configuration, or ideas for additional assets.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages