Add support for running indexing pipeline in Docker #75

chosak · 2025-08-13T20:54:16Z

This PR adds a Dockerfile to this project, which allows for running its data pipelines in Docker.

The Dockerfile uses Python 3.13 instead of the consumerfinance.gov version 3.8. There are several vulnerabilities in Python 3.8 (which is EOL) and the indexing doesn't have to run using the same Python version as the API code. In order to support this upgrade, the version of boto has been updated from 1.11.7 to 1.37.38, which is the last version to support 3.8.

The make elasticsearch logic has been modified so that checking if we need to reindex (based on sentinel files containing last dataset/index times) doesn't rely on the local presence of the dataset files. Currently, this pipeline is assumed to run somewhere with a persistent Jenkins workspace, meaning that the (large) dataset files will be present locally. In an ephemeral Jenkins setup, we don't have such a persistent workspace. This change in logic is intended to support a workflow where we (1) download only the (small) sentinel files from a persistent place (S3); (2) check those file timestamps to see if we need to reindex; (3) only download the large source data if we need to reindex (or if the FORCE_REINDEX environment variable is set).

For additional context, see internal DEVPLAT-1569.

Testing

First, build the Docker image:

docker build -t ccdb-data-pipeline:latest .

Then, run the indexing pipeline, making sure to pass valid AWS credentials that can read the source S3 bucket:

 docker run \
    -e ENV=local \
    -e INPUT_S3_BUCKET=<name of your bucket> \
    -e INPUT_S3_KEY=path/to/consumer_complaint_datashare.csv \
    -e INPUT_S3_KEY_METADATA=path/to/consumer_complaint_datashare_metadata.json \
    -e AWS_ACCESS_KEY_ID \
    -e AWS_SECRET_ACCESS_KEY \
    -e AWS_SESSION_TOKEN \
    -e ES_HOST=host.docker.internal \
    -e FORCE_REINDEX=1 \
    -v /tmp:/state \
    ccdb-data-pipeline:latest \
    make STATE_DIR=/state elasticsearch

The above example uses host.docker.internal to connect to an Elasticsearch running on your local machine. Alternatively pass ES_PORT, ES_USERNAME, and ES_PASSWORD to connect to a remote Elasticsearch. It also mounts your local /tmp directory as the place where files are downloaded/checked.

Todos

So far I've only tested the indexing pipeline; there are other pipelines that may need to be migrated as well.
This Dockerfile works properly to index into an Elasticsearch 7.x index. In order for this project to work to index into OpenSearch 2.x (our target environment) it must be modified to use a more modern OS library instead of the current ES one. I'm planning to open an additional PR to add this support but wanted to keep this piece separate.

Upgrade boto3 to the latest version that still supports Python 3.8. See notes under 1.38.0 here: https://github.com/boto/boto3/blob/develop/CHANGELOG.rst

chosak · 2025-08-15T18:27:53Z

A note about the most recent commit (693d46f): the current code tries to pass HTTP basic auth params via a http_auth param to the Elasticsearch client, but version 7.x doesn't actually support that param. See source code here that documents that username/password need to be passed as part of the host.

Additionally, the existing code was passing a user_ssl=True param which seems like a typo - the correct param is use_ssl, and should only be set if the connection actually needs it.

higs4281

A great step forward!

chosak added 5 commits August 8, 2025 09:09

Add Dockerfile

4f6f4dc

Make sentinel/temp file locations configurable

c225d08

Add early check on indexing timestamps

4b7c0ec

Use Python 3.13 in Dockerfile

d39ea99

Upgrade boto3 to the latest version that still supports Python 3.8. See notes under 1.38.0 here: https://github.com/boto/boto3/blob/develop/CHANGELOG.rst

Add additional tools (aws, jq) to the Dockerfile

dc1e9b1

chosak requested review from higs4281 and imuchnik August 13, 2025 20:54

Generalize Python version used in GitHub CI

048c6c1

chosak force-pushed the feature/docker branch from 67d0daa to 3312f10 Compare August 15, 2025 17:01

Properly pass HTTP auth parameters to ES 7

693d46f

chosak force-pushed the feature/docker branch from 3312f10 to 693d46f Compare August 15, 2025 17:32

chosak mentioned this pull request Aug 15, 2025

Fix use of HTTP credentials cfpb/ccdb5-api#220

Merged

higs4281 approved these changes Sep 9, 2025

View reviewed changes

chosak merged commit d7d52fb into main Sep 9, 2025
1 check passed

chosak deleted the feature/docker branch September 9, 2025 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for running indexing pipeline in Docker #75

Add support for running indexing pipeline in Docker #75

Uh oh!

chosak commented Aug 13, 2025 •

edited

Loading

Uh oh!

chosak commented Aug 15, 2025 •

edited

Loading

Uh oh!

higs4281 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add support for running indexing pipeline in Docker #75

Add support for running indexing pipeline in Docker #75

Uh oh!

Conversation

chosak commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Todos

Uh oh!

chosak commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

higs4281 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chosak commented Aug 13, 2025 •

edited

Loading

chosak commented Aug 15, 2025 •

edited

Loading