Skip to content

Conversation

@chosak
Copy link
Member

@chosak chosak commented Aug 13, 2025

This PR adds a Dockerfile to this project, which allows for running its data pipelines in Docker.

The Dockerfile uses Python 3.13 instead of the consumerfinance.gov version 3.8. There are several vulnerabilities in Python 3.8 (which is EOL) and the indexing doesn't have to run using the same Python version as the API code. In order to support this upgrade, the version of boto has been updated from 1.11.7 to 1.37.38, which is the last version to support 3.8.

The make elasticsearch logic has been modified so that checking if we need to reindex (based on sentinel files containing last dataset/index times) doesn't rely on the local presence of the dataset files. Currently, this pipeline is assumed to run somewhere with a persistent Jenkins workspace, meaning that the (large) dataset files will be present locally. In an ephemeral Jenkins setup, we don't have such a persistent workspace. This change in logic is intended to support a workflow where we (1) download only the (small) sentinel files from a persistent place (S3); (2) check those file timestamps to see if we need to reindex; (3) only download the large source data if we need to reindex (or if the FORCE_REINDEX environment variable is set).

For additional context, see internal DEVPLAT-1569.

Testing

First, build the Docker image:

docker build -t ccdb-data-pipeline:latest .

Then, run the indexing pipeline, making sure to pass valid AWS credentials that can read the source S3 bucket:

 docker run \
    -e ENV=local \
    -e INPUT_S3_BUCKET=<name of your bucket> \
    -e INPUT_S3_KEY=path/to/consumer_complaint_datashare.csv \
    -e INPUT_S3_KEY_METADATA=path/to/consumer_complaint_datashare_metadata.json \
    -e AWS_ACCESS_KEY_ID \
    -e AWS_SECRET_ACCESS_KEY \
    -e AWS_SESSION_TOKEN \
    -e ES_HOST=host.docker.internal \
    -e FORCE_REINDEX=1 \
    -v /tmp:/state \
    ccdb-data-pipeline:latest \
    make STATE_DIR=/state elasticsearch

The above example uses host.docker.internal to connect to an Elasticsearch running on your local machine. Alternatively pass ES_PORT, ES_USERNAME, and ES_PASSWORD to connect to a remote Elasticsearch. It also mounts your local /tmp directory as the place where files are downloaded/checked.

Todos

  • So far I've only tested the indexing pipeline; there are other pipelines that may need to be migrated as well.
  • This Dockerfile works properly to index into an Elasticsearch 7.x index. In order for this project to work to index into OpenSearch 2.x (our target environment) it must be modified to use a more modern OS library instead of the current ES one. I'm planning to open an additional PR to add this support but wanted to keep this piece separate.

@chosak chosak requested review from higs4281 and imuchnik August 13, 2025 20:54
@chosak
Copy link
Member Author

chosak commented Aug 15, 2025

A note about the most recent commit (693d46f): the current code tries to pass HTTP basic auth params via a http_auth param to the Elasticsearch client, but version 7.x doesn't actually support that param. See source code here that documents that username/password need to be passed as part of the host.

Additionally, the existing code was passing a user_ssl=True param which seems like a typo - the correct param is use_ssl, and should only be set if the connection actually needs it.

Copy link
Member

@higs4281 higs4281 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A great step forward!

@chosak chosak merged commit d7d52fb into main Sep 9, 2025
1 check passed
@chosak chosak deleted the feature/docker branch September 9, 2025 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants