Add support for running indexing pipeline in Docker #75
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a Dockerfile to this project, which allows for running its data pipelines in Docker.
The Dockerfile uses Python 3.13 instead of the consumerfinance.gov version 3.8. There are several vulnerabilities in Python 3.8 (which is EOL) and the indexing doesn't have to run using the same Python version as the API code. In order to support this upgrade, the version of boto has been updated from 1.11.7 to 1.37.38, which is the last version to support 3.8.
The
make elasticsearchlogic has been modified so that checking if we need to reindex (based on sentinel files containing last dataset/index times) doesn't rely on the local presence of the dataset files. Currently, this pipeline is assumed to run somewhere with a persistent Jenkins workspace, meaning that the (large) dataset files will be present locally. In an ephemeral Jenkins setup, we don't have such a persistent workspace. This change in logic is intended to support a workflow where we (1) download only the (small) sentinel files from a persistent place (S3); (2) check those file timestamps to see if we need to reindex; (3) only download the large source data if we need to reindex (or if theFORCE_REINDEXenvironment variable is set).For additional context, see internal DEVPLAT-1569.
Testing
First, build the Docker image:
docker build -t ccdb-data-pipeline:latest .Then, run the indexing pipeline, making sure to pass valid AWS credentials that can read the source S3 bucket:
docker run \ -e ENV=local \ -e INPUT_S3_BUCKET=<name of your bucket> \ -e INPUT_S3_KEY=path/to/consumer_complaint_datashare.csv \ -e INPUT_S3_KEY_METADATA=path/to/consumer_complaint_datashare_metadata.json \ -e AWS_ACCESS_KEY_ID \ -e AWS_SECRET_ACCESS_KEY \ -e AWS_SESSION_TOKEN \ -e ES_HOST=host.docker.internal \ -e FORCE_REINDEX=1 \ -v /tmp:/state \ ccdb-data-pipeline:latest \ make STATE_DIR=/state elasticsearchThe above example uses
host.docker.internalto connect to an Elasticsearch running on your local machine. Alternatively passES_PORT,ES_USERNAME, andES_PASSWORDto connect to a remote Elasticsearch. It also mounts your local/tmpdirectory as the place where files are downloaded/checked.Todos