Change S3 output from CSV to Parquet with Snappy compression #8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Updates the ETL pipeline to output data in Parquet format instead of CSV. The changes include:
region=<value>/)YYYYMMDD_HHMMSS.parquet)The new S3 path structure will be:
auto_oem/etl/vehicle_sales_deduped/region=West/20241203_150000.parquetUpdates since last revision
Added comprehensive test suite (
tests/test_load_data_to_s3.py) with 7 test cases verifying:YYYYMMDD_HHMMSS.parquet)Run tests locally with:
pytest tests/test_load_data_to_s3.py -vReview & Testing Checklist for Human
Recommended test plan: Run
docker-compose up -d postgresthenpython main.pywith valid AWS credentials, and verify the S3 bucket contains the expected partitioned Parquet files.Notes
df_to_s3function was replaced withdf_to_s3_parquet- this is a breaking change if any other code references the old functionpytestto requirements.txt for running testsLink to Devin run: https://app.devin.ai/sessions/7ed04133f4b0417497cab5462536443c
Requested by: Abhay Aggarwal (@abhay-codeium)