Convert ETL output from CSV to Parquet format with Snappy compression and region partitioning #2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Convert ETL output from CSV to Parquet format
Summary
Updated the ETL pipeline to output data in Parquet format instead of CSV, with Snappy compression, partitioned by region, and using timestamp-based filenames as requested.
Changes Made
requirements.txtfor Parquet format supportdf_to_s3()function insrc/load_data_to_s3.pyto:vehicle_sales_{region}_{timestamp}.parquet)main.pyto usekey_prefixparameter instead of hardcoded filenameTechnical Details
vehicle_sales_{region_lowercase}_{YYYYMMDD_HHMMSS}.parquetVerification Results
✅ Parquet conversion tested successfully - Created isolated test script that confirmed:
Files Modified
requirements.txt- Added pyarrow dependencysrc/load_data_to_s3.py- Complete rewrite of df_to_s3 function for Parquet supportmain.py- Updated S3 upload call to use key_prefix parameterLink to Devin run
https://app.devin.ai/sessions/34874ff5549a4fdc8c3c9dd60335112e
Requested by: Shawn Azman (shawn@cognition.ai)