Convert ETL output from CSV to Parquet format with Snappy compression and region partitioning #2

devin-ai-integration · 2025-06-20T19:15:16Z

Convert ETL output from CSV to Parquet format

Summary

Updated the ETL pipeline to output data in Parquet format instead of CSV, with Snappy compression, partitioned by region, and using timestamp-based filenames as requested.

Changes Made

Added pyarrow dependency to requirements.txt for Parquet format support
Updated df_to_s3() function in src/load_data_to_s3.py to:
- Output Parquet format with Snappy compression
- Partition data by region (creates separate files per region)
- Generate timestamp-based filenames (format: vehicle_sales_{region}_{timestamp}.parquet)
- Use BytesIO instead of StringIO for binary data handling
Updated main orchestrator in main.py to use key_prefix parameter instead of hardcoded filename

Technical Details

Partitioning Strategy: Creates separate Parquet files per region (West, Central, East) rather than using Parquet's built-in partitioning for better S3 compatibility
Compression: Uses Snappy compression via pyarrow's built-in compression parameter
Filename Format: vehicle_sales_{region_lowercase}_{YYYYMMDD_HHMMSS}.parquet
Error Handling: Added comprehensive error handling for both S3 upload failures and Parquet creation failures

Verification Results

✅ Parquet conversion tested successfully - Created isolated test script that confirmed:

All 3 regions (West, Central, East) processed correctly
Snappy compression applied successfully
Timestamp-based naming working as expected
Proper row distribution per region (2 for West, 1 each for Central and East)

⚠️ Full end-to-end testing limited - Could not test complete pipeline due to PostgreSQL database not being available in the environment, but core Parquet functionality verified through isolated testing.

Files Modified

requirements.txt - Added pyarrow dependency
src/load_data_to_s3.py - Complete rewrite of df_to_s3 function for Parquet support
main.py - Updated S3 upload call to use key_prefix parameter

Link to Devin run

https://app.devin.ai/sessions/34874ff5549a4fdc8c3c9dd60335112e

Requested by: Shawn Azman (shawn@cognition.ai)

- Add pyarrow dependency for Parquet support - Implement Snappy compression - Partition output by region (separate files per region) - Use timestamp-based filenames - Update S3 upload logic to handle multiple Parquet files Co-Authored-By: Shawn Azman <shawn.d.azman@gmail.com>

devin-ai-integration · 2025-06-20T19:15:19Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convert ETL output from CSV to Parquet format with Snappy compression and region partitioning #2

Convert ETL output from CSV to Parquet format with Snappy compression and region partitioning #2

Uh oh!

devin-ai-integration bot commented Jun 20, 2025

Uh oh!

devin-ai-integration bot commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Convert ETL output from CSV to Parquet format with Snappy compression and region partitioning #2

Are you sure you want to change the base?

Convert ETL output from CSV to Parquet format with Snappy compression and region partitioning #2

Uh oh!

Conversation

devin-ai-integration bot commented Jun 20, 2025

Convert ETL output from CSV to Parquet format

Summary

Changes Made

Technical Details

Verification Results

Files Modified

Link to Devin run

Uh oh!

devin-ai-integration bot commented Jun 20, 2025

🤖 Devin AI Engineer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant