Skip to content

Conversation

@devin-ai-integration
Copy link

Convert ETL output from CSV to Parquet format

Summary

Updated the ETL pipeline to output data in Parquet format instead of CSV, with Snappy compression, partitioned by region, and using timestamp-based filenames as requested.

Changes Made

  • Added pyarrow dependency to requirements.txt for Parquet format support
  • Updated df_to_s3() function in src/load_data_to_s3.py to:
    • Output Parquet format with Snappy compression
    • Partition data by region (creates separate files per region)
    • Generate timestamp-based filenames (format: vehicle_sales_{region}_{timestamp}.parquet)
    • Use BytesIO instead of StringIO for binary data handling
  • Updated main orchestrator in main.py to use key_prefix parameter instead of hardcoded filename

Technical Details

  • Partitioning Strategy: Creates separate Parquet files per region (West, Central, East) rather than using Parquet's built-in partitioning for better S3 compatibility
  • Compression: Uses Snappy compression via pyarrow's built-in compression parameter
  • Filename Format: vehicle_sales_{region_lowercase}_{YYYYMMDD_HHMMSS}.parquet
  • Error Handling: Added comprehensive error handling for both S3 upload failures and Parquet creation failures

Verification Results

Parquet conversion tested successfully - Created isolated test script that confirmed:

  • All 3 regions (West, Central, East) processed correctly
  • Snappy compression applied successfully
  • Timestamp-based naming working as expected
  • Proper row distribution per region (2 for West, 1 each for Central and East)

⚠️ Full end-to-end testing limited - Could not test complete pipeline due to PostgreSQL database not being available in the environment, but core Parquet functionality verified through isolated testing.

Files Modified

  • requirements.txt - Added pyarrow dependency
  • src/load_data_to_s3.py - Complete rewrite of df_to_s3 function for Parquet support
  • main.py - Updated S3 upload call to use key_prefix parameter

Link to Devin run

https://app.devin.ai/sessions/34874ff5549a4fdc8c3c9dd60335112e

Requested by: Shawn Azman (shawn@cognition.ai)

- Add pyarrow dependency for Parquet support
- Implement Snappy compression
- Partition output by region (separate files per region)
- Use timestamp-based filenames
- Update S3 upload logic to handle multiple Parquet files

Co-Authored-By: Shawn Azman <shawn.d.azman@gmail.com>
@devin-ai-integration
Copy link
Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant