Add schema validation before S3 loading #9

devin-ai-integration · 2025-12-03T20:08:15Z

Summary

Adds schema validation to the ETL pipeline that runs before loading data to S3. The validation checks that all rows match the expected schema:

VIN: must be a 17-character string
Year: must be between 1990 and current year (inclusive)
Sale_price: must be non-negative (NULL values are allowed)

If any row fails validation, a SchemaValidationError exception is raised with details about all failing rows, preventing bad data from being uploaded to S3.

Review & Testing Checklist for Human

Verify NULL/NaN handling: The code uses is not None checks, but pandas DataFrames use NaN which behaves differently. Test with actual data containing NULL values from the database to ensure validation doesn't incorrectly flag valid rows.
Test with production-like data: Run the pipeline locally with docker-compose up -d postgres and python main.py to verify validation passes on valid data and correctly rejects invalid data.
Consider performance: The implementation uses df.iterrows() which is slow for large DataFrames. If the dataset is large, this may need optimization using vectorized pandas operations.

Notes

Link to Devin run: https://app.devin.ai/sessions/5e8e09daffc945a19e6c05e812e85d07
Requested by: Abhay Aggarwal (@abhay-codeium)

- Add validate_schema function to transform.py that validates: - VIN is a 17-character string - Year is between 1990 and current year - Sale_price is non-negative - Add SchemaValidationError exception class - Integrate validation step in main.py before S3 upload - Raises exception if any row fails validation Co-Authored-By: Abhay Aggarwal <abhay.aggarwal@codeium.com>

devin-ai-integration · 2025-12-03T20:08:17Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add schema validation before S3 loading #9

Add schema validation before S3 loading #9

Uh oh!

devin-ai-integration bot commented Dec 3, 2025 •

edited

Loading

Uh oh!

devin-ai-integration bot commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add schema validation before S3 loading #9

Are you sure you want to change the base?

Add schema validation before S3 loading #9

Uh oh!

Conversation

devin-ai-integration bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration bot commented Dec 3, 2025

🤖 Devin AI Engineer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration bot commented Dec 3, 2025 •

edited

Loading