Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize PyArrow Parquet outputs #3246

Open
3 tasks
zaneselvans opened this issue Jan 16, 2024 · 0 comments
Open
3 tasks

Optimize PyArrow Parquet outputs #3246

zaneselvans opened this issue Jan 16, 2024 · 0 comments
Labels
output Exporting data from PUDL into other platforms or interchange formats. parquet Issues related to the Apache Parquet file format which we use for long tables. performance Make PUDL run faster!

Comments

@zaneselvans
Copy link
Member

With a little additional work, we can improve the PyArrow schemas and further reduce the size of the Parquet files we generate.

  • Incredibly ALL 190 Resource definitions combined only add up to ~1.2GB of snappy compressed Parquet output (excluding the ~5GB, billion-row epacems table).
  • We have no way of creating premeditated row-groups within the Parquet output, but there are only 2 tables that are more than 100MB, and 20 tables that are more than 10MB, so this isn't a huge issue.
  • Most columns whose values are constrained are not ENUM, and instead have FK constraints that point to one of the small tables of codes. We could potentially reduce the output Parquet file size significantly by automatically converting columns constrained by FK relations to codes into dictionary encoded string columns.
  • The out_ferc714__hourly_predicted_state_demand parquet file is 100MB with 6.7M records and 4 columns while the core_ferc714__hourly_demand_pa parquet is 60MB with 15M records and 5 columns. Something seems fishy with the state demand. It should probably be more like 25MB rather than 100MB.

Parquet Schema Optimizations

@zaneselvans zaneselvans added output Exporting data from PUDL into other platforms or interchange formats. performance Make PUDL run faster! parquet Issues related to the Apache Parquet file format which we use for long tables. labels Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
output Exporting data from PUDL into other platforms or interchange formats. parquet Issues related to the Apache Parquet file format which we use for long tables. performance Make PUDL run faster!
Projects
Status: New
Development

No branches or pull requests

1 participant