Optimize PyArrow Parquet outputs #3246
Labels
output
Exporting data from PUDL into other platforms or interchange formats.
parquet
Issues related to the Apache Parquet file format which we use for long tables.
performance
Make PUDL run faster!
With a little additional work, we can improve the PyArrow schemas and further reduce the size of the Parquet files we generate.
Resource
definitions combined only add up to ~1.2GB ofsnappy
compressed Parquet output (excluding the ~5GB, billion-rowepacems
table).ENUM
, and instead have FK constraints that point to one of the small tables of codes. We could potentially reduce the output Parquet file size significantly by automatically converting columns constrained by FK relations to codes into dictionary encoded string columns.out_ferc714__hourly_predicted_state_demand
parquet file is 100MB with 6.7M records and 4 columns while thecore_ferc714__hourly_demand_pa
parquet is 60MB with 15M records and 5 columns. Something seems fishy with the state demand. It should probably be more like 25MB rather than 100MB.Parquet Schema Optimizations
The text was updated successfully, but these errors were encountered: