Optimize PyArrow Parquet outputs #3246

zaneselvans · 2024-01-16T23:12:40Z

With a little additional work, we can improve the PyArrow schemas and further reduce the size of the Parquet files we generate.

Incredibly ALL 190 Resource definitions combined only add up to ~1.2GB of snappy compressed Parquet output (excluding the ~5GB, billion-row epacems table).
We have no way of creating premeditated row-groups within the Parquet output, but there are only 2 tables that are more than 100MB, and 20 tables that are more than 10MB, so this isn't a huge issue.
Most columns whose values are constrained are not ENUM, and instead have FK constraints that point to one of the small tables of codes. We could potentially reduce the output Parquet file size significantly by automatically converting columns constrained by FK relations to codes into dictionary encoded string columns.
The out_ferc714__hourly_predicted_state_demand parquet file is 100MB with 6.7M records and 4 columns while the core_ferc714__hourly_demand_pa parquet is 60MB with 15M records and 5 columns. Something seems fishy with the state demand. It should probably be more like 25MB rather than 100MB.

Parquet Schema Optimizations

Give feedback

Improve Resource.to_pyarrow() to use categorical types when a field has a foreign key constraint that ties it to a set of codes in one of the coding tables.
Modify the Resource class to allow customization of the row-groups that are created when we're writing Parquet files. This information would be accessed within the Parquet IO Manager.
Figure out why state_demand table is ~4x as large as we expect it to be and fix it.
Options

The text was updated successfully, but these errors were encountered:

zaneselvans added output Exporting data from PUDL into other platforms or interchange formats. performance Make PUDL run faster! parquet Issues related to the Apache Parquet file format which we use for long tables. labels Jan 16, 2024

zaneselvans added this to Catalyst Megaproject Jan 16, 2024

github-project-automation bot moved this to New in Catalyst Megaproject Jan 16, 2024

zaneselvans mentioned this issue Jan 16, 2024

Output PUDL as Parquet as well as SQLite #3102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize PyArrow Parquet outputs #3246

Optimize PyArrow Parquet outputs #3246

zaneselvans commented Jan 16, 2024

Parquet Schema Optimizations

Optimize PyArrow Parquet outputs #3246

Optimize PyArrow Parquet outputs #3246

Comments

zaneselvans commented Jan 16, 2024

Parquet Schema Optimizations