Plant part updates to fix RMI CI memory issues #1865

katie-lamb · 2022-08-25T12:56:29Z

Currently my RMI branch (update-ci) references this PUDL branch. This branch should get merged first before the RMI CI branch.

There are two changes that make the PPL more memory efficient:

clears the pudl_out._dfs cache after all the plant parts list inputs are compiled and before the start of the plant parts list compilation
converts columns in the plant parts list to strings or categoricals to be more memory efficient

Is this type of update worthy of including in the release notes??

src/pudl/analysis/plant_parts_eia.py

katie-lamb · 2022-08-25T13:03:33Z

src/pudl/metadata/fields.py

@@ -2120,6 +2152,17 @@

 FIELD_METADATA_BY_RESOURCE: dict[str, dict[str, Any]] = {
    "sector_consolidated_eia": {"code": {"type": "integer"}},
+    "plant_parts_eia": {


This isn't the nicest to make all the category columns part of the resource specific metadata but when I included categorical columns in the general FIELD_METADATA there were some errors from string columns becoming category so I decided it's safest to do all the categorical dtype conversions at the end on a resource specific level.

It's probably more correct for a bunch of our coded columns to be categoricals in the database, but they're almost all strings right now, and I'm sure we rely on that typing in a ton of places, so this seems fine for now. I made an issue to remember this in, if you have thoughts to add: #1866

codecov · 2022-08-25T13:22:37Z

Codecov Report

Base: 83.2% // Head: 83.3% // Increases project coverage by +0.0% 🎉

Coverage data is based on head (eb32d7b) compared to base (c734d05).
Patch coverage: 96.4% of modified lines in pull request are covered.

Additional details and impacted files

@@          Coverage Diff           @@
##             dev   #1865    +/-   ##
======================================
  Coverage   83.2%   83.3%            
======================================
  Files         65      67     +2     
  Lines       7398    7773   +375     
======================================
+ Hits        6158    6476   +318     
- Misses      1240    1297    +57

Impacted Files	Coverage Δ
src/pudl/metadata/resources/eia.py	`100.0% <ø> (ø)`
src/pudl/metadata/classes.py	`82.0% <66.6%> (-0.1%)`	⬇️
src/pudl/analysis/plant_parts_eia.py	`96.6% <100.0%> (+<0.1%)`	⬆️
src/pudl/helpers.py	`89.9% <100.0%> (+2.2%)`	⬆️
src/pudl/metadata/enums.py	`100.0% <100.0%> (ø)`
src/pudl/metadata/fields.py	`100.0% <100.0%> (ø)`
src/pudl/output/epacems.py	`80.3% <0.0%> (-5.8%)`	⬇️
src/pudl/output/pudltabl.py	`88.3% <0.0%> (-0.7%)`	⬇️
src/pudl/extract/epacems.py	`97.1% <0.0%> (-0.2%)`	⬇️
src/pudl/etl.py	`96.1% <0.0%> (-0.1%)`	⬇️
... and 12 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/pudl/helpers.py

zaneselvans · 2022-08-25T13:35:29Z

src/pudl/metadata/fields.py

@@ -2120,6 +2152,17 @@

 FIELD_METADATA_BY_RESOURCE: dict[str, dict[str, Any]] = {
    "sector_consolidated_eia": {"code": {"type": "integer"}},
+    "plant_parts_eia": {


It's probably more correct for a bunch of our coded columns to be categoricals in the database, but they're almost all strings right now, and I'm sure we rely on that typing in a ton of places, so this seems fine for now. I made an issue to remember this in, if you have thoughts to add: #1866

src/pudl/metadata/fields.py

zaneselvans · 2022-08-25T13:54:48Z

src/pudl/metadata/constants.py

@@ -14,6 +14,7 @@
    "date": "datetime64[ns]",
    "datetime": "datetime64[ns]",
    "year": "datetime64[ns]",
+    "category": "category",


We might want to think about this a little bit in the context of how our resource definition & typing system works.

Right now we can set an enum constraint on a column, which results in a CHECK VALUE IN () OR NULL constraint being applied to the column when it's loaded into the database. How should that kind of constraint relate to the specification that a column is a category when it's loaded into a dataframe?

katie-lamb · 2022-09-09T14:27:11Z

The latest iteration adds the plant parts list as a Resource and adds the attributes whose type I want to explicitly set as a Field. Then, Resource.to_pandas_dtypes retrieves the type of these attributes and they are set in MakePlantParts.

A few notes:

Right now one can either:
- Set an enum constraint on a column and set it to a python type. Later, Field.to_pandas_dtype will set this column to Pandas Categorical
- Set a column type to a Pandas category with no enum constraint
I thought there might be an instance where you want to set a column to category to save memory but don't actually care about constraining the values or updating the constraints when new values are introduced
If this discrepancy is too weird, I could probably add the functionality to set value constraints on category types. Or it could be handled with Update our schemas to use categoricals where appropriate #1866
Finally, I didn't add all the plant part list fields to fields.py, only the fields that needed their type explicitly set. Should I add the remaining 10 or so fields? I thought that I wouldn't require adding all the fields to the Resource because I've been continuously adding columns for experimenting with Panda. But maybe it's better practice to add all the fields in this output version of the plant part list.

katie-lamb · 2022-09-09T14:33:46Z

src/pudl/metadata/fields.py

@@ -2138,6 +2171,15 @@
            },
        }
    },
+    "plant_parts_eia": {
+        "energy_source_code_1": {


The energy_source_code_1 and prime_movers_eia resource override enum constraints could maybe just be moved to the general entry in FIELD_METADATA

In the database tables these are both constrained by virtue of their FK relationships to the energy_sources_eia and prime_movers_eia coding tables, so adding the enums to the field definitions would be duplicative. But for the purpose of imposing those constraints on this free floating output table it seems like the ENUM constraint makes sense.

Ah right, makes sense.

zaneselvans

I agree this dynamically applied categorical type would be useful, but let's see if we can find an easy way to implement it without violating the TableSchema spec.

zaneselvans · 2022-09-09T17:13:48Z

src/pudl/metadata/classes.py

@@ -576,7 +576,7 @@ class Field(Base):

    name: SnakeCase
    type: Literal[  # noqa: A003
-        "string", "number", "integer", "boolean", "date", "datetime", "year"
+        "string", "number", "integer", "boolean", "date", "datetime", "year", "category"


category is not an allowed field type under the TableSchema definition, so this violates the referenced standard.

https://specs.frictionlessdata.io/table-schema/#types-and-formats

Under the TableSchema specification, this functionality is implemented with a type (e.g. string) plus an enum constraint that enumerates the values the field is allowed to take on, so we should probably figure out how to implement the desired functionality within that framework.

Got it. I'll take out all the category types and instead add an enum constraint. These fields can be cast to categoricals later down the line, with Field.to_pandas_dtype

Yeah, but somehow we have to convey to to_pandas_dtype() that it should be treated as a categorical, not just a vanilla string. I think that that's what happens if you've got an enum constrained string, so if you add the enum in the PPE overrides, it should work maybe?

zaneselvans · 2022-09-09T17:24:56Z

docs/release_notes.rst


 Metadata
 ^^^^^^^^
 * Used the data source metadata class added in release 0.6.0 to dynamically generate
  the data source documentation (See :doc:`data_sources/index`). :pr:`1532`
+* Column attributes may now be the pandas ``Categorical`` data type. Using categorical


I think this will break the system as it's implemented right now.

Column types are independent of Pandas (and SQLite, and Parquet, etc) and are defined by the TableSchema specificaton. We then define mappings from that generic TableSchema type to the specific types for the various outputs (Pandas, SQLite, pyarrow, etc.).

It seems like the desired functionality is an unenumerated categorical column? Where we're telling the system "Please treat this as a categorical value, but figure out what the categories are dynamically based on the contents of the column, without imposing any constraints on the values."

This seems like it should not be implemented in the column type (since there is no categorical type under the standard, just enum constrained values of a given type), but rather in the process of translating to the various output formats. Somehow we need to communicate to the output that some columns should be treated as categoricals when they're put into a dataframe or parquet file, even if they aren't constrained.

zaneselvans · 2022-09-09T17:44:55Z

I think if you're defining a PPE Resource all of the fields that are part of the Resource have to be defined, don't they? Or are you asking it to process dataframes that have columns (which you want to retain) that aren't part of the Resource?

I think the high level function that we eventually really want to be using here is Resource.format_df() -- rather than all of the lower level components. format_df() ensures that all and only the columns defined in the resource exist in the dataframe, and that they conform to the types as specified. Basically it's the final preparatory step for loading the dataframe into a database table.

If a column is useful and pretty stable -- assumed to exist in analyses and functions that we have in the repo -- then it should probably be in the schema with a type, etc.

katie-lamb · 2022-09-12T18:22:27Z

I've now taken out the "type": "category" option from the field metadata and added an enum constraint for every PPE column that should be a categorical. In MakePlantParts I use Resource.format_df() which casts these enum constrained columns to pandas categorical types.

All columns from the PPE are now included in the field metadata.

docs/release_notes.rst

src/pudl/analysis/plant_parts_eia.py

zaneselvans · 2022-09-13T16:35:50Z

src/pudl/metadata/enums.py

@@ -213,3 +213,35 @@
    "Unknown Code",  # Should be replaced with NA
 ]
 """Valid emissions measurement codes for the EPA CEMS hourly data."""
+
+TECH_DESCRIPTIONS: list[str] = [


If the ordering isn't informational and duplicates should be prohibited, it might make sense for this to be a set rather than a list. I guess this goes for all of the enums. I wonder if there's some reason that doesn't work.

I tried converting all the list enums to set which broke a bunch of stuff because some of the enums are used in for concatenation with a list. Just updated the enums I added (TECH_DESCRIPTIONS and PLANT_PARTS) to be sets.

zaneselvans · 2022-09-13T16:37:45Z

src/pudl/metadata/fields.py

+                "plant",
+                "plant_unit",
+                "plant_prime_mover",
+                "plant_technology",
+                "plant_prime_fuel",
+                "plant_ferc_acct",
+                "plant_operating_year",
+                "plant_gen",


Is this list of values stored somewhere else independently, that it can be referenced in case it were to change? I think the home definition should probably be one of:

pudl.analysis.plant_parts_eia.PLANT_PARTS_ORDERED pudl.analysis.plant_parts_eia.PLANT_PARTS.keys()

(side note: if PLANT_PARTS were an OrderedDict I think we could get rid of PLANT_PARTS_ORDERED and just use PLANT_PARTS.keys() directly)

I got a circular import error when I tried to import pudl.analysis.plant_parts_eia.PLANT_PARTS into pudl.metadata.fields.py or pudl.metadata.enums.py. I added PLANT_PARTS as an enum to at least take out the duplication of this list in fields.py.

I changed the PLANT_PARTS dictionary to an OrderedDict and removed PLANT_PARTS_ORDERED.

src/pudl/metadata/fields.py

zaneselvans · 2022-09-13T16:43:38Z

src/pudl/metadata/fields.py

+    "ownership": {
+        "type": "string",
+        "description": "Whether each generator record is for one owner or represents a total of all ownerships.",
+        "constraints": {"enum": ["owned", "total"]},
+    },


Having very generic column names like this in the global namespace can get confusing. Could it be renamed to be more descriptive / clear about its usage context? Maybe something like ownership_record_type ?

Changed to ownership_record_type.

zaneselvans · 2022-09-13T16:44:40Z

src/pudl/metadata/fields.py

+        "description": "The part of the plant a record corresponds to.",
+        "constraints": {
+            "enum": [
+                "plant",


As with appro_part_label above it would be better if we could refer to the single source of truth for this list directly.

See my comment about circular import error.

zaneselvans · 2022-09-13T16:56:54Z

src/pudl/metadata/resources/eia.py

@@ -285,6 +285,54 @@
        "etl_group": "entity_eia",
        "field_namespace": "eia",
    },
+    "plant_parts_eia": {


For completeness I think it might be good to add the sources (what is the PPE built out of?) and field_namespace of ppe, and maybe an etl_group of outputs or something?

Filled out these resource fields.

src/pudl/metadata/fields.py

This reverts commit cef65be.

zaneselvans

Just one mistake I think in the release notes, referring to plant_name_eia rather than plant_name_ppe.

src/pudl/analysis/plant_parts_eia.py

zaneselvans · 2022-09-19T19:07:20Z

docs/release_notes.rst

-  metadata is now included in the PUDL metadata.
+  metadata is now included in the PUDL metadata. See :pr:`1865`
+* For clarity and specificity, the ``plant_name_new`` column was renamed
+  ``plant_name_eia`` and the ``ownership`` column was renamed ``ownership_record_type``.


I think you mean plant_name_ppe here, not plant_name_eia

zaneselvans · 2022-09-19T19:14:23Z

src/pudl/metadata/fields.py

+    "ferc_acct_name": {
+        "type": "string",
+        "description": "Name of FERC account, derived from technology description and prime mover code.",
+        "constraints": {"enum": ["Hydraulic", "Nuclear", "Steam", "Other"]},


Is there a reason to go with the Title Case values here rather than standardizing to lower case like we do almost everywhere else? Are we not normalizing strings in this column? Where does it come from?

These values are read in in pudl.helpers.get_eia_ferc_acct_map(), which reads in pudl.package_data.glue.ferc_acct_to_pm_tech_map.csv. In this CSV all fields (technology_description, prime_mover_code, ferc_acct_name) use Title Case. This table is then merged with the MCOE table to create the mega generators table. TitleCase is maintained in the merge keys and merged on fields in pudl.analysis.plant_parts_eia.get_gens_mega_table(). Since all of the fields use TitleCase (and a few others in the plant parts list/MCOE table), I'm leaving this as is for now.

katie-lamb added 5 commits July 26, 2022 22:56

clear cache

2fb6948

actually clear cache

a9e1cf1

Merge branch 'dev' into rmi-ci-fixes

870b45f

change ppl dtypes in module

e766d8b

Merge branch 'dev' into rmi-ci-fixes

81ea97d

katie-lamb added metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. ppe Plant Parts EIA (formerly the EIA plant parts list) data-types Dtype conversions, standardization and implications of data types rmi labels Aug 25, 2022

katie-lamb requested review from zaneselvans and cmgosnell August 25, 2022 12:56

katie-lamb self-assigned this Aug 25, 2022

katie-lamb commented Aug 25, 2022

View reviewed changes

src/pudl/analysis/plant_parts_eia.py Outdated Show resolved Hide resolved

katie-lamb commented Aug 25, 2022

View reviewed changes

katie-lamb mentioned this pull request Aug 25, 2022

Improve CI workflows, add autoformatters / linters catalyst-cooperative/rmi-ferc1-eia#236

Merged

katie-lamb added the ccai Tasks related to CCAI grant for entity matching label Aug 25, 2022

zaneselvans mentioned this pull request Aug 25, 2022

Update our schemas to use categoricals where appropriate #1866

Open

zaneselvans reviewed Aug 25, 2022

View reviewed changes

katie-lamb added 7 commits August 30, 2022 13:43

Merge branch 'dev' into rmi-ci-fixes

6780779

take out some category fields from metadata

eac2521

take out old docstring

bc5257a

Merge branch 'dev' into rmi-ci-fixes

049ee55

added ppl to resources metadata

083f491

took out format df so not all fields must be included

aa67d3b

Merge branch 'dev' into rmi-ci-fixes

ba47985

katie-lamb commented Sep 9, 2022

View reviewed changes

updated release notes

f614860

zaneselvans requested changes Sep 9, 2022

View reviewed changes

katie-lamb added 5 commits September 12, 2022 12:35

took out all category types and put in enum

5262aa2

added all fields to field metadata

e50bbd3

changed to format df

a654c60

updated release notes

61516db

oops take out category from Field

3d3483d

take out BU

caf44b1

zaneselvans requested changes Sep 13, 2022

View reviewed changes

katie-lamb added 9 commits September 13, 2022 22:38

Merge branch 'dev' into rmi-ci-fixes

20fe508

check values are in categories

08778c8

update release notes to include pr ref

c625623

update to plant name eia, clean up enums, fill out resource metadata

6fb269f

changed ownership to onwership_record_type

b9197e0

udpate release notes

6db9a3c

change enum constraints to sets

cef65be

Revert "change enum constraints to sets"

19cd02f

This reverts commit cef65be.

take out plant parts ordered replace with ordered dict

4cc67c1

zaneselvans approved these changes Sep 19, 2022

View reviewed changes

zaneselvans reviewed Sep 19, 2022

View reviewed changes

fix typo in release notes

eb32d7b

katie-lamb merged commit 2ddd197 into dev Sep 20, 2022

katie-lamb deleted the rmi-ci-fixes branch September 20, 2022 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plant part updates to fix RMI CI memory issues #1865

Plant part updates to fix RMI CI memory issues #1865

katie-lamb commented Aug 25, 2022 •

edited

Loading

katie-lamb Aug 25, 2022

zaneselvans Aug 25, 2022

codecov bot commented Aug 25, 2022 •

edited

Loading

zaneselvans Aug 25, 2022

zaneselvans Aug 25, 2022

katie-lamb commented Sep 9, 2022

katie-lamb Sep 9, 2022

zaneselvans Sep 9, 2022

katie-lamb Sep 9, 2022

zaneselvans left a comment

zaneselvans Sep 9, 2022

katie-lamb Sep 9, 2022

zaneselvans Sep 9, 2022

zaneselvans Sep 9, 2022

zaneselvans commented Sep 9, 2022

katie-lamb commented Sep 12, 2022 •

edited

Loading

zaneselvans Sep 13, 2022

katie-lamb Sep 18, 2022

zaneselvans Sep 13, 2022

katie-lamb Sep 18, 2022

katie-lamb Sep 18, 2022

zaneselvans Sep 13, 2022

katie-lamb Sep 18, 2022

zaneselvans Sep 13, 2022

katie-lamb Sep 18, 2022 •

edited

Loading

zaneselvans Sep 13, 2022

katie-lamb Sep 18, 2022

zaneselvans left a comment

zaneselvans Sep 19, 2022

zaneselvans Sep 19, 2022

katie-lamb Sep 20, 2022 •

edited

Loading

Plant part updates to fix RMI CI memory issues #1865

Plant part updates to fix RMI CI memory issues #1865

Conversation

katie-lamb commented Aug 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 25, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb commented Sep 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Sep 9, 2022

katie-lamb commented Sep 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb Sep 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb Sep 20, 2022 • edited Loading

Choose a reason for hiding this comment

katie-lamb commented Aug 25, 2022 •

edited

Loading

codecov bot commented Aug 25, 2022 •

edited

Loading

katie-lamb commented Sep 12, 2022 •

edited

Loading

katie-lamb Sep 18, 2022 •

edited

Loading

katie-lamb Sep 20, 2022 •

edited

Loading