Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plant part updates to fix RMI CI memory issues #1865

Merged
merged 29 commits into from
Sep 20, 2022
Merged

Plant part updates to fix RMI CI memory issues #1865

merged 29 commits into from
Sep 20, 2022

Conversation

katie-lamb
Copy link
Member

@katie-lamb katie-lamb commented Aug 25, 2022

Currently my RMI branch (update-ci) references this PUDL branch. This branch should get merged first before the RMI CI branch.

There are two changes that make the PPL more memory efficient:

  • clears the pudl_out._dfs cache after all the plant parts list inputs are compiled and before the start of the plant parts list compilation
  • converts columns in the plant parts list to strings or categoricals to be more memory efficient

Is this type of update worthy of including in the release notes??

@katie-lamb katie-lamb added metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. ppe Plant Parts EIA (formerly the EIA plant parts list) data-types Dtype conversions, standardization and implications of data types rmi labels Aug 25, 2022
@katie-lamb katie-lamb self-assigned this Aug 25, 2022
@@ -2120,6 +2152,17 @@

FIELD_METADATA_BY_RESOURCE: dict[str, dict[str, Any]] = {
"sector_consolidated_eia": {"code": {"type": "integer"}},
"plant_parts_eia": {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the nicest to make all the category columns part of the resource specific metadata but when I included categorical columns in the general FIELD_METADATA there were some errors from string columns becoming category so I decided it's safest to do all the categorical dtype conversions at the end on a resource specific level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably more correct for a bunch of our coded columns to be categoricals in the database, but they're almost all strings right now, and I'm sure we rely on that typing in a ton of places, so this seems fine for now. I made an issue to remember this in, if you have thoughts to add: #1866

@katie-lamb katie-lamb added the ccai Tasks related to CCAI grant for entity matching label Aug 25, 2022
@codecov
Copy link

codecov bot commented Aug 25, 2022

Codecov Report

Base: 83.2% // Head: 83.3% // Increases project coverage by +0.0% 🎉

Coverage data is based on head (eb32d7b) compared to base (c734d05).
Patch coverage: 96.4% of modified lines in pull request are covered.

Additional details and impacted files
@@          Coverage Diff           @@
##             dev   #1865    +/-   ##
======================================
  Coverage   83.2%   83.3%            
======================================
  Files         65      67     +2     
  Lines       7398    7773   +375     
======================================
+ Hits        6158    6476   +318     
- Misses      1240    1297    +57     
Impacted Files Coverage Δ
src/pudl/metadata/resources/eia.py 100.0% <ø> (ø)
src/pudl/metadata/classes.py 82.0% <66.6%> (-0.1%) ⬇️
src/pudl/analysis/plant_parts_eia.py 96.6% <100.0%> (+<0.1%) ⬆️
src/pudl/helpers.py 89.9% <100.0%> (+2.2%) ⬆️
src/pudl/metadata/enums.py 100.0% <100.0%> (ø)
src/pudl/metadata/fields.py 100.0% <100.0%> (ø)
src/pudl/output/epacems.py 80.3% <0.0%> (-5.8%) ⬇️
src/pudl/output/pudltabl.py 88.3% <0.0%> (-0.7%) ⬇️
src/pudl/extract/epacems.py 97.1% <0.0%> (-0.2%) ⬇️
src/pudl/etl.py 96.1% <0.0%> (-0.1%) ⬇️
... and 12 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/pudl/helpers.py Outdated Show resolved Hide resolved
@@ -2120,6 +2152,17 @@

FIELD_METADATA_BY_RESOURCE: dict[str, dict[str, Any]] = {
"sector_consolidated_eia": {"code": {"type": "integer"}},
"plant_parts_eia": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably more correct for a bunch of our coded columns to be categoricals in the database, but they're almost all strings right now, and I'm sure we rely on that typing in a ton of places, so this seems fine for now. I made an issue to remember this in, if you have thoughts to add: #1866

src/pudl/metadata/fields.py Outdated Show resolved Hide resolved
@@ -14,6 +14,7 @@
"date": "datetime64[ns]",
"datetime": "datetime64[ns]",
"year": "datetime64[ns]",
"category": "category",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to think about this a little bit in the context of how our resource definition & typing system works.

Right now we can set an enum constraint on a column, which results in a CHECK VALUE IN () OR NULL constraint being applied to the column when it's loaded into the database. How should that kind of constraint relate to the specification that a column is a category when it's loaded into a dataframe?

@katie-lamb
Copy link
Member Author

The latest iteration adds the plant parts list as a Resource and adds the attributes whose type I want to explicitly set as a Field. Then, Resource.to_pandas_dtypes retrieves the type of these attributes and they are set in MakePlantParts.

A few notes:

  • Right now one can either:
    • Set an enum constraint on a column and set it to a python type. Later, Field.to_pandas_dtype will set this column to Pandas Categorical
    • Set a column type to a Pandas category with no enum constraint
  • I thought there might be an instance where you want to set a column to category to save memory but don't actually care about constraining the values or updating the constraints when new values are introduced
  • If this discrepancy is too weird, I could probably add the functionality to set value constraints on category types. Or it could be handled with Update our schemas to use categoricals where appropriate #1866
  • Finally, I didn't add all the plant part list fields to fields.py, only the fields that needed their type explicitly set. Should I add the remaining 10 or so fields? I thought that I wouldn't require adding all the fields to the Resource because I've been continuously adding columns for experimenting with Panda. But maybe it's better practice to add all the fields in this output version of the plant part list.

@@ -2138,6 +2171,15 @@
},
}
},
"plant_parts_eia": {
"energy_source_code_1": {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The energy_source_code_1 and prime_movers_eia resource override enum constraints could maybe just be moved to the general entry in FIELD_METADATA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the database tables these are both constrained by virtue of their FK relationships to the energy_sources_eia and prime_movers_eia coding tables, so adding the enums to the field definitions would be duplicative. But for the purpose of imposing those constraints on this free floating output table it seems like the ENUM constraint makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, makes sense.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this dynamically applied categorical type would be useful, but let's see if we can find an easy way to implement it without violating the TableSchema spec.

@@ -576,7 +576,7 @@ class Field(Base):

name: SnakeCase
type: Literal[ # noqa: A003
"string", "number", "integer", "boolean", "date", "datetime", "year"
"string", "number", "integer", "boolean", "date", "datetime", "year", "category"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category is not an allowed field type under the TableSchema definition, so this violates the referenced standard.

https://specs.frictionlessdata.io/table-schema/#types-and-formats

Under the TableSchema specification, this functionality is implemented with a type (e.g. string) plus an enum constraint that enumerates the values the field is allowed to take on, so we should probably figure out how to implement the desired functionality within that framework.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'll take out all the category types and instead add an enum constraint. These fields can be cast to categoricals later down the line, with Field.to_pandas_dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but somehow we have to convey to to_pandas_dtype() that it should be treated as a categorical, not just a vanilla string. I think that that's what happens if you've got an enum constrained string, so if you add the enum in the PPE overrides, it should work maybe?


Metadata
^^^^^^^^
* Used the data source metadata class added in release 0.6.0 to dynamically generate
the data source documentation (See :doc:`data_sources/index`). :pr:`1532`
* Column attributes may now be the pandas ``Categorical`` data type. Using categorical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will break the system as it's implemented right now.

Column types are independent of Pandas (and SQLite, and Parquet, etc) and are defined by the TableSchema specificaton. We then define mappings from that generic TableSchema type to the specific types for the various outputs (Pandas, SQLite, pyarrow, etc.).

It seems like the desired functionality is an unenumerated categorical column? Where we're telling the system "Please treat this as a categorical value, but figure out what the categories are dynamically based on the contents of the column, without imposing any constraints on the values."

This seems like it should not be implemented in the column type (since there is no categorical type under the standard, just enum constrained values of a given type), but rather in the process of translating to the various output formats. Somehow we need to communicate to the output that some columns should be treated as categoricals when they're put into a dataframe or parquet file, even if they aren't constrained.

@zaneselvans
Copy link
Member

I think if you're defining a PPE Resource all of the fields that are part of the Resource have to be defined, don't they? Or are you asking it to process dataframes that have columns (which you want to retain) that aren't part of the Resource?

I think the high level function that we eventually really want to be using here is Resource.format_df() -- rather than all of the lower level components. format_df() ensures that all and only the columns defined in the resource exist in the dataframe, and that they conform to the types as specified. Basically it's the final preparatory step for loading the dataframe into a database table.

If a column is useful and pretty stable -- assumed to exist in analyses and functions that we have in the repo -- then it should probably be in the schema with a type, etc.

@katie-lamb
Copy link
Member Author

katie-lamb commented Sep 12, 2022

I've now taken out the "type": "category" option from the field metadata and added an enum constraint for every PPE column that should be a categorical. In MakePlantParts I use Resource.format_df() which casts these enum constrained columns to pandas categorical types.

All columns from the PPE are now included in the field metadata.

docs/release_notes.rst Outdated Show resolved Hide resolved
docs/release_notes.rst Outdated Show resolved Hide resolved
src/pudl/analysis/plant_parts_eia.py Show resolved Hide resolved
@@ -213,3 +213,35 @@
"Unknown Code", # Should be replaced with NA
]
"""Valid emissions measurement codes for the EPA CEMS hourly data."""

TECH_DESCRIPTIONS: list[str] = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ordering isn't informational and duplicates should be prohibited, it might make sense for this to be a set rather than a list. I guess this goes for all of the enums. I wonder if there's some reason that doesn't work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried converting all the list enums to set which broke a bunch of stuff because some of the enums are used in for concatenation with a list. Just updated the enums I added (TECH_DESCRIPTIONS and PLANT_PARTS) to be sets.

Comment on lines 62 to 69
"plant",
"plant_unit",
"plant_prime_mover",
"plant_technology",
"plant_prime_fuel",
"plant_ferc_acct",
"plant_operating_year",
"plant_gen",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this list of values stored somewhere else independently, that it can be referenced in case it were to change? I think the home definition should probably be one of:

pudl.analysis.plant_parts_eia.PLANT_PARTS_ORDERED
pudl.analysis.plant_parts_eia.PLANT_PARTS.keys()

(side note: if PLANT_PARTS were an OrderedDict I think we could get rid of PLANT_PARTS_ORDERED and just use PLANT_PARTS.keys() directly)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a circular import error when I tried to import pudl.analysis.plant_parts_eia.PLANT_PARTS into pudl.metadata.fields.py or pudl.metadata.enums.py. I added PLANT_PARTS as an enum to at least take out the duplication of this list in fields.py.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the PLANT_PARTS dictionary to an OrderedDict and removed PLANT_PARTS_ORDERED.

src/pudl/metadata/fields.py Outdated Show resolved Hide resolved
Comment on lines 1416 to 1420
"ownership": {
"type": "string",
"description": "Whether each generator record is for one owner or represents a total of all ownerships.",
"constraints": {"enum": ["owned", "total"]},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having very generic column names like this in the global namespace can get confusing. Could it be renamed to be more descriptive / clear about its usage context? Maybe something like ownership_record_type ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to ownership_record_type.

"description": "The part of the plant a record corresponds to.",
"constraints": {
"enum": [
"plant",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with appro_part_label above it would be better if we could refer to the single source of truth for this list directly.

Copy link
Member Author

@katie-lamb katie-lamb Sep 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment about circular import error.

@@ -285,6 +285,54 @@
"etl_group": "entity_eia",
"field_namespace": "eia",
},
"plant_parts_eia": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness I think it might be good to add the sources (what is the PPE built out of?) and field_namespace of ppe, and maybe an etl_group of outputs or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filled out these resource fields.

src/pudl/metadata/fields.py Outdated Show resolved Hide resolved
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one mistake I think in the release notes, referring to plant_name_eia rather than plant_name_ppe.

src/pudl/analysis/plant_parts_eia.py Show resolved Hide resolved
metadata is now included in the PUDL metadata.
metadata is now included in the PUDL metadata. See :pr:`1865`
* For clarity and specificity, the ``plant_name_new`` column was renamed
``plant_name_eia`` and the ``ownership`` column was renamed ``ownership_record_type``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean plant_name_ppe here, not plant_name_eia

"ferc_acct_name": {
"type": "string",
"description": "Name of FERC account, derived from technology description and prime mover code.",
"constraints": {"enum": ["Hydraulic", "Nuclear", "Steam", "Other"]},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to go with the Title Case values here rather than standardizing to lower case like we do almost everywhere else? Are we not normalizing strings in this column? Where does it come from?

Copy link
Member Author

@katie-lamb katie-lamb Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are read in in pudl.helpers.get_eia_ferc_acct_map(), which reads in pudl.package_data.glue.ferc_acct_to_pm_tech_map.csv. In this CSV all fields (technology_description, prime_mover_code, ferc_acct_name) use Title Case. This table is then merged with the MCOE table to create the mega generators table. TitleCase is maintained in the merge keys and merged on fields in pudl.analysis.plant_parts_eia.get_gens_mega_table(). Since all of the fields use TitleCase (and a few others in the plant parts list/MCOE table), I'm leaving this as is for now.

@katie-lamb katie-lamb merged commit 2ddd197 into dev Sep 20, 2022
@katie-lamb katie-lamb deleted the rmi-ci-fixes branch September 20, 2022 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ccai Tasks related to CCAI grant for entity matching data-types Dtype conversions, standardization and implications of data types metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. ppe Plant Parts EIA (formerly the EIA plant parts list) rmi
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants