-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plant part updates to fix RMI CI memory issues #1865
Conversation
src/pudl/metadata/fields.py
Outdated
@@ -2120,6 +2152,17 @@ | |||
|
|||
FIELD_METADATA_BY_RESOURCE: dict[str, dict[str, Any]] = { | |||
"sector_consolidated_eia": {"code": {"type": "integer"}}, | |||
"plant_parts_eia": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the nicest to make all the category
columns part of the resource specific metadata but when I included categorical columns in the general FIELD_METADATA
there were some errors from string
columns becoming category
so I decided it's safest to do all the categorical dtype conversions at the end on a resource specific level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably more correct for a bunch of our coded columns to be categoricals in the database, but they're almost all strings right now, and I'm sure we rely on that typing in a ton of places, so this seems fine for now. I made an issue to remember this in, if you have thoughts to add: #1866
Codecov ReportBase: 83.2% // Head: 83.3% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## dev #1865 +/- ##
======================================
Coverage 83.2% 83.3%
======================================
Files 65 67 +2
Lines 7398 7773 +375
======================================
+ Hits 6158 6476 +318
- Misses 1240 1297 +57
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
src/pudl/metadata/fields.py
Outdated
@@ -2120,6 +2152,17 @@ | |||
|
|||
FIELD_METADATA_BY_RESOURCE: dict[str, dict[str, Any]] = { | |||
"sector_consolidated_eia": {"code": {"type": "integer"}}, | |||
"plant_parts_eia": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably more correct for a bunch of our coded columns to be categoricals in the database, but they're almost all strings right now, and I'm sure we rely on that typing in a ton of places, so this seems fine for now. I made an issue to remember this in, if you have thoughts to add: #1866
src/pudl/metadata/constants.py
Outdated
@@ -14,6 +14,7 @@ | |||
"date": "datetime64[ns]", | |||
"datetime": "datetime64[ns]", | |||
"year": "datetime64[ns]", | |||
"category": "category", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to think about this a little bit in the context of how our resource definition & typing system works.
Right now we can set an enum
constraint on a column, which results in a CHECK VALUE IN () OR NULL
constraint being applied to the column when it's loaded into the database. How should that kind of constraint relate to the specification that a column is a category
when it's loaded into a dataframe?
The latest iteration adds the plant parts list as a A few notes:
|
@@ -2138,6 +2171,15 @@ | |||
}, | |||
} | |||
}, | |||
"plant_parts_eia": { | |||
"energy_source_code_1": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The energy_source_code_1
and prime_movers_eia
resource override enum constraints could maybe just be moved to the general entry in FIELD_METADATA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the database tables these are both constrained by virtue of their FK relationships to the energy_sources_eia
and prime_movers_eia
coding tables, so adding the enums to the field definitions would be duplicative. But for the purpose of imposing those constraints on this free floating output table it seems like the ENUM constraint makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this dynamically applied categorical type would be useful, but let's see if we can find an easy way to implement it without violating the TableSchema spec.
src/pudl/metadata/classes.py
Outdated
@@ -576,7 +576,7 @@ class Field(Base): | |||
|
|||
name: SnakeCase | |||
type: Literal[ # noqa: A003 | |||
"string", "number", "integer", "boolean", "date", "datetime", "year" | |||
"string", "number", "integer", "boolean", "date", "datetime", "year", "category" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
category
is not an allowed field type under the TableSchema definition, so this violates the referenced standard.
https://specs.frictionlessdata.io/table-schema/#types-and-formats
Under the TableSchema specification, this functionality is implemented with a type (e.g. string
) plus an enum
constraint that enumerates the values the field is allowed to take on, so we should probably figure out how to implement the desired functionality within that framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'll take out all the category
types and instead add an enum
constraint. These fields can be cast to categoricals later down the line, with Field.to_pandas_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but somehow we have to convey to to_pandas_dtype()
that it should be treated as a categorical, not just a vanilla string. I think that that's what happens if you've got an enum constrained string, so if you add the enum in the PPE overrides, it should work maybe?
docs/release_notes.rst
Outdated
|
||
Metadata | ||
^^^^^^^^ | ||
* Used the data source metadata class added in release 0.6.0 to dynamically generate | ||
the data source documentation (See :doc:`data_sources/index`). :pr:`1532` | ||
* Column attributes may now be the pandas ``Categorical`` data type. Using categorical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will break the system as it's implemented right now.
Column types are independent of Pandas (and SQLite, and Parquet, etc) and are defined by the TableSchema specificaton. We then define mappings from that generic TableSchema type to the specific types for the various outputs (Pandas, SQLite, pyarrow, etc.).
It seems like the desired functionality is an unenumerated categorical column? Where we're telling the system "Please treat this as a categorical value, but figure out what the categories are dynamically based on the contents of the column, without imposing any constraints on the values."
This seems like it should not be implemented in the column type (since there is no categorical type under the standard, just enum constrained values of a given type), but rather in the process of translating to the various output formats. Somehow we need to communicate to the output that some columns should be treated as categoricals when they're put into a dataframe or parquet file, even if they aren't constrained.
I think if you're defining a PPE I think the high level function that we eventually really want to be using here is If a column is useful and pretty stable -- assumed to exist in analyses and functions that we have in the repo -- then it should probably be in the schema with a type, etc. |
I've now taken out the All columns from the PPE are now included in the field metadata. |
src/pudl/metadata/enums.py
Outdated
@@ -213,3 +213,35 @@ | |||
"Unknown Code", # Should be replaced with NA | |||
] | |||
"""Valid emissions measurement codes for the EPA CEMS hourly data.""" | |||
|
|||
TECH_DESCRIPTIONS: list[str] = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the ordering isn't informational and duplicates should be prohibited, it might make sense for this to be a set
rather than a list
. I guess this goes for all of the enums. I wonder if there's some reason that doesn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried converting all the list
enums to set
which broke a bunch of stuff because some of the enums are used in for concatenation with a list. Just updated the enums I added (TECH_DESCRIPTIONS
and PLANT_PARTS
) to be sets.
src/pudl/metadata/fields.py
Outdated
"plant", | ||
"plant_unit", | ||
"plant_prime_mover", | ||
"plant_technology", | ||
"plant_prime_fuel", | ||
"plant_ferc_acct", | ||
"plant_operating_year", | ||
"plant_gen", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this list of values stored somewhere else independently, that it can be referenced in case it were to change? I think the home definition should probably be one of:
pudl.analysis.plant_parts_eia.PLANT_PARTS_ORDERED
pudl.analysis.plant_parts_eia.PLANT_PARTS.keys()
(side note: if PLANT_PARTS
were an OrderedDict
I think we could get rid of PLANT_PARTS_ORDERED
and just use PLANT_PARTS.keys()
directly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a circular import error when I tried to import pudl.analysis.plant_parts_eia.PLANT_PARTS
into pudl.metadata.fields.py
or pudl.metadata.enums.py
. I added PLANT_PARTS as an enum to at least take out the duplication of this list in fields.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the PLANT_PARTS
dictionary to an OrderedDict
and removed PLANT_PARTS_ORDERED
.
src/pudl/metadata/fields.py
Outdated
"ownership": { | ||
"type": "string", | ||
"description": "Whether each generator record is for one owner or represents a total of all ownerships.", | ||
"constraints": {"enum": ["owned", "total"]}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having very generic column names like this in the global namespace can get confusing. Could it be renamed to be more descriptive / clear about its usage context? Maybe something like ownership_record_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to ownership_record_type
.
src/pudl/metadata/fields.py
Outdated
"description": "The part of the plant a record corresponds to.", | ||
"constraints": { | ||
"enum": [ | ||
"plant", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with appro_part_label
above it would be better if we could refer to the single source of truth for this list directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment about circular import error.
@@ -285,6 +285,54 @@ | |||
"etl_group": "entity_eia", | |||
"field_namespace": "eia", | |||
}, | |||
"plant_parts_eia": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness I think it might be good to add the sources
(what is the PPE built out of?) and field_namespace
of ppe
, and maybe an etl_group
of outputs
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filled out these resource fields.
This reverts commit cef65be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one mistake I think in the release notes, referring to plant_name_eia
rather than plant_name_ppe
.
docs/release_notes.rst
Outdated
metadata is now included in the PUDL metadata. | ||
metadata is now included in the PUDL metadata. See :pr:`1865` | ||
* For clarity and specificity, the ``plant_name_new`` column was renamed | ||
``plant_name_eia`` and the ``ownership`` column was renamed ``ownership_record_type``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean plant_name_ppe
here, not plant_name_eia
"ferc_acct_name": { | ||
"type": "string", | ||
"description": "Name of FERC account, derived from technology description and prime mover code.", | ||
"constraints": {"enum": ["Hydraulic", "Nuclear", "Steam", "Other"]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to go with the Title Case values here rather than standardizing to lower case like we do almost everywhere else? Are we not normalizing strings in this column? Where does it come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values are read in in pudl.helpers.get_eia_ferc_acct_map()
, which reads in pudl.package_data.glue.ferc_acct_to_pm_tech_map.csv
. In this CSV all fields (technology_description
, prime_mover_code
, ferc_acct_name
) use Title Case. This table is then merged with the MCOE table to create the mega generators table. TitleCase is maintained in the merge keys and merged on fields in pudl.analysis.plant_parts_eia.get_gens_mega_table()
. Since all of the fields use TitleCase (and a few others in the plant parts list/MCOE table), I'm leaving this as is for now.
Currently my RMI branch (
update-ci
) references this PUDL branch. This branch should get merged first before the RMI CI branch.There are two changes that make the PPL more memory efficient:
pudl_out._dfs
cache after all the plant parts list inputs are compiled and before the start of the plant parts list compilationIs this type of update worthy of including in the release notes??