-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pd.NA where appropriate for ENUM and categorical fields #1376
Conversation
Codecov Report
@@ Coverage Diff @@
## dev #1376 +/- ##
==========================================
+ Coverage 83.66% 83.66% +0.01%
==========================================
Files 62 62
Lines 6883 6892 +9
==========================================
+ Hits 5758 5766 +8
- Misses 1125 1126 +1
Continue to review full report at Codecov.
|
Forgot to address this in above comments - I could also turn "unknown" into NAs for these columns if wanted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly just some questions about whether there are additional missing-ish values that we should turn into NAs.
src/pudl/transform/ferc1.py
Outdated
[CONSTRUCTION_TYPE_STRINGS, PLANT_KIND_STRINGS], | ||
unmapped='') | ||
.pipe(pudl.helpers.cleanstrings, ['construction_type'], [CONSTRUCTION_TYPE_STRINGS], unmapped=pd.NA) | ||
.pipe(pudl.helpers.cleanstrings, ['plant_type'], [PLANT_KIND_STRINGS], unmapped="") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this one retain the empty string for unmapped values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plant_type
wasn't listed as a column that needed to remove empty strings in the issue, so when I encountered this I skipped plant_type
. Also a few tests failed when I changed it to pd.NA (I forget what the error is but could regenerate), maybe the column is treated as a primary key? I could go back and change this to NAs though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fields.py
there's this comment:
TODO Disambiguate column name and apply table specific ENUM constraints. There are different allowable values in different tables.
So it seems like maybe this is a good candidate to remove the empty string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay. I don't think it's ever used as a primary key. There's a mechanism for overriding the generic purely field-name based attributes if we need to. It looks like plant_type
shows up in in the plants_hydro_ferc1
, plants_small_ferc1
, and plants_steam_ferc1
tables, but currently only the plants_steam_ferc1
version is being cleaned and set to some canonical values defined in the PLANT_KIND_STRINGS
, so to do what the TODO asks would require categorizing all the (different) nonsense strings in the hydro and small plants tables. I don't know if this is something that might be getting done already in @aesharpe's FERC 1 cleaning PR but we should ask.
I think the thing to do for now if we want to apply a table specific ENUM is to override the constraints which are defined for this column only for the plants_steam_ferc1
table, using the FIELD_METADATA_BY_RESOURCE
defined at the bottom of the fields.py
module (currently there's only one override). And then maybe create an issue to remind us to create ENUM constraints for the plant_type
field in the hydro and small plants tables in concert with Austen's cleanup of those tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, the plant_type
column gets cleaned and standardized in the steam table, manually mapped in the small generators table, left alone in the hydro table, and is non-existent (and irrelevant) in the pumped storage table.
Part of my FERC 1 cleaning module extracts the headers from the small generators table, standardizes them, and puts them in the plant_type
column so that we don't have to rely on manual mapping. This is still a draft PR, however, so none of this is integrated yet.
The plant_type
extracted from the headers is standardized according to a dictionary that I created and which we can change to reflect whatever final plant type string we want (or specify in an ENUM constraint).
Current small plant fuel types:
None, 'hydro', 'internal_combustion', 'solar_pv', 'cogen',
'steam_heat', 'wind', 'gas_turbine', 'diesel_turbine',
'pumped_hydro', 'coal', 'waste_heat', 'fuel_cell'
Current steam fuel types:
'steam', 'nuclear', 'unknown', 'combustion_turbine',
'combined_cycle', 'internal_combustion', 'geothermal', 'wind',
'photovoltaic', 'solar_thermal', ''
At the very least, we could standardize the NA values for both of these tables. Ideally we will pull values for plant_type
from the same set of values, whether they are constrained or not. I think it's a good idea to create an issue to remind us to do this (and for the hydro table and maybe even an arbitrary one for the pumped storage table so that it has a plant_type
when it gets merged with the others into the mega FERC table). Ultimately, the goal of all this RMI manual mapping we're doing is to connect, among other things, the EIA technology_description
field with individual FERC records so we won't need to use the janky FERC plant_type
or header extraction at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I've moved plant_type
to the FIELD_METADATA_BY_RESOURCE and added an assert so that there are no uncaught strings in the plant_type
column of plants_steam_ferc1
. Seems like there should be a new issue that adds ENUM constraints for the FERC 1 hydro and small plants tables? Or is the goal to remove plant_type
altogether and have it agree with EIA technology_description
?
I think what's happening here is that the null values are getting loaded into SQLite just fine, and stored as appropriate NULL values as far as the database is concerned, but then pandas is doing it's standard datatype conversions when we call With the new metadata system we should be able to give pandas an explicit schema that it can use to type the columns it reads out of the DB appropriately, but that's a later issue. |
I made a list of the fields with FERC1construction_type'', 'automatic operation', 'comb. turb. installn', 'comb. turb. instaln', plant_kind'', 'n/a', 'see pgs 402.1-402.3', 'see pgs 403.1-403.9', "respondent's share", fuel_unit'', '1265', 'mwh units', 'composite', 'therms', 'n/a', 'mbtu/kg', 'uranium 235',
EIAentity_typeenum includes "unknown" and "other"
EPA CEMS
|
FERC 1The FERC 1 EIAWe should look at what kinds of utilities are being assigned "unknown" and "other" entity types, and whether they are consistent across years, or if how they're reported has changed. Like did they used to be "unknown" and now they're "other"? Or did they eventually get categorized? What proportion of the overall population has these mystery types? how often does a utility entity type change? Can we fill in the missing values confidently? What does the EIA documentation say about this field (if anything)? Often this is in one of the non-data tabs of the raw input spreadsheets, or in the instructions (PDFs which we've tried to archive in our docs) EPA CEMSSimilarly we should see what EPA is saying these fields / values mean, and try to understand if "undetermined" and "unknown code" and "other" mean different things, or if they've just changed the coding scheme over the years, and also what proportion of the dataset is affected. There's probably some documentation on web from EPA, but I'm not 100% sure where. If we can't find it, we've been in contact with some of the EPA people in the past and might be able to ask for help. |
Summary of some changes: FERC 1I find some strings in EIA 860
EPA CEMSI couldn't find much documentation about the nox and co2 measurement codes, but it seems like the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a few comments, if you want to chat. I suspect we should look at the broad dropna() from the fuel_ferc1
table more closely.
The following columns are filled with
pd.NA
in their respective datasets when going through the extract and transform steps of the pipeline.FERC 1
construction_type
EIA
natural_gas_delivery_contract_type_code
natural_gas_transport_code
The call
df.to_sql()
insqlite.py
converts thepd.NA
values to NoneType since the column dtypes are not preserved. This problem is fixed after the fact by recasting the column to a string type once pudl.sqlite is read into a pandas dataframe. Preserving the column dtype is maybe something that should be addressed at some point.The following columns are filled with
np.nan
in the EPA CEMS datasets, notpd.NA
. This is because they are categorical type columns and thus do not allow forpd.NA
.EPA CEMS
co2_mass_measurement_code
nox_mass_measurement_code
nox_rate_measurement_code
so2_mass_measurement_code
Note
The column
purchase_type
is listed in the issue as an EIA column with an empty string in the enum. However this appears to be a FERC 1 column and the enum is based off the dictionaryPOWER_PURCHASE_TYPES_FERC1
inlabels.py
. There is no empty string listed in thePOWER_PURCHASE_TYPES_FERC1
dictionary, so I didn't do anything about thepurchase_type
column.