Use pd.NA where appropriate for ENUM and categorical fields #1376

katie-lamb · 2021-12-16T22:27:55Z

The following columns are filled with pd.NA in their respective datasets when going through the extract and transform steps of the pipeline.

FERC 1

construction_type

EIA

natural_gas_delivery_contract_type_code
natural_gas_transport_code

The call df.to_sql() in sqlite.py converts the pd.NA values to NoneType since the column dtypes are not preserved. This problem is fixed after the fact by recasting the column to a string type once pudl.sqlite is read into a pandas dataframe. Preserving the column dtype is maybe something that should be addressed at some point.

The following columns are filled with np.nan in the EPA CEMS datasets, not pd.NA. This is because they are categorical type columns and thus do not allow for pd.NA.

EPA CEMS

co2_mass_measurement_code
nox_mass_measurement_code
nox_rate_measurement_code
so2_mass_measurement_code

Note

The column purchase_type is listed in the issue as an EIA column with an empty string in the enum. However this appears to be a FERC 1 column and the enum is based off the dictionary POWER_PURCHASE_TYPES_FERC1 in labels.py. There is no empty string listed in the POWER_PURCHASE_TYPES_FERC1 dictionary, so I didn't do anything about the purchase_type column.

…n call to cleanstrings function

…s enum

codecov · 2021-12-16T22:32:15Z

Codecov Report

Merging #1376 (2524d10) into dev (eccddc9) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev    #1376      +/-   ##
==========================================
+ Coverage   83.66%   83.66%   +0.01%     
==========================================
  Files          62       62              
  Lines        6883     6892       +9     
==========================================
+ Hits         5758     5766       +8     
- Misses       1125     1126       +1

Impacted Files	Coverage Δ
src/pudl/constants.py	`100.00% <ø> (ø)`
src/pudl/metadata/enums.py	`100.00% <ø> (ø)`
src/pudl/metadata/fields.py	`100.00% <ø> (ø)`
src/pudl/transform/eia860.py	`96.52% <ø> (ø)`
src/pudl/transform/eia923.py	`93.99% <ø> (ø)`
src/pudl/transform/ferc1.py	`92.04% <100.00%> (+0.26%)`	⬆️
src/pudl/analysis/timeseries_cleaning.py	`88.45% <0.00%> (-0.22%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eccddc9...2524d10. Read the comment docs.

katie-lamb · 2021-12-17T02:44:01Z

Forgot to address this in above comments - I could also turn "unknown" into NAs for these columns if wanted

zaneselvans

Mostly just some questions about whether there are additional missing-ish values that we should turn into NAs.

zaneselvans · 2021-12-17T00:51:19Z

src/pudl/transform/ferc1.py

-              [CONSTRUCTION_TYPE_STRINGS, PLANT_KIND_STRINGS],
-              unmapped='')
+        .pipe(pudl.helpers.cleanstrings, ['construction_type'], [CONSTRUCTION_TYPE_STRINGS], unmapped=pd.NA)
+        .pipe(pudl.helpers.cleanstrings, ['plant_type'], [PLANT_KIND_STRINGS], unmapped="")


Why does this one retain the empty string for unmapped values?

plant_type wasn't listed as a column that needed to remove empty strings in the issue, so when I encountered this I skipped plant_type. Also a few tests failed when I changed it to pd.NA (I forget what the error is but could regenerate), maybe the column is treated as a primary key? I could go back and change this to NAs though

In fields.py there's this comment:

TODO Disambiguate column name and apply table specific ENUM constraints. There are different allowable values in different tables.

So it seems like maybe this is a good candidate to remove the empty string?

Ah okay. I don't think it's ever used as a primary key. There's a mechanism for overriding the generic purely field-name based attributes if we need to. It looks like plant_type shows up in in the plants_hydro_ferc1, plants_small_ferc1, and plants_steam_ferc1 tables, but currently only the plants_steam_ferc1 version is being cleaned and set to some canonical values defined in the PLANT_KIND_STRINGS , so to do what the TODO asks would require categorizing all the (different) nonsense strings in the hydro and small plants tables. I don't know if this is something that might be getting done already in @aesharpe's FERC 1 cleaning PR but we should ask.

I think the thing to do for now if we want to apply a table specific ENUM is to override the constraints which are defined for this column only for the plants_steam_ferc1 table, using the FIELD_METADATA_BY_RESOURCE defined at the bottom of the fields.py module (currently there's only one override). And then maybe create an issue to remind us to create ENUM constraints for the plant_type field in the hydro and small plants tables in concert with Austen's cleanup of those tables.

Right now, the plant_type column gets cleaned and standardized in the steam table, manually mapped in the small generators table, left alone in the hydro table, and is non-existent (and irrelevant) in the pumped storage table.

Part of my FERC 1 cleaning module extracts the headers from the small generators table, standardizes them, and puts them in the plant_type column so that we don't have to rely on manual mapping. This is still a draft PR, however, so none of this is integrated yet.

The plant_type extracted from the headers is standardized according to a dictionary that I created and which we can change to reflect whatever final plant type string we want (or specify in an ENUM constraint).

Current small plant fuel types:

None, 'hydro', 'internal_combustion', 'solar_pv', 'cogen', 'steam_heat', 'wind', 'gas_turbine', 'diesel_turbine', 'pumped_hydro', 'coal', 'waste_heat', 'fuel_cell'

Current steam fuel types:

'steam', 'nuclear', 'unknown', 'combustion_turbine', 'combined_cycle', 'internal_combustion', 'geothermal', 'wind', 'photovoltaic', 'solar_thermal', ''

At the very least, we could standardize the NA values for both of these tables. Ideally we will pull values for plant_type from the same set of values, whether they are constrained or not. I think it's a good idea to create an issue to remind us to do this (and for the hydro table and maybe even an arbitrary one for the pumped storage table so that it has a plant_type when it gets merged with the others into the mega FERC table). Ultimately, the goal of all this RMI manual mapping we're doing is to connect, among other things, the EIA technology_description field with individual FERC records so we won't need to use the janky FERC plant_type or header extraction at all.

For now I've moved plant_type to the FIELD_METADATA_BY_RESOURCE and added an assert so that there are no uncaught strings in the plant_type column of plants_steam_ferc1. Seems like there should be a new issue that adds ENUM constraints for the FERC 1 hydro and small plants tables? Or is the goal to remove plant_type altogether and have it agree with EIA technology_description?

src/pudl/metadata/enums.py

src/pudl/metadata/fields.py

zaneselvans · 2021-12-17T04:37:13Z

The call df.to_sql() in sqlite.py converts the pd.NA values to NoneType since the column dtypes are not preserved. This problem is fixed after the fact by recasting the column to a string type once pudl.sqlite is read into a pandas dataframe. Preserving the column dtype is maybe something that should be addressed at some point.

I think what's happening here is that the null values are getting loaded into SQLite just fine, and stored as appropriate NULL values as far as the database is concerned, but then pandas is doing it's standard datatype conversions when we call pd.read_sql() and it incorrectly converts the SQLite NULL values into None (which is often treated as a "null" object).

With the new metadata system we should be able to give pandas an explicit schema that it can use to type the columns it reads out of the DB appropriately, but that's a later issue.

katie-lamb · 2021-12-17T17:10:29Z

I made a list of the fields with unknown or undetermined as well as some of the strings that map to these values:

FERC1

construction_type

'', 'automatic operation', 'comb. turb. installn', 'comb. turb. instaln',
'com. turb. installn', 'n/a', 'for detailed info.', 'for detailed info',

plant_kind

'', 'n/a', 'see pgs 402.1-402.3', 'see pgs 403.1-403.9', "respondent's share",
'--', '(see note 7)', 'other', 'not applicable', 'peach bottom', 'none.',
'fuel facilities', '0', 'not in service', 'none', 'common expenses', ..... many more

fuel_unit

'', '1265', 'mwh units', 'composite', 'therms', 'n/a', 'mbtu/kg', 'uranium 235',
'oil', 'ccf', '2261', 'uo2', '(7)', 'oil #2', 'oil #6', '\x99å\x83\x90?"', .... many more

construction_type and fuel_unit don't seem like primary keys and maybe are good candidates to make NAs?

EIA

entity_type

enum includes "unknown" and "other"

seems like "unknown" could be made into an NA

EPA CEMS

co2_mass_measurement_code
nox_mass_measurement_code
nox_rate_measurement_code
so2_mass_measurement_code
All accept "other", "undetermined", or "unknown code" - these could probably be made into NAs?

zaneselvans · 2021-12-17T17:31:27Z

FERC 1

The FERC 1 construction_type, plant_kind and fuel_unit "cleaned strings" are things we assigned, so we won't mess up anyone else if we make them NA, but we might mess up our own code. There's an unmapped argument in pudl.helpers.cleanstrings() that we could use to set any unmapped strings to pd.NA but I don't think we want to do that, because we want it to be clear whether we've reviewed and attempted to categorize every one of these nonsense strings, and they add new ones every year. So we may want to go ahead and assign them unknown assert that every value in the column is one of our simplified values (so we catch anything new and unexpected) and then replace unknown with pd.NA.

EIA

We should look at what kinds of utilities are being assigned "unknown" and "other" entity types, and whether they are consistent across years, or if how they're reported has changed. Like did they used to be "unknown" and now they're "other"? Or did they eventually get categorized? What proportion of the overall population has these mystery types? how often does a utility entity type change? Can we fill in the missing values confidently? What does the EIA documentation say about this field (if anything)? Often this is in one of the non-data tabs of the raw input spreadsheets, or in the instructions (PDFs which we've tried to archive in our docs)

EPA CEMS

Similarly we should see what EPA is saying these fields / values mean, and try to understand if "undetermined" and "unknown code" and "other" mean different things, or if they've just changed the coding scheme over the years, and also what proportion of the dataset is affected. There's probably some documentation on web from EPA, but I'm not 100% sure where. If we can't find it, we've been in contact with some of the EPA people in the past and might be able to ask for help.

into na-values

src/pudl/transform/eia860.py

src/pudl/transform/ferc1.py

katie-lamb · 2021-12-21T22:14:32Z

Summary of some changes:

FERC 1

I find some strings in construction_type, plant_type, and fuel_unit that hadn't been caught yet by the dictionary or strings and put them in the unknown category of this dictionary. I also added asserts so that we know if there are more uncaught strings. Then I converted all the unknown to pd.NA after this assert is made.

EIA 860

entity_type has unknown as a valid value, however there aren't actually any data points that use this unknown field. I couldn't find much in the documentation about entity_type but other seems to be an EIA label, so I left all the other values in the dataframes. There are some data points (58,000 out of around 109,000 rows for all years) that have np.nan so I converted these to pd.NA

EPA CEMS

I couldn't find much documentation about the nox and co2 measurement codes, but it seems like the Undetermined and Unknown Code are rarely used, so I left them.

zaneselvans

Made a few comments, if you want to chat. I suspect we should look at the broad dropna() from the fuel_ferc1 table more closely.

src/pudl/metadata/fields.py

katherinelamb added 3 commits December 13, 2021 17:29

fixed construction_type field for FERC data, changed unmapped value i…

198d795

…n call to cleanstrings function

fixed both natural_gas EIA 923 fields

b8c96c4

changed column types to string type, removed empty string from epacem…

2cde664

…s enum

katie-lamb added data-cleaning Tasks related to cleaning & regularizing data during ETL. sqlite Issues related to interacting with sqlite databases metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. labels Dec 16, 2021

katie-lamb requested a review from zaneselvans December 16, 2021 22:27

katie-lamb self-assigned this Dec 16, 2021

zaneselvans and others added 2 commits December 16, 2021 19:06

Merge branch 'dev' into na-values

452abd0

Merge remote-tracking branch 'origin/dev' into na-values

4734813

zaneselvans requested changes Dec 17, 2021

View reviewed changes

katie-lamb added 6 commits December 17, 2021 14:09

tackled more strings that were categorized as na and added asserts

99cf8f2

changed a few strings to unknowns

ca1ee71

added to strings caught in unknown and then changed unknown to pd.na

aa6bffc

Merge branch 'dev' into na-values

3e48edc

Merge branch 'na-values' of https://github.com/catalyst-cooperative/pudl

ca2d111

into na-values

small syntax fix

2524d10

katie-lamb commented Dec 21, 2021

View reviewed changes

src/pudl/transform/eia860.py Show resolved Hide resolved

katie-lamb commented Dec 21, 2021

View reviewed changes

src/pudl/transform/ferc1.py Show resolved Hide resolved

zaneselvans linked an issue Dec 28, 2021 that may be closed by this pull request

Ensure categorical columns use appropriate NA values #1210

Closed

zaneselvans reviewed Dec 29, 2021

View reviewed changes

src/pudl/metadata/fields.py Show resolved Hide resolved

zaneselvans approved these changes Dec 30, 2021

View reviewed changes

katie-lamb merged commit 432a592 into dev Dec 30, 2021

katie-lamb deleted the na-values branch December 30, 2021 21:10

zaneselvans mentioned this pull request Mar 10, 2022

Release PUDL v0.6.0 #1497

Closed

22 tasks

zaneselvans changed the title ~~Na values~~ Use pd.NA where appropriate for ENUM and categorical fields Mar 10, 2022

zaneselvans mentioned this pull request Mar 11, 2022

Update release notes for v0.6.0 #1525

Closed

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use pd.NA where appropriate for ENUM and categorical fields #1376

Use pd.NA where appropriate for ENUM and categorical fields #1376

katie-lamb commented Dec 16, 2021

codecov bot commented Dec 16, 2021 •

edited

Loading

katie-lamb commented Dec 17, 2021

zaneselvans left a comment

zaneselvans Dec 17, 2021

katie-lamb Dec 17, 2021

katie-lamb Dec 17, 2021

zaneselvans Dec 17, 2021

aesharpe Dec 17, 2021

katie-lamb Dec 21, 2021

zaneselvans commented Dec 17, 2021

katie-lamb commented Dec 17, 2021 •

edited

Loading

zaneselvans commented Dec 17, 2021

katie-lamb commented Dec 21, 2021

zaneselvans left a comment

Use pd.NA where appropriate for ENUM and categorical fields #1376

Use pd.NA where appropriate for ENUM and categorical fields #1376

Conversation

katie-lamb commented Dec 16, 2021

FERC 1

EIA

EPA CEMS

Note

codecov bot commented Dec 16, 2021 • edited Loading

Codecov Report

katie-lamb commented Dec 17, 2021

zaneselvans left a comment

Choose a reason for hiding this comment

zaneselvans Dec 17, 2021

Choose a reason for hiding this comment

katie-lamb Dec 17, 2021

Choose a reason for hiding this comment

katie-lamb Dec 17, 2021

Choose a reason for hiding this comment

zaneselvans Dec 17, 2021

Choose a reason for hiding this comment

aesharpe Dec 17, 2021

Choose a reason for hiding this comment

katie-lamb Dec 21, 2021

Choose a reason for hiding this comment

zaneselvans commented Dec 17, 2021

katie-lamb commented Dec 17, 2021 • edited Loading

FERC1

construction_type

plant_kind

fuel_unit

EIA

entity_type

EPA CEMS

zaneselvans commented Dec 17, 2021

FERC 1

EIA

EPA CEMS

katie-lamb commented Dec 21, 2021

FERC 1

EIA 860

EPA CEMS

zaneselvans left a comment

Choose a reason for hiding this comment

codecov bot commented Dec 16, 2021 •

edited

Loading

katie-lamb commented Dec 17, 2021 •

edited

Loading