Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pd.NA where appropriate for ENUM and categorical fields #1376

Merged
merged 11 commits into from
Dec 30, 2021
Merged

Conversation

katie-lamb
Copy link
Member

The following columns are filled with pd.NA in their respective datasets when going through the extract and transform steps of the pipeline.

FERC 1

  • construction_type

EIA

  • natural_gas_delivery_contract_type_code
  • natural_gas_transport_code

The call df.to_sql() in sqlite.py converts the pd.NA values to NoneType since the column dtypes are not preserved. This problem is fixed after the fact by recasting the column to a string type once pudl.sqlite is read into a pandas dataframe. Preserving the column dtype is maybe something that should be addressed at some point.

The following columns are filled with np.nan in the EPA CEMS datasets, not pd.NA. This is because they are categorical type columns and thus do not allow for pd.NA.

EPA CEMS

  • co2_mass_measurement_code
  • nox_mass_measurement_code
  • nox_rate_measurement_code
  • so2_mass_measurement_code

Note

The column purchase_type is listed in the issue as an EIA column with an empty string in the enum. However this appears to be a FERC 1 column and the enum is based off the dictionary POWER_PURCHASE_TYPES_FERC1 in labels.py. There is no empty string listed in the POWER_PURCHASE_TYPES_FERC1 dictionary, so I didn't do anything about the purchase_type column.

@katie-lamb katie-lamb added data-cleaning Tasks related to cleaning & regularizing data during ETL. sqlite Issues related to interacting with sqlite databases metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. labels Dec 16, 2021
@katie-lamb katie-lamb self-assigned this Dec 16, 2021
@codecov
Copy link

codecov bot commented Dec 16, 2021

Codecov Report

Merging #1376 (2524d10) into dev (eccddc9) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev    #1376      +/-   ##
==========================================
+ Coverage   83.66%   83.66%   +0.01%     
==========================================
  Files          62       62              
  Lines        6883     6892       +9     
==========================================
+ Hits         5758     5766       +8     
- Misses       1125     1126       +1     
Impacted Files Coverage Δ
src/pudl/constants.py 100.00% <ø> (ø)
src/pudl/metadata/enums.py 100.00% <ø> (ø)
src/pudl/metadata/fields.py 100.00% <ø> (ø)
src/pudl/transform/eia860.py 96.52% <ø> (ø)
src/pudl/transform/eia923.py 93.99% <ø> (ø)
src/pudl/transform/ferc1.py 92.04% <100.00%> (+0.26%) ⬆️
src/pudl/analysis/timeseries_cleaning.py 88.45% <0.00%> (-0.22%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eccddc9...2524d10. Read the comment docs.

@katie-lamb
Copy link
Member Author

Forgot to address this in above comments - I could also turn "unknown" into NAs for these columns if wanted

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just some questions about whether there are additional missing-ish values that we should turn into NAs.

[CONSTRUCTION_TYPE_STRINGS, PLANT_KIND_STRINGS],
unmapped='')
.pipe(pudl.helpers.cleanstrings, ['construction_type'], [CONSTRUCTION_TYPE_STRINGS], unmapped=pd.NA)
.pipe(pudl.helpers.cleanstrings, ['plant_type'], [PLANT_KIND_STRINGS], unmapped="")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this one retain the empty string for unmapped values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plant_type wasn't listed as a column that needed to remove empty strings in the issue, so when I encountered this I skipped plant_type. Also a few tests failed when I changed it to pd.NA (I forget what the error is but could regenerate), maybe the column is treated as a primary key? I could go back and change this to NAs though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fields.py there's this comment:

TODO Disambiguate column name and apply table specific ENUM constraints. There are different allowable values in different tables.

So it seems like maybe this is a good candidate to remove the empty string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay. I don't think it's ever used as a primary key. There's a mechanism for overriding the generic purely field-name based attributes if we need to. It looks like plant_type shows up in in the plants_hydro_ferc1, plants_small_ferc1, and plants_steam_ferc1 tables, but currently only the plants_steam_ferc1 version is being cleaned and set to some canonical values defined in the PLANT_KIND_STRINGS , so to do what the TODO asks would require categorizing all the (different) nonsense strings in the hydro and small plants tables. I don't know if this is something that might be getting done already in @aesharpe's FERC 1 cleaning PR but we should ask.

I think the thing to do for now if we want to apply a table specific ENUM is to override the constraints which are defined for this column only for the plants_steam_ferc1 table, using the FIELD_METADATA_BY_RESOURCE defined at the bottom of the fields.py module (currently there's only one override). And then maybe create an issue to remind us to create ENUM constraints for the plant_type field in the hydro and small plants tables in concert with Austen's cleanup of those tables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the plant_type column gets cleaned and standardized in the steam table, manually mapped in the small generators table, left alone in the hydro table, and is non-existent (and irrelevant) in the pumped storage table.

Part of my FERC 1 cleaning module extracts the headers from the small generators table, standardizes them, and puts them in the plant_type column so that we don't have to rely on manual mapping. This is still a draft PR, however, so none of this is integrated yet.

The plant_type extracted from the headers is standardized according to a dictionary that I created and which we can change to reflect whatever final plant type string we want (or specify in an ENUM constraint).

Current small plant fuel types:

None, 'hydro', 'internal_combustion', 'solar_pv', 'cogen',
       'steam_heat', 'wind', 'gas_turbine', 'diesel_turbine',
       'pumped_hydro', 'coal', 'waste_heat', 'fuel_cell'

Current steam fuel types:

'steam', 'nuclear', 'unknown', 'combustion_turbine',
       'combined_cycle', 'internal_combustion', 'geothermal', 'wind',
       'photovoltaic', 'solar_thermal', ''

At the very least, we could standardize the NA values for both of these tables. Ideally we will pull values for plant_type from the same set of values, whether they are constrained or not. I think it's a good idea to create an issue to remind us to do this (and for the hydro table and maybe even an arbitrary one for the pumped storage table so that it has a plant_type when it gets merged with the others into the mega FERC table). Ultimately, the goal of all this RMI manual mapping we're doing is to connect, among other things, the EIA technology_description field with individual FERC records so we won't need to use the janky FERC plant_type or header extraction at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I've moved plant_type to the FIELD_METADATA_BY_RESOURCE and added an assert so that there are no uncaught strings in the plant_type column of plants_steam_ferc1. Seems like there should be a new issue that adds ENUM constraints for the FERC 1 hydro and small plants tables? Or is the goal to remove plant_type altogether and have it agree with EIA technology_description?

src/pudl/metadata/enums.py Show resolved Hide resolved
src/pudl/metadata/fields.py Outdated Show resolved Hide resolved
@zaneselvans
Copy link
Member

The call df.to_sql() in sqlite.py converts the pd.NA values to NoneType since the column dtypes are not preserved. This problem is fixed after the fact by recasting the column to a string type once pudl.sqlite is read into a pandas dataframe. Preserving the column dtype is maybe something that should be addressed at some point.

I think what's happening here is that the null values are getting loaded into SQLite just fine, and stored as appropriate NULL values as far as the database is concerned, but then pandas is doing it's standard datatype conversions when we call pd.read_sql() and it incorrectly converts the SQLite NULL values into None (which is often treated as a "null" object).

With the new metadata system we should be able to give pandas an explicit schema that it can use to type the columns it reads out of the DB appropriately, but that's a later issue.

@katie-lamb
Copy link
Member Author

katie-lamb commented Dec 17, 2021

I made a list of the fields with unknown or undetermined as well as some of the strings that map to these values:

FERC1

construction_type

'', 'automatic operation', 'comb. turb. installn', 'comb. turb. instaln',
'com. turb. installn', 'n/a', 'for detailed info.', 'for detailed info',

plant_kind

'', 'n/a', 'see pgs 402.1-402.3', 'see pgs 403.1-403.9', "respondent's share",
'--', '(see note 7)', 'other', 'not applicable', 'peach bottom', 'none.',
'fuel facilities', '0', 'not in service', 'none', 'common expenses', ..... many more

fuel_unit

'', '1265', 'mwh units', 'composite', 'therms', 'n/a', 'mbtu/kg', 'uranium 235',
'oil', 'ccf', '2261', 'uo2', '(7)', 'oil #2', 'oil #6', '\x99å\x83\x90?"', .... many more

  • construction_type and fuel_unit don't seem like primary keys and maybe are good candidates to make NAs?

EIA

entity_type

enum includes "unknown" and "other"

  • seems like "unknown" could be made into an NA

EPA CEMS

co2_mass_measurement_code
nox_mass_measurement_code
nox_rate_measurement_code
so2_mass_measurement_code
All accept "other", "undetermined", or "unknown code" - these could probably be made into NAs?

@zaneselvans
Copy link
Member

FERC 1

The FERC 1 construction_type, plant_kind and fuel_unit "cleaned strings" are things we assigned, so we won't mess up anyone else if we make them NA, but we might mess up our own code. There's an unmapped argument in pudl.helpers.cleanstrings() that we could use to set any unmapped strings to pd.NA but I don't think we want to do that, because we want it to be clear whether we've reviewed and attempted to categorize every one of these nonsense strings, and they add new ones every year. So we may want to go ahead and assign them unknown assert that every value in the column is one of our simplified values (so we catch anything new and unexpected) and then replace unknown with pd.NA.

EIA

We should look at what kinds of utilities are being assigned "unknown" and "other" entity types, and whether they are consistent across years, or if how they're reported has changed. Like did they used to be "unknown" and now they're "other"? Or did they eventually get categorized? What proportion of the overall population has these mystery types? how often does a utility entity type change? Can we fill in the missing values confidently? What does the EIA documentation say about this field (if anything)? Often this is in one of the non-data tabs of the raw input spreadsheets, or in the instructions (PDFs which we've tried to archive in our docs)

EPA CEMS

Similarly we should see what EPA is saying these fields / values mean, and try to understand if "undetermined" and "unknown code" and "other" mean different things, or if they've just changed the coding scheme over the years, and also what proportion of the dataset is affected. There's probably some documentation on web from EPA, but I'm not 100% sure where. If we can't find it, we've been in contact with some of the EPA people in the past and might be able to ask for help.

@katie-lamb
Copy link
Member Author

Summary of some changes:

FERC 1

I find some strings in construction_type, plant_type, and fuel_unit that hadn't been caught yet by the dictionary or strings and put them in the unknown category of this dictionary. I also added asserts so that we know if there are more uncaught strings. Then I converted all the unknown to pd.NA after this assert is made.

EIA 860

entity_type has unknown as a valid value, however there aren't actually any data points that use this unknown field. I couldn't find much in the documentation about entity_type but other seems to be an EIA label, so I left all the other values in the dataframes. There are some data points (58,000 out of around 109,000 rows for all years) that have np.nan so I converted these to pd.NA

EPA CEMS

I couldn't find much documentation about the nox and co2 measurement codes, but it seems like the Undetermined and Unknown Code are rarely used, so I left them.

@zaneselvans zaneselvans linked an issue Dec 28, 2021 that may be closed by this pull request
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a few comments, if you want to chat. I suspect we should look at the broad dropna() from the fuel_ferc1 table more closely.

src/pudl/metadata/fields.py Show resolved Hide resolved
@katie-lamb katie-lamb merged commit 432a592 into dev Dec 30, 2021
@katie-lamb katie-lamb deleted the na-values branch December 30, 2021 21:10
@zaneselvans zaneselvans mentioned this pull request Mar 10, 2022
22 tasks
@zaneselvans zaneselvans changed the title Na values Use pd.NA where appropriate for ENUM and categorical fields Mar 10, 2022
@zaneselvans zaneselvans mentioned this pull request Mar 11, 2022
28 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-cleaning Tasks related to cleaning & regularizing data during ETL. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. sqlite Issues related to interacting with sqlite databases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure categorical columns use appropriate NA values
4 participants