Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a data_maturity label for EIA data #1855

Merged
merged 14 commits into from
Aug 24, 2022
Merged

Create a data_maturity label for EIA data #1855

merged 14 commits into from
Aug 24, 2022

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Aug 22, 2022

We added the early release eia data in #1834. This PR adds a data_maturity column into the EIA data tables to communicate with users about where the data comes from in the hope of communicating how much trust they should place in the permanence of the data.

Things that happened in here:

  • Add a add_data_maturity method into the standard excel extractor that 🥁 adds a data_maturity column. This is deployed in process_raw (Question: the generic version of process_raw is customized in all of the datasets... should add_data_maturity be called directly in GenericExtractor.extract instead?)
  • Edit the harvesting process a bit! wahoo. The desire overall for this label is that it is kept in the "data" tables (think frc) and is present in the annual entity tables (think generators tbl). This way the label is front and center for users! To implement this, we needed to both harvest the column for the relevant entities AND keep the column in the tables themselves. We did this before in a hacky way for the utility_id_eia. This PR actually sets this up in a non-hacky way (imo). For each entity in ENTITIES we add a not_to_drop_cols list which is employed in the harvesting process to effectively avoid dropping those columns during harvesting.
  • Ensure the column actually propagates through the transform process... this looked mostly like finding all of the gazillion places we enumerate/restrict columns.
  • Add this column into the metadata for all the tables!
  • Added the new eia860m columns into the eia860 generators metadata map. This is really a holdover from Update eia923 raw inputs to include revisions made by EIA on 2022-08-11 #1846 from the few new 860m columns that got added. With these new columns, if there is no eia860m data the etl fill fail. (Question/suggestion: We could add them as null columns in the extract step if this current empty csv rows feel bad. That might be better because we'll be able to document it at least.)

@cmgosnell cmgosnell linked an issue Aug 22, 2022 that may be closed by this pull request
src/pudl/extract/excel.py Outdated Show resolved Hide resolved
these columns being anywhere in the non-m eia860 enables the columns
to exist in the tables without the eia860m being loaded. Moving them
from the generators -> generators_existing is slightly aspirational.
We hope they'll add them in their next non-monthly data updates so
this just adds the empties here where we hope they'll show up later.
@codecov
Copy link

codecov bot commented Aug 22, 2022

Codecov Report

Merging #1855 (7ab88ba) into dev (79308d7) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##             dev   #1855     +/-   ##
=======================================
+ Coverage   83.0%   83.2%   +0.1%     
=======================================
  Files         65      65             
  Lines       7327    7518    +191     
=======================================
+ Hits        6088    6255    +167     
- Misses      1239    1263     +24     
Impacted Files Coverage Δ
src/pudl/metadata/codes.py 100.0% <ø> (ø)
src/pudl/metadata/fields.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia860.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia923.py 100.0% <ø> (ø)
src/pudl/extract/eia860.py 100.0% <100.0%> (ø)
src/pudl/extract/eia860m.py 100.0% <100.0%> (ø)
src/pudl/extract/eia923.py 100.0% <100.0%> (ø)
src/pudl/extract/excel.py 95.4% <100.0%> (+0.4%) ⬆️
src/pudl/transform/eia.py 94.9% <100.0%> (-0.1%) ⬇️
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@zaneselvans
Copy link
Member

This looks like some hackish stuff happening. Can you explain more in the PR comment what all you had to do to make it work?

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a coding table that explains what the different levels of data_maturity mean with some examples (maybe even exhaustive lists since there's not much right now).

Don't forget to update release notes!

src/pudl/metadata/fields.py Outdated Show resolved Hide resolved
src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia.py Show resolved Hide resolved
annual release and should be used with caution. :pr:`1834`
annual release and should be used with caution. We also integrated a ``data_maturity``
column and related ``data_maturities`` table into most of the EIA data tables in
order to alter users to the level of finality of the data. :pr:`1834` :pr:`1855`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alter => alert?

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you've set up the foreign key generation rule for the data_maturity columns, the ENUM constraint on that column is duplicative.

src/pudl/extract/excel.py Outdated Show resolved Hide resolved
src/pudl/extract/excel.py Outdated Show resolved Hide resolved
src/pudl/metadata/fields.py Outdated Show resolved Hide resolved
@cmgosnell cmgosnell merged commit ddd2583 into dev Aug 24, 2022
@cmgosnell cmgosnell deleted the er-label branch August 24, 2022 15:42
@cmgosnell cmgosnell added eia923 Anything having to do with EIA Form 923 eia860 Anything having to do with EIA Form 860 new-data Requests for integration of new data. rmi labels Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 new-data Requests for integration of new data. rmi
Projects
None yet
Development

Successfully merging this pull request may close these issues.

integrate warning/notice of eia 923 early release data
2 participants