Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move the EIA plant-parts list into PUDL outputs #1157

Merged
merged 45 commits into from
Mar 10, 2022
Merged

Move the EIA plant-parts list into PUDL outputs #1157

merged 45 commits into from
Mar 10, 2022

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Aug 27, 2021

Hookay. This PR is an attempt to move the EIA plant-parts list into PUDL. The bulk of the code is in pudl.analysis.plant_parts_eia.py. I added the little standard wrappers for the three main outputs of the plant-parts list into the pudl_out object, a small collection of helper functions, one "run-it" CI test and one data validation test.

This is a biggie. There is a fair amount of documentation in pudl.analysis.plant_parts_eia.py. I highly recommend reading the module doc string before looking at any of the code.

There are two classes in there that still feel like problem children to me because they are conceptual challenging and a bit slow (LabelTrueGranularities and AddQualifier)

Desires from this PR:

  • Fix PerformanceWarnings
  • Get advice on LabelTrueGranularities (this is my least fav part... it is conceptually challenging and slow)
  • Get advice on AddQualifier (this part is just slow bc it is doing a ton of groupby().count()'s)
  • Advice on making the record_id_eia more useful/readable

The EIA plant-parts list is now integrated into pudl_out. There are 
three tables/methods that were added, each major step along the way to 
get to the plant-part list.

I added one annual run of the plant-parts list into the CI tests. And 
one run of one of the test functions that check the aggregations into 
the validation tests.

I removed the caching from the plant_parts_eia.py objects bc this is now 
taken care of in the pudl_out object.

Working on Issue #952
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov
Copy link

codecov bot commented Aug 27, 2021

Codecov Report

Merging #1157 (26950da) into dev (9d27550) will increase coverage by 0.9%.
The diff coverage is 97.7%.

Impacted file tree graph

@@           Coverage Diff           @@
##             dev   #1157     +/-   ##
=======================================
+ Coverage   83.2%   84.2%   +0.9%     
=======================================
  Files         65      66      +1     
  Lines       6886    7307    +421     
=======================================
+ Hits        5733    6155    +422     
+ Misses      1153    1152      -1     
Impacted Files Coverage Δ
src/pudl/analysis/plant_parts_eia.py 97.3% <97.3%> (ø)
src/pudl/analysis/mcoe.py 92.4% <100.0%> (-0.3%) ⬇️
src/pudl/helpers.py 87.4% <100.0%> (+2.6%) ⬆️
src/pudl/output/pudltabl.py 88.6% <100.0%> (+0.4%) ⬆️
src/pudl/metadata/dfs.py 100.0% <0.0%> (ø)
src/pudl/metadata/enums.py 100.0% <0.0%> (ø)
src/pudl/extract/ferc714.py 100.0% <0.0%> (ø)
src/pudl/metadata/fields.py 100.0% <0.0%> (ø)
src/pudl/metadata/labels.py 100.0% <0.0%> (ø)
src/pudl/metadata/constants.py 100.0% <0.0%> (ø)
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d27550...26950da. Read the comment docs.

cmgosnell added a commit to catalyst-cooperative/rmi-ferc1-eia that referenced this pull request Sep 7, 2021
As of 9/7/21 this integration is in a PR, 
catalyst-cooperative/pudl#1157.
Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good! The code is stable and the outputs seem to be providing good record linkage results.

Running pudl_out.plant_parts_eia() took about 16 mins on my computer which I think is fast enough for our purposes. I did profile the code and found that loading the MCOE data took 25% of the total runtime and finding qualifying records took 31% of the runtime.

It took me a while to wrap my head around what each step in the process was doing and why. The code does a bunch of transformations and it is difficult to understand what a single record represents at the end of each transformation. As someone who is less familiar with energy systems and raw FERC data, I think providing some examples of inputs and expected outputs in unit tests would be helpful.

I would like to know more about the desired outcome of the plant parts process. Are the entities in PLANT_PARTS real entities in power plants or are they just strange slices that exist in FERC? If they are real entities, how are they related to one another? If they are just weird slices from FERC, how did we identify plant_technology, plant_prime_mover, etc as important slices to try to create from EIA data? I think answering these questions could help us potentially simplify the plant parts logic.

src/pudl/analysis/plant_parts_eia.py Outdated Show resolved Hide resolved
Comment on lines 28 to 31
infrastructure as other plant-part list records. For example, if we have a
"plant_prime_mover" plant part record and a "plant_unit" plant part record
which were both cobbled together from the same two generators. We want to be
able to reduce the master unit list to only unique collections of generators,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to provide some sample data for an example of a true granularity? As someone without much energy expertise having specific examples of how these plant parts interact would be super helpful.

Comment on lines 421 to 426
In order to accumulate every possible version of how a generator could
be reported, this method generates two records for each generator's
reported owners: one of the portion of the plant part they own and one
for the plant-part as a whole. The portion records are labeled in the
`ownership` column as "owned" and the total records are labeled as
"total".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add an example of a FERC record that splits a plant part into ownership slices?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea!

src/pudl/analysis/plant_parts_eia.py Show resolved Hide resolved
src/pudl/analysis/plant_parts_eia.py Outdated Show resolved Hide resolved
src/pudl/analysis/plant_parts_eia.py Outdated Show resolved Hide resolved
src/pudl/analysis/plant_parts_eia.py Outdated Show resolved Hide resolved
src/pudl/analysis/plant_parts_eia.py Outdated Show resolved Hide resolved
Comment on lines 970 to 972
Note: This table is entirely intented for use within
``PudlTabl.plant_parts_eia`` and would be difficult to work with
outside of that context.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does each step of the plant parts process need to live in pudltabl.py? plant_parts_eia() could create each plant part object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i was wondering this myself. I mostly set it up this way for debugging because I have very often needed to generate the final output but not these two interim outputs. I think with the deep_update argument of plant_parts_eia this workflow could stay enabled without having their own output methods.

this enables checking of consistency across time. i think that the record id eia is not the most clean column as it is. hoping to get some guidence on that in general.
this enables checking of consistency across time. i think that the 
record id eia is not the most clean column as it is. hoping to get some 
guidence on that in general.
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to get merged in because it's being used and the logistics of having it live in a separate repo are unworkable, but I don't really understand what's going on. Most of the comments I've made are small localized issues, but architecturally I think we have to find a way for this to be much more understandable and ideally also faster. I don't see any fundamental reason why these operations should have to take tens of minutes to run.

The identification of distinct record groups ("true granularities") feels particularly opaque.

I was also completely lost in the process of aggregating scaled generator+ownership slices.

I was unable to keep track of the nature of the tables being manipulated at various points in the process -- what to expect their natural keys to be, whether they represented individual plant parts for a single plant, or the whole dataset. I don't understand what the record IDs are.

I suspect some of the confusion could be helped by abstracting out operations that work on a given "plant part" or column that's used for scaling (capacity and ownership fraction) and separating the ownership-related operations from the plant-attribute related operations, and also from the operational status operations. Maybe setting up several different dataframes each of which is simpler to think about, which are then combined at the end into the output that's actually used for matching with FERC 1.

I also wasn't able to follow the flow of the program. Eventually it just became a bunch of different freefloating functions, with notes about where they were expected to be called, but I lost track of the high-level order of operations.

If we do get the CCAI grant and move to using a data structure like this as a fundamental building block of many parts of our overall data linkages, I think we need to take what you've learned from this process and do a new iteration.

src/pudl/helpers.py Outdated Show resolved Hide resolved
src/pudl/helpers.py Show resolved Hide resolved
src/pudl/helpers.py Outdated Show resolved Hide resolved
return result.to_frame(name=data_col).reset_index()


def agg_cols(df_in, id_cols, sum_cols, wtavg_dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agg_cols isn't descriptive of what this function does -- it's a very specific kind of aggregation of columns. The one line docstring also doesn't really explain what's happening, other than it involves dataframes, summing and weighted averages. Those operations could be combined in lots of different ways to do different things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very fair point. I've renamed it to... sum_and_weighted_average_agg does that sound okay?

src/pudl/helpers.py Outdated Show resolved Hide resolved

"""
plant_parts_eia = (
pudl.helpers.calc_capacity_factor(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helpers seems like a weird place to have this analytical function.

Copy link
Member Author

@cmgosnell cmgosnell Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it here because it is now being used in both the analysis.mcoeand here. I suppose I could move it back into mcoe? But it is now used in both mcoe and this plant_parts_eia model

src/pudl/analysis/plant_parts_eia.py Outdated Show resolved Hide resolved
src/pudl/analysis/plant_parts_eia.py Outdated Show resolved Hide resolved
src/pudl/analysis/plant_parts_eia.py Outdated Show resolved Hide resolved
src/pudl/analysis/plant_parts_eia.py Outdated Show resolved Hide resolved
@zaneselvans
Copy link
Member

zaneselvans commented Jan 25, 2022

Hey @ezwelty we're having a doctest issue here that makes absolutely no sense to us and we've spent hours trying to debut it and wonder if it might be clear to you what's happening.

On @cmgosnell's computer the doctests for this new module src/pudl/analysis/plant_parts_eia.py pass just fine, either using:

pytest --doctest-modules src/pudl test/unit

or

tox -re unit

But on my (Linux) machine, and in the CI some of the doctests are failing and claiming that the python code embedded in the docstrings are producing no outputs, resulting in an "Expected [dataframe] got nothing" error. Examples of the build failure can be found here.

This is happening in the doctests embedded within the class-level docstrings for:

  • pudl.analysis.plant_parts_eia.PlantPart
  • pudl.analysis.plant_parts_eia.MakeMegaGenTbl

The methods that are being exercised clearly produce output when used in other contexts, and the tests are passing on her MacOS environment. Other doctests that construct dataframes by hand using pd.DataFrame() seem to produce output as expected and pass just fine.

It seems bizarre and we are stumped.

@ezwelty
Copy link
Contributor

ezwelty commented Jan 27, 2022

@zaneselvans Strange indeed. I see no way, reading through the code, for the functions being tested to return None (without also raising an exception, since a dataframe is expected throughout). So I wonder if the printing of dataframes is being suppressed in some environments? Honestly, I have no idea, but things I might try:

Making the failing tests more like the working ones:

>>> df = PlantPart(part_name='plant').ag_part_by_own_slice(gens_mega, sum_cols=['capacity_mw'], wtavg_dict={})
>>> df
...

Or even explicitly calling print:

>>> print(PlantPart(part_name='plant').ag_part_by_own_slice(gens_mega, sum_cols=['capacity_mw'], wtavg_dict={}))
...

@katie-lamb katie-lamb self-assigned this Mar 1, 2022
@katie-lamb katie-lamb added the ccai Tasks related to CCAI grant for entity matching label Mar 1, 2022
@katie-lamb
Copy link
Member

katie-lamb commented Mar 8, 2022

Task list to move doctests over to unit tests:

  • module doc string
  • MakeMegaGenTbl
  • 4 in PlantPart
  • maybe these should be broken up into separate tests
  • move doctests out and update comments

@katie-lamb
Copy link
Member

katie-lamb commented Mar 8, 2022

In the latest commits I merged in dev and migrated the doctests over to unit tests. Here's how it broke down into unit tests:

test_plant_part:

  • doctest from module doc string showing aggregation by plant
  • doctest of aggregation by plant_prime_fuel, plant_prime_mover, and plant_gen from PlantPart class

test_make_mega_gen_tbl:

  • doctest from MakeMegaGenTbl that tests creation of mega_gen table

test_slice_by_ownership:

  • already included in unit tests, tests slice_by_ownership method from MakeMegaGenTbl class

The examples @cmgosnell included in the doctests made sense to me so I just migrated them over directly.

Questions:

  • Should test_plant_part be broken up into separate tests for each aggregation?
  • Is the test_slice_by_ownership test redundant since slice_by_ownership is already called when the mega_gens table is made in MakeMegaGenTbl().execute() in test_make_mega_gen_tbl?
  • I kept all the example dataframes that aren't run through PPL code in the docstrings and added some text to explain the methods one would call to generate various output tables. Maybe there's a better way or the docstrings should be further simplified

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only looked at the doctest to unit test migration portion of the PR for this go-around.

Seemed like there was a leftover formatting fixture (from the too-wide doctests), and like it might be preferable to break the one large test function that's checking several different things into several smaller more specific tests, unless I missed some cross-dependency between them.

test/conftest.py Outdated
Comment on lines 250 to 254
@pytest.fixture(autouse=True, scope='session')
def pandas_terminal_width():
"""Set pandas displays for doctests."""
pd.options.display.width = 1000
pd.options.display.max_columns = 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this still be here, or was it only for the doctests, which have now been removed?

import pudl.analysis.plant_parts_eia


def test_plant_part():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that all of these individual unit tests need to be combined in a single test function? Or can they easily be broken out into several more specific unit tests, which would make it easier to identify exactly what test has failed when something goes wrong. I see gens_mega shows up in many of the checks, but that could be pulled out into a GENS_MEGA constant that's used in many of the individual tests.

@zaneselvans
Copy link
Member

Also now there's some conflicts with dev because I changed the output testing a bit in the nuclear generation fuel PR.

# I tried adding this into __init__.py.
# I tried adding this into the module docstring.
pd.options.display.width = 1000
pd.options.display.max_columns = 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this the final decision on where to put this?

@katie-lamb
Copy link
Member

katie-lamb commented Mar 10, 2022

I only looked at the doctest to unit test migration portion of the PR for this go-around.

Seemed like there was a leftover formatting fixture (from the too-wide doctests), and like it might be preferable to break the one large test function that's checking several different things into several smaller more specific tests, unless I missed some cross-dependency between them.

I took out that formatting fixture but the pandas formatting (pd.options.display.width = 1000 and pd.options.display.max_columns = 1000) at the top of plant_parts_eia.py is still necessary to pass the doctests with the pandas dataframe examples that are left in there.

I broke up the different plant part aggregations into smaller more specific tests and made the global GENS_MEGA.

I also merged in dev.

@zaneselvans
Copy link
Member

I think having those display options set as a side-effect of importing the pudl package will be unexpected, but also cosmetic, and something users can override, so I'm not gonna worry about it too much.

image

@cmgosnell cmgosnell merged commit 1f88c6f into dev Mar 10, 2022
@zaneselvans zaneselvans deleted the mul_eia branch March 10, 2022 22:38
This was referenced Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ccai Tasks related to CCAI grant for entity matching
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants