-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move the EIA plant-parts list into PUDL outputs #1157
Conversation
Pulling over the relevant module from: https://github.com/catalyst-cooperative/rmi-ferc1-eia/ Working on Issue #952
The EIA plant-parts list is now integrated into pudl_out. There are three tables/methods that were added, each major step along the way to get to the plant-part list. I added one annual run of the plant-parts list into the CI tests. And one run of one of the test functions that check the aggregations into the validation tests. I removed the caching from the plant_parts_eia.py objects bc this is now taken care of in the pudl_out object. Working on Issue #952
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov Report
@@ Coverage Diff @@
## dev #1157 +/- ##
=======================================
+ Coverage 83.2% 84.2% +0.9%
=======================================
Files 65 66 +1
Lines 6886 7307 +421
=======================================
+ Hits 5733 6155 +422
+ Misses 1153 1152 -1
Continue to review full report at Codecov.
|
As of 9/7/21 this integration is in a PR, catalyst-cooperative/pudl#1157.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR looks good! The code is stable and the outputs seem to be providing good record linkage results.
Running pudl_out.plant_parts_eia()
took about 16 mins on my computer which I think is fast enough for our purposes. I did profile the code and found that loading the MCOE data took 25% of the total runtime and finding qualifying records took 31% of the runtime.
It took me a while to wrap my head around what each step in the process was doing and why. The code does a bunch of transformations and it is difficult to understand what a single record represents at the end of each transformation. As someone who is less familiar with energy systems and raw FERC data, I think providing some examples of inputs and expected outputs in unit tests would be helpful.
I would like to know more about the desired outcome of the plant parts process. Are the entities in PLANT_PARTS
real entities in power plants or are they just strange slices that exist in FERC? If they are real entities, how are they related to one another? If they are just weird slices from FERC, how did we identify plant_technology
, plant_prime_mover
, etc as important slices to try to create from EIA data? I think answering these questions could help us potentially simplify the plant parts logic.
src/pudl/analysis/plant_parts_eia.py
Outdated
infrastructure as other plant-part list records. For example, if we have a | ||
"plant_prime_mover" plant part record and a "plant_unit" plant part record | ||
which were both cobbled together from the same two generators. We want to be | ||
able to reduce the master unit list to only unique collections of generators, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to provide some sample data for an example of a true granularity? As someone without much energy expertise having specific examples of how these plant parts interact would be super helpful.
src/pudl/analysis/plant_parts_eia.py
Outdated
In order to accumulate every possible version of how a generator could | ||
be reported, this method generates two records for each generator's | ||
reported owners: one of the portion of the plant part they own and one | ||
for the plant-part as a whole. The portion records are labeled in the | ||
`ownership` column as "owned" and the total records are labeled as | ||
"total". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add an example of a FERC record that splits a plant part into ownership slices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea!
src/pudl/output/pudltabl.py
Outdated
Note: This table is entirely intented for use within | ||
``PudlTabl.plant_parts_eia`` and would be difficult to work with | ||
outside of that context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does each step of the plant parts process need to live in pudltabl.py
? plant_parts_eia()
could create each plant part object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i was wondering this myself. I mostly set it up this way for debugging because I have very often needed to generate the final output but not these two interim outputs. I think with the deep_update
argument of plant_parts_eia
this workflow could stay enabled without having their own output methods.
this enables checking of consistency across time. i think that the record id eia is not the most clean column as it is. hoping to get some guidence on that in general.
this enables checking of consistency across time. i think that the record id eia is not the most clean column as it is. hoping to get some guidence on that in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to get merged in because it's being used and the logistics of having it live in a separate repo are unworkable, but I don't really understand what's going on. Most of the comments I've made are small localized issues, but architecturally I think we have to find a way for this to be much more understandable and ideally also faster. I don't see any fundamental reason why these operations should have to take tens of minutes to run.
The identification of distinct record groups ("true granularities") feels particularly opaque.
I was also completely lost in the process of aggregating scaled generator+ownership slices.
I was unable to keep track of the nature of the tables being manipulated at various points in the process -- what to expect their natural keys to be, whether they represented individual plant parts for a single plant, or the whole dataset. I don't understand what the record IDs are.
I suspect some of the confusion could be helped by abstracting out operations that work on a given "plant part" or column that's used for scaling (capacity and ownership fraction) and separating the ownership-related operations from the plant-attribute related operations, and also from the operational status operations. Maybe setting up several different dataframes each of which is simpler to think about, which are then combined at the end into the output that's actually used for matching with FERC 1.
I also wasn't able to follow the flow of the program. Eventually it just became a bunch of different freefloating functions, with notes about where they were expected to be called, but I lost track of the high-level order of operations.
If we do get the CCAI grant and move to using a data structure like this as a fundamental building block of many parts of our overall data linkages, I think we need to take what you've learned from this process and do a new iteration.
src/pudl/helpers.py
Outdated
return result.to_frame(name=data_col).reset_index() | ||
|
||
|
||
def agg_cols(df_in, id_cols, sum_cols, wtavg_dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agg_cols
isn't descriptive of what this function does -- it's a very specific kind of aggregation of columns. The one line docstring also doesn't really explain what's happening, other than it involves dataframes, summing and weighted averages. Those operations could be combined in lots of different ways to do different things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very fair point. I've renamed it to... sum_and_weighted_average_agg
does that sound okay?
|
||
""" | ||
plant_parts_eia = ( | ||
pudl.helpers.calc_capacity_factor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helpers seems like a weird place to have this analytical function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it here because it is now being used in both the analysis.mcoe
and here. I suppose I could move it back into mcoe? But it is now used in both mcoe and this plant_parts_eia model
Hey @ezwelty we're having a doctest issue here that makes absolutely no sense to us and we've spent hours trying to debut it and wonder if it might be clear to you what's happening. On @cmgosnell's computer the doctests for this new module
or
But on my (Linux) machine, and in the CI some of the doctests are failing and claiming that the python code embedded in the docstrings are producing no outputs, resulting in an "Expected [dataframe] got nothing" error. Examples of the build failure can be found here. This is happening in the doctests embedded within the class-level docstrings for:
The methods that are being exercised clearly produce output when used in other contexts, and the tests are passing on her MacOS environment. Other doctests that construct dataframes by hand using It seems bizarre and we are stumped. |
@zaneselvans Strange indeed. I see no way, reading through the code, for the functions being tested to return Making the failing tests more like the working ones: >>> df = PlantPart(part_name='plant').ag_part_by_own_slice(gens_mega, sum_cols=['capacity_mw'], wtavg_dict={})
>>> df
... Or even explicitly calling >>> print(PlantPart(part_name='plant').ag_part_by_own_slice(gens_mega, sum_cols=['capacity_mw'], wtavg_dict={}))
... |
Task list to move doctests over to unit tests:
|
In the latest commits I merged in
The examples @cmgosnell included in the doctests made sense to me so I just migrated them over directly. Questions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only looked at the doctest to unit test migration portion of the PR for this go-around.
Seemed like there was a leftover formatting fixture (from the too-wide doctests), and like it might be preferable to break the one large test function that's checking several different things into several smaller more specific tests, unless I missed some cross-dependency between them.
test/conftest.py
Outdated
@pytest.fixture(autouse=True, scope='session') | ||
def pandas_terminal_width(): | ||
"""Set pandas displays for doctests.""" | ||
pd.options.display.width = 1000 | ||
pd.options.display.max_columns = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this still be here, or was it only for the doctests, which have now been removed?
import pudl.analysis.plant_parts_eia | ||
|
||
|
||
def test_plant_part(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that all of these individual unit tests need to be combined in a single test function? Or can they easily be broken out into several more specific unit tests, which would make it easier to identify exactly what test has failed when something goes wrong. I see gens_mega
shows up in many of the checks, but that could be pulled out into a GENS_MEGA
constant that's used in many of the individual tests.
Also now there's some conflicts with |
# I tried adding this into __init__.py. | ||
# I tried adding this into the module docstring. | ||
pd.options.display.width = 1000 | ||
pd.options.display.max_columns = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this the final decision on where to put this?
I took out that formatting fixture but the pandas formatting ( I broke up the different plant part aggregations into smaller more specific tests and made the global I also merged in dev. |
Hookay. This PR is an attempt to move the EIA plant-parts list into PUDL. The bulk of the code is in
pudl.analysis.plant_parts_eia.py
. I added the little standard wrappers for the three main outputs of the plant-parts list into thepudl_out
object, a small collection of helper functions, one "run-it" CI test and one data validation test.This is a biggie. There is a fair amount of documentation in
pudl.analysis.plant_parts_eia.py
. I highly recommend reading the module doc string before looking at any of the code.There are two classes in there that still feel like problem children to me because they are conceptual challenging and a bit slow (
LabelTrueGranularities
andAddQualifier
)Desires from this PR:
LabelTrueGranularities
(this is my least fav part... it is conceptually challenging and slow)AddQualifier
(this part is just slow bc it is doing a ton of groupby().count()'s)