Update to 2022 data #322

grgmiller · 2023-12-09T01:02:39Z

Purpose

This PR updates the OGE pipeline to work with 2022 data, and also updates the manual tables.

What the code is doing

Updates default years to 2022 (Fixes CAR-3399)
Updates source of eGRID2020 data to the v2 file (not used in the pipeline, only for comparison)

Updates reference tables (see #260) (Fixes CAR-3349)

load_data.py

Although much of the EPA plant ID to EIA plant mapping has been integrated into the pudl version of the CEMS data, our manual plant mapping from epa_eia_crosswalk_manual would not be reflected. Now, whenever cems data is being loaded, we run update_epa_to_eia_map() to update the plant_id_eia codes
When loading the EPA-EIA crosswalk, we now load in the manual plant map from eGRID to make sure this is considered in the mapping.

data_cleaning.py

No longer remove steam units: Previously, we had been dropping from the CEMS data any units that only reported steam load and no gross generation (see Interpretation of steam data from CEMS #103 for context), mostly just to save memory since we were not including this steam data in the outputs. However, based on our updated understanding in Convert CEMS steam load to gross generation #216 we do not necessarily want to be dropping this data. Even though we are not yet implementing this steam to generation conversion, we would have had to have updated the steam_units_to_remove manual table, which didn't seem worth it if we are not going to be dropping these units in the future.
Drops negative fuel consumption: Plant 10613 reports negative fuel consumption for its WDS fuel in May 2022. This is an error in the input EIA-923 data but our pipeline was not previously set up to handle such cases. For now, we replace these negative fuel consumption values with missing values, and print a warning in the logger.
Switches back to running pudl.analysis.allocate_gen_fuel_by_generator_energy_source() instead of loading the table from pudl. This allows us to see data quality warnings generated during running that part of the pipeline which will help with data quality checks.
Deletes the code that dropped all EIA-923 that was missing or zero. I cannot remember why I implemented this, but this was resulting in a lot of missing records in the data outputs. Even if fuel consumption and generation is zero for a plant in the month, that is still useful information and we should not drop it. Note: some data is actually missing in the input EIA-923 data.
Fixes a bug when identifying partial cems plants. We did not want to assign the "partial cems" methodology to a subplant if the primary fuel of the plant was not fossil based, since that could mean that we were assigning data from a fossil backup generator to shape nuclear generation, for example. This should have been identified using the "plant_primary_fuel" column but it was mistakenly using the "energy_source_code" column, which is the generator-specific energy source. This was leading to a mixed method error at line 1765.

eia930.py

Adds an end date to the TEPC timestamp issue in EIA-930 (Fixes CAR-2506)

emissions.py

For some reason in 2022, none of the JF-fueled plants reported fuel sulfur content data. This means that several plants were missing SO2 emission factors. To fix this, I added a further backstop to the existing annual_avg_fuel_sulfur_content. Now, in the case that there is no sulfur content data available for a fuel in a year, the pipeline will check the reported sulfur contents from the previous year to see if it can fill in the annual average value. This means for the 2022 pipeline, the JF generators without a specified sulfur content will use the average JF sulfur content from 2021 (looking at multiple years, this sulfur content does not seem to change from year to year so this seems like a reasonable backstop). In implementing this fix, I split an existing function into two components.
When checking for missing nox and so2 emission factors, we now only raise this error if these are missing for PM-fuels where there is non-zero fuel consumption. Previously we had been checking in case of missing fuel consumption, but we do not need emission factors for fuel-PMs with zero fuel consumption.

validation.py

now print a sample of the bad data whenever the test_for_negative_values check does not pass
Fixes bug where the incorrect variable was being referenced when testing emissions adjustments
renames the check_for_complete_timeseries check to check_for_complete_hourly_timeseries, and adds a check_for_complete_monthly_timeseries check. This new check ensures that monthly-resolution data contains all 12 months of data.

Testing

Running the pipeline for 2022 (not yet complete)

Usage Example/Visuals

How the code can be used and/or images of any graphs, tables or other visuals (not always applicable).

Review estimate

How long will it take for reviewers and observers to understand this code change?

Future work

The following warnings were raised when running the pipeline:

"oge.oge.validation:249 There are 259 subplants that only contain one part of a combined cycle system. Subplants that represent combined cycle generation should contain both CA and CT parts.": This error is the same as previous years, this is something on our backlog that we haven't gotten around to fixing yet.
"oge.oge.validation:78 Allocated EIA-923 doesn't match input data for plants": This is a high priority thing to fix but it requires messing with the pudl code. See pudl.analysis.allocate_gen_fuel is dropping/adding data catalyst-cooperative/pudl#3165
"oge.oge.validation:140 There are 173 plants where the assigned primary fuel doesn't match the capacity-based primary fuel. It is possible that these plants will categorized as a different fuel in EIA-930": This is expected/same as previous years. Nothing to "fix" here but may influence some future design choices about how primary fuels are assigned. See Validate primary fuel methodology that should be used for assigning a plant's fuel category #281
We get our expected warnings about missing Nox and SO2 emission factors for fuel cell (FC) prime movers. This is expected. See Emissions from fuel cells #70
"oge.oge.validation:334 Some data being removed has non-zero data associated with it:" We drop cems data where there is zero generation or fuel consumption data. However, some of these records have non-zero steam load and nox mass. This is a known issue and is tied to Integrate CEMS data for units that only report NOx #134
"oge.oge.validation:374 There are 212 subplants at 142 plants for which there is zero gross generation associated with positive net generation.", "oge.oge.validation:411 There are 53 subplants at 41 plants for which there is zero net generation associated with positive gross generation.", "oge.oge.validation:450 The following plants have annual net generation that is >125% of annual gross generation:" This is raised when EIA data is linked to CEMS data, and the generation numbers are inconsistent between the two sources. This happens with all years of data and is not unique to 2022. This may be due to inconsistent reporting between sources, or potentially bad crosswalking. This will require future investigation in our backlog of issues.
"oge.oge.gross_to_net_generation:93 The following subplants are missing default GTN ratios. Using a default value of 0.97": For some reason these plants do not have prime-mover related default factors from EIA. We should look into the prime mover code of these plants.
"oge.oge.validation:1606 Potentially anomalous co2 factors detected for the following plants" - there are some plants that have abnormally high or low EFs. These need to be manually looked into to understand the source of these anomalies. Some might just be input data quality issues, or there is an issue with our processing pipeline, or they are just abnormal plants and the data is accurate.

Checklist

Update the documentation to reflect changes made in this PR
Format all updated python files using ruff
Clear outputs from all notebooks modified
Add docstrings and type hints to any new functions created

rouille · 2023-12-22T19:05:59Z

src/oge/data_cleaning.py

@@ -1002,7 +1018,8 @@ def clean_cems(year: int, small: bool, primary_fuel_table, subplant_emission_fac
    )

    # manually remove steam-only units
-    cems = manually_remove_steam_units(cems)
+    # NOTE(greg): disabling this for the 2022 data release
+    # cems = manually_remove_steam_units(cems)


Why do we stop removing these units in 2022?

Sorry this is unclear! This is not just for 2022, we are stopping using this method altogether. See note about this in the main PR description. I'll delete this whole line.

I understand it will apply to all years from now on. My question is why do we stop removing steam-only units. Do we feel confident estimating emissions from these generators moving forward?

rouille · 2023-12-22T19:19:59Z

src/oge/consumed.py

-                # Cut off emissions at 9 hours after UTC year
-                emissions = emissions[: f"{self.year+1}-01-01 09:00:00+00:00"]
+                # Cut off emissions at 8 hours after UTC year
+                emissions = emissions[: f"{self.year+1}-01-01 08:00:00+00:00"]


How does it work? I thought the emission would be hourly over a full year in UTC time. Are we doing this in Pacific Time instead. And why was it nine hours before, did we have Alaska in 2021 and not 2022?

So this is just saying that we only want to do the calculations through midnight pacific time on 12/31, which would be 8am UTC time on Jan 1. Not sure why this was previously set to 9 hours, since we do not do the consumed calculations for AK or HI. I'm also not sure why this didn't raise a key error in previous years, but it does now, since 2023-01-01 9:00 does not exist in the data.

Should we also set the lower bound? So, it starts on 2022-01-01 Pacific Time, i.e, do:

emissions = emissions[f"{self.year}-01-01 08:00:00+00:00": f"{self.year+1}-01-01 08:00:00+00:00"]

Right now, we probably have the last 8 hours in 2021/12/31.

src/oge/consumed.py

grgmiller · 2023-12-23T19:23:55Z

Updates since last review: (in general, these are aimed at reducing the number of missing values in the output data)

data_cleaning.py

Adds a new function to help with validating missing data later in the pipeline inventory_input_data_sources. When we identify missing months of data, we want to know whether we introduced that missing data, or if the data was already missing in the input data sources. This function identifies, for each plant-month, whether there was any data in CEMS or the EIA-923 generation and fuel table (the two definitive sources of generation and fuel consumption data) for that plant-month. It also indicates whether there was any non-zero input data for that plant month. This table is outputted to the outputs folder.
We had previously been removing data from the CEMS table if there were all zeros reported for a specific unit month. However, all zeros are still data, and by removing this, we were introducing missing data further down in the pipeline. We stop using remove_cems_with_zero_monthly_data().
Sometimes in CEMS, there is a month with less than a single day of data (and typically only a single hour of data). The function that removed these unit months was previously removing data with fewer than 600 monthly observations, but the intent was only to remove data that is mostly incomplete except for a few hours. We fix this bug.
Fixes a bug when identifying partial cems subplatnts. We only want to flag the subplant as having partial data if both the data from CEMS and EIA is non-zero.
Fixes a bug when identifying balancing areas where some plants were not being assigned a BA. We now use the most recent BA assignment avaialble, even if there is a missing BA assignment for the current year.

output_data.py

When outputting plant metadata, we fix an issue where incorrect metadata for shaped EIA plants was getting outputted, and re-factored the code a bit to make it easier to follow.

validation.py

We update the check for complete monthly timeseries to make use of the input data inventory. This check now flags whether there is less than 12 months of output data for each grouping, as well as where there are more missing data values than expected based on missing inputs.

rouille

Thanks

update references to 2021

0f8d2fd

grgmiller changed the base branch from main to development December 15, 2023 23:24

grgmiller added 10 commits December 15, 2023 15:35

Merge branch 'development' into greg/2022_data

efc789b

update egrid2020 url

69976b7

update 930 cleaning for TEPC

0d9aeca

update manual tables

b4cc46c

manual epa eia mapping

be1aece

fix missing nox factors for JF

fcaa139

fix futurewarning

c58f42f

update consumed calcs for 2022

2eaa023

fix allocate_gen_fuel warning

fb49d6a

add check for monthly data

b1d9327

grgmiller marked this pull request as ready for review December 22, 2023 01:36

grgmiller requested a review from rouille December 22, 2023 01:36

rouille reviewed Dec 22, 2023

View reviewed changes

src/oge/consumed.py Show resolved Hide resolved

grgmiller added 4 commits December 22, 2023 11:28

fix issue with incorrect plant metadata

c43bb1b

stop dropping missing data

b37dc7e

fix bug with to_string

c33ea9b

fix eia-923 validation

ea9ee2a

rouille approved these changes Dec 27, 2023

View reviewed changes

update documentation

de10b0a

grgmiller merged commit 5ed7596 into development Dec 27, 2023
2 checks passed

grgmiller deleted the greg/2022_data branch December 27, 2023 22:50

grgmiller restored the greg/2022_data branch December 29, 2023 18:47

grgmiller deleted the greg/2022_data branch April 13, 2024 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to 2022 data #322

Update to 2022 data #322

grgmiller commented Dec 9, 2023 •

edited

Loading

rouille Dec 22, 2023

grgmiller Dec 22, 2023

rouille Dec 22, 2023

rouille Dec 22, 2023

grgmiller Dec 22, 2023

rouille Dec 22, 2023

grgmiller commented Dec 23, 2023

rouille left a comment

Update to 2022 data #322

Update to 2022 data #322

Conversation

grgmiller commented Dec 9, 2023 • edited Loading

Purpose

What the code is doing

Testing

Usage Example/Visuals

Review estimate

Future work

Checklist

rouille Dec 22, 2023

Choose a reason for hiding this comment

grgmiller Dec 22, 2023

Choose a reason for hiding this comment

rouille Dec 22, 2023

Choose a reason for hiding this comment

rouille Dec 22, 2023

Choose a reason for hiding this comment

grgmiller Dec 22, 2023

Choose a reason for hiding this comment

rouille Dec 22, 2023

Choose a reason for hiding this comment

grgmiller commented Dec 23, 2023

rouille left a comment

Choose a reason for hiding this comment

grgmiller commented Dec 9, 2023 •

edited

Loading