Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to 2022 data #322

Merged
merged 16 commits into from
Dec 27, 2023
Merged

Update to 2022 data #322

merged 16 commits into from
Dec 27, 2023

Conversation

grgmiller
Copy link
Collaborator

@grgmiller grgmiller commented Dec 9, 2023

Purpose

This PR updates the OGE pipeline to work with 2022 data, and also updates the manual tables.

What the code is doing

Updates default years to 2022 (Fixes CAR-3399)
Updates source of eGRID2020 data to the v2 file (not used in the pipeline, only for comparison)

Updates reference tables (see #260) (Fixes CAR-3349)

  • ba_reference: no update to the FERC table, new retirements of GLHB and GRIF according to EIA
  • default_gross_to_net_ratios.csv: No updates
  • eGRID2020_crosswalk_of_EIA_ID_to_EPA_ID.csv: updated based on eGRID2021. One new plant added to list
  • emission_factors_for_co2_ch4_n2o: No updates to AP-42 or IPCC
  • emission_factors_for_nox: No updates to AP-42 or IPCC. Added several new factors for boiler configurations that were not previously added, but are in the 2022 data.
  • emission_factors_for_so2: No updates to AP-42 or IPCC, Added several new factors for boiler configurations that were not previously added, but are in the 2022 data.
  • energy_source_groups: no changes based on pudl metadata
  • epa_eia_crosswalk_manual: used notebook to identify new additions to table
  • geothermal_emission_factors: no changes to source data
  • ipcc_gwp: most current report is still AR6
  • physical_ba
  • plants_not_connected_to_grid: no changes based on eGRID2021
  • steam_units_to_remove: Not updated since steam units no longer being removed.
  • updated_oth_energy_source_codes: ran notebook, no new matches needed
  • utility_name_ba_code_map: ran notebook, added several new maps, sorted alphabetically.

load_data.py

  • Although much of the EPA plant ID to EIA plant mapping has been integrated into the pudl version of the CEMS data, our manual plant mapping from epa_eia_crosswalk_manual would not be reflected. Now, whenever cems data is being loaded, we run update_epa_to_eia_map() to update the plant_id_eia codes
  • When loading the EPA-EIA crosswalk, we now load in the manual plant map from eGRID to make sure this is considered in the mapping.

data_cleaning.py

  • No longer remove steam units: Previously, we had been dropping from the CEMS data any units that only reported steam load and no gross generation (see Interpretation of steam data from CEMS #103 for context), mostly just to save memory since we were not including this steam data in the outputs. However, based on our updated understanding in Convert CEMS steam load to gross generation #216 we do not necessarily want to be dropping this data. Even though we are not yet implementing this steam to generation conversion, we would have had to have updated the steam_units_to_remove manual table, which didn't seem worth it if we are not going to be dropping these units in the future.
  • Drops negative fuel consumption: Plant 10613 reports negative fuel consumption for its WDS fuel in May 2022. This is an error in the input EIA-923 data but our pipeline was not previously set up to handle such cases. For now, we replace these negative fuel consumption values with missing values, and print a warning in the logger.
  • Switches back to running pudl.analysis.allocate_gen_fuel_by_generator_energy_source() instead of loading the table from pudl. This allows us to see data quality warnings generated during running that part of the pipeline which will help with data quality checks.
  • Deletes the code that dropped all EIA-923 that was missing or zero. I cannot remember why I implemented this, but this was resulting in a lot of missing records in the data outputs. Even if fuel consumption and generation is zero for a plant in the month, that is still useful information and we should not drop it. Note: some data is actually missing in the input EIA-923 data.
  • Fixes a bug when identifying partial cems plants. We did not want to assign the "partial cems" methodology to a subplant if the primary fuel of the plant was not fossil based, since that could mean that we were assigning data from a fossil backup generator to shape nuclear generation, for example. This should have been identified using the "plant_primary_fuel" column but it was mistakenly using the "energy_source_code" column, which is the generator-specific energy source. This was leading to a mixed method error at line 1765.

eia930.py

  • Adds an end date to the TEPC timestamp issue in EIA-930 (Fixes CAR-2506)

emissions.py

  • For some reason in 2022, none of the JF-fueled plants reported fuel sulfur content data. This means that several plants were missing SO2 emission factors. To fix this, I added a further backstop to the existing annual_avg_fuel_sulfur_content. Now, in the case that there is no sulfur content data available for a fuel in a year, the pipeline will check the reported sulfur contents from the previous year to see if it can fill in the annual average value. This means for the 2022 pipeline, the JF generators without a specified sulfur content will use the average JF sulfur content from 2021 (looking at multiple years, this sulfur content does not seem to change from year to year so this seems like a reasonable backstop). In implementing this fix, I split an existing function into two components.
  • When checking for missing nox and so2 emission factors, we now only raise this error if these are missing for PM-fuels where there is non-zero fuel consumption. Previously we had been checking in case of missing fuel consumption, but we do not need emission factors for fuel-PMs with zero fuel consumption.

validation.py

  • now print a sample of the bad data whenever the test_for_negative_values check does not pass
  • Fixes bug where the incorrect variable was being referenced when testing emissions adjustments
  • renames the check_for_complete_timeseries check to check_for_complete_hourly_timeseries, and adds a check_for_complete_monthly_timeseries check. This new check ensures that monthly-resolution data contains all 12 months of data.

Testing

Running the pipeline for 2022 (not yet complete)

Usage Example/Visuals

How the code can be used and/or images of any graphs, tables or other visuals (not always applicable).

Review estimate

How long will it take for reviewers and observers to understand this code change?

Future work

The following warnings were raised when running the pipeline:

  • "oge.oge.validation:249 There are 259 subplants that only contain one part of a combined cycle system. Subplants that represent combined cycle generation should contain both CA and CT parts.": This error is the same as previous years, this is something on our backlog that we haven't gotten around to fixing yet.
  • "oge.oge.validation:78 Allocated EIA-923 doesn't match input data for plants": This is a high priority thing to fix but it requires messing with the pudl code. See pudl.analysis.allocate_gen_fuel is dropping/adding data catalyst-cooperative/pudl#3165
  • "oge.oge.validation:140 There are 173 plants where the assigned primary fuel doesn't match the capacity-based primary fuel. It is possible that these plants will categorized as a different fuel in EIA-930": This is expected/same as previous years. Nothing to "fix" here but may influence some future design choices about how primary fuels are assigned. See Validate primary fuel methodology that should be used for assigning a plant's fuel category #281
  • We get our expected warnings about missing Nox and SO2 emission factors for fuel cell (FC) prime movers. This is expected. See Emissions from fuel cells #70
  • "oge.oge.validation:334 Some data being removed has non-zero data associated with it:" We drop cems data where there is zero generation or fuel consumption data. However, some of these records have non-zero steam load and nox mass. This is a known issue and is tied to Integrate CEMS data for units that only report NOx #134
  • "oge.oge.validation:374 There are 212 subplants at 142 plants for which there is zero gross generation associated with positive net generation.", "oge.oge.validation:411 There are 53 subplants at 41 plants for which there is zero net generation associated with positive gross generation.", "oge.oge.validation:450 The following plants have annual net generation that is >125% of annual gross generation:" This is raised when EIA data is linked to CEMS data, and the generation numbers are inconsistent between the two sources. This happens with all years of data and is not unique to 2022. This may be due to inconsistent reporting between sources, or potentially bad crosswalking. This will require future investigation in our backlog of issues.
  • "oge.oge.gross_to_net_generation:93 The following subplants are missing default GTN ratios. Using a default value of 0.97": For some reason these plants do not have prime-mover related default factors from EIA. We should look into the prime mover code of these plants.
  • "oge.oge.validation:1606 Potentially anomalous co2 factors detected for the following plants" - there are some plants that have abnormally high or low EFs. These need to be manually looked into to understand the source of these anomalies. Some might just be input data quality issues, or there is an issue with our processing pipeline, or they are just abnormal plants and the data is accurate.

Checklist

  • Update the documentation to reflect changes made in this PR
  • Format all updated python files using ruff
  • Clear outputs from all notebooks modified
  • Add docstrings and type hints to any new functions created

@grgmiller grgmiller changed the base branch from main to development December 15, 2023 23:24
@grgmiller grgmiller marked this pull request as ready for review December 22, 2023 01:36
@@ -1002,7 +1018,8 @@ def clean_cems(year: int, small: bool, primary_fuel_table, subplant_emission_fac
)

# manually remove steam-only units
cems = manually_remove_steam_units(cems)
# NOTE(greg): disabling this for the 2022 data release
# cems = manually_remove_steam_units(cems)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we stop removing these units in 2022?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is unclear! This is not just for 2022, we are stopping using this method altogether. See note about this in the main PR description. I'll delete this whole line.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it will apply to all years from now on. My question is why do we stop removing steam-only units. Do we feel confident estimating emissions from these generators moving forward?

# Cut off emissions at 9 hours after UTC year
emissions = emissions[: f"{self.year+1}-01-01 09:00:00+00:00"]
# Cut off emissions at 8 hours after UTC year
emissions = emissions[: f"{self.year+1}-01-01 08:00:00+00:00"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work? I thought the emission would be hourly over a full year in UTC time. Are we doing this in Pacific Time instead. And why was it nine hours before, did we have Alaska in 2021 and not 2022?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is just saying that we only want to do the calculations through midnight pacific time on 12/31, which would be 8am UTC time on Jan 1. Not sure why this was previously set to 9 hours, since we do not do the consumed calculations for AK or HI. I'm also not sure why this didn't raise a key error in previous years, but it does now, since 2023-01-01 9:00 does not exist in the data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also set the lower bound? So, it starts on 2022-01-01 Pacific Time, i.e, do:

emissions = emissions[f"{self.year}-01-01 08:00:00+00:00": f"{self.year+1}-01-01 08:00:00+00:00"]

Right now, we probably have the last 8 hours in 2021/12/31.

@grgmiller
Copy link
Collaborator Author

Updates since last review: (in general, these are aimed at reducing the number of missing values in the output data)

data_cleaning.py

  • Adds a new function to help with validating missing data later in the pipeline inventory_input_data_sources. When we identify missing months of data, we want to know whether we introduced that missing data, or if the data was already missing in the input data sources. This function identifies, for each plant-month, whether there was any data in CEMS or the EIA-923 generation and fuel table (the two definitive sources of generation and fuel consumption data) for that plant-month. It also indicates whether there was any non-zero input data for that plant month. This table is outputted to the outputs folder.
  • We had previously been removing data from the CEMS table if there were all zeros reported for a specific unit month. However, all zeros are still data, and by removing this, we were introducing missing data further down in the pipeline. We stop using remove_cems_with_zero_monthly_data().
  • Sometimes in CEMS, there is a month with less than a single day of data (and typically only a single hour of data). The function that removed these unit months was previously removing data with fewer than 600 monthly observations, but the intent was only to remove data that is mostly incomplete except for a few hours. We fix this bug.
  • Fixes a bug when identifying partial cems subplatnts. We only want to flag the subplant as having partial data if both the data from CEMS and EIA is non-zero.
  • Fixes a bug when identifying balancing areas where some plants were not being assigned a BA. We now use the most recent BA assignment avaialble, even if there is a missing BA assignment for the current year.

output_data.py

  • When outputting plant metadata, we fix an issue where incorrect metadata for shaped EIA plants was getting outputted, and re-factored the code a bit to make it easier to follow.

validation.py

  • We update the check for complete monthly timeseries to make use of the input data inventory. This check now flags whether there is less than 12 months of output data for each grouping, as well as where there are more missing data values than expected based on missing inputs.

Copy link
Collaborator

@rouille rouille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@grgmiller grgmiller merged commit 5ed7596 into development Dec 27, 2023
2 checks passed
@grgmiller grgmiller deleted the greg/2022_data branch December 27, 2023 22:50
@grgmiller grgmiller restored the greg/2022_data branch December 29, 2023 18:47
@grgmiller grgmiller deleted the greg/2022_data branch April 13, 2024 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants