Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plant part updates to fix RMI CI memory issues #1865

Merged
merged 29 commits into from
Sep 20, 2022
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
2fb6948
clear cache
katie-lamb Jul 26, 2022
a9e1cf1
actually clear cache
katie-lamb Jul 29, 2022
870b45f
Merge branch 'dev' into rmi-ci-fixes
katie-lamb Aug 14, 2022
e766d8b
change ppl dtypes in module
katie-lamb Aug 25, 2022
81ea97d
Merge branch 'dev' into rmi-ci-fixes
katie-lamb Aug 25, 2022
6780779
Merge branch 'dev' into rmi-ci-fixes
katie-lamb Aug 30, 2022
eac2521
take out some category fields from metadata
katie-lamb Sep 7, 2022
bc5257a
take out old docstring
katie-lamb Sep 7, 2022
049ee55
Merge branch 'dev' into rmi-ci-fixes
katie-lamb Sep 7, 2022
083f491
added ppl to resources metadata
katie-lamb Sep 8, 2022
aa67d3b
took out format df so not all fields must be included
katie-lamb Sep 8, 2022
ba47985
Merge branch 'dev' into rmi-ci-fixes
katie-lamb Sep 8, 2022
f614860
updated release notes
katie-lamb Sep 9, 2022
5262aa2
took out all category types and put in enum
katie-lamb Sep 12, 2022
e50bbd3
added all fields to field metadata
katie-lamb Sep 12, 2022
a654c60
changed to format df
katie-lamb Sep 12, 2022
61516db
updated release notes
katie-lamb Sep 12, 2022
3d3483d
oops take out category from Field
katie-lamb Sep 12, 2022
caf44b1
take out BU
katie-lamb Sep 13, 2022
20fe508
Merge branch 'dev' into rmi-ci-fixes
katie-lamb Sep 14, 2022
08778c8
check values are in categories
katie-lamb Sep 16, 2022
c625623
update release notes to include pr ref
katie-lamb Sep 16, 2022
6fb269f
update to plant name eia, clean up enums, fill out resource metadata
katie-lamb Sep 18, 2022
b9197e0
changed ownership to onwership_record_type
katie-lamb Sep 18, 2022
6db9a3c
udpate release notes
katie-lamb Sep 18, 2022
cef65be
change enum constraints to sets
katie-lamb Sep 18, 2022
19cd02f
Revert "change enum constraints to sets"
katie-lamb Sep 18, 2022
4cc67c1
take out plant parts ordered replace with ordered dict
katie-lamb Sep 18, 2022
eb32d7b
fix typo in release notes
katie-lamb Sep 20, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -116,11 +116,16 @@ Plant Parts List Module Changes
:mod:`pudl.analysis.mcoe.DEFAULT_GENS_COLS`. If additional columns that are not part
of the default list are needed from the EIA 860 generators table, these columns can be
passed in with the ``gens_cols`` argument. See :pr:`1550`
* For greater memory efficiency, appropriate columns are now cast to string and
katie-lamb marked this conversation as resolved.
Show resolved Hide resolved
categorical types when the full plant parts list is created. The resource and field
metadata is now included in the PUDL metadata.

Metadata
^^^^^^^^
* Used the data source metadata class added in release 0.6.0 to dynamically generate
the data source documentation (See :doc:`data_sources/index`). :pr:`1532`
* The EIA plant parts list was added to the resource and field metadata. This is the
first output table to be included in the metadata.
katie-lamb marked this conversation as resolved.
Show resolved Hide resolved

Bug Fixes
^^^^^^^^^
Expand Down
6 changes: 6 additions & 0 deletions src/pudl/analysis/plant_parts_eia.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,7 @@
import pandas as pd

import pudl
from pudl.metadata.classes import Resource

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -649,6 +650,9 @@ def execute(self, gens_mega):

"""
# aggregate everything by each plant part
df_keys = list(self.pudl_out._dfs.keys())
for k in df_keys:
zaneselvans marked this conversation as resolved.
Show resolved Hide resolved
del self.pudl_out._dfs[k]
part_dfs = []
for part_name in PLANT_PARTS_ORDERED:
part_df = PlantPart(part_name).execute(gens_mega)
Expand Down Expand Up @@ -686,7 +690,9 @@ def execute(self, gens_mega):
self.add_additonal_cols(plant_parts_eia)
.pipe(pudl.helpers.organize_cols, FIRST_COLS)
.pipe(self._clean_plant_parts)
.pipe(Resource.from_id("plant_parts_eia").format_df)
)
self.plant_parts_eia.index = self.plant_parts_eia.index.astype("string")
self.validate_ownership_for_owned_records(self.plant_parts_eia)
validate_run_aggregations(self.plant_parts_eia, gens_mega)
return self.plant_parts_eia
Expand Down
4 changes: 3 additions & 1 deletion src/pudl/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1063,7 +1063,9 @@ def merge_dicts(list_of_dicts):


def convert_cols_dtypes(
df: pd.DataFrame, data_source: str | None = None, name: str | None = None
df: pd.DataFrame,
data_source: str | None = None,
name: str | None = None,
) -> pd.DataFrame:
"""Convert a PUDL dataframe's columns to the correct data type.

Expand Down
8 changes: 7 additions & 1 deletion src/pudl/metadata/classes.py
Original file line number Diff line number Diff line change
Expand Up @@ -576,7 +576,13 @@ class Field(Base):

name: SnakeCase
type: Literal[ # noqa: A003
"string", "number", "integer", "boolean", "date", "datetime", "year"
"string",
"number",
"integer",
"boolean",
"date",
"datetime",
"year",
]
format: Literal["default"] = "default" # noqa: A003
description: String = None
Expand Down
32 changes: 32 additions & 0 deletions src/pudl/metadata/enums.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,3 +213,35 @@
"Unknown Code", # Should be replaced with NA
]
"""Valid emissions measurement codes for the EPA CEMS hourly data."""

TECH_DESCRIPTIONS: list[str] = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ordering isn't informational and duplicates should be prohibited, it might make sense for this to be a set rather than a list. I guess this goes for all of the enums. I wonder if there's some reason that doesn't work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried converting all the list enums to set which broke a bunch of stuff because some of the enums are used in for concatenation with a list. Just updated the enums I added (TECH_DESCRIPTIONS and PLANT_PARTS) to be sets.

"Conventional Hydroelectric",
"Conventional Steam Coal",
"Natural Gas Steam Turbine",
"Natural Gas Fired Combustion Turbine",
"Natural Gas Internal Combustion Engine",
"Nuclear",
"Natural Gas Fired Combined Cycle",
"Petroleum Liquids",
"Hydroelectric Pumped Storage",
"Solar Photovoltaic",
"Batteries",
"Geothermal",
"Municipal Solid Waste",
"Wood/Wood Waste Biomass",
"Onshore Wind Turbine",
"Coal Integrated Gasification Combined Cycle",
"Other Gases",
"Landfill Gas",
"All Other",
"Other Waste Biomass",
"Petroleum Coke",
"Solar Thermal without Energy Storage",
"Solar Thermal with Energy Storage",
"Other Natural Gas",
"Flywheels",
"Offshore Wind Turbine",
"Natural Gas with Compressed Air Storage",
"Hydrokinetic",
]
"""Valid technology descriptions from the EIA plant parts list."""
136 changes: 127 additions & 9 deletions src/pudl/metadata/fields.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
REVENUE_CLASSES,
RTO_CLASSES,
TECH_CLASSES,
TECH_DESCRIPTIONS,
US_STATES_TERRITORIES,
)
from pudl.metadata.labels import (
Expand Down Expand Up @@ -53,6 +54,26 @@
},
"annual_indirect_program_cost": {"type": "number", "unit": "USD"},
"annual_total_cost": {"type": "number", "unit": "USD"},
"appro_part_label": {
"type": "string",
"description": "Plant part of the associated true granularity record.",
"constraints": {
"enum": [
"plant",
"plant_unit",
"plant_prime_mover",
"plant_technology",
"plant_prime_fuel",
"plant_ferc_acct",
"plant_operating_year",
"plant_gen",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this list of values stored somewhere else independently, that it can be referenced in case it were to change? I think the home definition should probably be one of:

pudl.analysis.plant_parts_eia.PLANT_PARTS_ORDERED
pudl.analysis.plant_parts_eia.PLANT_PARTS.keys()

(side note: if PLANT_PARTS were an OrderedDict I think we could get rid of PLANT_PARTS_ORDERED and just use PLANT_PARTS.keys() directly)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a circular import error when I tried to import pudl.analysis.plant_parts_eia.PLANT_PARTS into pudl.metadata.fields.py or pudl.metadata.enums.py. I added PLANT_PARTS as an enum to at least take out the duplication of this list in fields.py.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the PLANT_PARTS dictionary to an OrderedDict and removed PLANT_PARTS_ORDERED.

]
},
},
"appro_record_id_eia": {
"type": "string",
"description": "EIA record ID of the associated true granularity record.",
},
"ash_content_pct": {
"type": "number",
"description": "Ash content percentage by weight to the nearest 0.1 percent.",
Expand Down Expand Up @@ -105,6 +126,13 @@
"description": "Monthly average billing demand (for requirements purchases, and any transactions involving demand charges). In megawatts.",
"unit": "MW",
},
"boiler_generator_assn_type_code": {
"type": "string",
"description": (
"Indicates whether boiler associations with generator during the year were "
"actual or theoretical. Only available before 2013."
),
},
"boiler_id": {
"type": "string",
"description": "Alphanumeric boiler ID.",
Expand All @@ -126,6 +154,15 @@
"unit": "min",
},
"caidi_wo_major_event_days_minutes": {"type": "number", "unit": "min"},
"capacity_eoy_mw": {
"type": "number",
"description": "Total end of year installed (nameplate) capacity for a plant part, in megawatts.",
"unit": "MW",
},
"capacity_factor": {
"type": "number",
"description": "Fraction of potential generation that was actually reported for a plant part.",
},
"capacity_mw": {
"type": "number",
"description": "Total installed (nameplate) capacity, in megawatts.",
Expand Down Expand Up @@ -535,6 +572,13 @@
"type": "string",
"description": "Account number, from FERC's Uniform System of Accounts for Electric Plant. Also includes higher level labeled categories.",
},
"ferc_acct_name": {
"type": "string",
"description": "Name of FERC account, derived from technology description and prime mover code.",
"constraints": {
"enum": ["other", "hydro", "steam", "nuclear", "Other", "Steam"]
katie-lamb marked this conversation as resolved.
Show resolved Hide resolved
},
},
"ferc_cogen_docket_no": {
"type": "string",
"description": "The docket number relating to the FERC cogenerator status. See FERC Form 556.",
Expand Down Expand Up @@ -609,6 +653,11 @@
"description": "Average fuel cost per mmBTU of heat content in nominal USD.",
"unit": "USD_per_MMBtu",
},
"fuel_cost_per_mwh": {
"type": "number",
"description": "Derived from MCOE, a unit level value. Average fuel cost per MWh of heat content in nominal USD.",
"unit": "USD_per_MWh",
},
"fuel_cost_per_unit_burned": {
"type": "number",
"description": "Average cost of fuel consumed in the report year per reported fuel unit (USD).",
Expand Down Expand Up @@ -749,13 +798,6 @@
"description": "General Plant Total (FERC Accounts 389-399.1).",
},
"generation_activity": {"type": "boolean"},
"boiler_generator_assn_type_code": {
"type": "string",
"description": (
"Indicates whether boiler associations with generator during the year were "
"actual or theoretical. Only available before 2013."
),
},
"generator_id": {
"type": "string",
"description": (
Expand Down Expand Up @@ -791,6 +833,11 @@
"description": "The energy contained in fuel burned, measured in million BTU.",
"unit": "MMBtu",
},
"heat_rate_mmbtu_mwh": {
"type": "number",
"description": "Fuel content per unit of electricity generated. Coming from MCOE calculation.",
"unit": "MMBtu_MWh",
},
"highest_distribution_voltage_kv": {"type": "number", "unit": "kV"},
"home_area_network": {"type": "integer"},
"hydro_acct330_land": {
Expand Down Expand Up @@ -1126,6 +1173,10 @@
"description": "Length of time interval measured.",
"unit": "hr",
},
"operating_year": {
"type": "integer",
"description": "Year a generator went into service.",
},
"operational_status": {
"type": "string",
"description": "The operating status of the generator. This is based on which tab the generator was listed in in EIA 860.",
Expand All @@ -1134,6 +1185,11 @@
"type": "string",
"description": "The operating status of the generator.",
},
"operational_status_pudl": {
"type": "string",
"description": "The operating status of the generator using PUDL categories.",
"constraints": {"enum": ["operating", "retired", "proposed"]},
},
"opex_allowances": {"type": "number", "description": "Allowances.", "unit": "USD"},
"opex_boiler": {
"type": "number",
Expand Down Expand Up @@ -1357,10 +1413,19 @@
"pattern": r"^\d{5}$",
},
},
"ownership": {
"type": "string",
"description": "Whether each generator record is for one owner or represents a total of all ownerships.",
"constraints": {"enum": ["owned", "total"]},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having very generic column names like this in the global namespace can get confusing. Could it be renamed to be more descriptive / clear about its usage context? Maybe something like ownership_record_type ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to ownership_record_type.

"ownership_code": {
"type": "string",
"description": "Identifies the ownership for each generator.",
},
"ownership_dupe": {
"type": "boolean",
"description": "Whether a plant part record has a duplicate record with different ownership status.",
},
"peak_demand_mw": {
"type": "number",
"unit": "MW",
Expand Down Expand Up @@ -1470,19 +1535,47 @@
"type": "integer",
"description": "A manually assigned PUDL plant ID. May not be constant over time.",
},
"plant_id_report_year": {
"type": "string",
"description": "PUDL plant ID and report year of the record.",
},
"plant_name_clean": {
"type": "string",
"description": "A semi-manually cleaned version of the freeform FERC 1 plant name.",
},
"plant_name_eia": {"type": "string", "description": "Plant name."},
"plant_name_ferc1": {
"type": "string",
"description": "Name of the plant, as reported to FERC. This is a freeform string, not guaranteed to be consistent across references to the same plant.",
},
"plant_name_clean": {
"plant_name_new": {
katie-lamb marked this conversation as resolved.
Show resolved Hide resolved
"type": "string",
"description": "A semi-manually cleaned version of the freeform FERC 1 plant name.",
"description": "Derived plant name that includes EIA plant name and other strings associated with ID and PK columns of the plant part.",
},
"plant_name_pudl": {
"type": "string",
"description": "Plant name, chosen arbitrarily from the several possible plant names available in the plant matching process. Included for human readability only.",
},
"plant_part": {
"type": "string",
"description": "The part of the plant a record corresponds to.",
"constraints": {
"enum": [
"plant",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with appro_part_label above it would be better if we could refer to the single source of truth for this list directly.

Copy link
Member Author

@katie-lamb katie-lamb Sep 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment about circular import error.

"plant_unit",
"plant_prime_mover",
"plant_technology",
"plant_prime_fuel",
"plant_ferc_acct",
"plant_operating_year",
"plant_gen",
]
},
},
"plant_part_id_eia": {
"type": "string",
"description": "Contains EIA plant ID, plant part, ownership, and EIA utility id",
},
"plant_type": {
"type": "string"
# TODO Disambiguate column name and apply table specific ENUM constraints. There
Expand Down Expand Up @@ -1559,6 +1652,10 @@
"description": "Gross megawatt-hours received in power exchanges and used as the basis for settlement.",
"unit": "MWh",
},
"record_count": {
"type": "integer",
"description": "Number of distinct generator IDs that partcipated in the aggregation for a plant part list record.",
},
"record_id": {
"type": "string",
"description": "Identifier indicating original FERC Form 1 source record. format: {table_name}_{report_year}_{report_prd}_{respondent_id}_{spplmnt_num}_{row_number}. Unique within FERC Form 1 DB tables which are not row-mapped.", # noqa: FS003
Expand Down Expand Up @@ -1864,7 +1961,15 @@
},
"total_disposition_mwh": {"type": "number", "unit": "MWh"},
"total_energy_losses_mwh": {"type": "number", "unit": "MWh"},
"total_fuel_cost": {
"type": "number",
"description": "Total annual reported fuel costs for the plant part. Includes costs from all fuels.",
},
"total_meters": {"type": "integer", "unit": "m"},
"total_mmbtu": {
"type": "number",
"description": "Total annual heat content of fuel consumed by a plant part record in the plant parts list.",
},
"total_settlement": {
"type": "number",
"description": "Sum of demand, energy, and other charges (USD). For power exchanges, the settlement amount for the net receipt of energy. If more energy was delivered than received, this amount is negative.",
Expand Down Expand Up @@ -1931,6 +2036,10 @@
"type": "number",
"description": "Total Transmission Plant (FERC Accounts 350-359.1)",
},
"true_gran": {
"type": "boolean",
"description": "Indicates whether a plant part list record is associated with the highest priority plant part for all identical records.",
},
"turbines_inverters_hydrokinetics": {
"type": "integer",
"description": "Number of wind turbines, or hydrokinetic buoys.",
Expand Down Expand Up @@ -2138,6 +2247,15 @@
},
}
},
"plant_parts_eia": {
"energy_source_code_1": {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The energy_source_code_1 and prime_movers_eia resource override enum constraints could maybe just be moved to the general entry in FIELD_METADATA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the database tables these are both constrained by virtue of their FK relationships to the energy_sources_eia and prime_movers_eia coding tables, so adding the enums to the field definitions would be duplicative. But for the purpose of imposing those constraints on this free floating output table it seems like the ENUM constraint makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, makes sense.

"constraints": {"enum": set(CODE_METADATA["energy_sources_eia"]["df"].code)}
},
"prime_movers_eia": {
"constraints": {"enum": set(CODE_METADATA["prime_movers_eia"]["df"].code)}
},
"technology_description": {"constraints": {"enum": TECH_DESCRIPTIONS}},
},
}


Expand Down
Loading