Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CEMS extraction handle new listed year_quarter partitions #3187

Merged
merged 30 commits into from
Jan 8, 2024

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Dec 22, 2023

Overview

Closes #3185.

What problem does this address?

We've changed the CEMS, PHMSA and 860M partitions to use lists instead of values, in order to accommodate multi-file zips and to meet Zenodo's new file count restrictions for the API. This necessitates changes for both datastore and the extraction of CEMS data in order for us to be able to successfully read in new data.

What did you change?

  • Update CEMS and 860M doi
  • Update datastore's _match method to handle both strings/integers and lists of strings/integers
  • Updates the get_partitions() method to handle lists for the docs build
  • Change file name format expected in pudl.extract.epacems

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

@e-belfer e-belfer added wontfix and removed wontfix labels Dec 22, 2023
Base automatically changed from dev to main January 5, 2024 04:14
Copy link

codecov bot commented Jan 5, 2024

Codecov Report

Attention: 31 lines in your changes are missing coverage. Please review.

Comparison is base (b8fa2b5) 92.6% compared to head (fdffa26) 92.6%.
Report is 141 commits behind head on main.

Files Patch % Lines
...rc/pudl/analysis/record_linkage/link_cross_year.py 88.7% 12 Missing ⚠️
src/pudl/analysis/record_linkage/name_cleaner.py 85.9% 12 Missing ⚠️
test/integration/record_linkage_test.py 94.4% 4 Missing ⚠️
...l/analysis/record_linkage/classify_plants_ferc1.py 92.9% 2 Missing ⚠️
...rc/pudl/analysis/record_linkage/embed_dataframe.py 98.6% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #3187     +/-   ##
=======================================
- Coverage   92.6%   92.6%   -0.0%     
=======================================
  Files        134     140      +6     
  Lines      12611   12861    +250     
=======================================
+ Hits       11682   11912    +230     
- Misses       929     949     +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zaneselvans zaneselvans added datastore Managing the acquisition and organization of external raw data. epacems Integration and analysis of the EPA CEMS dataset. zenodo Issues having to do with Zenodo data archiving and retrieval. performance Make PUDL run faster! labels Jan 5, 2024
@zaneselvans zaneselvans marked this pull request as draft January 5, 2024 16:23
"Facility ID": pd.Int16Dtype(), # unique facility id for internal EPA database management (ORIS code)
"Facility ID": pd.Int32Dtype(), # unique facility id for internal EPA database management (ORIS code)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 bit integers aren't actually large enough to hold these IDs, so some of them were wrapping around and becoming negative numbers.

Comment on lines +112 to +114
"Gross Load (MW)": pd.Float32Dtype(),
"Steam Load (1000 lb/hr)": pd.Float32Dtype(),
"SO2 Mass (lbs)": pd.Float32Dtype(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if switching these to 32bit floats was necessary.

low_memory=False,
).rename(columns=rename_dict)
chunksize=chunksize,
low_memory=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what it does on the inside, but low_memory=True tries to be more memory efficient, and chunksize reads in batches of records and processes them one at a time, returning an iterator of dataframes (one per chunk) rather than a single dataframe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all of the CategoricalDtype() columns are well-behaved and we actually do know what the values will be ahead of time, probably the right thing to do in here is define the categories ahead of time, and then this chunking will use low-memory dtypes that can also be concatenated without being objectified.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make sense, but to be consistent with how we're handling other datasets I'd probably want to map all the column dtypes in fields.py and codes.py, as there are currently no column by column type enforcements on EPA CEMS data in the same way that we have for EIA/FERC data. This seems out of scope of this current PR and I'm tempted to move it into a separate issue and handle it there.

Comment on lines +217 to +219
df = pd.concat(chunk_iter)
dtypes = {k: v for k, v in dtype_dict.items() if k in df.columns}
return df.astype(dtypes).rename(columns=rename_dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This apparently worked, but I'm not sure why it worked. I didn't expect it to for several reasons:

  • The CSV is in a zipfile, and we're asking pandas to read directly from the zipfile. Does that mean the entire zipfile has to be decompressed before you can get at any of the data? Or can you just read the first 100,000 rows of a zipfile (seems unlikely).
  • When you concatenate together dataframes with automatically generated categorical columns, the categoricals become objects, because each dataframe is using a different dictionary mapping for the categories internally, so there's no peak memory savings here from using categorical columns. You could squeeze that savings out (and I thought we would have to) by explicitly stating the categories in the CategoricalDtype() ensuring that all the dataframes have the same categories, or by iteratively concatenating the many per-chunk dataframes together one at a time, and explicitly converting the categorical in both the new and already concatenated dataframes dynamically to match, using union categoricals

assert ratio_correct > 0.85, "Percent of correctly matched FERC records below 85%."
assert ratio_correct > 0.80, "Percent of correctly matched FERC records below 80%."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what changed here to break this test and was feeling a little desperate. It was juuuuuust under 85% But also we had already reduced the threshold form 95% in the FERC-FERC PR merge. I thought it might have been stochastic variation, but saw that @zschira was already using a fixed seed for the random number generator so I was very confused.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure why this is behaving non-deterministically, although there's a few different types of randomness being used, so I'm guessing there's something weird going on there. I think with some slightly better tuning though, we can get the ratio high enough that slight variations won't bring this below the 85%, so I'll see if I can put together a PR to improve that. For the time being, however, I think lowering this threshold so it doesn't cause CI runs to fail is a good patch.

@e-belfer e-belfer marked this pull request as ready for review January 8, 2024 16:03
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would be good to use the full categories on the initial csv read but also agree that that feels oos for this issue/it'd be pretty different as compared to what we do for other encoder categories. But it would be good to make a translation between the existing encoder infrastructure and the initial extraction! I suspect not a lot of other datasets will be that clean to enable this type of upfront category setting though.

Nonetheless, all of these changes look good and i'm glad this pushes this new cems archive over the finish line!

@e-belfer e-belfer enabled auto-merge (squash) January 8, 2024 21:25
@e-belfer e-belfer merged commit 0525092 into main Jan 8, 2024
14 checks passed
@e-belfer e-belfer deleted the cems-extraction branch January 8, 2024 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datastore Managing the acquisition and organization of external raw data. epacems Integration and analysis of the EPA CEMS dataset. performance Make PUDL run faster! zenodo Issues having to do with Zenodo data archiving and retrieval.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

CEMS: Retool extraction to handle listed partitions
4 participants