-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make CEMS extraction handle new listed year_quarter
partitions
#3187
Conversation
it seems to all just work which is tres fun but makes sense after looking at it
update the 860m doi
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #3187 +/- ##
=======================================
- Coverage 92.6% 92.6% -0.0%
=======================================
Files 134 140 +6
Lines 12611 12861 +250
=======================================
+ Hits 11682 11912 +230
- Misses 929 949 +20 ☔ View full report in Codecov by Sentry. |
src/pudl/extract/epacems.py
Outdated
"Facility ID": pd.Int16Dtype(), # unique facility id for internal EPA database management (ORIS code) | ||
"Facility ID": pd.Int32Dtype(), # unique facility id for internal EPA database management (ORIS code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
16 bit integers aren't actually large enough to hold these IDs, so some of them were wrapping around and becoming negative numbers.
"Gross Load (MW)": pd.Float32Dtype(), | ||
"Steam Load (1000 lb/hr)": pd.Float32Dtype(), | ||
"SO2 Mass (lbs)": pd.Float32Dtype(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if switching these to 32bit floats was necessary.
low_memory=False, | ||
).rename(columns=rename_dict) | ||
chunksize=chunksize, | ||
low_memory=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what it does on the inside, but low_memory=True
tries to be more memory efficient, and chunksize
reads in batches of records and processes them one at a time, returning an iterator of dataframes (one per chunk) rather than a single dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all of the CategoricalDtype()
columns are well-behaved and we actually do know what the values will be ahead of time, probably the right thing to do in here is define the categories ahead of time, and then this chunking will use low-memory dtypes that can also be concatenated without being object
ified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would make sense, but to be consistent with how we're handling other datasets I'd probably want to map all the column dtypes in fields.py
and codes.py
, as there are currently no column by column type enforcements on EPA CEMS data in the same way that we have for EIA/FERC data. This seems out of scope of this current PR and I'm tempted to move it into a separate issue and handle it there.
df = pd.concat(chunk_iter) | ||
dtypes = {k: v for k, v in dtype_dict.items() if k in df.columns} | ||
return df.astype(dtypes).rename(columns=rename_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This apparently worked, but I'm not sure why it worked. I didn't expect it to for several reasons:
- The CSV is in a zipfile, and we're asking pandas to read directly from the zipfile. Does that mean the entire zipfile has to be decompressed before you can get at any of the data? Or can you just read the first 100,000 rows of a zipfile (seems unlikely).
- When you concatenate together dataframes with automatically generated categorical columns, the categoricals become objects, because each dataframe is using a different dictionary mapping for the categories internally, so there's no peak memory savings here from using categorical columns. You could squeeze that savings out (and I thought we would have to) by explicitly stating the categories in the
CategoricalDtype()
ensuring that all the dataframes have the same categories, or by iteratively concatenating the many per-chunk dataframes together one at a time, and explicitly converting the categorical in both the new and already concatenated dataframes dynamically to match, using union categoricals
assert ratio_correct > 0.85, "Percent of correctly matched FERC records below 85%." | ||
assert ratio_correct > 0.80, "Percent of correctly matched FERC records below 80%." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what changed here to break this test and was feeling a little desperate. It was juuuuuust under 85% But also we had already reduced the threshold form 95% in the FERC-FERC PR merge. I thought it might have been stochastic variation, but saw that @zschira was already using a fixed seed for the random number generator so I was very confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure why this is behaving non-deterministically, although there's a few different types of randomness being used, so I'm guessing there's something weird going on there. I think with some slightly better tuning though, we can get the ratio high enough that slight variations won't bring this below the 85%, so I'll see if I can put together a PR to improve that. For the time being, however, I think lowering this threshold so it doesn't cause CI runs to fail is a good patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it would be good to use the full categories on the initial csv read but also agree that that feels oos for this issue/it'd be pretty different as compared to what we do for other encoder categories. But it would be good to make a translation between the existing encoder infrastructure and the initial extraction! I suspect not a lot of other datasets will be that clean to enable this type of upfront category setting though.
Nonetheless, all of these changes look good and i'm glad this pushes this new cems archive over the finish line!
Overview
Closes #3185.
What problem does this address?
We've changed the CEMS, PHMSA and 860M partitions to use lists instead of values, in order to accommodate multi-file zips and to meet Zenodo's new file count restrictions for the API. This necessitates changes for both datastore and the extraction of CEMS data in order for us to be able to successfully read in new data.
What did you change?
datastore
's_match
method to handle both strings/integers and lists of strings/integersget_partitions()
method to handle lists for the docs buildpudl.extract.epacems
Testing
How did you make sure this worked? How can a reviewer verify this?
To-do list