Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make CEMS extraction handle new listed
year_quarter
partitions #3187Make CEMS extraction handle new listed
year_quarter
partitions #3187Changes from 27 commits
4fb33e2
a4cbf89
aaecdfa
ee9b3bf
61bc757
542ce85
cf0454d
d26b4aa
0b08162
e1215fe
4f50183
65af95d
2cbd7dd
5587b2e
36752c3
19dfb7b
b8782cf
71f548f
35211db
cc0ebc9
e2c77bc
28d50df
36b823d
5c01dd4
f3833c3
fdffa26
69d40d2
b472667
efb8ac3
cf8e64c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if switching these to 32bit floats was necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what it does on the inside, but
low_memory=True
tries to be more memory efficient, andchunksize
reads in batches of records and processes them one at a time, returning an iterator of dataframes (one per chunk) rather than a single dataframe.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all of the
CategoricalDtype()
columns are well-behaved and we actually do know what the values will be ahead of time, probably the right thing to do in here is define the categories ahead of time, and then this chunking will use low-memory dtypes that can also be concatenated without beingobject
ified.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would make sense, but to be consistent with how we're handling other datasets I'd probably want to map all the column dtypes in
fields.py
andcodes.py
, as there are currently no column by column type enforcements on EPA CEMS data in the same way that we have for EIA/FERC data. This seems out of scope of this current PR and I'm tempted to move it into a separate issue and handle it there.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This apparently worked, but I'm not sure why it worked. I didn't expect it to for several reasons:
CategoricalDtype()
ensuring that all the dataframes have the same categories, or by iteratively concatenating the many per-chunk dataframes together one at a time, and explicitly converting the categorical in both the new and already concatenated dataframes dynamically to match, using union categoricalsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what changed here to break this test and was feeling a little desperate. It was juuuuuust under 85% But also we had already reduced the threshold form 95% in the FERC-FERC PR merge. I thought it might have been stochastic variation, but saw that @zschira was already using a fixed seed for the random number generator so I was very confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure why this is behaving non-deterministically, although there's a few different types of randomness being used, so I'm guessing there's something weird going on there. I think with some slightly better tuning though, we can get the ratio high enough that slight variations won't bring this below the 85%, so I'll see if I can put together a PR to improve that. For the time being, however, I think lowering this threshold so it doesn't cause CI runs to fail is a good patch.