-
Notifications
You must be signed in to change notification settings - Fork 147
Make the automatic pooled=>string column promotion more efficient #676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #676 +/- ##
==========================================
+ Coverage 84.29% 86.41% +2.11%
==========================================
Files 10 10
Lines 1757 1619 -138
==========================================
- Hits 1481 1399 -82
+ Misses 276 220 -56
Continue to review full report at Codecov.
|
|
Thank you! (it is hard do a proper review of this change for me) |
|
I can provide a little more explanation on the changes here for those interested.
The one thing I'm still noodling on here is if there's a way we can get multithreaded parsing to "sync" their 100-row checks. Currently, each thread will parse 100 rows and do sampling check independently; obviously it'd be better to take 100 rows from each thread and do an overall unique % check. Regardless, the approach in this PR wastes much less memory sampling values from string columns, then does a single sampled unique % test on whether a column should be pooled or not. |
|
This is a very nice approach. Thank you for it. Two thoughts:
|
Make the automatic pooled=>string column promotion more efficient
cc: @bkamins, this seems to drastically improve the sample generated file you shared on slack the other day.