Make the automatic pooled=>string column promotion more efficient #676

quinnj · 2020-07-03T05:51:35Z

cc: @bkamins, this seems to drastically improve the sample generated file you shared on slack the other day.

codecov · 2020-07-03T06:33:44Z

Codecov Report

Merging #676 into master will increase coverage by 2.11%.
The diff coverage is 97.43%.

@@            Coverage Diff             @@
##           master     #676      +/-   ##
==========================================
+ Coverage   84.29%   86.41%   +2.11%     
==========================================
  Files          10       10              
  Lines        1757     1619     -138     
==========================================
- Hits         1481     1399      -82     
+ Misses        276      220      -56

Impacted Files	Coverage Δ
src/utils.jl	`89.61% <90.90%> (+8.11%)`	⬆️
src/file.jl	`96.33% <100.00%> (+0.13%)`	⬆️
src/rows.jl	`91.22% <100.00%> (-1.94%)`	⬇️
src/tables.jl	`0.00% <0.00%> (ø)`
src/header.jl	`95.79% <0.00%> (+0.83%)`	⬆️
src/detection.jl	`96.67% <0.00%> (+2.11%)`	⬆️
src/write.jl	`87.34% <0.00%> (+4.13%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed48ad2...88cf907. Read the comment docs.

bkamins · 2020-07-03T06:36:04Z

Thank you! (it is hard do a proper review of this change for me)

quinnj · 2020-07-03T13:27:57Z

I can provide a little more explanation on the changes here for those interested.

The issue on current master is that we have the default of pool=0.1, which means if a column has less than 10% unique values, we will "pool" the column values and you'll get a PooledArray back
For the implementation, we start by assuming any string column should be pooled, which means allocating an entire Vector{UInt32} for the # of rows we estimate in the file; and then on each value parsed in the column, we hash it, do the "ref check", which means we check if we've seen the value before, if so, we use the previously generated code, otherwise we add a new code for the new unique value. On each value, we also check if the # of unique values "trips the trigger" of going over 10% unique values for the column. If so, we immediately promote the column to plain Vector{String} (and never go back, regardless if the % drops back below 10%).
The problem with this approach is that for large files with lots of string columns, we start out by allocating lots of Vector{UInt32} and possibly need to throw them all away by promoting to string columns.
At first, I thought we could just optimize the pooled column => string column promotion, but the memory waste quickly trumps the cost of promoting, so I realized we needed to find a better way to check columns for "pooled-ness" without wasting so much memory
The approach in this PR will allocate only 100 rows for "maybe pooled" columns while we sample; once we've filled up the 100 rows, we do a single check if the # of unique values is over the pool threshold. 100 rows was chosen because it's a sample size congruent with 95% z-score, and around 10% margin of error; it's rough statistics-wise, but I think will work well in practice
Once we do the 100-row check, we'll decided whether to promote the column to PooledArray permanently, or promote to string permanently.

The one thing I'm still noodling on here is if there's a way we can get multithreaded parsing to "sync" their 100-row checks. Currently, each thread will parse 100 rows and do sampling check independently; obviously it'd be better to take 100 rows from each thread and do an overall unique % check.

Regardless, the approach in this PR wastes much less memory sampling values from string columns, then does a single sampled unique % test on whether a column should be pooled or not.

bkamins · 2020-07-04T07:39:07Z

This is a very nice approach. Thank you for it. Two thoughts:

Documentation of pool kwarg should be updated to reflect this behavior
Maybe allow user to specify the 100 threshold via a kwarg e.g. poolsample? Then some sentinel (e.g. 0) could be chosen to force checking of the whole column (this is desirable when the file is relatively small and the user has no problem with checking the pooling threshold exactly). The general rationale for this proposed change is that the current implementation assumes "randomly shuffled values", while in practice data could be ordered, e.g. by some key, and just looking at first few values can be misleading.

Make the automatic pooled=>string column promotion more efficient

quinnj added 2 commits July 2, 2020 23:50

Make the automatic pooled=>string column promotion more efficient

b9cb5d4

remove debug

2933624

fix nightly test

88cf907

quinnj merged commit e21beb3 into master Jul 3, 2020

quinnj deleted the jq/pool branch July 3, 2020 15:38

quinnj mentioned this pull request Jul 4, 2020

Updates for the new CSV.jl version davidanthoff/csv-comparison#7

Merged

quinnj added a commit that referenced this pull request Jan 12, 2026

Merge pull request #676 from JuliaData/jq/pool

38a74fe

Make the automatic pooled=>string column promotion more efficient

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the automatic pooled=>string column promotion more efficient #676

Make the automatic pooled=>string column promotion more efficient #676

Uh oh!

quinnj commented Jul 3, 2020

Uh oh!

codecov bot commented Jul 3, 2020 •

edited

Loading

Uh oh!

bkamins commented Jul 3, 2020

Uh oh!

quinnj commented Jul 3, 2020

Uh oh!

bkamins commented Jul 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Make the automatic pooled=>string column promotion more efficient #676

Make the automatic pooled=>string column promotion more efficient #676

Uh oh!

Conversation

quinnj commented Jul 3, 2020

Uh oh!

codecov bot commented Jul 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bkamins commented Jul 3, 2020

Uh oh!

quinnj commented Jul 3, 2020

Uh oh!

bkamins commented Jul 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Jul 3, 2020 •

edited

Loading