Skip to content

Conversation

@quinnj
Copy link
Member

@quinnj quinnj commented Jul 3, 2020

cc: @bkamins, this seems to drastically improve the sample generated file you shared on slack the other day.

@codecov
Copy link

codecov bot commented Jul 3, 2020

Codecov Report

Merging #676 into master will increase coverage by 2.11%.
The diff coverage is 97.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #676      +/-   ##
==========================================
+ Coverage   84.29%   86.41%   +2.11%     
==========================================
  Files          10       10              
  Lines        1757     1619     -138     
==========================================
- Hits         1481     1399      -82     
+ Misses        276      220      -56     
Impacted Files Coverage Δ
src/utils.jl 89.61% <90.90%> (+8.11%) ⬆️
src/file.jl 96.33% <100.00%> (+0.13%) ⬆️
src/rows.jl 91.22% <100.00%> (-1.94%) ⬇️
src/tables.jl 0.00% <0.00%> (ø)
src/header.jl 95.79% <0.00%> (+0.83%) ⬆️
src/detection.jl 96.67% <0.00%> (+2.11%) ⬆️
src/write.jl 87.34% <0.00%> (+4.13%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed48ad2...88cf907. Read the comment docs.

@bkamins
Copy link
Member

bkamins commented Jul 3, 2020

Thank you! (it is hard do a proper review of this change for me)

@quinnj
Copy link
Member Author

quinnj commented Jul 3, 2020

I can provide a little more explanation on the changes here for those interested.

  • The issue on current master is that we have the default of pool=0.1, which means if a column has less than 10% unique values, we will "pool" the column values and you'll get a PooledArray back
  • For the implementation, we start by assuming any string column should be pooled, which means allocating an entire Vector{UInt32} for the # of rows we estimate in the file; and then on each value parsed in the column, we hash it, do the "ref check", which means we check if we've seen the value before, if so, we use the previously generated code, otherwise we add a new code for the new unique value. On each value, we also check if the # of unique values "trips the trigger" of going over 10% unique values for the column. If so, we immediately promote the column to plain Vector{String} (and never go back, regardless if the % drops back below 10%).
  • The problem with this approach is that for large files with lots of string columns, we start out by allocating lots of Vector{UInt32} and possibly need to throw them all away by promoting to string columns.
  • At first, I thought we could just optimize the pooled column => string column promotion, but the memory waste quickly trumps the cost of promoting, so I realized we needed to find a better way to check columns for "pooled-ness" without wasting so much memory
  • The approach in this PR will allocate only 100 rows for "maybe pooled" columns while we sample; once we've filled up the 100 rows, we do a single check if the # of unique values is over the pool threshold. 100 rows was chosen because it's a sample size congruent with 95% z-score, and around 10% margin of error; it's rough statistics-wise, but I think will work well in practice
  • Once we do the 100-row check, we'll decided whether to promote the column to PooledArray permanently, or promote to string permanently.

The one thing I'm still noodling on here is if there's a way we can get multithreaded parsing to "sync" their 100-row checks. Currently, each thread will parse 100 rows and do sampling check independently; obviously it'd be better to take 100 rows from each thread and do an overall unique % check.

Regardless, the approach in this PR wastes much less memory sampling values from string columns, then does a single sampled unique % test on whether a column should be pooled or not.

@quinnj quinnj merged commit e21beb3 into master Jul 3, 2020
@quinnj quinnj deleted the jq/pool branch July 3, 2020 15:38
@bkamins
Copy link
Member

bkamins commented Jul 4, 2020

This is a very nice approach. Thank you for it. Two thoughts:

  1. Documentation of pool kwarg should be updated to reflect this behavior
  2. Maybe allow user to specify the 100 threshold via a kwarg e.g. poolsample? Then some sentinel (e.g. 0) could be chosen to force checking of the whole column (this is desirable when the file is relatively small and the user has no problem with checking the pooling threshold exactly). The general rationale for this proposed change is that the current implementation assumes "randomly shuffled values", while in practice data could be ordered, e.g. by some key, and just looking at first few values can be misleading.

quinnj added a commit that referenced this pull request Jan 12, 2026
Make the automatic pooled=>string column promotion more efficient
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants