Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better skip= and nrow= #2623

Merged
merged 17 commits into from
Feb 13, 2018
Merged

Better skip= and nrow= #2623

merged 17 commits into from
Feb 13, 2018

Conversation

mattdowle
Copy link
Member

@mattdowle mattdowle commented Feb 12, 2018

Closes #1267
Closes #2518
Closes #2515 (fixed previously, actually, but including its good test in this PR)
Closes #1671

Default for skip= now "__auto__" rather than 0. To more correctly represent what actually happens. The __ are so that skip="auto" still looks for a line containing the string "auto". When skip= is used, this now determines the first row (either column names or data) and automatic detection of the first consistent line is now turned off (before, it used to still run).

autostart=NA moved to the end of the argument list as it's deprecated. That was about automatically searching up from within a consistent block when skip= was pointing inside one of many tables. That was removed a while back in dev.

nrow= limit can no longer be 0. Must be >=1. If 1, then 1 row is used for type sampling. Before, the sampling proceeded on the whole file regardless of nrow=. What I had in mind there is consistency of column types when user is extracting batches from a valid file. However, invalid files are more common and causing more pain, which is more often why skip= and nrow= is used. Invalid lines outside this range no longer cause errors; i.e. works as user expects.

If a too-few or too-many field line occurs, the result is returned up to that line with a detailed warning suggesting fill=TRUE. If that occurs in-sample, that jump is skipped and the warning is left to after reading when we're sure previous jumps have processed correctly. Before, spurious invalid lines were tripping up sampling just because the jump point wasn't good. Sampling is simplified now.

If nrow= is supplied, multi-threading is now turned off. Since if an out-of-sample bump occurred in a jump after the jump which reaches nrow=, the bump from the later thread would still occur (and trigger reread) when that was not intended because the bump occurred after the nrow= was read. Accommodating that while keeping MT would require each thread to have its own copy of type[], or maintain a stack of bumps to be applied in the ordered clause. Both would complicate the code. A copy of type[] for each thread would bite memory usage in the case of 10,000 columns, too. Further, when nrow= is supplied, just the first jump is now sampled. Similarly, we don't want sampling problems or bumps after the nrow= row to affect the result.

Using nrows= also now turns off auto skip, i.e. skip is set to 0 and column names are expected on line 1 (since auto skip relies on testing 100 rows to find the biggest contiguous consistent set of rows). If nrow= is provided, it could be small (say 1), so for consistency, auto skip is then off.

Sampling no longer attempts to find the lastRowEnd. This relied on the last jump finding a good nextGoodLine() and could be incorrect in edge cases. The data read step now always goes up to eof and checks for discarding footer afterwards, once all jumps have been successfully completed and therefore we're sure we're positioned correctly at the end.

An out-of-sample type bump now checks that line has the correct number of fields, before applying the bump. Before, a too-few or too-many line was an error so this didn't matter, but now that it's a warning and the result is returned up to that point, bumps from the invalid line should not affect the result.

@mattdowle mattdowle added this to the v1.10.6 milestone Feb 12, 2018
@codecov-io
Copy link

codecov-io commented Feb 13, 2018

Codecov Report

Merging #2623 into master will increase coverage by 0.01%.
The diff coverage is 95.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2623      +/-   ##
==========================================
+ Coverage   92.94%   92.95%   +0.01%     
==========================================
  Files          61       61              
  Lines       12109    12130      +21     
==========================================
+ Hits        11255    11276      +21     
  Misses        854      854
Impacted Files Coverage Δ
R/fread.R 95.68% <100%> (+0.03%) ⬆️
src/freadR.c 89.64% <75%> (+0.25%) ⬆️
src/fread.c 96.38% <95.91%> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcaf004...7f48c74. Read the comment docs.

Copy link
Contributor

@st-pasha st-pasha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great changes, happy to see so many issues resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants