Better skip= and nrow= #2623

mattdowle · 2018-02-12T21:25:28Z

Closes #1267
Closes #2518
Closes #2515 (fixed previously, actually, but including its good test in this PR)
Closes #1671

Default for skip= now "__auto__" rather than 0. To more correctly represent what actually happens. The __ are so that skip="auto" still looks for a line containing the string "auto". When skip= is used, this now determines the first row (either column names or data) and automatic detection of the first consistent line is now turned off (before, it used to still run).

autostart=NA moved to the end of the argument list as it's deprecated. That was about automatically searching up from within a consistent block when skip= was pointing inside one of many tables. That was removed a while back in dev.

nrow= limit can no longer be 0. Must be >=1. If 1, then 1 row is used for type sampling. Before, the sampling proceeded on the whole file regardless of nrow=. What I had in mind there is consistency of column types when user is extracting batches from a valid file. However, invalid files are more common and causing more pain, which is more often why skip= and nrow= is used. Invalid lines outside this range no longer cause errors; i.e. works as user expects.

If a too-few or too-many field line occurs, the result is returned up to that line with a detailed warning suggesting fill=TRUE. If that occurs in-sample, that jump is skipped and the warning is left to after reading when we're sure previous jumps have processed correctly. Before, spurious invalid lines were tripping up sampling just because the jump point wasn't good. Sampling is simplified now.

If nrow= is supplied, multi-threading is now turned off. Since if an out-of-sample bump occurred in a jump after the jump which reaches nrow=, the bump from the later thread would still occur (and trigger reread) when that was not intended because the bump occurred after the nrow= was read. Accommodating that while keeping MT would require each thread to have its own copy of type[], or maintain a stack of bumps to be applied in the ordered clause. Both would complicate the code. A copy of type[] for each thread would bite memory usage in the case of 10,000 columns, too. Further, when nrow= is supplied, just the first jump is now sampled. Similarly, we don't want sampling problems or bumps after the nrow= row to affect the result.

Using nrows= also now turns off auto skip, i.e. skip is set to 0 and column names are expected on line 1 (since auto skip relies on testing 100 rows to find the biggest contiguous consistent set of rows). If nrow= is provided, it could be small (say 1), so for consistency, auto skip is then off.

Sampling no longer attempts to find the lastRowEnd. This relied on the last jump finding a good nextGoodLine() and could be incorrect in edge cases. The data read step now always goes up to eof and checks for discarding footer afterwards, once all jumps have been successfully completed and therefore we're sure we're positioned correctly at the end.

An out-of-sample type bump now checks that line has the correct number of fields, before applying the bump. Before, a too-few or too-many line was an error so this didn't matter, but now that it's a warning and the result is returned up to that point, bumps from the invalid line should not affect the result.

codecov-io · 2018-02-13T01:32:41Z

Codecov Report

Merging #2623 into master will increase coverage by 0.01%.
The diff coverage is 95.23%.

@@            Coverage Diff             @@
##           master    #2623      +/-   ##
==========================================
+ Coverage   92.94%   92.95%   +0.01%     
==========================================
  Files          61       61              
  Lines       12109    12130      +21     
==========================================
+ Hits        11255    11276      +21     
  Misses        854      854

Impacted Files	Coverage Δ
R/fread.R	`95.68% <100%> (+0.03%)`	⬆️
src/freadR.c	`89.64% <75%> (+0.25%)`	⬆️
src/fread.c	`96.38% <95.91%> (-0.02%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcaf004...7f48c74. Read the comment docs.

…works.

st-pasha

Great changes, happy to see so many issues resolved.

mattdowle added 10 commits February 5, 2018 11:53

Interim

505a034

Merge branch 'master' into hardskip

2f20429

Interim

71b52e0

Merge branch 'master' into hardskip

41ea789

Tidied test numbers

6e3841e

Interim

0fdec61

Interim

f4bf93e

Interim

67b39b3

Interim

5ebf28b

Passing tests locally

469458c

mattdowle added this to the v1.10.6 milestone Feb 12, 2018

mattdowle added 3 commits February 12, 2018 14:23

Merge branch 'master' into hardskip

af1509c

Added another test from the issue

c0f2558

Tidy

e1feed3

mattdowle requested a review from st-pasha February 13, 2018 00:20

mattdowle added 2 commits February 12, 2018 16:59

Added test from #2518

7300928

Added test from #2515

d3caa5f

mattdowle added 2 commits February 12, 2018 17:36

Added test from #1671

4ac7e26

Pencilled in test from #2267. Added 'nocov' in C code to see if that …

7f48c74

…works.

st-pasha approved these changes Feb 13, 2018

View reviewed changes

mattdowle merged commit dfcd71d into master Feb 13, 2018

mattdowle deleted the hardskip branch February 13, 2018 18:53

st-pasha mentioned this pull request Feb 13, 2018

Master task for fread bugs / proposals #2247

Closed

mattdowle mentioned this pull request Feb 13, 2018

fread fails w/ nrow > 1000 and out of region garbage row #2621

Closed

eddelbuettel mentioned this pull request Feb 15, 2018

Implement comment.char argument in fread #856

Open

ben-schwen mentioned this pull request Nov 7, 2021

fread(file, nrows=0) file with header does not determine types #5253

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better skip= and nrow= #2623

Better skip= and nrow= #2623

mattdowle commented Feb 12, 2018 •

edited

Loading

codecov-io commented Feb 13, 2018 •

edited

Loading

st-pasha left a comment

Better skip= and nrow= #2623

Better skip= and nrow= #2623

Conversation

mattdowle commented Feb 12, 2018 • edited Loading

codecov-io commented Feb 13, 2018 • edited Loading

Codecov Report

st-pasha left a comment

Choose a reason for hiding this comment

mattdowle commented Feb 12, 2018 •

edited

Loading

codecov-io commented Feb 13, 2018 •

edited

Loading