-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better skip= and nrow= #2623
Better skip= and nrow= #2623
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2623 +/- ##
==========================================
+ Coverage 92.94% 92.95% +0.01%
==========================================
Files 61 61
Lines 12109 12130 +21
==========================================
+ Hits 11255 11276 +21
Misses 854 854
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great changes, happy to see so many issues resolved.
Closes #1267
Closes #2518
Closes #2515 (fixed previously, actually, but including its good test in this PR)
Closes #1671
Default for
skip=
now"__auto__"
rather than 0. To more correctly represent what actually happens. The__
are so thatskip="auto"
still looks for a line containing the string"auto"
. Whenskip=
is used, this now determines the first row (either column names or data) and automatic detection of the first consistent line is now turned off (before, it used to still run).autostart=NA
moved to the end of the argument list as it's deprecated. That was about automatically searching up from within a consistent block whenskip=
was pointing inside one of many tables. That was removed a while back in dev.nrow=
limit can no longer be 0. Must be>=1
. If 1, then 1 row is used for type sampling. Before, the sampling proceeded on the whole file regardless ofnrow=
. What I had in mind there is consistency of column types when user is extracting batches from a valid file. However, invalid files are more common and causing more pain, which is more often whyskip=
andnrow=
is used. Invalid lines outside this range no longer cause errors; i.e. works as user expects.If a too-few or too-many field line occurs, the result is returned up to that line with a detailed warning suggesting
fill=TRUE
. If that occurs in-sample, that jump is skipped and the warning is left to after reading when we're sure previous jumps have processed correctly. Before, spurious invalid lines were tripping up sampling just because the jump point wasn't good. Sampling is simplified now.If
nrow=
is supplied, multi-threading is now turned off. Since if an out-of-sample bump occurred in a jump after the jump which reachesnrow=
, the bump from the later thread would still occur (and trigger reread) when that was not intended because the bump occurred after thenrow=
was read. Accommodating that while keeping MT would require each thread to have its own copy oftype[]
, or maintain a stack of bumps to be applied in the ordered clause. Both would complicate the code. A copy oftype[]
for each thread would bite memory usage in the case of 10,000 columns, too. Further, whennrow=
is supplied, just the first jump is now sampled. Similarly, we don't want sampling problems or bumps after thenrow=
row to affect the result.Using
nrows=
also now turns off auto skip, i.e. skip is set to 0 and column names are expected on line 1 (since auto skip relies on testing 100 rows to find the biggest contiguous consistent set of rows). Ifnrow=
is provided, it could be small (say 1), so for consistency, auto skip is then off.Sampling no longer attempts to find the
lastRowEnd
. This relied on the last jump finding a goodnextGoodLine()
and could be incorrect in edge cases. The data read step now always goes up toeof
and checks for discarding footer afterwards, once all jumps have been successfully completed and therefore we're sure we're positioned correctly at the end.An out-of-sample type bump now checks that line has the correct number of fields, before applying the bump. Before, a too-few or too-many line was an error so this didn't matter, but now that it's a warning and the result is returned up to that point, bumps from the invalid line should not affect the result.