Ability to export lines fread thinks are incorrectly classed #2163

bg49ag · 2017-05-13T20:56:20Z

I've been really impressed with data.table, not just the speed but the fact the error messages are actually useful.

I was working with ~350 FDA files amounting to ~127 million lines and continually encountering coercion issues with some of the columns. It turns out numerous lines within the originals break across two rows. I can only image this is some form of encoding issue when they were originally being generated; I've tried re-downloading and unpacking them multiple times and its been a persistent issue.

There only seem to be five or ten per file but that's potentially 1,500 to 3,500 lines that would need to be manually edited; I actually set out to do with this with readLines, opening the result on notepad++ to check it, trying the import again, fixing the next ones, checking it etc, but rapidly began to give up the will to live and had to abandon ship. It also significantly reduces the otherwise excellent performance of a fully fread'ed section of code.

At present, fread will report the first one it identifies. It'd be really useful if there was an option to export all of the lines it thinks are incorrectly classified with the associated line numbers for reference. It'd also be good if that option allowed users to specify how many previous and subsequent lines to include in the warning report; e.g. if fread finds one, get the line before and the line after as well to see what's going on with it.

Perhaps it'd be possible to implement some form of error correction. In this case, the only issue with the lines is they need a backspace applying to the start of them.

I've also noticed with the same files in data.table 1.10.4 that 'strip.white = TRUE' seems to produce empty line errors for almost every one of the FDA files I try to import, on entirely different lines, with blank.lines.skip = TRUE.

I've manually checked many of those lines and there aren't any empty lines present. I'm not sure if it might be worth including a suggestion to disable strip white in the warning for them. As for the broken line check, maybe it be useful if fread printed the previous and next few lines after an empty line warning to quickly check if there is actually a blank there at the console.

bg49ag changed the title ~~Wish list - ability to export lines fread thinks are incorrectly classed.~~ [Request] Ability to export lines fread thinks are incorrectly classed. Jun 3, 2017

bg49ag changed the title ~~[Request] Ability to export lines fread thinks are incorrectly classed.~~ [Request] Ability to export lines fread thinks are incorrectly classed Jun 3, 2017

st-pasha added feature request fread labels Jun 28, 2017

st-pasha mentioned this issue Jul 6, 2017

Master task for fread bugs / proposals #2247

Closed

jangorecki changed the title ~~[Request] Ability to export lines fread thinks are incorrectly classed~~ Ability to export lines fread thinks are incorrectly classed Apr 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to export lines fread thinks are incorrectly classed #2163

Ability to export lines fread thinks are incorrectly classed #2163

bg49ag commented May 13, 2017 •

edited

Loading

Ability to export lines fread thinks are incorrectly classed #2163

Ability to export lines fread thinks are incorrectly classed #2163

Comments

bg49ag commented May 13, 2017 • edited Loading

bg49ag commented May 13, 2017 •

edited

Loading