You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been really impressed with data.table, not just the speed but the fact the error messages are actually useful.
I was working with ~350 FDA files amounting to ~127 million lines and continually encountering coercion issues with some of the columns. It turns out numerous lines within the originals break across two rows. I can only image this is some form of encoding issue when they were originally being generated; I've tried re-downloading and unpacking them multiple times and its been a persistent issue.
There only seem to be five or ten per file but that's potentially 1,500 to 3,500 lines that would need to be manually edited; I actually set out to do with this with readLines, opening the result on notepad++ to check it, trying the import again, fixing the next ones, checking it etc, but rapidly began to give up the will to live and had to abandon ship. It also significantly reduces the otherwise excellent performance of a fully fread'ed section of code.
At present, fread will report the first one it identifies. It'd be really useful if there was an option to export all of the lines it thinks are incorrectly classified with the associated line numbers for reference. It'd also be good if that option allowed users to specify how many previous and subsequent lines to include in the warning report; e.g. if fread finds one, get the line before and the line after as well to see what's going on with it.
Perhaps it'd be possible to implement some form of error correction. In this case, the only issue with the lines is they need a backspace applying to the start of them.
I've also noticed with the same files in data.table 1.10.4 that 'strip.white = TRUE' seems to produce empty line errors for almost every one of the FDA files I try to import, on entirely different lines, with blank.lines.skip = TRUE.
I've manually checked many of those lines and there aren't any empty lines present. I'm not sure if it might be worth including a suggestion to disable strip white in the warning for them. As for the broken line check, maybe it be useful if fread printed the previous and next few lines after an empty line warning to quickly check if there is actually a blank there at the console.
The text was updated successfully, but these errors were encountered:
bg49ag
changed the title
Wish list - ability to export lines fread thinks are incorrectly classed.
[Request] Ability to export lines fread thinks are incorrectly classed.
Jun 3, 2017
bg49ag
changed the title
[Request] Ability to export lines fread thinks are incorrectly classed.
[Request] Ability to export lines fread thinks are incorrectly classed
Jun 3, 2017
jangorecki
changed the title
[Request] Ability to export lines fread thinks are incorrectly classed
Ability to export lines fread thinks are incorrectly classed
Apr 6, 2020
I've been really impressed with data.table, not just the speed but the fact the error messages are actually useful.
I was working with ~350 FDA files amounting to ~127 million lines and continually encountering coercion issues with some of the columns. It turns out numerous lines within the originals break across two rows. I can only image this is some form of encoding issue when they were originally being generated; I've tried re-downloading and unpacking them multiple times and its been a persistent issue.
There only seem to be five or ten per file but that's potentially 1,500 to 3,500 lines that would need to be manually edited; I actually set out to do with this with readLines, opening the result on notepad++ to check it, trying the import again, fixing the next ones, checking it etc, but rapidly began to give up the will to live and had to abandon ship. It also significantly reduces the otherwise excellent performance of a fully fread'ed section of code.
At present, fread will report the first one it identifies. It'd be really useful if there was an option to export all of the lines it thinks are incorrectly classified with the associated line numbers for reference. It'd also be good if that option allowed users to specify how many previous and subsequent lines to include in the warning report; e.g. if fread finds one, get the line before and the line after as well to see what's going on with it.
Perhaps it'd be possible to implement some form of error correction. In this case, the only issue with the lines is they need a backspace applying to the start of them.
I've also noticed with the same files in data.table 1.10.4 that 'strip.white = TRUE' seems to produce empty line errors for almost every one of the FDA files I try to import, on entirely different lines, with blank.lines.skip = TRUE.
I've manually checked many of those lines and there aren't any empty lines present. I'm not sure if it might be worth including a suggestion to disable strip white in the warning for them. As for the broken line check, maybe it be useful if fread printed the previous and next few lines after an empty line warning to quickly check if there is actually a blank there at the console.
The text was updated successfully, but these errors were encountered: