Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to export lines fread thinks are incorrectly classed #2163

Open
bg49ag opened this issue May 13, 2017 · 0 comments
Open

Ability to export lines fread thinks are incorrectly classed #2163

bg49ag opened this issue May 13, 2017 · 0 comments

Comments

@bg49ag
Copy link

bg49ag commented May 13, 2017

I've been really impressed with data.table, not just the speed but the fact the error messages are actually useful.

I was working with ~350 FDA files amounting to ~127 million lines and continually encountering coercion issues with some of the columns. It turns out numerous lines within the originals break across two rows. I can only image this is some form of encoding issue when they were originally being generated; I've tried re-downloading and unpacking them multiple times and its been a persistent issue.

There only seem to be five or ten per file but that's potentially 1,500 to 3,500 lines that would need to be manually edited; I actually set out to do with this with readLines, opening the result on notepad++ to check it, trying the import again, fixing the next ones, checking it etc, but rapidly began to give up the will to live and had to abandon ship. It also significantly reduces the otherwise excellent performance of a fully fread'ed section of code.

At present, fread will report the first one it identifies. It'd be really useful if there was an option to export all of the lines it thinks are incorrectly classified with the associated line numbers for reference. It'd also be good if that option allowed users to specify how many previous and subsequent lines to include in the warning report; e.g. if fread finds one, get the line before and the line after as well to see what's going on with it.

Perhaps it'd be possible to implement some form of error correction. In this case, the only issue with the lines is they need a backspace applying to the start of them.

I've also noticed with the same files in data.table 1.10.4 that 'strip.white = TRUE' seems to produce empty line errors for almost every one of the FDA files I try to import, on entirely different lines, with blank.lines.skip = TRUE.

I've manually checked many of those lines and there aren't any empty lines present. I'm not sure if it might be worth including a suggestion to disable strip white in the warning for them. As for the broken line check, maybe it be useful if fread printed the previous and next few lines after an empty line warning to quickly check if there is actually a blank there at the console.

@bg49ag bg49ag changed the title Wish list - ability to export lines fread thinks are incorrectly classed. [Request] Ability to export lines fread thinks are incorrectly classed. Jun 3, 2017
@bg49ag bg49ag changed the title [Request] Ability to export lines fread thinks are incorrectly classed. [Request] Ability to export lines fread thinks are incorrectly classed Jun 3, 2017
@jangorecki jangorecki changed the title [Request] Ability to export lines fread thinks are incorrectly classed Ability to export lines fread thinks are incorrectly classed Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants