-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse failure when crossing chunk boundaries #74
Comments
Hi @pjones, thanks for reporting this! Is it possible you can share your tool's source code with us? If not, because of secrecy reasons for example, could you strip out the sensitive parts? It will also be helpfull if you can share the CSV file and the error message you are getting. Thanks! |
Here's a small program and CSV file which reproduces the bug: http://www.pmade.com/static/tmp/boundary-bug.tar.bz2 I'll delete that tarball after a couple of days. The error message is: "parse error (endOfInput)" which I suspect is partially constructed by the cassava |
Thanks for the repro, @pjones. The error is being reported from I've been able to cut your CSV file down to 29% of its original size while preserving the behaviour. No idea yet what's actually going on, but a smaller test case has to be strictly better than a larger one ... right? (Oh, and cc @tibbe for good measure.) |
I had a wild notion that the problem must have something to do with the sizes of the chunks of data being fed to the parser, and amazingly this turned out to be a helpful direction to pursue. I've now got the failing input trimmed down to just two lines: the header and a row of actual data. This gives a fighting chance at figuring out what's wrong. |
Awesome work @bos, thanks! |
I should mention that I released a new version of attoparsec already that has this fixed. Sorry for the inconvenience! |
changelog: 0.12.1.2 * Fixed the incorrect tracking of capacity if the initial buffer was empty (haskell/attoparsec#75) 0.12.1.1 * Fixed a data corruption bug that occurred under some circumstances if a buffer grew after prompting for more input (haskell/attoparsec#74) 0.12.1.0 * Now compatible with GHC 7.9 * Reintroduced the Chunk class, used by the parsers package 0.12.0.0 * A new internal representation makes almost all real-world parsers faster, sometimes by big margins. For example, parsing JSON data with aeson is now up to 70% faster. These performance improvements also come with reduced memory consumption and some new capabilities. * The new match combinator gives both the result of a parse and the input that it matched. * The test suite has doubled in size. This made it possible to switch to the new internal representation with a decent degree of confidence that everything was more or less working. * The benchmark suite now contains a small family of benchmarks taken from real-world uses of attoparsec. * A few types that ought to have been private now are. * A few obsolete modules and functions have been marked as deprecated. They will be removed from the next major release. 0.11.3.0 * New function scientific is compatible with rational, but parses integers more efficiently (haskell/aeson#198) 0.11.2.0 * The new Chunk typeclass allows for some code sharing with Ed Kmett's parsers package: http://hackage.haskell.org/package/parsers * New function runScanner generalises scan to return the final state of the scanner as well as the input consumed.
I believe I've found a bug in attoparsec and I need some help writing a failing test. Some background:
I have a tool that processes CSV files using Cassava. After upgrading attoparsec I encountered a regression in my tool while processing production files. The problem only happens in some of the CSV files, and only when the files are big enough to be fed into attoparsec in chunks. Feeding only whole lines into attoparsec works fine.
After some experimenting and some help from
git bisect
, I've tracked the issue down to commit 791e046c526710dce7b87e308ee48f2fb6811d7b. Before that commit all my regression tests pass. After that commit they fail every time.I haven't yet been able to shrink this down to a small, reproducible test case. The tool reads chunks of
ByteString.Char8
values usingio-streams
and feeds them into Cassava, which feeds them into attoparsec. The smallest CSV file that produces the problem is about 2MB.Due to time constraints I don't have the spare cycles to track this down further right now. Since I can lock cabal down to the last working version of attoparsec it's not a show stopper for me. That said, if someone else wants to chase this down I'd be happy to try out patches or provide more detail.
The text was updated successfully, but these errors were encountered: