Parse failure when crossing chunk boundaries #74

pjones · 2014-08-04T15:06:25Z

I believe I've found a bug in attoparsec and I need some help writing a failing test. Some background:

I have a tool that processes CSV files using Cassava. After upgrading attoparsec I encountered a regression in my tool while processing production files. The problem only happens in some of the CSV files, and only when the files are big enough to be fed into attoparsec in chunks. Feeding only whole lines into attoparsec works fine.

After some experimenting and some help from git bisect, I've tracked the issue down to commit 791e046c526710dce7b87e308ee48f2fb6811d7b. Before that commit all my regression tests pass. After that commit they fail every time.

I haven't yet been able to shrink this down to a small, reproducible test case. The tool reads chunks of ByteString.Char8 values using io-streams and feeds them into Cassava, which feeds them into attoparsec. The smallest CSV file that produces the problem is about 2MB.

Due to time constraints I don't have the spare cycles to track this down further right now. Since I can lock cabal down to the last working version of attoparsec it's not a show stopper for me. That said, if someone else wants to chase this down I'd be happy to try out patches or provide more detail.

The text was updated successfully, but these errors were encountered:

basvandijk · 2014-08-05T07:33:29Z

Hi @pjones, thanks for reporting this!

Is it possible you can share your tool's source code with us? If not, because of secrecy reasons for example, could you strip out the sensitive parts?

It will also be helpfull if you can share the CSV file and the error message you are getting.

Thanks!

pjones · 2014-08-05T17:05:02Z

Here's a small program and CSV file which reproduces the bug:

http://www.pmade.com/static/tmp/boundary-bug.tar.bz2

I'll delete that tarball after a couple of days.

The error message is: "parse error (endOfInput)" which I suspect is partially constructed by the cassava Parser and attoparsec.

bos · 2014-08-07T03:22:52Z

Thanks for the repro, @pjones. The error is being reported from Data.Csv.Incremental.decodeWithP. In fact, merely finding that location took me a while, as there were several packages that could have been reporting that exact error message.

I've been able to cut your CSV file down to 29% of its original size while preserving the behaviour. No idea yet what's actually going on, but a smaller test case has to be strictly better than a larger one ... right?

(Oh, and cc @tibbe for good measure.)

bos · 2014-08-07T04:30:47Z

I had a wild notion that the problem must have something to do with the sizes of the chunks of data being fed to the parser, and amazingly this turned out to be a helpful direction to pursue.

I've now got the failing input trimmed down to just two lines: the header and a row of actual data. This gives a fighting chance at figuring out what's wrong.

pjones · 2014-08-07T13:52:50Z

Awesome work @bos, thanks!

bos · 2014-08-07T17:27:34Z

I should mention that I released a new version of attoparsec already that has this fixed. Sorry for the inconvenience!

changelog: 0.12.1.2 * Fixed the incorrect tracking of capacity if the initial buffer was empty (haskell/attoparsec#75) 0.12.1.1 * Fixed a data corruption bug that occurred under some circumstances if a buffer grew after prompting for more input (haskell/attoparsec#74) 0.12.1.0 * Now compatible with GHC 7.9 * Reintroduced the Chunk class, used by the parsers package 0.12.0.0 * A new internal representation makes almost all real-world parsers faster, sometimes by big margins. For example, parsing JSON data with aeson is now up to 70% faster. These performance improvements also come with reduced memory consumption and some new capabilities. * The new match combinator gives both the result of a parse and the input that it matched. * The test suite has doubled in size. This made it possible to switch to the new internal representation with a decent degree of confidence that everything was more or less working. * The benchmark suite now contains a small family of benchmarks taken from real-world uses of attoparsec. * A few types that ought to have been private now are. * A few obsolete modules and functions have been marked as deprecated. They will be removed from the next major release. 0.11.3.0 * New function scientific is compatible with rational, but parses integers more efficiently (haskell/aeson#198) 0.11.2.0 * The new Chunk typeclass allows for some code sharing with Ed Kmett's parsers package: http://hackage.haskell.org/package/parsers * New function runScanner generalises scan to return the final state of the scanner as well as the input consumed.

bos closed this as completed in 381ad6a Aug 7, 2014

bos mentioned this issue Sep 5, 2014

Odd behavioiur of many' #72

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse failure when crossing chunk boundaries #74

Parse failure when crossing chunk boundaries #74

pjones commented Aug 4, 2014

basvandijk commented Aug 5, 2014

pjones commented Aug 5, 2014

bos commented Aug 7, 2014

bos commented Aug 7, 2014

pjones commented Aug 7, 2014

bos commented Aug 7, 2014

Parse failure when crossing chunk boundaries #74

Parse failure when crossing chunk boundaries #74

Comments

pjones commented Aug 4, 2014

basvandijk commented Aug 5, 2014

pjones commented Aug 5, 2014

bos commented Aug 7, 2014

bos commented Aug 7, 2014

pjones commented Aug 7, 2014

bos commented Aug 7, 2014