Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse failure when crossing chunk boundaries #74

Closed
pjones opened this issue Aug 4, 2014 · 6 comments
Closed

Parse failure when crossing chunk boundaries #74

pjones opened this issue Aug 4, 2014 · 6 comments

Comments

@pjones
Copy link

pjones commented Aug 4, 2014

I believe I've found a bug in attoparsec and I need some help writing a failing test. Some background:

I have a tool that processes CSV files using Cassava. After upgrading attoparsec I encountered a regression in my tool while processing production files. The problem only happens in some of the CSV files, and only when the files are big enough to be fed into attoparsec in chunks. Feeding only whole lines into attoparsec works fine.

After some experimenting and some help from git bisect, I've tracked the issue down to commit 791e046c526710dce7b87e308ee48f2fb6811d7b. Before that commit all my regression tests pass. After that commit they fail every time.

I haven't yet been able to shrink this down to a small, reproducible test case. The tool reads chunks of ByteString.Char8 values using io-streams and feeds them into Cassava, which feeds them into attoparsec. The smallest CSV file that produces the problem is about 2MB.

Due to time constraints I don't have the spare cycles to track this down further right now. Since I can lock cabal down to the last working version of attoparsec it's not a show stopper for me. That said, if someone else wants to chase this down I'd be happy to try out patches or provide more detail.

@basvandijk
Copy link
Member

Hi @pjones, thanks for reporting this!

Is it possible you can share your tool's source code with us? If not, because of secrecy reasons for example, could you strip out the sensitive parts?

It will also be helpfull if you can share the CSV file and the error message you are getting.

Thanks!

@pjones
Copy link
Author

pjones commented Aug 5, 2014

Here's a small program and CSV file which reproduces the bug:

http://www.pmade.com/static/tmp/boundary-bug.tar.bz2

I'll delete that tarball after a couple of days.

The error message is: "parse error (endOfInput)" which I suspect is partially constructed by the cassava Parser and attoparsec.

@bos
Copy link
Collaborator

bos commented Aug 7, 2014

Thanks for the repro, @pjones. The error is being reported from Data.Csv.Incremental.decodeWithP. In fact, merely finding that location took me a while, as there were several packages that could have been reporting that exact error message.

I've been able to cut your CSV file down to 29% of its original size while preserving the behaviour. No idea yet what's actually going on, but a smaller test case has to be strictly better than a larger one ... right?

(Oh, and cc @tibbe for good measure.)

@bos
Copy link
Collaborator

bos commented Aug 7, 2014

I had a wild notion that the problem must have something to do with the sizes of the chunks of data being fed to the parser, and amazingly this turned out to be a helpful direction to pursue.

I've now got the failing input trimmed down to just two lines: the header and a row of actual data. This gives a fighting chance at figuring out what's wrong.

@bos bos closed this as completed in 381ad6a Aug 7, 2014
@pjones
Copy link
Author

pjones commented Aug 7, 2014

Awesome work @bos, thanks!

@bos
Copy link
Collaborator

bos commented Aug 7, 2014

I should mention that I released a new version of attoparsec already that has this fixed. Sorry for the inconvenience!

mamash pushed a commit to TritonDataCenter/pkgsrc-wip that referenced this issue Sep 6, 2014
changelog:
0.12.1.2

* Fixed the incorrect tracking of capacity if the initial buffer was
  empty (haskell/attoparsec#75)

0.12.1.1

* Fixed a data corruption bug that occurred under some circumstances
  if a buffer grew after prompting for more input
  (haskell/attoparsec#74)

0.12.1.0

* Now compatible with GHC 7.9

* Reintroduced the Chunk class, used by the parsers package

0.12.0.0

* A new internal representation makes almost all real-world parsers
  faster, sometimes by big margins.  For example, parsing JSON data
  with aeson is now up to 70% faster.  These performance improvements
  also come with reduced memory consumption and some new capabilities.

* The new match combinator gives both the result of a parse and the
  input that it matched.

* The test suite has doubled in size.  This made it possible to switch
  to the new internal representation with a decent degree of
  confidence that everything was more or less working.

* The benchmark suite now contains a small family of benchmarks taken
  from real-world uses of attoparsec.

* A few types that ought to have been private now are.

* A few obsolete modules and functions have been marked as deprecated.
  They will be removed from the next major release.

0.11.3.0

* New function scientific is compatible with rational, but parses
  integers more efficiently (haskell/aeson#198)

0.11.2.0

* The new Chunk typeclass allows for some code sharing with Ed
  Kmett's parsers package: http://hackage.haskell.org/package/parsers

* New function runScanner generalises scan to return the final state
  of the scanner as well as the input consumed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants