Skip to content

Complete rewrite#17

Closed
BurningWitness wants to merge 19 commits intoGaloisInc:masterfrom
BurningWitness:apotheosis
Closed

Complete rewrite#17
BurningWitness wants to merge 19 commits intoGaloisInc:masterfrom
BurningWitness:apotheosis

Conversation

@BurningWitness
Copy link

@BurningWitness BurningWitness commented Aug 4, 2023

What lead me here

  1. I've been using aeson at work for the past four years and I thoroughly dislike it;
  2. I wrote a small pretty-printer a month ago and suddenly it turned out aeson can't stream requests. The expected way of implementing streaming is I guess json-stream, a library that has a big ol' "actually we cheat a lot" disclaimer at the bottom of it;
  3. There's probably lots of cool things that can be done here;
  4. There's a library called json that has been dead for three years (I started on July 6th).

What this rewrite does not have

  • No unifying typeclass (ToJSON/FromJSON) and hence no GHC.Generics.

    I enjoy handrolling instances and the idea of "the one true instance" gets in the way constantly. Generics also have an incredibly lame issue of inheriting pair names from record fields which are rather stringent in what they allow. Both of these are something a higher-level library can provide, hopefully in a more generalized way.

  • No unifying datatype (Value).

    Adding something like this is completely optional and highly ambiguous depending on how much information about the JSON we wish to convey (do we remove duplicates from container values? Is whitespace critical? Should numbers be stored denormalized if they were delivered that way?). Also raw JSON already conveys all of this info by itself.

Corners explored while rewriting

  • attoparsec's API is... weird:

    • fail prepends extra text to the input, satisfy alters parser context when failing.
    • There's peekWord8, but there's no advance. I have to anyWord8 the character to consume it, so I have to hope the compiler is diligent enough to compile peekWord8 >> anyWord8 into peekWord8 >> advance 1.
    • take and match return a StrictByteString, meaning if the parser ever needs to backtrack ten thousand bytes, those ten thousand bytes have to be reallocated into one continuous block of memory

    The solutions taken to address each of these are:

    • err is defined to replace fail. This pushes attoparsec's lower boundary up to 0.13, since that's when that library revealed Data.Attoparsec.Internal.Types;
    • Every single parser in the rewrite consumes one byte at a time and never backtracks. The only functions used are peekWord8, peekWord8' and anyWord8;
    • Since matching is off the table, copying is performed by accumulating a chain of poke operations that are forced on chunk overflow.
  • attoparsec-iso8601 uses explicitly Text parsers. I rewrote the functions from scratch to fit the "one byte at a time" rule described above, so it's not a big issue;

  • empty (from Alternative) does not apply to any decoders. I use the semigroupoids package for the Alt typeclass instead.

What this rewrite provides

  • Basic JSON decoding and encoding. Decoding consumes input lazily, encoding produces output lazily.

  • Streaming JSON decoding out of the box. For the overwhelming majority of users this will just be arrays, however streaming can be nested (see test/conformance/Test/JSON/Decode/Stream.hs for two-dimensional arrays and objects);

  • Four distinct ways to slice up an object:

    • Read the entire object, shove every key into a radix tree, then resolve the object parser. This object decoder is referred to as Plain, it's the most powerful and also the most wasteful;
    • Analyze the object parser for keys, read the entire object, shove every key we know we need into a radix tree, then resolve the object parser. This object decoder is referred to as Bounded and it's not a Monad. Branching is supported through Selective;
    • If the object parser is known to never branch: Analyze the object parser for keys and so long as they're all unique, all the decoding can be ran inline. This object decoder is referred to as Linear, it's strictly Applicative.
    • Fold left-to-right (both a basic and a streaming variant);
  • Two distinct ways to slice up an array:

    • Consume elements left to right. This array decoder is referred to as Elements and it mostly exists for weird encoding a la aeson's handling of tuples;
    • Fold left-to-right (both a basic and a streaming variant);
  • String parsing with inlined UTF-8 decoding. This caused quite a lot of code duplication, but it's also relatively easy to test rigorously, which I did;

  • O(n) complexity for sized number parsing. Since we know the size beforehand, integers can fail early and floats can skip extra precision;

  • Time and UUID decoding and encoding. Both are straightforward to implement and show up often enough to be included;

  • Every primitive can be decoded with a skip (discarding the result) and decoded raw (input copied verbatim). For testing sanity on verbatim copies all the whitespace is preserved.

Extra things I did (or didn't)

  • Every declaration is documented.

  • None of this has been properly optimized or benchmarked. I assume it's relatively fast, simply because there's not much to do in a byte-by-byte decoder, but some algorithms are probably suboptimal;

  • Tests, quite a few of them;

  • Dependency boundaries correspond to GHC 8.8 (GHC 8.6 was the last version of MonadFail proposal implementation and it also didn't have NumericUnderscores) with a hard requirement of text being at version 2.0 or higher (that's when they've switched to UTF-8). Due to the text being this high the bottom version is most probably going to be GHC 9.4+ on average;

  • Travis was replaced with CircleCI. This project compiles on GHC 8.8.4 and every latest major version since. I don't think there are any platform-dependent issues with this rewrite as both JSON and UTF-8 are byteorder-independent.

What could probably be added

  • Custom parser that stores all backtrackable chunks lazily as a difference list, perhaps even using the fresh new delimited continuations. This would have to be a separate library for both sanity and performance, and thankfully attoparsec is good enough for the job as is.

  • Non-UTF8 decoding/encoding. This shouldn't be hard to implement with GHC.IO.Encoding, but it's a rare sidecase and a generic solution needs recursive copying, which is verbose and needs non-trivial testing;

  • The radix tree probably belongs in its own library, however there is no proper "radix tree containers" library I know of;

  • time decoding probably belongs in attoparsec-iso8601. I implement a few more functions than that library and our two implementations only align on dependencies, so I assume it requires more than just a PR with my changes;

  • Number parsing ideally should use the carry flag instead of maxBound shenanigans, however GHC.Exts only exposes that functionality for Int# and Word#, not for any sized types. While doing this with C FFI would be trivial, I don't want to pollute the library with it. Oh, apparently even C doesn't implement direct access to the carry flag, the correct way to handle this is through lame comparisons. Color me surprised.

@BurningWitness
Copy link
Author

Succeeded by #18.

@BurningWitness BurningWitness deleted the apotheosis branch October 25, 2024 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant