Complete rewrite by BurningWitness · Pull Request #17 · GaloisInc/json

BurningWitness · 2023-08-04T20:02:23Z

What lead me here

I've been using aeson at work for the past four years and I thoroughly dislike it;
I wrote a small pretty-printer a month ago and suddenly it turned out aeson can't stream requests. The expected way of implementing streaming is I guess json-stream, a library that has a big ol' "actually we cheat a lot" disclaimer at the bottom of it;
There's ~~probably~~ lots of cool things that can be done here;
There's a library called json that has been dead for three years (I started on July 6th).

What this rewrite does not have

No unifying typeclass (ToJSON/FromJSON) and hence no GHC.Generics.

I enjoy handrolling instances and the idea of "the one true instance" gets in the way constantly. Generics also have an incredibly lame issue of inheriting pair names from record fields which are rather stringent in what they allow. Both of these are something a higher-level library can provide, hopefully in a more generalized way.
No unifying datatype (Value).

Adding something like this is completely optional and highly ambiguous depending on how much information about the JSON we wish to convey (do we remove duplicates from container values? Is whitespace critical? Should numbers be stored denormalized if they were delivered that way?). Also raw JSON already conveys all of this info by itself.

Corners explored while rewriting

attoparsec's API is... weird:
- fail prepends extra text to the input, satisfy alters parser context when failing.
- There's peekWord8, but there's no advance. I have to anyWord8 the character to consume it, so I have to hope the compiler is diligent enough to compile peekWord8 >> anyWord8 into peekWord8 >> advance 1.
- take and match return a StrictByteString, meaning if the parser ever needs to backtrack ten thousand bytes, those ten thousand bytes have to be reallocated into one continuous block of memory
The solutions taken to address each of these are:
- err is defined to replace fail. This pushes attoparsec's lower boundary up to 0.13, since that's when that library revealed Data.Attoparsec.Internal.Types;
- Every single parser in the rewrite consumes one byte at a time and never backtracks. The only functions used are peekWord8, peekWord8' and anyWord8;
- Since matching is off the table, copying is performed by accumulating a chain of poke operations that are forced on chunk overflow.
attoparsec-iso8601 uses explicitly Text parsers. I rewrote the functions from scratch to fit the "one byte at a time" rule described above, so it's not a big issue;
empty (from Alternative) does not apply to any decoders. I use the semigroupoids package for the Alt typeclass instead.

What this rewrite provides

Basic JSON decoding and encoding. Decoding consumes input lazily, encoding produces output lazily.
Streaming JSON decoding out of the box. For the overwhelming majority of users this will just be arrays, however streaming can be nested (see test/conformance/Test/JSON/Decode/Stream.hs for two-dimensional arrays and objects);
Four distinct ways to slice up an object:
- Read the entire object, shove every key into a radix tree, then resolve the object parser. This object decoder is referred to as Plain, it's the most powerful and also the most wasteful;
- Analyze the object parser for keys, read the entire object, shove every key we know we need into a radix tree, then resolve the object parser. This object decoder is referred to as Bounded and it's not a Monad. Branching is supported through Selective;
- If the object parser is known to never branch: Analyze the object parser for keys and so long as they're all unique, all the decoding can be ran inline. This object decoder is referred to as Linear, it's strictly Applicative.
- Fold left-to-right (both a basic and a streaming variant);
Two distinct ways to slice up an array:
- Consume elements left to right. This array decoder is referred to as Elements and it mostly exists for weird encoding a la aeson's handling of tuples;
- Fold left-to-right (both a basic and a streaming variant);
String parsing with inlined UTF-8 decoding. This caused quite a lot of code duplication, but it's also relatively easy to test rigorously, which I did;
O(n) complexity for sized number parsing. Since we know the size beforehand, integers can fail early and floats can skip extra precision;
Time and UUID decoding and encoding. Both are straightforward to implement and show up often enough to be included;
Every primitive can be decoded with a skip (discarding the result) and decoded raw (input copied verbatim). For testing sanity on verbatim copies all the whitespace is preserved.

Extra things I did (or didn't)

Every declaration is documented.
None of this has been properly optimized or benchmarked. I assume it's relatively fast, simply because there's not much to do in a byte-by-byte decoder, but some algorithms are probably suboptimal;
Tests, quite a few of them;
Dependency boundaries correspond to GHC 8.8 (GHC 8.6 was the last version of MonadFail proposal implementation and it also didn't have NumericUnderscores) with a hard requirement of text being at version 2.0 or higher (that's when they've switched to UTF-8). Due to the text being this high the bottom version is most probably going to be GHC 9.4+ on average;
Travis was replaced with CircleCI. This project compiles on GHC 8.8.4 and every latest major version since. I don't think there are any platform-dependent issues with this rewrite as both JSON and UTF-8 are byteorder-independent.

What could probably be added

Custom parser that stores all backtrackable chunks lazily as a difference list, perhaps even using the fresh new delimited continuations. This would have to be a separate library for both sanity and performance, and thankfully attoparsec is good enough for the job as is.
Non-UTF8 decoding/encoding. This shouldn't be hard to implement with GHC.IO.Encoding, but it's a rare sidecase and a generic solution needs recursive copying, which is verbose and needs non-trivial testing;
The radix tree probably belongs in its own library, however there is no proper "radix tree containers" library I know of;
time decoding probably belongs in attoparsec-iso8601. I implement a few more functions than that library and our two implementations only align on dependencies, so I assume it requires more than just a PR with my changes;
Number parsing ideally should use the carry flag instead of maxBound shenanigans, however GHC.Exts only exposes that functionality for Int# and Word#, not for any sized types. ~~While doing this with C FFI would be trivial, I don't want to pollute the library with it.~~ Oh, apparently even C doesn't implement direct access to the carry flag, the correct way to handle this is through lame comparisons. Color me surprised.

…formance

BurningWitness · 2024-10-25T20:05:57Z

Succeeded by #18.

BurningWitness added 19 commits August 4, 2023 22:28

Full library rewrite

08087ed

Adding CircleCI configuration

2b3b2f5

Trying to run Alpine docker container on CI

011dd41

Fixing CI YAML

c96b436

Figuring out how to setup Alpine on CI

e9fdf5e

Using adduser instead of useradd on CI

2e9bf29

Not assigning a password to the CI user

64a2c4d

Moving CI to the base image

9cda373

Adding apt-get update to CI

2e9a5a3

Bumping libffi CI dependency

dda46ea

Tightening text lower boundary to >= 2, checking larger container per…

ec46819

…formance

Fixing CI YAML

bb700c7

Sizing the resource class back down and broadening CI

3424f4e

Using Data.Time.LocalTime.Compat on GHC 9.0.2 and below

a27f485

Raising lower base boundary to GHC 8.8

4930e15

Pushing CI 9.2 version to the latest one.

2edf82c

Compressing the radix tree test wordlist into a single line

20dd6d0

Adding testing for the stream accumulator

a48183f

Adding tests for skip* functions

ea8d0c8

BurningWitness mentioned this pull request Aug 6, 2023

Resumable parsing using continuation primitives AndrasKovacs/flatparse#37

Open

BurningWitness mentioned this pull request Aug 13, 2023

Decoupling byte-level encoding haskell/text#535

Open

BurningWitness mentioned this pull request Oct 25, 2024

Complete rewrite #18

Open

BurningWitness closed this Oct 25, 2024

BurningWitness deleted the apotheosis branch October 25, 2024 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete rewrite#17

Complete rewrite#17
BurningWitness wants to merge 19 commits intoGaloisInc:masterfrom
BurningWitness:apotheosis

BurningWitness commented Aug 4, 2023 •

edited

Loading

Uh oh!

BurningWitness commented Oct 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BurningWitness commented Aug 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What lead me here

What this rewrite does not have

Corners explored while rewriting

What this rewrite provides

Extra things I did (or didn't)

What could probably be added

Uh oh!

BurningWitness commented Oct 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BurningWitness commented Aug 4, 2023 •

edited

Loading