Closed
Conversation
Open
Author
|
Succeeded by #18. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What lead me here
aesonat work for the past four years and I thoroughly dislike it;aesoncan't stream requests. The expected way of implementing streaming is I guessjson-stream, a library that has a big ol' "actually we cheat a lot" disclaimer at the bottom of it;probablylots of cool things that can be done here;jsonthat has been dead for three years (I started on July 6th).What this rewrite does not have
No unifying typeclass (
ToJSON/FromJSON) and hence noGHC.Generics.I enjoy handrolling instances and the idea of "the one true instance" gets in the way constantly.
Genericsalso have an incredibly lame issue of inheriting pair names from record fields which are rather stringent in what they allow. Both of these are something a higher-level library can provide, hopefully in a more generalized way.No unifying datatype (
Value).Adding something like this is completely optional and highly ambiguous depending on how much information about the JSON we wish to convey (do we remove duplicates from container values? Is whitespace critical? Should numbers be stored denormalized if they were delivered that way?). Also raw JSON already conveys all of this info by itself.
Corners explored while rewriting
attoparsec's API is... weird:failprepends extra text to the input,satisfyalters parser context when failing.peekWord8, but there's noadvance. I have toanyWord8the character to consume it, so I have to hope the compiler is diligent enough to compilepeekWord8 >> anyWord8intopeekWord8 >> advance 1.takeandmatchreturn aStrictByteString, meaning if the parser ever needs to backtrack ten thousand bytes, those ten thousand bytes have to be reallocated into one continuous block of memoryThe solutions taken to address each of these are:
erris defined to replacefail. This pushesattoparsec's lower boundary up to0.13, since that's when that library revealedData.Attoparsec.Internal.Types;peekWord8,peekWord8'andanyWord8;matching is off the table, copying is performed by accumulating a chain ofpokeoperations that are forced on chunk overflow.attoparsec-iso8601uses explicitlyTextparsers. I rewrote the functions from scratch to fit the "one byte at a time" rule described above, so it's not a big issue;empty(fromAlternative) does not apply to any decoders. I use thesemigroupoidspackage for theAlttypeclass instead.What this rewrite provides
Basic JSON decoding and encoding. Decoding consumes input lazily, encoding produces output lazily.
Streaming JSON decoding out of the box. For the overwhelming majority of users this will just be arrays, however streaming can be nested (see
test/conformance/Test/JSON/Decode/Stream.hsfor two-dimensional arrays and objects);Four distinct ways to slice up an object:
Plain, it's the most powerful and also the most wasteful;Boundedand it's not aMonad. Branching is supported throughSelective;Linear, it's strictlyApplicative.Two distinct ways to slice up an array:
Elementsand it mostly exists for weird encoding a laaeson's handling of tuples;String parsing with inlined UTF-8 decoding. This caused quite a lot of code duplication, but it's also relatively easy to test rigorously, which I did;
O(n)complexity for sized number parsing. Since we know the size beforehand, integers can fail early and floats can skip extra precision;Time and
UUIDdecoding and encoding. Both are straightforward to implement and show up often enough to be included;Every primitive can be decoded with a skip (discarding the result) and decoded raw (input copied verbatim). For testing sanity on verbatim copies all the whitespace is preserved.
Extra things I did (or didn't)
Every declaration is documented.
None of this has been properly optimized or benchmarked. I assume it's relatively fast, simply because there's not much to do in a byte-by-byte decoder, but some algorithms are probably suboptimal;
Tests, quite a few of them;
Dependency boundaries correspond to GHC 8.8 (GHC 8.6 was the last version of
MonadFailproposal implementation and it also didn't haveNumericUnderscores) with a hard requirement oftextbeing at version2.0or higher (that's when they've switched to UTF-8). Due to thetextbeing this high the bottom version is most probably going to be GHC 9.4+ on average;Travis was replaced with CircleCI. This project compiles on GHC 8.8.4 and every latest major version since. I don't think there are any platform-dependent issues with this rewrite as both JSON and UTF-8 are byteorder-independent.
What could probably be added
Custom parser that stores all backtrackable chunks lazily as a difference list, perhaps even using the fresh new delimited continuations. This would have to be a separate library for both sanity and performance, and thankfully
attoparsecis good enough for the job as is.Non-UTF8 decoding/encoding. This shouldn't be hard to implement with
GHC.IO.Encoding, but it's a rare sidecase and a generic solution needs recursive copying, which is verbose and needs non-trivial testing;The radix tree probably belongs in its own library, however there is no proper "radix tree containers" library I know of;
timedecoding probably belongs inattoparsec-iso8601. I implement a few more functions than that library and our two implementations only align on dependencies, so I assume it requires more than just a PR with my changes;Number parsing ideally should use the carry flag instead of
maxBoundshenanigans, howeverGHC.Extsonly exposes that functionality forInt#andWord#, not for any sized types.While doing this with C FFI would be trivial, I don't want to pollute the library with it.Oh, apparently even C doesn't implement direct access to the carry flag, the correct way to handle this is through lame comparisons. Color me surprised.