title |
---|
A Category for Correct-By-Construction Serializers and Deserializers |
Frequently in computer programming we need to work with data in different representations, and we need to work with the data on both sides of said representation. For example, we might have some Haskell data types in memory, which we later serialize to disk. When the user restarts our application, we need to reload this data back into Haskell data types, to allow them to resume work.
Haskell provides us with machinery for doing this serialization via the
binary
library, which gives us
the Binary
type class with two methods:
class Binary t where
put :: t -> Put
get :: Get t
get
deserializes a sequence of bytes into Haskell values, while put
operates
in the reverse direction - transforming Haskell values to a sequence of bytes.
Usually, we want the implementations of these methods to be mutual inverses -
the computation in get
should restore data serialized with put
, and
vice versa. Unfortunately, nothing in the type system nor the structure of these
methods gives us this guarantee - it's all down to the programmer. I don't trust
myself, so I set out to investigate a more reliable approach.
Ideally, we would like to build up serializers and deserializers from smaller
pieces, such that each piece carries its own inverse. For example, we could pair
up serialization for a String
with its own inverse:
type Serializer a = (Get a, a -> Put)
string :: Serializer String
string = (get, put)
As long as String
has a Binary
instance where get
and put
correctly
specified, we know that string
is going to function as we expect in both
directions.
We're on to something here, but currently this only works for single
String
s. What if I have a pair of String
s that I want to serialize? From
what we've seen so far, there's no way to combine our bidirectional
serializers. Earlier I mentioned that we would like to work with small pieces
and compose them - lets see if we can solve this problem for just serialization
first.
Serialization consumes data. If we have data to serialize, the application of one serializer should consume some of this data, leaving us with slightly less data that we have to continue serializing. By repeated application of serializers, we will eventually have considered the entire structure and will have nothing left to do. This consumption of a structure bit-by-bit suggests that serialization will be a type changing operation, as a record with a field removed is certainly not the same type as its larger record. So let's try and incorporate that:
data Serializing a b = Serializing (a -> PutM b)
pString2 :: Serializing (String, String) String
pString2 = Serializing $ \(a, b) -> do put a; return b
pString1 :: Serializing String ()
pString1 = Serializing $ \a -> put a
Composition of Serializing
should now be clear - we just need to compose them
one after another:
(.) :: Serializing b c -> Serializing a b -> Serializing a c
(Serializing g) . (Serializing f) = Serializing (f >=> g)
putTwoStrings :: Serializing (String, String) ()
putTwoStrings = pString1 . pString2
We've built a serializer that can serialize a tuple of two strings, and we did so piece-by-piece. While what we have so far is not entirely satisfactory, it seems like we're heading in the right direction. We'll come back to this later, but first let's see if the same ideas translate to deserializers.
Our serializer consumed data, so deserialization is naturally the opposite of
consuming data - that is, deserialization produces data. When we deserialize
we'll start with nothing, and we'll deserialize a few bytes into part of our
structure one step at a time. Each step of deserialization should take the
smaller structure and expand it into a larger structure - eventually leading us
to the desired structure. Again, this will be a type changing operation, and we
can encode all of this just as we did with Serializing
:
data Deserializing a b = Deserializing (a -> Get b)
getString1 :: Deserializing () String
getString1 = Deserializing $ \() -> get
getString2 :: Deserializing String (String, String)
getString2 = Deserializing $ \s -> do { s' <- get; return (s, s') }
(.) :: Deserializing b c -> Deserializing a b -> Deserializing a c
(Deserializing g) . (Deserializing f) = Deserializing (f >=> g)
getTwoStrings :: Deserializing () (String, String)
getTwoStrings = getString2 . getString1
As you can see, it's pretty much exactly the same idea. The only difference is that now each of our deserializers return a slightly bigger structure, whereas our serializers would move our structure to something smaller.
Just to prove that what we have so far works, we can try this in GHCI:
> let bytes = case putTwoStrings of Serializing p -> runPut (p ("Hello", "World!"))
> case getTwoStrings of Deserializing g -> runGet (g ()) (LBS.pack bytes)
("Hello","World!")
To carry on working towards our goal, we need to pair the Serializer
up with
its Deserializer
. Unfortunately, what we have so far won't work:
data Serializer a b = Serializer (a -> Get b) (a -> Put b)
Notice here how the types both move from a
to b
- that's certainly not
going to work, as the shape of the data is changing in opposite directions! In
Get
, a
is "smaller" than b
, whereas for Put
a
is "larger" then b
. In
order to work around this, we just need to swap the order of types in one of
these functions - I've swapped the order for Put
:
data Serializer a b = Serializer (a -> Get b) (b -> PutM a)
This makes sense - if put
will shrink our structure, then get
can move from
this smaller structure back to the original structure. We can express our
string1
and string2
serializers now:
string2 :: Serializer (String, String) String
string2 = Serializer (\(a, b) -> do put a; return b)
(\s -> do { s' <- get; return (s', s) })
string1 :: Serializer String ()
string1 = Serializer put (\() -> get)
We were able to compose things before, and we can certainly compose things here...
(.) :: Serializer b c -> Serializer a b -> Serializer a c
(Serializer g g') . (Serializer f f') = Serializer (f >=> g) (g' >=> f')
twoStrings :: Serializer (String, String) ()
twoStrings = string1 . string2
However, this has a rather significant problem - can you spot it? Take time to think about this and see if you can work out what's going wrong.
Did you find it? If we fire up GHCI and have a play with our twoStrings
serializer, lets see what we get...
> let bytes = case twoStrings of Serializer _ p -> runPut (p ("A", "B"))
> case twoStrings of Serializer g _ -> runGet (g ()) bytes
("B","A")
Oh no - that's not what we wanted at all! The problem is that the order of
effects are being reversed. When we put
data, we put the first tuple element
first, and then the second. However, we're reading data in the opposite order -
expecting the second element to be first in the stream, which is clearly not
correct. For (String, String)
the deserializer works but the tuple is in the
wrong order - for other data types this would lead to a runtime exception.
With the current definition of Serializer
, there's simply no way around this -
the types won't let us run effects in different orders. The reason for this is
that we can only access the underlying Get
computation by having the smaller
structure around first. However, we can be sneaky and changes things around just
enough to let us run Get
in a different order. Now the Get
computation is no
longer a function, but is a computation that returns a function:
data Serializer a b = Serializer (Get (a -> b)) (b -> PutM a)
With this change we do have access to any Get
computation we like, and we
are free to run them in a different order:
(.) :: Serializer b c -> Serializer a b -> Serializer a c
(Serializer g g') . (Serializer f f') =
Serializer (do buildF <- g
buildG <- f
return (buildF . buildG))
(g' >=> f')
Now it's clear that both our Put
and our Get
computations are sequenced in
the same order - nice! It turns out that our composition comes with a sane
definition of identity too, which means our Serializer
can be used with
Category
:
instance Category Serializer where
(Serializer g g') . (Serializer f f') =
Serializer (do buildF <- g
buildG <- f
return (buildF . buildG))
(g' >=> f')
id = Serializer (return id) return
We have finally reached a nice core to our solution, but the surface API isn't
really working out. We had to write different Serializer
s for both (String, String)
and String
, which is certainly not desirable. Ultimately, we would
like to be able to work with just one Serializer
for String
, and compose
them however we please.
Unfortunately, working with tuples is causing us the real pain here. The reason for this is that tuples don't really have any structure that would allow us to work with them in any sort of principled manner. Instead, what we can do is use a heterogeneous list, which we can recurse on just like an ordinary linked list. So, we introduce a type for heterogeneous lists:
data List :: [*] -> * where
Nil :: List '[]
Cons :: a -> List as -> List (a ': as)
And now we can use the new poly-kinded Category
to upgrade Serializer
to
work with these lists:
data Serializer :: [*] -> [*] -> * where
Serializer :: (Get (List a -> List b))
-> (List b -> PutM (List a))
-> Serializer a b
instance Category Serializer where
(Serializer g g') . (Serializer f f') =
Serializer (do mkB <- g
mkA <- f
return (mkB . mkA))
(g' >=> f')
id = Serializer (return id) return
This was quite a detour, and has this really helped us? Indeed it has, as we can
now we can write a much more general Serializer String
:
string :: Serializer as (String ': as)
string = Serializer (do a <- get; return (Cons a))
(\(Cons a as) -> do put a; return as)
The type of string
now indicates that this Serializer
can serialize anything
that starts with a String
, and likewise when deserializing it expects a
String
to be the first element. This composes exactly as we'd expect:
twoStrings :: Serializer as (String ': String ': as)
twoStrings = string . string
All we need to do is unwrap the List
resulting from a Get
or wrap up data in
a List
for Put
and we're good to go:
> let bytes = case twoStrings of
Serializer _ p -> runPut (void $ p ("A" `Cons` ("B" `Cons` Nil)))
> case twoStrings of
Serializer g _ -> runGet (($ Nil) <$> g) bytes
Cons "A" (Cons "B" Nil)
The API we've built works really well if we already have data decomposed into a
List
, but we don't normally have this luxury. This means we need a way to
convert from a data type to it's constituent parts, and this is exactly the
functionality that Prism
s in the
lens
library provide us with. While
Prism
s can be a little hard to get your head around, it can be illuminating to
experiment with them in GHCI:
> review _Cons (10, [])
[10]
> review _Cons (10, [20])
[10,20]
> review _Cons (1, [2..5])
[1,2,3,4,5]
> preview _Cons [10, 20, 30]
Just (10,[20,30])
> preview _Cons [10]
Just (10,[])
> preview _Cons []
Nothing
Prisms
have two main operations: review
and preview
. review
lets us
construct some data out of its parts - above we use _Cons
with (10, [20])
,
which is the same as (10 : [20])
- resulting in the list [10, 20]
. preview
lets us go the other way, which is the same idea as pattern matching on a
constructor. If we preview _Cons
on non-empty lists, then the pattern matching
succeeds and the list is separated into its head and tail. However, we can't
pattern match with _Cons
on an empty list, so preview
returns Nothing
-
which corresponds to a pattern match failure.
Armed with Prism
, we're almost entirely ready to go! The only problem is that
Prism
normally works with tuples, which we've already seen aren't a great data
for our needs. It's entirely mechanical to convert between tuples and List
, so
we simply move between them with a type class. Combining this all together, we
have the following:
class ListIso a b | a -> b, b -> a where
_HList :: Iso' b (List a)
usePrism :: ListIso a b => Prism' d b -> Serializer a '[d]
usePrism p = Serializer get put
where
put (Cons d Nil) = do
Just tuple <- return (preview p d)
return (tuple ^. _HList)
get =
return $ \hlist -> Cons (review p (hlist ^. from _HList)) Nil
Now we are free to use this on our data types, just as we'd expect:
instance ListIso '[a, b] (a, b) where
_HList = iso (\(a, b) -> Cons a (Cons b Nil)) (\(Cons a (Cons b Nil)) -> (a, b))
data PairOfStrings = PairOfStrings String String
makePrisms ''PairOfStrings
pairOfStrings :: Serializer '[] '[PairOfStrings]
pairOfStrings = usePrism _PairOfStrings . string . string
If you look closely at our definition of usePrism
you might have seen
something suspicious. Here's the relevant code:
usePrism = ...
where
put (Cons d Nil) = do
Just tuple <- return (Lens.preview p d)
return (tuple ^. _HList)
In our put
definition, we are assuming that Lens.preview
is always returning
a Just
value. However, we saw earlier that this isn't necessarily the case -
the Prism corresponds to one of potentially many constructors. If we try and use
usePrism
with a prism that doesn't match our expectations, then things go
horribly wrong:
data Strings = PairOfStrings String String | ThreeStrings String String String
makePrims ''Strings
> case pairOfStrings of
Serializer _ p -> runPut (void $ p (ThreeStrings "Uh" "Oh" "!" `Cons` Nil))
"*** Exception: Pattern match failure in do expression at ...
What we need to do is to allow for choice - if we have multiple possible prisms, then we need to consider each one. This corresponds to exhaustive pattern matching in case analysis.
It turns out choice is relatively straight forward to add in. Get
is already
an instance of MonadPlus
, so we get choice there for free. Put
however is a
little more involved, as it doesn't have an instance of MonadPlus
. The best
solution I've found thus far is to wrap up our Put
computation inside Maybe
,
but this isn't entirely satisfactory. Unfortunately binary
doesn't quite
export enough to have a less expensive solution (PairS
doesn't have its
oconstructor exported).
Monoid
is a sensible type class to use for alternatives - choice is
associative, and there is a sane identity (an always-failing Serializer
). Thus
the final definition of Serializer
and its associated type classes are:
data Serializer :: [*] -> [*] -> * where
Serializer :: (Get (List a -> List b)) -> (List b -> Maybe (PutM (List a))) -> Serializer a b
instance Category Serializer where
(Serializer g g') . (Serializer f f') =
Serializer (g >>= \b -> f >>= \a -> return (b . a))
(\a -> do putF <- g' a
let (b, lbs) = runPutM putF
putG <- f' b
return (putLazyByteString lbs >> putG))
id = Serializer (return id) (return . return)
instance Monoid (Serializer a b) where
mempty = Serializer mzero (const mzero)
(Serializer g p) `mappend` (Serializer g' p') =
Serializer (g `mplus` g') (\i -> p i `mplus` p' i)
Armed with this final definition of Serializer
, we're almost ready to provide
a complete definition of serializing our Strings
type. We need to provide a
little extra information however, which allows us to disambiguate
constructors. This is because if we are deserializing, if I read two strings I
don't necessarily know constructor to choose (yes, if we considered EOF this
could be done, I'm going for brevity). You can find the definition of
disambiguate
in the full code listing.
Thus the final user-facing code is just:
strings :: Serializer '[] '[Strings]
strings = mconcat
[ usePrism _PairOfStrings . disambiguate 1 . string . string
, usePrism _ThreeStrings . disambiguate 2 . string . string . string
]
And just to prove it all works...
> let Just putter = case strings of
Serializer _ p -> p (ThreeStrings "A" "B" "C" `Cons` Nil)
bytes = runPut (void putter)
> case strings of Serializer g _ -> runGet (($ Nil) <$> g) bytes
Cons (ThreeStrings "A" "B" "C") Nil
> let Just putter = case strings of
Serializer _ p -> p (PairOfStrings "Hello" "World!" `Cons` Nil)
bytes = runPut (void putter)
> case strings of Serializer g _ -> runGet (($ Nil) <$> g) bytes
Cons (PairOfStrings "Hello" "World!") Nil
We've seen that it's possible to build correct-by-construction serializers and deserializers, and we got there by breaking down our problem into small parts and finding a good way to combine the parts together. Hopefully, I've illustrated some of the problems that arise from a naive solution, and how these problems guided us towards an implementation that is both more correct and more flexible.
Serializer
is still not perfect however. With the idea of choice above,
there's no way to indicate exhaustive pattern matching. For example, in
strings
we are considering both constructors, yet put strings
returns Maybe Put
. This isn't particularly satisfactory, because it should now always be
possible to serialize this data type! On a similar note, it becomes harder to
get compile time checks about exhaustive pattern matching, because we're no
longer doing case
analysis explicitly. This is an interesting problem to me,
and one that I would still like to solve.
There is also a bit more work that we might want to consider doing with Get
and Put
, which is to use a different concept of choice. There are other
options than using Maybe
- for example we could use lists which would inform
us of all possible serializations for a data type, which might provide better
debugging information than simply using the first one that matches.
I'd like to conclude by mentioning that the ideas here aren't particularly
new. In 2010
Rendel and Ostermann
presented a solution using a category of partial isomorphisms and product
functors from this category to Hask, which lead to various libraries on Hackage
such as
invertible-syntax
, and
boomerang
. At ZuriHack,
Martijn van Steenbergen presented the
latest version of
JsonGrammar
, which uses a
free category to describe operations on a JSON AST, and also illustrated how one
can use prisms to provide a modern vocabulary for partial
isomorphisms. json-grammar
uses a stack-prism
data type, which achieves the
same goal as using heterogeneous lists, but does require another Template
Haskell call (makeStackPrisms
).
While I'm happy with the solution so far, I haven't finished playing around with
this. It's unclear to me how this plays with recursive data types (for example,
does this work for lists? What about trees?), and I need to learn more about
stack-prism
to see if using heterogenous lists impedes composition (as
Sjoerd Visscher has warned
me!). Hopefully I'll be able to start using what I have so far in production,
iron out the last problems, and release this to Hackage in the near future.
Thanks for reading, a full code listing can be found on Github