Yesterday I finished porting a MusicBrainz production service from Perl to Haskell. The port was done for no real reason other than giving me something to do during my commute, but as work developed I became interested in how the two applications might compare. In this post I will discuss my experience, hopefully providing one more data point for people who are considering using Haskell for production work.
The project I ported was the
caa-indexer
, a long running
service that achieves consistency between data stored in the MusicBrainz
database, and data stored by the Cover Art Archive itself. If you're not
familiar with the Cover Art Archive, here's a brief
technical overview of how it works.
The Cover Art Archive (CAA) stores artwork about music. Every release in
MusicBrainz has a MusicBrainz identifier, and the CAA is a separate database
associating images with these identifiers. Images are stored by the Internet
Archive, and can be managed via an S3-like protocol. The metadata about the
image - such as tracking which MusicBrainz release it belongs to, or what type
of image it is - is stored in the main MusicBrainz database. As artwork is
added, removed, or otherwise modified, events are propagated from the
MusicBrainz database over RabbitMQ to the caa-indexer
, who responds to each
event accordindly. There are three events - indexing, moving and deleting:
-
Indexing refreshes the
index.json
file, which is a JSON serialization of metadata that MusicBrainz stores. This is used by the http://coverartarchive.org web service. It also builds amb_metadata.xml
file, which is a snapshot of the release according to the MusicBrainz web service. This is used by the Internet Archive to enhance their own search results. -
Move events are used to move an image from one bucket to another. For example, when two releases are merged, all the images are moved to the same bucket.
-
Delete events delete data from buckets - be that images or the
index.json
file.
Thus any caa-indexer
-like project needs the ability to:
- Communicate with RabbitMQ to listen for events.
- Communicate with PostgreSQL to fetch metadata from the MusicBrainz database.
- Communicate with the Internet Archive's S3 store
- Produce JSON documents
- Make web requests to musicbrainz.org
My methodology for the porting was ad hoc, and not particularly coordinated. I was happy with the existing architecture, so the Haskell rewrite is almost taking code verbatim from Perl and doing the necessary rewriting. The general architecture remained the same.
Everything the caa-indexer
needs to do is supported by Haskell libraries on
Hackage, which is the central location of Haskell
libraries (much like CPAN for Perl). I used the following libraries to meet the
requirements:
postgresql-simple
for talking to PostgreSQL.amqp
for talking to RabbitMQ.aws
for talking to the Internet Archive.aeson
for JSON serializationhttp-conduit
for making web requests. I initially wanted to usehttp-streams
for this, but asaws
already usedhttp-conduit
, it was easier to use the same library.xml
for trivial XML processing of responses.configurator
to configure the application outside of the code itself.
The following were also used to produce more idiomatic Haskell code:
uuid
provides aUUID
data type, allowing me to achieve a little more type safety.async
was used to process events in a separate thread, in order to provide a somewhat ultimatetry
/catch
. We'll see more on this later.errors
, which is part of my standard toolkit for dealing with things going wrong.text
which is the canonical way to deal with text in Haskell, along with encoding and decoding bytestreams.
Together, these libraries make a formidable toolkit for solving problems in the
caa-indexer
's domain. Of these libraries, the aws
library was new to me, and
I hadn't used http-conduit
for a long time. Documentation on these libraries
was particularly helpful - and I didn't find it difficult to learn the new APIs.
The fact that Haskell provides type signatures means that Haddock (Haskell's
documentation tool) is already able to provide documentation even if the library
is otherwise undocumented. The xml
library falls into this category, and while
it could have done with better documentation, being able to find functions
purely on their type signature was sufficient.
The Haskell codebase measures at 389 lines of code, while the Perl codebase measures at 399. I'm somewhat amazed at how similar both codebases are in size. Haskell is a tiny bit more compact, though this is likely due to the fact that all code is in a single file, where as the Perl code is split over multiple files. That said, it's clear that the difference in code size is neglible. What's more interesting to note is that the Haskell code provides significantly more guarantees due to its type system, in the same amount of code as Perl. I believe this is a testement to just how succinct Haskell code can be.
That said, there are places in the Haskell code that are more verbose than Perl,
due to being more type safe. One example is dealing with UUIDs from the
MusicBrainz database. In Perl, almost everything coming out of the database is
just a string, so UUIDs are dealt with like any other bit of text.
postgresql-simple
however checks the types of columns against the type they are being
represented with in Haskell at runtime. Thus if you want to deal with UUIDs in
Haskell, you need to either cast it to plain text in the query, or write
(de)serializers for this data type. I chose the latter, and had to write the
following code, which has very little to do with the caa-indexer
:
instance Pg.ToField UUID.UUID where
toField = Pg.Plain . Pg.inQuotes . Builder.fromString . UUID.toString
instance Pg.FromField UUID.UUID where
fromField f Nothing = Pg.returnError Pg.UnexpectedNull f "UUID cannot be null"
fromField f (Just v) = do
t <- Pg.typename f
if t /= "uuid" then incompatible else tryParse
where
incompatible = Pg.returnError Pg.Incompatible f "UUIDs must be PG type 'uuid'"
tryParse = case UUID.fromString (Char8.unpack v) of
Just uuid -> return uuid
Nothing -> Pg.returnError Pg.ConversionFailed f "Not a valid UUID"
Hardly the end of the world, but it's code that otherwise detracts from the main
purpose of the caa-indexer
. I will likely wrap this code up in a small library
and upload it to Hackage in the future.
The only other piece of code that feels a little more verbose in Haskell is the JSON serialization itself. In Perl, I can write the following:
$json->objToJson({
images => [
map +{
types => $_->{types},
front => $_->{is_front} ? JSON::Any->true : JSON::Any->false,
back => $_->{is_back} ? JSON::Any->true : JSON::Any->false,
...
}, $self->dbh->query(
'SELECT * FROM cover_art_archive.index_listing
WHERE release = ?
ORDER BY ordering',
$release->{id}
)->hashes
],
release => 'http://musicbrainz.org/release/' . $release->{gid}
});
Which maps over the results of database query directly into JSON. In Haskell, I chose to use first convert the row into a Haskell data type, and then write a serializer for this data type. Thus, the code is more along the lines of:
data IndexListing = IndexListing
{ indexRelease :: UUID.UUID
, indexImages :: [ReleaseImage]
}
data ReleaseImage = ReleaseImage
{ imageTypes :: Vector.Vector Text.Text
, imageIsFront :: Bool
...
}
instance Aeson.ToJSON IndexListing where
toJSON IndexListing{..} = Aeson.object
[ "images" .= indexImages
, "release" .= ("http://musicbrainz.org/release/" ++ UUID.toString indexRelease)
]
instance Aeson.ToJSON ReleaseImage where
toJSON ReleaseImage{..} = Aeson.object
[ "types" .= imageTypes
, "front" .= imageIsFront
...
]
-- later:
images <- Pg.query pg [sql| SELECT ... |]
Aeson.encode (IndexListing release images)
Which as you can see, is significantly more verbose, and has required me to
write the data
declaration with selector names, only to later write this again
using ToJSON
. Haskell does have
generics support, but for JSON I find this is
usually suboptimal when you have a specific schema in mind. I also needed to
include properties in the JSON document that are a function of other value (such
as expanding the image ID into a full URL), so automatically deriving JSON
serialization is not sufficient, unless I want to store duplicate state in
ReleaseImage
itself (which is certainly worse).
There is a price to pay for being more type safe, and while often this is superficial, there are places where it leads to a little more friction. One could argue that introducing the data type leads to better testability however, as it's now possible for me to check the JSON serialization in a pure manner without touching a database at all.
A friend at a recent Haskell meetup once described Python development to me as "backtrace driven development", and unfortunately Perl often feels very much the same. I generally write code in Perl by sketching out an idea, and then trying to run it. If it crashes, I refine my code and try again. It's a very haphazard approach to building software. It can be improved by working in a test driven manner, but ultimately you are still working with assurances over a subset of the codebase.
In Haskell I find there are two main approaches, depending on how you are
solving problems. If you have a solution in mind, you can take a loosely "holes
driven" approach to programming, where you write top level types and leave
implementations blank; or you can do more bottom up programming by working in
the ghci
REPL. Having already written this code once in Perl, I opted for the
former - first getting my types right, and then filling in the blanks later.
Structuring my types took a while - in fact the entire project didn't compile for about
4 hours. This may seem very strange to some people, but I find it very natural
in Haskell. The lack of compilation doesn't imply I'm not making progress - in
those 4 hours I learnt more about my program, developing an understanding about
how components
fit together. Once it compiled, it was then a case of replacing all calls
to undefined
with a real implementation. Again, this inevitably lead to false
assumptions and compiler errors, so the cycle repeated.
Remarkably, once I expanded undefined
and had things compiling, the majority
of the battle was won. Running the program connected to RabbitMQ, setup the
queues I expected, and was listening for events as I expected. This isn't to say
the job was done - there were still places where runtime execution was wrong, or
error handling was insufficient.
The problems at runtime where genuine problems, and the bugs that I find
interesting to fix. I wasn't spending time working out why something was
null
, or if a method call had the wrong arguments, because the compiler had
already held my hand through that stage of programming. Overall the development
experience was fun.
I wrote this port almost entirely on my own, and didn't have to interact much
with the Haskell community. The amqp
package didn't initially have support for
message headers, which is needed for some of the retry logic, so I raised a bug
on the amqp
project's Github issue tracker. The author was responsive and
released a new version of the library with this support.
While extremely limited, this is consistent with my experience of the Haskell community outside this work. A small group of people very passionate about their work, and who embrace working with others.
There are few things in the Haskell port that I really liked. This section is a place to express my enjoyment of writing Haskell, and show off some of the things I found cool. While these points are highly specific and won't help you make a choice of whether you want to write your projects in Haskell, I hope that some of my enthauiasm will reach you, and express my feeling that Haskell is both fun and a constant learning experience.
The data type of an EventHandler
uses RankNTypes
:
data EventHandler = forall payload. EventHandler
{ eventRoutingKey :: Text.Text
, eventParser :: String -> Maybe payload
, eventAction :: HandlerEnv -> payload -> IO ()
}
This lets me have lists of EventHandler
s, each of which is able to parse a
String
into a handler specific payload of a more refined type. Thus, I have a
few top level definitions such as index :: EventHandler
and I can later map
over all EventHandler
s to register them with RabbitMQ:
mapM_ registerEventHandler [ index, move, delete ]
This is a simple way to achieve a somewhat hetrogeneous list. The alternative
approach of parameterizing an EventHandler
by its payload type would make this
list construction harder to use, so by moving the choice of the payload to the
EventHandler
value itself, the code is simplified.
I use a few different monads inside my caa-indexer
. For example, I use the
Maybe
monad to simplify XML traversals:
XML.parseXMLDoc (HTTP.responseBody response) >>=
XML.findChild (XML.unqual "Error") >>=
XML.findChild (XML.unqual "Resource") >>=
pure . XML.strContent
This parses a HTTP response as XML, and finds the text content inside the
<Error><Resource>
element. If any of this fails, the whole computation aborts
and Nothing
is returned. I find it particularly elegant that the monadic
sequencing took care of this for me.
Another monad that I really like is EitherT
, which lets me sequence a series
of actions, aborting much like Maybe
if anything fails, but with the ability
to provide a reason for failure. EitherT
transforms a base monad and I use it
with IO to provide a try/catch like block:
EitherT.eitherT
(retry msg handlerChan eventRoutingKey)
(const $ return ()) $
(do payload <-
eventParser messageBody ?? (toException MessageBodyParseFailure)
EitherT.EitherT $
Async.withAsync (eventAction env payload) Async.waitCatch)
The final do
block is the code to execute. This code can either result in an
exception, at which point retry
is invoked to retry the message in 4 hours
(this is done by dead lettering in RabbitMQ), or it can succeed by running to
completion.
I use ??
to lift the Maybe
result of the eventParser
into EitherT
, and I
use withAsync
to run a computation on a separate thread, and if this dies
because of an exception I lift the exception into EitherT
. Thus we have a nice
try/catch-like piece of code, that required no language support - we just built
it ourselves.
We all know that it's important to choose the right tool for the job, but I am honestly yet to find problems where Haskell isn't a suitable tool. Perhaps a while ago when the library coverage was sparse, Haskell wasn't a great choice - but now that is patently not the case. I showed that it's easy to write software that interacts with relational databases, message brokers and Amazon web services, all without having to write any of this code myself.
Haskell is often claimed to drastically reduce code size, but I didn't find this claim held up against other high level languages. On the other hand, I also didn't find that Haskell was any more verbose either - so I feel that claims about having to "work to please the type system" don't really hold up to much, though there are cases where it does add a little more friction.
Code legibility is another important issue, and I can't provide much feedback around this. Of course, I feel the Haskell is readable because I wrote it! The only useful exercise here is for you to see for yourself:
More important to me though, is that the compiler is always reading the code with me. While I feel Perl code (in general) is readable, I have frequently misinterpreted the meaning of some code and carried false assumptions forward later in my project. Haskell isn't free of this either, but a lot of assumptions are spelled out in the types. The more you use types (such as phantom types, GADTs), the more we encode about data and the harder it is to misinterpret code.
The only remaining thing that makes me uncomfortable is getting the rest of my team to write some Haskell. We have so much going on at MusicBrainz, there hasn't been a good time to accept that things will have to slow down while people learn a new language, and the idioms that go with it. I don't feel that Haskell is particularly deserving of the scary/academic label it sometimes gets. Yes, it will be alien, but if something doesn't require a little bit of rewiring in order to show you a different perspective on things, why bother learning it? Moving between Python and Ruby is easy, because there is very little between them, but I think that also results in very little net gain in moving between them as well.
With an extremely powerful type system, a vibrant community, a comprehensive set of libraries, a new programming paradigm (purity and laziness) and a beautifully succinct language, Haskell is still the ultimate language to me. If you haven't yet picked up Haskell, I encourage you to go and Learn You a Haskell for Great Good today.