title |
---|
A New Database Abstraction Layer |
The New Edit System, NES, requires a fairly hefty new schema due to increase of tables to actually do the versioning. While the bulk of this schema is done, it's fairly useless if the website can't talk to it!
The next step for me is to get the website to talk to the new database. This isn't quite as straight forward as changing the implementation of existing API methods - the whole API needs to change to match the main ideas of NES (trees, revisions, etc). Even if it was a case of just changing the API, this means I have to write more Perl code, fix hundreds of tests, etc.
It doesn't fill me with great joy to do this rewrite, partly because it's just not fun work, but also because I don't believe I'll do a good a job as I can, simply because it's Perl.
In this post, I've outlined the 2 major ideas going round - PostgreSQL as an API, and database as a service.
One option is to move a lot of the heavy lifting to PostgreSQL through stored procedures. Before we look at the pro's and con's of this, lets have a look at what this would look like:
CREATE TYPE loaded_artist (
artist_id uuid,
revision_id integer,
artist_tree_id integer,
name text,
sort_name text,
artist_type_id integer,
-- etc
);
-- This gets the latest 'master' version of an artist and joins in basic data
-- from the artist tree
CREATE FUNCTION get_latest_artist_by_mbid(in_mbid UUID)
RETURNS SETOF loaded_artist AS $$
SELECT
artist_id, revision_id, artist_tree_id, name.name,
sort_name.name AS sort_name, artist_type_id
FROM artist
JOIN artist_revision USING (artist_id)
JOIN artist_tree USING (artist_tree_id)
JOIN artist_data USING (artist_data_id)
WHERE artist.master_revision_id = revision_id
AND artist_id = $1
$$ LANGUAGE SQL;
This is a real-world example of something that we need to do in NES - loading
artists by their MBID. The first part of the above code snippet declares a new
data type, loaded_artist
which we can use to enforce a little more type safety
throughout the API. By defining functions in terms of loaded_artist
, we are
bound to a contract that they /must/ return data that matches that structure.
For some clients, direct queries into the database might be enough, but most people are going to going to want to talk to the database by some language specific bindings. In Perl, it might look something like:
sub get_latest_by_mbid {
my ($self, $mbid) = @_;
return $self->new_from_row(
$self->sql->select_single_row_hash(
'SELECT * FROM get_latest_artist_by_mbid(?)',
$mbid));
}
Where new_from_row
knows how to turn a loaded_artist
into some Perl data
type. The Perl code is now minimized to being a very light wrapper around the
database.
PostgreSQL has a pretty nice type system. Method's can only be called if you specify the correct number of arguments, of the write types. Return types allow function composition to be type safe.
The type system also creates some great cohesion. In the above example if I
changed the specification of what a loaded_artist
is but didn't update
get_latest_artist_by_mbid
, I wouldn't be able to make those changes.
No matter what programming language you're working in, if you can talk to the database you immediately have a very rich API to work with.
Documentation in PostgreSQL is sometimes overlooked, but the COMMENT declaration lets you assign comments to various objects in the database. This requires the object actual exis
This does not change our application stack at all. We already require PostgreSQL, we're just using it for more higher level operations.
We only get the aforementioned type of cohesion at stuff that's first class, and more complicated functions (such as PL/Python, PL/Perl and even PL/PGSQL) will not be type checked at all. This means they may crash at runtime.
Also, this type saftey is not great over ALTER
ing things - if I later run
ALTER TYPE
on loaded_artist
and drop an attribute, the existing functions
will only break at runtime. This problem can be avoided by always doing 'full'
installations rather than migrations (think loading snapshots rather than deltas).
Releasing new versions of the server now involves having to do database updates as well. There are tools to help with this, such as Versioning, Sqitch and more.
Tool support is weaker for PostgreSQL. There is no real way to debug things
other than print tracing. EXPLAIN
and log analysis is about as far as you get
with a code profiler (which is not awful by any means).
SQL is not the most flexible of languages, and this shows when you start to what
to do things like abstract functions out. Take the example of relationships -
fetching relationships between x
and y
is always the same, but the l_a_b
table changes, as do the tables you link to. This either leads to massive
amounts of copying and pasting, or dynamic SQL. I usually go with the latter
these days, and create install_
functions that are then immediately selected
with various parameters, which gives a poor mans compile time phase.
As you're talking directly to the database, you can't really make use of memcached in between. Client libraries would have to do this,
I like the increased safety, but it just feels a bit weird, though I can't quite my finger on where. Not very helpful, I know! I've been somewhat prototyping this API in my nes-playground project, if you want to have a look.
Rather than having applications talk to PostgreSQL, they actually talk to a service somewhere. This could be a REST server, a SOAP server, a Thrift server, that isn't really important to this discussion. But the idea is that you make the database an implementation detail, so when handling website requests you consult this service, rather talk to the database directly.
This is essentially what we have right now, it's just that the service happens to run in the same process as the web server itself. However, I think if we were going with this idea we'd be looking at moving this service out of the web server process and into its own process, giving us the ability to implement this in a different language.
Rewriting the service layer doesn't have to be done in Perl, we could write it in Java, Python, C, whatever. This will let us do a separate round of language evaluation, and hopefully we can choose something that doesn't have quite the same amount of accidental complexity as Perl.
Now that we have a service we can aggregate data from other services, or otherwise communicate with them. For example, memcached, message queues, and so on. However, as good as this sounds it does mean that the 'service' itself starts to grow and have more dependencies, or the code becomes more complicated to do with missing services.
If a client wants to use our API they now have to run a PostgreSQL server for their queries to run against, but they also have to run this service somewhere as well. And as alluded to earlier, if the service also requires other services, then they might need those as well.
Depending on the transport layer for the service, there is more overhead. For example, most people will probably want REST+JSON, which means you have the overhead of a HTTP server, and the overhead of serializing and then deserializing. With a lot of roundtrips, this can be expensive. PostgreSQL doesn't really suffer from this as much as it's a binary protocol with persistent connections.
My personal feeling for this is that moving things to PostgreSQL is going to be enough, though I do like the added type checking that we get. However, I'm not sure that justifies the increased costs in deployment, the context switching, and a development environment that's likely new to a lot of people.
With that said, Perl is becoming a maintainence nightmare. We're at least tightening some parts with my drive for more tests and faster tests, but it's becoming hard to trust the code that we all write. Though I'm sure I won't be popular with my choice of language :)
Being practical we have 2 full time developers, and a whole stack of bugs and more as we go forward. We need something that is going to allow us to get results progressively, rather than some gold perfect system 4 years later. And my NES work is somewhat blocked by this, so we need to figure it out fairly soon!