Skip to content
Dave Revell edited this page Jun 9, 2011 · 5 revisions

Links:

All about Megalon

Megalon is a clone of Google's Megastore transaction manager. Its goal is to be a highly scalable fault-tolerant data storage system that spans multiple data centers.

There's nothing very novel in Megalon so far, since it's a pretty straightforward ripoff of Megastore. So all the credit for the interesting protocols and architecture goes to the Googlers who wrote the Megastore research paper: Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, and Vadim Yushprakh.

Current status

Megalon isn't even prototype-quality yet, so you probably shouldn't download it unless you feel like writing a lot of Java. I hope to have to have it running as a prototype at the end of June (2011). At that point I will invite people to contribute code and ideas.

Megalon started as a school project and is now being written in my spare time. The code quality is not yet great, and some parts need to be rewritten.

Megalon basic ideas

Megalon (like Megastore) uses the Paxos distributed agreement protocol between all replicas to agree on a single total ordering of transactions. This means that all replicas have the same view of the data at all times. This guarantee makes it easy to write client applications: the ACID transaction is a programmer-friendly building block for data storage. The programmer doesn't have to wonder about questions like "what will be the state of the data if a batched write partially succeeded during a network partition followed by a server reboot?" Transactions either atomically commit, or don't.

In Megalon (as in Megastore) the full database is broken up into small entity groups. An entity group is a transactional domain: your transaction can only touch one entity group, and gets full ACID guarantees within that entity group. For example, your application could store all the data for a single user within one entity group, allowing you to run ACID transactions that touch only that user's data.

Megalon's support for transactions distinguishes it from many of the current generation of scalable non-SQL data storage systems. Transactionality has a cost in latency and availability, though (see below).

For example, suppose you're writing a Gmail-like application on top of Megalon. You might design your application so that there is one entity group per user that contains their data. When a user receives a message, you could run a transaction that:

  • Writes the message to the inbox
  • Increments the "unread messages" and "inbox messages" counters
  • Updates the user's storage quota information

All this can happen in a fully atomic way that's isolated from other concurrent transactions. Without transactions, the application programmer has to worry about failure scenarios such as "how can we guarantee that the quota information is always updated whenever a message is stored, even despite server failures halfway through?"

Of course, many applications don't need transactions. These applications might benefit from a simpler data storage system than Megalon.

Megalon tradeoffs

It's is in the nature of distributed systems to suck. You can't have low latency, perfect fault tolerance, and perfect consistency in the same system. The fact that all replicas have the same consistent view of the data is very nice, but we should be clear about the cost that comes with this.

  • High latency: every transaction commit operation must communicate with a quorum of replicas. Cross-country latency in the US is about 70ms round-trip. If your replicas are distributed across multiple continents, you have even bigger problems. Megastore does offer fast local reads, though. If you prefer low write latency over strict consistency, go check out Cassandra or Riak. Applications can sometimes use tricks to hide latency from users (e.g. ack the write now, and actually do it later) but that's not always a good idea.
  • Partitions cause some unavailability: In order for a write to succeed, the writer must be able to communicate with a majority of replicas. So if your European data centers lose their connection to the North American data centers, at least one continent will contain annoyed users. Applications might be able to use tricks to hide failures from users, but of course that's not ideal.
  • Limited throughput per entity group: This design assumes that client applications will not need high write throughput to any one entity group. Because a round-trip between replicas is required to commit a transaction, no single entity group will be able to commit more than 1/X transactions per second if your inter-replica latency is X seconds (e.g. 50ms RTT => max theoretical throughput 20 commits/sec per entity group). Aggregate throughput across all entity groups is high, but not within a single entity group. If you can't meet your application's needs with a large number of low-throughput entity groups, Megalon isn't the tool for you.
Clone this wiki locally