Skip to content
This repository has been archived by the owner on Sep 21, 2022. It is now read-only.

Commit

Permalink
Adding Sharding.md reference, fixing list markup syntax.
Browse files Browse the repository at this point in the history
  • Loading branch information
alainjobart committed May 11, 2015
1 parent dcd87a5 commit fea04f1
Show file tree
Hide file tree
Showing 9 changed files with 155 additions and 147 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ and a more [detailed presentation from @Scale '14](http://youtu.be/5yDO-tmIoXY).
### Intro
* [Helicopter overview](http://vitess.io):
high level overview of Vitess that should tell you whether Vitess is for you.
* [Sharding in Vitess](http://vitess.io/doc/Sharding)
* [Frequently Asked Questions](http://vitess.io/doc/FAQ).

### Using Vitess
Expand Down
28 changes: 14 additions & 14 deletions doc/Concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,19 +30,19 @@ eventual consistency guarantees), run data analysis tools that take a long time
### Tablet

A tablet is a single server that runs:
- a MySQL instance
- a vttablet instance
- a local row cache instance
- an other per-db process that is necessary for operational purposes
* a MySQL instance
* a vttablet instance
* a local row cache instance
* an other per-db process that is necessary for operational purposes

It can be idle (not assigned to any keyspace), or assigned to a keyspace/shard. If it becomes unhealthy, it is usually changed to scrap.

It has a type. The commonly used types are:
- master: for the mysql master, RW database.
- replica: for a mysql slave that serves read-only traffic, with guaranteed low replication latency.
- rdonly: for a mysql slave that serves read-only traffic for backend processing jobs (like map-reduce type jobs). It has no real guaranteed replication latency.
- spare: for a mysql slave not used at the moment (hot spare).
- experimental, schema, lag, backup, restore, checker, ... : various types for specific purposes.
* master: for the mysql master, RW database.
* replica: for a mysql slave that serves read-only traffic, with guaranteed low replication latency.
* rdonly: for a mysql slave that serves read-only traffic for backend processing jobs (like map-reduce type jobs). It has no real guaranteed replication latency.
* spare: for a mysql slave not used at the moment (hot spare).
* experimental, schema, lag, backup, restore, checker, ... : various types for specific purposes.

Only master, replica and rdonly are advertised in the Serving Graph.

Expand Down Expand Up @@ -107,11 +107,11 @@ There is one local instance of that service per Cell (Data Center). The goal is
using the remaining Cells. (a Zookeeper instance running on 3 or 5 hosts locally is a good configuration).

The data is partitioned as follows:
- Keyspaces: global instance
- Shards: global instance
- Tablets: local instances
- Serving Graph: local instances
- Replication Graph: the master alias is in the global instance, the master-slave map is in the local cells.
* Keyspaces: global instance
* Shards: global instance
* Tablets: local instances
* Serving Graph: local instances
* Replication Graph: the master alias is in the global instance, the master-slave map is in the local cells.

Clients are designed to just read the local Serving Graph, therefore they only need the local instance to be up.

Expand Down
10 changes: 5 additions & 5 deletions doc/HelicopterOverview.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,30 +6,30 @@ and what kind of effort it would require for you to start using Vitess.

## When do you need Vitess?

- You store all your data in a MySQL database, and have a significant number of
* You store all your data in a MySQL database, and have a significant number of
clients. At some point, you start getting Too many connections errors from
MySQL, so you have to change the max_connections system variable. Every MySQL
connection has a memory overhead, which is just below 3 MB in the default
configuration. If you want 1500 additional connections, you will need over 4 GB
of additional RAM – and this is not going to be contributing to faster queries.

- From time to time, your developers make mistakes. For example, they make your
* From time to time, your developers make mistakes. For example, they make your
app issue a query without setting a LIMIT, which makes the database slow for
all users. Or maybe they issue updates that break statement based replication.
Whenever you see such a query, you react, but it usually takes some time and
effort to get the story straight.

- You store your data in a MySQL database, and your database has grown
* You store your data in a MySQL database, and your database has grown
uncomfortably big. You are planning to do some horizontal sharding. MySQL
doesn’t support sharding, so you will have write the code to perform the
sharding, and then bake all the sharding logic into your app.

- You run a MySQL cluster, and use replication for availability: you have a master
* You run a MySQL cluster, and use replication for availability: you have a master
database and a few replicas, and in the case of a master failure some replica
should become the new master. You have to manage the lifecycle of the databases,
and communicate the current state of the system to the application.

- You run a MySQL cluster, and have custom database configurations for different
* You run a MySQL cluster, and have custom database configurations for different
workloads. There’s the master where all the writes go, fast read-only replicas
for web clients, slower read-only replicas for batch jobs, and another kind of
slower replicas for backups. If you have horizontal sharding, this setup is
Expand Down
8 changes: 4 additions & 4 deletions doc/HorizontalReshardingGuide.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ vtctl MigrateServedTypes -reverse test_keyspace/0 replica
## Scrap the source shard

If all the above steps were successful, it’s safe to remove the source shard (which should no longer be in use):
- For each tablet in the source shard: `vtctl ScrapTablet <source tablet alias>`
- For each tablet in the source shard: `vtctl DeleteTablet <source tablet alias>`
- Rebuild the serving graph: `vtctl RebuildKeyspaceGraph test_keyspace`
- Delete the source shard: `vtctl DeleteShard test_keyspace/0`
* For each tablet in the source shard: `vtctl ScrapTablet <source tablet alias>`
* For each tablet in the source shard: `vtctl DeleteTablet <source tablet alias>`
* Rebuild the serving graph: `vtctl RebuildKeyspaceGraph test_keyspace`
* Delete the source shard: `vtctl DeleteShard test_keyspace/0`
58 changes: 29 additions & 29 deletions doc/Reparenting.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,29 +54,29 @@ live system, it errs on the side of safety, and will abort if any
tablet is not responding right.

The actions performed are:
- any existing tablet replication is stopped. If any tablet fails
* any existing tablet replication is stopped. If any tablet fails
(because it is not available or not succeeding), we abort.
- the master-elect is initialized as a master.
- in parallel for each tablet, we do:
- on the master-elect, we insert an entry in a test table.
- on the slaves, we set the master, and wait for the entry in the test table.
- if any tablet fails, we error out.
- we then rebuild the serving graph for the shard.
* the master-elect is initialized as a master.
* in parallel for each tablet, we do:
* on the master-elect, we insert an entry in a test table.
* on the slaves, we set the master, and wait for the entry in the test table.
* if any tablet fails, we error out.
* we then rebuild the serving graph for the shard.

### Planned Reparents: vtctl PlannedReparentShard

This command is used when both the current master and the new master
are alive and functioning properly.

The actions performed are:
- we tell the old master to go read-only. It then shuts down its query
* we tell the old master to go read-only. It then shuts down its query
service. We get its replication position back.
- we tell the master-elect to wait for that replication data, and then
* we tell the master-elect to wait for that replication data, and then
start being the master.
- in parallel for each tablet, we do:
- on the master-elect, we insert an entry in a test table. If that
* in parallel for each tablet, we do:
* on the master-elect, we insert an entry in a test table. If that
works, we update the MasterAlias record of the global Shard object.
- on the slaves (including the old master), we set the master, and
* on the slaves (including the old master), we set the master, and
wait for the entry in the test table. (if a slave wasn't
replicating, we don't change its state and don't start replication
after reparent)
Expand All @@ -95,15 +95,15 @@ just make sure the master-elect is the most advanced in replication
within all the available slaves, and reparent everybody.

The actions performed are:
- if the current master is still alive, we scrap it. That will make it
* if the current master is still alive, we scrap it. That will make it
stop what it's doing, stop its query service, and be unusable.
- we gather the current replication position on all slaves.
- we make sure the master-elect has the most advanced position.
- we promote the master-elect.
- in parallel for each tablet, we do:
- on the master-elect, we insert an entry in a test table. If that
* we gather the current replication position on all slaves.
* we make sure the master-elect has the most advanced position.
* we promote the master-elect.
* in parallel for each tablet, we do:
* on the master-elect, we insert an entry in a test table. If that
works, we update the MasterAlias record of the global Shard object.
- on the slaves (excluding the old master), we set the master, and
* on the slaves (excluding the old master), we set the master, and
wait for the entry in the test table. (if a slave wasn't
replicating, we don't change its state and don't start replication
after reparent)
Expand All @@ -121,33 +121,33 @@ servers. We then trigger the 'vtctl TabletExternallyReparented'
command.

The flow for that command is as follows:
- the shard is locked in the global topology server.
- we read the Shard object from the global topology server.
- we read all the tablets in the replication graph for the shard. Note
* the shard is locked in the global topology server.
* we read the Shard object from the global topology server.
* we read all the tablets in the replication graph for the shard. Note
we allow partial reads here, so if a data center is down, as long as
the data center containing the new master is up, we keep going.
- the new master performs a 'SlaveWasPromoted' action. This remote
* the new master performs a 'SlaveWasPromoted' action. This remote
action makes sure the new master is not a MySQL slave of another
server (the 'show slave status' command should not return anything,
meaning 'reset slave' should have been called).
- for every host in the replication graph, we call the
* for every host in the replication graph, we call the
'SlaveWasRestarted' action. It takes as parameter the address of the
new master. On each slave, we update the topology server record for
that tablet with the new master, and the replication graph for that
tablet as well.
- for the old master, if it doesn't successfully return from
* for the old master, if it doesn't successfully return from
'SlaveWasRestarted', we change its type to 'spare' (so a dead old
master doesn't interfere).
- we then update the Shard object with the new master.
- we rebuild the serving graph for that shard. This will update the
* we then update the Shard object with the new master.
* we rebuild the serving graph for that shard. This will update the
'master' record for sure, and also keep all the tablets that have
successfully reparented.

Failure cases:
- The global topology server has to be available for locking and
* The global topology server has to be available for locking and
modification during this operation. If not, the operation will just
fail.
- If a single topology server is down in one data center (and it's not
* If a single topology server is down in one data center (and it's not
the master data center), the tablets in that data center will be
ignored by the reparent. When the topology server comes back up,
just re-run 'vtctl InitTablet' on the tablets, and that will fix
Expand Down
61 changes: 33 additions & 28 deletions doc/Resharding.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,40 @@
# Resharding

In Vitess, resharding describes the process of re-organizing data dynamically, with very minimal downtime (we manage to
completely perform most data transitions with less than 5 seconds of read-only downtime - new data cannot be written,
existing data can still be read).
In Vitess, resharding describes the process of re-organizing data
dynamically, with very minimal downtime (we manage to completely
perform most data transitions with less than 5 seconds of read-only
downtime - new data cannot be written, existing data can still be
read).

See the description of [how Sharding works in Vitess](Sharding.md) for
higher level concepts on Sharding.

## Process

To follow a step-by-step guide for how to shard a keyspace, you can see [this page](HorizontalReshardingGuide.md).

In general, the process to achieve this goal is composed of the following steps:
- pick the original shard(s)
- pick the destination shard(s) coverage
- create the destination shard(s) tablets (in a mode where they are not used to serve traffic yet)
- bring up the destination shard(s) tablets, with read-only masters.
- backup and split the data from the original shard(s)
- merge and import the data on the destination shard(s)
- start and run filtered replication from original to destination shard(s), catch up
- move the read-only traffic to the destination shard(s), stop serving read-only traffic from original shard(s). This transition can take a few hours. We might want to move rdonly separately from replica traffic.
- in quick succession:
- make original master(s) read-only
- flush filtered replication on all filtered replication source servers (after making sure they were caught up with their masters)
- wait until replication is caught up on all destination shard(s) masters
- move the write traffic to the destination shard(s)
- make destination master(s) read-write
- scrap the original shard(s)
* pick the original shard(s)
* pick the destination shard(s) coverage
* create the destination shard(s) tablets (in a mode where they are not used to serve traffic yet)
* bring up the destination shard(s) tablets, with read-only masters.
* backup and split the data from the original shard(s)
* merge and import the data on the destination shard(s)
* start and run filtered replication from original to destination shard(s), catch up
* move the read-only traffic to the destination shard(s), stop serving read-only traffic from original shard(s). This transition can take a few hours. We might want to move rdonly separately from replica traffic.
* in quick succession:
* make original master(s) read-only
* flush filtered replication on all filtered replication source servers (after making sure they were caught up with their masters)
* wait until replication is caught up on all destination shard(s) masters
* move the write traffic to the destination shard(s)
* make destination master(s) read-write
* scrap the original shard(s)

## Applications

The main application we currently support:
- in a sharded keyspace, split or merge shards (horizontal sharding)
- in a non-sharded keyspace, break out some tables into a different keyspace (vertical sharding)
* in a sharded keyspace, split or merge shards (horizontal sharding)
* in a non-sharded keyspace, break out some tables into a different keyspace (vertical sharding)

With these supported features, it is very easy to start with a single keyspace containing all the data (multiple tables),
and then as the data grows, move tables to different keyspaces, start sharding some keyspaces, ... without any real
Expand All @@ -38,17 +43,17 @@ downtime for the application.
## Scaling Up and Down

Here is a quick table of what to do with Vitess when a change is required:
- uniformly increase read capacity: add replicas, or split shards
- uniformly increase write capacity: split shards
- reclaim free space: merge shards / keyspaces
- increase geo-diversity: add new cells and new replicas
- cool a hot tablet: if read access, add replicas or split shards, if write access, split shards.
* uniformly increase read capacity: add replicas, or split shards
* uniformly increase write capacity: split shards
* reclaim free space: merge shards / keyspaces
* increase geo-diversity: add new cells and new replicas
* cool a hot tablet: if read access, add replicas or split shards, if write access, split shards.

## Filtered Replication

The cornerstone of Resharding is being able to replicate the right data. Mysql doesn't support any filtering, so the
Vitess project implements it entirely:
- the tablet server tags transactions with comments that describe what the scope of the statements are (which keyspace_id,
* the tablet server tags transactions with comments that describe what the scope of the statements are (which keyspace_id,
which table, ...). That way the MySQL binlogs contain all filtering data.
- a server process can filter and stream the MySQL binlogs (using the comments).
- a client process can apply the filtered logs locally (they are just regular SQL statements at this point).
* a server process can filter and stream the MySQL binlogs (using the comments).
* a client process can apply the filtered logs locally (they are just regular SQL statements at this point).
Loading

0 comments on commit fea04f1

Please sign in to comment.