From dcd87a541d891b690daa971985014e3eb5a1312c Mon Sep 17 00:00:00 2001 From: Alain Jobart Date: Mon, 11 May 2015 10:40:26 -0700 Subject: [PATCH 1/2] Adding sharding doc. --- doc/Sharding.md | 165 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 165 insertions(+) create mode 100644 doc/Sharding.md diff --git a/doc/Sharding.md b/doc/Sharding.md new file mode 100644 index 00000000000..d33c914f270 --- /dev/null +++ b/doc/Sharding.md @@ -0,0 +1,165 @@ +# Sharding in Vitess + +This document describes the various options for sharding in Vitess. + +## Range-based Sharding + +This is the out-of-the-box sharding solution for Vitess. Each record +in the database has a Sharding Key value associated with it (and +stored in that row). Two records with the same Sharding Key are always +collocated on the same shard (allowing joins for instance). Two +records with different sharding keys are not necessarily on the same +shard. + +In a Keyspace, each Shard then contains a range of Sharding Key +values. The full set of Shards covers the entire range. + +In this environment, query routing needs to figure out the Sharding +Key or Keys for each query, and then route it properly to the +appropriate shard(s). We achieve this by providing either a sharding +key value (known as KeyspaceID in the API), or a sharding key range +(KeyRange). We are also developing more ways to route queries +automatically with [version 3 or our API](VTGateV3.md), where we store +more metadata in our topology to understand where to route queries. + +### Sharding Key + +Sharding Keys need to be compared to each other to enable range-based +sharding, as we need to figure out if a value is within a Shard's +range. + +Vitess was designed to allow two types of sharding keys: +* Binary data: just an array of bytes. We use regular byte array + comparison here. Can be used for strings. MySQL representation is a + VARBINARY field. +* 64 bits unsigned integer: we first convert the 64 bits integer into + a byte array (by just copying the bytes, most significant byte + first, into 8 bytes). Then we apply the same byte array + comparison. MySQL representation is an bigint(20) UNSIGNED . + +A sharded keyspace contains information about the type of sharding key +(binary data or 64 bits unsigned integer), and the column name for the +Sharding Key (that is in every table). + +To guarantee a balanced use of the shards, the Sharding Keys should be +evenly distributed in their space (if half the records of a Keyspace +map to a single sharding key, all these records will always be on the +same shard, making it impossible to split). + +A common example of a sharding key in a web site that serves millions +of users is the 64 bit hash of a User Id. By using a hashing function +the Sharding Keys are evenly distributed in the space. By using a very +granular sharding key, one could shard all the way to one user per +shard. + +Comparison of Sharding Keys can be a bit tricky, but if one remembers +they are always converted to byte arrays, and then converted, it +becomes easier. For instance, the value [ 0x80 ] is the mid value for +Sharding Keys. All byte sharding keys that start with numbers strictly +lower than 0x80 are lower, and all byte sharding keys that start with +number equal or greater than 0x80 are higher. Note that [ 0x80 ] can +also be compared to 64 bits unsigned integer values: any value that is +smaller than 0x8000000000000000 will be smaller, any value equal or +greater will be greater. + +### Key Range + +A Key Range has a Start and an End value. A value is inside the Key +Range if it is greater or equal to the Start, and strictly less than +the End. + +Two Key Ranges are consecutive if the End of the first one is equal to +the Start of the next one. + +Two special values exist: +* if a Start is empty, it represents the lowest value, and all values + are greater than it. +* if an End is empty, it represents the biggest value, and all values + are strictly lower than it. + +Examples: +* Start=[], End=[]: full Key Range +* Start=[], End=[0x80]: Lower half of the Key Range. +* Start=[0x80], End=[]: Upper half of the Key Range. +* Start=[0x40], End=[0x80]: Second quarter of the Key Range. + +As noted previously, this can be used for both binary and uint64 +Sharding Keys. For uint64 Sharding Keys, the single byte number +represents the most significant 8 bits of the value. + +### Range-based Shard Name + +The Name of a Shard when it is part of a Range-based sharded keyspace +is the Start and End of its keyrange, printed in hexadecimal, and +separated by a hyphen. + +For instance, if Start is the array of bytes [ 0x80 ] and End is the +array of bytes [ 0xc0 ], then the name of the Shard will be: 80-c0 + +We will use this convention in the rest of this document. + +### Sharding Key Partition + +A partition represent a set of Key Ranges that cover the entire space. For instance, the following four shards are a valid full partition: +* -40 +* 40-80 +* 80-c0 +* c0- + +When we build the serving graph for a given Range-based Sharded +Keyspace, we ensure the Shards are valid and cover the full space. + +During resharding, we can split or merge consecutive Shards with very +minimal downtime. + +### Resharding + +Vitess provides a set of tools and processes to deal with Range Based Shards: +* [Dynamic resharding](Resharding.md) allows splitting or merging of shards with no + read downtime, and very minimal master unavailability (<5s). +* Client APIs are designed to take sharding into account. +* [Map-reduce framework](https://github.com/youtube/vitess/blob/master/java/vtgate-client/src/main/java/com/youtube/vitess/vtgate/hadoop/README.md) fully utilizes the Key Ranges to read data as + fast as possible, concurrently from all shards and all replicas. + +### Cross-Shard Indexes + +The 'primary key' for sharded data is the Sharding Key value. In order +to look up data with another index, it is very straightforward to +create a lookup table in a different Keyspace that maps that other +index to the Sharding Key. + +For instance, if User ID is hashed as Sharding Key in a User keyspace, +adding a User Name to User Id lookup table in a different Lookup +keyspace allows the user to also route queries the right way. + +With the current version of the API, Cross-Shard indexes have to be +handled at the application layer. However, [version 3 or our API](VTGateV3.md) +will provide multiple ways to solve this without application layer changes. + +## Custom Sharding + +This is designed to be used if your application already has support +for sharding, or if you want to control exactly which shard the data +goes to. In that use case, each Keyspace just has a collection of +shards, and the client code always specifies which shard to talk to. + +The shards can just use any name, and they are always addressed by +name. The API calls to vtgate are ExecuteShard, ExecuteBatchShard and +StreamExecuteShard. None of the *KeyspaceIds, *KeyRanges or *EntityIds +API calls can be used. + +The Map-Reduce framework can still iterate over the data across multiple shards. + +Also, none of the automated resharding tools and processes that Vitess +provides for Range-Based sharding can be used here. + +Note: the *Shard API calls are not exposed by all clients at the +moment. This is going to be fixed soon. + +### Custom Sharding Example: Lookup-Based Sharding + +One example of Custom Sharding is Lookup based: one keyspace is used +as a lookup keyspace, and contains the mapping between the identifying +key of a record, and the shard name it is on. When accessing the +records, the client first needs to find the shard name by looking it +up in the lookup table, and then knows where to route queries. From fea04f15037f5eebd9f4f9fe1c781d9d015d24ee Mon Sep 17 00:00:00 2001 From: Alain Jobart Date: Mon, 11 May 2015 10:46:44 -0700 Subject: [PATCH 2/2] Adding Sharding.md reference, fixing list markup syntax. --- README.md | 1 + doc/Concepts.md | 28 ++++++------ doc/HelicopterOverview.md | 10 ++-- doc/HorizontalReshardingGuide.md | 8 ++-- doc/Reparenting.md | 58 ++++++++++++------------ doc/Resharding.md | 61 +++++++++++++------------ doc/SchemaManagement.md | 54 +++++++++++----------- doc/Tools.md | 4 +- doc/TopologyService.md | 78 ++++++++++++++++---------------- 9 files changed, 155 insertions(+), 147 deletions(-) diff --git a/README.md b/README.md index 8fcbe816576..bc1b9995090 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,7 @@ and a more [detailed presentation from @Scale '14](http://youtu.be/5yDO-tmIoXY). ### Intro * [Helicopter overview](http://vitess.io): high level overview of Vitess that should tell you whether Vitess is for you. + * [Sharding in Vitess](http://vitess.io/doc/Sharding) * [Frequently Asked Questions](http://vitess.io/doc/FAQ). ### Using Vitess diff --git a/doc/Concepts.md b/doc/Concepts.md index 52fbe723eda..b6c87ce9338 100644 --- a/doc/Concepts.md +++ b/doc/Concepts.md @@ -30,19 +30,19 @@ eventual consistency guarantees), run data analysis tools that take a long time ### Tablet A tablet is a single server that runs: -- a MySQL instance -- a vttablet instance -- a local row cache instance -- an other per-db process that is necessary for operational purposes +* a MySQL instance +* a vttablet instance +* a local row cache instance +* an other per-db process that is necessary for operational purposes It can be idle (not assigned to any keyspace), or assigned to a keyspace/shard. If it becomes unhealthy, it is usually changed to scrap. It has a type. The commonly used types are: -- master: for the mysql master, RW database. -- replica: for a mysql slave that serves read-only traffic, with guaranteed low replication latency. -- rdonly: for a mysql slave that serves read-only traffic for backend processing jobs (like map-reduce type jobs). It has no real guaranteed replication latency. -- spare: for a mysql slave not used at the moment (hot spare). -- experimental, schema, lag, backup, restore, checker, ... : various types for specific purposes. +* master: for the mysql master, RW database. +* replica: for a mysql slave that serves read-only traffic, with guaranteed low replication latency. +* rdonly: for a mysql slave that serves read-only traffic for backend processing jobs (like map-reduce type jobs). It has no real guaranteed replication latency. +* spare: for a mysql slave not used at the moment (hot spare). +* experimental, schema, lag, backup, restore, checker, ... : various types for specific purposes. Only master, replica and rdonly are advertised in the Serving Graph. @@ -107,11 +107,11 @@ There is one local instance of that service per Cell (Data Center). The goal is using the remaining Cells. (a Zookeeper instance running on 3 or 5 hosts locally is a good configuration). The data is partitioned as follows: -- Keyspaces: global instance -- Shards: global instance -- Tablets: local instances -- Serving Graph: local instances -- Replication Graph: the master alias is in the global instance, the master-slave map is in the local cells. +* Keyspaces: global instance +* Shards: global instance +* Tablets: local instances +* Serving Graph: local instances +* Replication Graph: the master alias is in the global instance, the master-slave map is in the local cells. Clients are designed to just read the local Serving Graph, therefore they only need the local instance to be up. diff --git a/doc/HelicopterOverview.md b/doc/HelicopterOverview.md index 83aef79945a..0da572c5e4c 100644 --- a/doc/HelicopterOverview.md +++ b/doc/HelicopterOverview.md @@ -6,30 +6,30 @@ and what kind of effort it would require for you to start using Vitess. ## When do you need Vitess? - - You store all your data in a MySQL database, and have a significant number of + * You store all your data in a MySQL database, and have a significant number of clients. At some point, you start getting Too many connections errors from MySQL, so you have to change the max_connections system variable. Every MySQL connection has a memory overhead, which is just below 3 MB in the default configuration. If you want 1500 additional connections, you will need over 4 GB of additional RAM – and this is not going to be contributing to faster queries. - - From time to time, your developers make mistakes. For example, they make your + * From time to time, your developers make mistakes. For example, they make your app issue a query without setting a LIMIT, which makes the database slow for all users. Or maybe they issue updates that break statement based replication. Whenever you see such a query, you react, but it usually takes some time and effort to get the story straight. - - You store your data in a MySQL database, and your database has grown + * You store your data in a MySQL database, and your database has grown uncomfortably big. You are planning to do some horizontal sharding. MySQL doesn’t support sharding, so you will have write the code to perform the sharding, and then bake all the sharding logic into your app. - - You run a MySQL cluster, and use replication for availability: you have a master + * You run a MySQL cluster, and use replication for availability: you have a master database and a few replicas, and in the case of a master failure some replica should become the new master. You have to manage the lifecycle of the databases, and communicate the current state of the system to the application. - - You run a MySQL cluster, and have custom database configurations for different + * You run a MySQL cluster, and have custom database configurations for different workloads. There’s the master where all the writes go, fast read-only replicas for web clients, slower read-only replicas for batch jobs, and another kind of slower replicas for backups. If you have horizontal sharding, this setup is diff --git a/doc/HorizontalReshardingGuide.md b/doc/HorizontalReshardingGuide.md index cee8397b43e..9b0215a656f 100644 --- a/doc/HorizontalReshardingGuide.md +++ b/doc/HorizontalReshardingGuide.md @@ -122,7 +122,7 @@ vtctl MigrateServedTypes -reverse test_keyspace/0 replica ## Scrap the source shard If all the above steps were successful, it’s safe to remove the source shard (which should no longer be in use): -- For each tablet in the source shard: `vtctl ScrapTablet ` -- For each tablet in the source shard: `vtctl DeleteTablet ` -- Rebuild the serving graph: `vtctl RebuildKeyspaceGraph test_keyspace` -- Delete the source shard: `vtctl DeleteShard test_keyspace/0` +* For each tablet in the source shard: `vtctl ScrapTablet ` +* For each tablet in the source shard: `vtctl DeleteTablet ` +* Rebuild the serving graph: `vtctl RebuildKeyspaceGraph test_keyspace` +* Delete the source shard: `vtctl DeleteShard test_keyspace/0` diff --git a/doc/Reparenting.md b/doc/Reparenting.md index 0be0a427a9b..341447986db 100644 --- a/doc/Reparenting.md +++ b/doc/Reparenting.md @@ -54,14 +54,14 @@ live system, it errs on the side of safety, and will abort if any tablet is not responding right. The actions performed are: -- any existing tablet replication is stopped. If any tablet fails +* any existing tablet replication is stopped. If any tablet fails (because it is not available or not succeeding), we abort. -- the master-elect is initialized as a master. -- in parallel for each tablet, we do: - - on the master-elect, we insert an entry in a test table. - - on the slaves, we set the master, and wait for the entry in the test table. -- if any tablet fails, we error out. -- we then rebuild the serving graph for the shard. +* the master-elect is initialized as a master. +* in parallel for each tablet, we do: + * on the master-elect, we insert an entry in a test table. + * on the slaves, we set the master, and wait for the entry in the test table. +* if any tablet fails, we error out. +* we then rebuild the serving graph for the shard. ### Planned Reparents: vtctl PlannedReparentShard @@ -69,14 +69,14 @@ This command is used when both the current master and the new master are alive and functioning properly. The actions performed are: -- we tell the old master to go read-only. It then shuts down its query +* we tell the old master to go read-only. It then shuts down its query service. We get its replication position back. -- we tell the master-elect to wait for that replication data, and then +* we tell the master-elect to wait for that replication data, and then start being the master. -- in parallel for each tablet, we do: - - on the master-elect, we insert an entry in a test table. If that +* in parallel for each tablet, we do: + * on the master-elect, we insert an entry in a test table. If that works, we update the MasterAlias record of the global Shard object. - - on the slaves (including the old master), we set the master, and + * on the slaves (including the old master), we set the master, and wait for the entry in the test table. (if a slave wasn't replicating, we don't change its state and don't start replication after reparent) @@ -95,15 +95,15 @@ just make sure the master-elect is the most advanced in replication within all the available slaves, and reparent everybody. The actions performed are: -- if the current master is still alive, we scrap it. That will make it +* if the current master is still alive, we scrap it. That will make it stop what it's doing, stop its query service, and be unusable. -- we gather the current replication position on all slaves. -- we make sure the master-elect has the most advanced position. -- we promote the master-elect. -- in parallel for each tablet, we do: - - on the master-elect, we insert an entry in a test table. If that +* we gather the current replication position on all slaves. +* we make sure the master-elect has the most advanced position. +* we promote the master-elect. +* in parallel for each tablet, we do: + * on the master-elect, we insert an entry in a test table. If that works, we update the MasterAlias record of the global Shard object. - - on the slaves (excluding the old master), we set the master, and + * on the slaves (excluding the old master), we set the master, and wait for the entry in the test table. (if a slave wasn't replicating, we don't change its state and don't start replication after reparent) @@ -121,33 +121,33 @@ servers. We then trigger the 'vtctl TabletExternallyReparented' command. The flow for that command is as follows: -- the shard is locked in the global topology server. -- we read the Shard object from the global topology server. -- we read all the tablets in the replication graph for the shard. Note +* the shard is locked in the global topology server. +* we read the Shard object from the global topology server. +* we read all the tablets in the replication graph for the shard. Note we allow partial reads here, so if a data center is down, as long as the data center containing the new master is up, we keep going. -- the new master performs a 'SlaveWasPromoted' action. This remote +* the new master performs a 'SlaveWasPromoted' action. This remote action makes sure the new master is not a MySQL slave of another server (the 'show slave status' command should not return anything, meaning 'reset slave' should have been called). -- for every host in the replication graph, we call the +* for every host in the replication graph, we call the 'SlaveWasRestarted' action. It takes as parameter the address of the new master. On each slave, we update the topology server record for that tablet with the new master, and the replication graph for that tablet as well. -- for the old master, if it doesn't successfully return from +* for the old master, if it doesn't successfully return from 'SlaveWasRestarted', we change its type to 'spare' (so a dead old master doesn't interfere). -- we then update the Shard object with the new master. -- we rebuild the serving graph for that shard. This will update the +* we then update the Shard object with the new master. +* we rebuild the serving graph for that shard. This will update the 'master' record for sure, and also keep all the tablets that have successfully reparented. Failure cases: -- The global topology server has to be available for locking and +* The global topology server has to be available for locking and modification during this operation. If not, the operation will just fail. -- If a single topology server is down in one data center (and it's not +* If a single topology server is down in one data center (and it's not the master data center), the tablets in that data center will be ignored by the reparent. When the topology server comes back up, just re-run 'vtctl InitTablet' on the tablets, and that will fix diff --git a/doc/Resharding.md b/doc/Resharding.md index edbc83fa410..2d80da249bb 100644 --- a/doc/Resharding.md +++ b/doc/Resharding.md @@ -1,35 +1,40 @@ # Resharding -In Vitess, resharding describes the process of re-organizing data dynamically, with very minimal downtime (we manage to -completely perform most data transitions with less than 5 seconds of read-only downtime - new data cannot be written, -existing data can still be read). +In Vitess, resharding describes the process of re-organizing data +dynamically, with very minimal downtime (we manage to completely +perform most data transitions with less than 5 seconds of read-only +downtime - new data cannot be written, existing data can still be +read). + +See the description of [how Sharding works in Vitess](Sharding.md) for +higher level concepts on Sharding. ## Process To follow a step-by-step guide for how to shard a keyspace, you can see [this page](HorizontalReshardingGuide.md). In general, the process to achieve this goal is composed of the following steps: -- pick the original shard(s) -- pick the destination shard(s) coverage -- create the destination shard(s) tablets (in a mode where they are not used to serve traffic yet) -- bring up the destination shard(s) tablets, with read-only masters. -- backup and split the data from the original shard(s) -- merge and import the data on the destination shard(s) -- start and run filtered replication from original to destination shard(s), catch up -- move the read-only traffic to the destination shard(s), stop serving read-only traffic from original shard(s). This transition can take a few hours. We might want to move rdonly separately from replica traffic. -- in quick succession: - - make original master(s) read-only - - flush filtered replication on all filtered replication source servers (after making sure they were caught up with their masters) - - wait until replication is caught up on all destination shard(s) masters - - move the write traffic to the destination shard(s) - - make destination master(s) read-write -- scrap the original shard(s) +* pick the original shard(s) +* pick the destination shard(s) coverage +* create the destination shard(s) tablets (in a mode where they are not used to serve traffic yet) +* bring up the destination shard(s) tablets, with read-only masters. +* backup and split the data from the original shard(s) +* merge and import the data on the destination shard(s) +* start and run filtered replication from original to destination shard(s), catch up +* move the read-only traffic to the destination shard(s), stop serving read-only traffic from original shard(s). This transition can take a few hours. We might want to move rdonly separately from replica traffic. +* in quick succession: + * make original master(s) read-only + * flush filtered replication on all filtered replication source servers (after making sure they were caught up with their masters) + * wait until replication is caught up on all destination shard(s) masters + * move the write traffic to the destination shard(s) + * make destination master(s) read-write +* scrap the original shard(s) ## Applications The main application we currently support: -- in a sharded keyspace, split or merge shards (horizontal sharding) -- in a non-sharded keyspace, break out some tables into a different keyspace (vertical sharding) +* in a sharded keyspace, split or merge shards (horizontal sharding) +* in a non-sharded keyspace, break out some tables into a different keyspace (vertical sharding) With these supported features, it is very easy to start with a single keyspace containing all the data (multiple tables), and then as the data grows, move tables to different keyspaces, start sharding some keyspaces, ... without any real @@ -38,17 +43,17 @@ downtime for the application. ## Scaling Up and Down Here is a quick table of what to do with Vitess when a change is required: -- uniformly increase read capacity: add replicas, or split shards -- uniformly increase write capacity: split shards -- reclaim free space: merge shards / keyspaces -- increase geo-diversity: add new cells and new replicas -- cool a hot tablet: if read access, add replicas or split shards, if write access, split shards. +* uniformly increase read capacity: add replicas, or split shards +* uniformly increase write capacity: split shards +* reclaim free space: merge shards / keyspaces +* increase geo-diversity: add new cells and new replicas +* cool a hot tablet: if read access, add replicas or split shards, if write access, split shards. ## Filtered Replication The cornerstone of Resharding is being able to replicate the right data. Mysql doesn't support any filtering, so the Vitess project implements it entirely: -- the tablet server tags transactions with comments that describe what the scope of the statements are (which keyspace_id, +* the tablet server tags transactions with comments that describe what the scope of the statements are (which keyspace_id, which table, ...). That way the MySQL binlogs contain all filtering data. -- a server process can filter and stream the MySQL binlogs (using the comments). -- a client process can apply the filtered logs locally (they are just regular SQL statements at this point). +* a server process can filter and stream the MySQL binlogs (using the comments). +* a client process can apply the filtered logs locally (they are just regular SQL statements at this point). diff --git a/doc/SchemaManagement.md b/doc/SchemaManagement.md index 5f6f3c6111f..37c3970228c 100644 --- a/doc/SchemaManagement.md +++ b/doc/SchemaManagement.md @@ -42,11 +42,11 @@ $ vtctl -wait-time=30s ValidateSchemaKeyspace user ## Changing the Schema Goals: -- simplify schema updates on the fleet -- minimize human actions / errors -- guarantee no or very little downtime for most schema updates -- do not store any permanent schema data in Topology Server, just use it for actions. -- only look at tables for now (not stored procedures or grants for instance, although they both could be added fairly easily in the same manner) +* simplify schema updates on the fleet +* minimize human actions / errors +* guarantee no or very little downtime for most schema updates +* do not store any permanent schema data in Topology Server, just use it for actions. +* only look at tables for now (not stored procedures or grants for instance, although they both could be added fairly easily in the same manner) We’re trying to get reasonable confidence that a schema update is going to work before applying it. Since we cannot really apply a change to live tables without potentially causing trouble, we have implemented a Preflight operation: it copies the current schema into a temporary database, applies the change there to validate it, and gathers the resulting schema. After this Preflight, we have a good idea of what to expect, and we can apply the change to any database and make sure it worked. @@ -71,35 +71,37 @@ type SchemaChange struct { ``` And the associated ApplySchema remote action for a tablet. Then the performed steps are: -- The database to use is either derived from the tablet dbName if UseVt is false, or is the _vt database. A ‘use dbname’ is prepended to the Sql. -- (if BeforeSchema is not nil) read the schema, make sure it is equal to BeforeSchema. If not equal: if Force is not set, we will abort, if Force is set, we’ll issue a warning and keep going. -- if AllowReplication is false, we’ll disable replication (adding SET sql_log_bin=0 before the Sql). -- We will then apply the Sql command. -- (if AfterSchema is not nil) read the schema again, make sure it is equal to AfterSchema. If not equal: if Force is not set, we will issue an error, if Force is set, we’ll issue a warning. +* The database to use is either derived from the tablet dbName if UseVt is false, or is the _vt database. A ‘use dbname’ is prepended to the Sql. +* (if BeforeSchema is not nil) read the schema, make sure it is equal to BeforeSchema. If not equal: if Force is not set, we will abort, if Force is set, we’ll issue a warning and keep going. +* if AllowReplication is false, we’ll disable replication (adding SET sql_log_bin=0 before the Sql). +* We will then apply the Sql command. +* (if AfterSchema is not nil) read the schema again, make sure it is equal to AfterSchema. If not equal: if Force is not set, we will issue an error, if Force is set, we’ll issue a warning. We will return the following information: -- whether it worked or not (doh!) -- BeforeSchema -- AfterSchema +* whether it worked or not (doh!) +* BeforeSchema +* AfterSchema ### Use case 1: Single tablet update: -- we first do a Preflight (to know what BeforeSchema and AfterSchema will be). This can be disabled, but is not recommended. -- we then do the schema upgrade. We will check BeforeSchema before the upgrade, and AfterSchema after the upgrade. +* we first do a Preflight (to know what BeforeSchema and AfterSchema will be). This can be disabled, but is not recommended. +* we then do the schema upgrade. We will check BeforeSchema before the upgrade, and AfterSchema after the upgrade. ### Use case 2: Single Shard update: -- need to figure out (or be told) if it’s a simple or complex schema update (does it require the shell game?). For now we'll use a command line flag. -- in any case, do a Preflight on the master, to get the BeforeSchema and AfterSchema values. -- in any case, gather the schema on all databases, to see which ones have been upgraded already or not. This guarantees we can interrupt and restart a schema change. Also, this makes sure no action is currently running on the databases we're about to change. -- if simple: - - nobody has it: apply to master, very similar to a single tablet update. - - some tablets have it but not others: error out -- if complex: do the shell game while disabling replication. Skip the tablets that already have it. Have an option to re-parent at the end. - - Note the Backup, and Lag servers won't apply a complex schema change. Only the servers actively in the replication graph will. - - the process can be interrupted at any time, restarting it as a complex schema upgrade should just work. + +* need to figure out (or be told) if it’s a simple or complex schema update (does it require the shell game?). For now we'll use a command line flag. +* in any case, do a Preflight on the master, to get the BeforeSchema and AfterSchema values. +* in any case, gather the schema on all databases, to see which ones have been upgraded already or not. This guarantees we can interrupt and restart a schema change. Also, this makes sure no action is currently running on the databases we're about to change. +* if simple: + * nobody has it: apply to master, very similar to a single tablet update. + * some tablets have it but not others: error out +* if complex: do the shell game while disabling replication. Skip the tablets that already have it. Have an option to re-parent at the end. + * Note the Backup, and Lag servers won't apply a complex schema change. Only the servers actively in the replication graph will. + * the process can be interrupted at any time, restarting it as a complex schema upgrade should just work. ### Use case 3: Keyspace update: -- Similar to Single Shard, but the BeforeSchema and AfterSchema values are taken from the first shard, and used in all shards after that. -- We don't know the new masters to use on each shard, so just skip re-parenting all together. + +* Similar to Single Shard, but the BeforeSchema and AfterSchema values are taken from the first shard, and used in all shards after that. +* We don't know the new masters to use on each shard, so just skip re-parenting all together. This translates into the following vtctl commands: diff --git a/doc/Tools.md b/doc/Tools.md index 00f765ce174..8ff0661a5c8 100644 --- a/doc/Tools.md +++ b/doc/Tools.md @@ -57,8 +57,8 @@ level picture of all the servers and their current state. ### vtworker vtworker is meant to host long-running processes. It supports a plugin infrastructure, and offers libraries to easily pick tablets to use. We have developed: -- resharding differ jobs: meant to check data integrity during shard splits and joins. -- vertical split differ jobs: meant to check data integrity during vertical splits and joins. +* resharding differ jobs: meant to check data integrity during shard splits and joins. +* vertical split differ jobs: meant to check data integrity during vertical splits and joins. It is very easy to add other checker processes for in-tablet integrity checks (verifying foreign key-like relationships), and cross shard data integrity (for instance, if a keyspace contains an index table referencing data in another keyspace). diff --git a/doc/TopologyService.md b/doc/TopologyService.md index 09bf8d7ce5c..b79384ae685 100644 --- a/doc/TopologyService.md +++ b/doc/TopologyService.md @@ -41,12 +41,12 @@ An entire Keyspace can be locked. We use this during resharding for instance, wh ### Shard A Shard contains a subset of the data for a Keyspace. The Shard record in the global topology contains: -- the MySQL Master tablet alias for this shard -- the sharding key range covered by this Shard inside the Keyspace -- the tablet types this Shard is serving (master, replica, batch, …), per cell if necessary. -- if during filtered replication, the source shards this shard is replicating from -- the list of cells that have tablets in this shard -- shard-global tablet controls, like blacklisted tables no tablet should serve in this shard +* the MySQL Master tablet alias for this shard +* the sharding key range covered by this Shard inside the Keyspace +* the tablet types this Shard is serving (master, replica, batch, …), per cell if necessary. +* if during filtered replication, the source shards this shard is replicating from +* the list of cells that have tablets in this shard +* shard-global tablet controls, like blacklisted tables no tablet should serve in this shard A Shard can be locked. We use this during operations that affect either the Shard record, or multiple tablets within a Shard (like reparenting), so multiple jobs don’t concurrently alter the data. @@ -61,19 +61,19 @@ This section describes the data structures stored in the local instance (per cel ### Tablets The Tablet record has a lot of information about a single vttablet process running inside a tablet (along with the MySQL process): -- the Tablet Alias (cell+unique id) that uniquely identifies the Tablet -- the Hostname, IP address and port map of the Tablet -- the current Tablet type (master, replica, batch, spare, …) -- which Keyspace / Shard the tablet is part of -- the health map for the Tablet (if in degraded mode) -- the sharding Key Range served by this Tablet -- user-specified tag map (to store per installation data for instance) +* the Tablet Alias (cell+unique id) that uniquely identifies the Tablet +* the Hostname, IP address and port map of the Tablet +* the current Tablet type (master, replica, batch, spare, …) +* which Keyspace / Shard the tablet is part of +* the health map for the Tablet (if in degraded mode) +* the sharding Key Range served by this Tablet +* user-specified tag map (to store per installation data for instance) A Tablet record is created before a tablet can be running (either by `vtctl InitTablet` or by passing the `init_*` parameters to vttablet). The only way a Tablet record will be updated is one of: -- The vttablet process itself owns the record while it is running, and can change it. -- At init time, before the tablet starts -- After shutdown, when the tablet gets scrapped or deleted. -- If a tablet becomes unresponsive, it may be forced to spare to remove it from the serving graph (such as when reparenting away from a dead master, by the `vtctl ReparentShard` action). +* The vttablet process itself owns the record while it is running, and can change it. +* At init time, before the tablet starts +* After shutdown, when the tablet gets scrapped or deleted. +* If a tablet becomes unresponsive, it may be forced to spare to remove it from the serving graph (such as when reparenting away from a dead master, by the `vtctl ReparentShard` action). ### Replication Graph @@ -86,16 +86,16 @@ The Serving Graph is what the clients use to find which EndPoints to send querie #### SrvKeyspace It is the local representation of a Keyspace. It contains information on what shard to use for getting to the data (but not information about each individual shard): -- the partitions map is keyed by the tablet type (master, replica, batch, …) and the values are list of shards to use for serving. -- it also contains the global Keyspace fields, copied for fast access. +* the partitions map is keyed by the tablet type (master, replica, batch, …) and the values are list of shards to use for serving. +* it also contains the global Keyspace fields, copied for fast access. It can be rebuilt by running `vtctl RebuildKeyspaceGraph`. It is not automatically rebuilt when adding new tablets in a cell, as this would cause too much overhead and is only needed once per cell/keyspace. It may also be changed during horizontal and vertical splits. #### SrvShard It is the local representation of a Shard. It contains information on details internal to this Shard only, but not to any tablet running in this shard: -- the name and sharding Key Range for this Shard. -- the cell that has the master for this Shard. +* the name and sharding Key Range for this Shard. +* the cell that has the master for this Shard. It is possible to lock a SrvShard object, to massively update all EndPoints in it. @@ -104,10 +104,10 @@ It can be rebuilt (along with all the EndPoints in this Shard) by running `vtctl #### EndPoints For each possible serving type (master, replica, batch), in each Cell / Keyspace / Shard, we maintain a rolled-up EndPoint list. Each entry in the list has information about one Tablet: -- the Tablet Uid -- the Host on which the Tablet resides -- the port map for that Tablet -- the health map for that Tablet +* the Tablet Uid +* the Host on which the Tablet resides +* the port map for that Tablet +* the health map for that Tablet ## Workflows Involving the Topology Server @@ -144,13 +144,13 @@ For locking, we use an auto-incrementing file name in the `/action` subdirectory Note the paths used to store global and per-cell data do not overlap, so a single ZK can be used for both global and local ZKs. This is however not recommended, for reliability reasons. -- Keyspace: `/zk/global/vt/keyspaces/` -- Shard: `/zk/global/vt/keyspaces//shards/` -- Tablet: `/zk//vt/tablets/` -- Replication Graph: `/zk//vt/replication//` -- SrvKeyspace: `/zk//vt/ns/` -- SrvShard: `/zk//vt/ns//` -- EndPoints: `/zk//vt/ns///` +* Keyspace: `/zk/global/vt/keyspaces/` +* Shard: `/zk/global/vt/keyspaces//shards/` +* Tablet: `/zk//vt/tablets/` +* Replication Graph: `/zk//vt/replication//` +* SrvKeyspace: `/zk//vt/ns/` +* SrvShard: `/zk//vt/ns//` +* EndPoints: `/zk//vt/ns///` We provide the 'zk' utility for easy access to the topology data in ZooKeeper. For instance: ``` @@ -171,11 +171,11 @@ We use the `_Data` filename to store the data, JSON encoded. For locking, we store a `_Lock` file with various contents in the directory that contains the object to lock. We use the following paths: -- Keyspace: `/vt/keyspaces//_Data` -- Shard: `/vt/keyspaces///_Data` -- Tablet: `/vt/tablets/-/_Data` -- Replication Graph: `/vt/replication///_Data` -- SrvKeyspace: `/vt/ns//_Data` -- SrvShard: `/vt/ns///_Data` -- EndPoints: `/vt/ns///` +* Keyspace: `/vt/keyspaces//_Data` +* Shard: `/vt/keyspaces///_Data` +* Tablet: `/vt/tablets/-/_Data` +* Replication Graph: `/vt/replication///_Data` +* SrvKeyspace: `/vt/ns//_Data` +* SrvShard: `/vt/ns///_Data` +* EndPoints: `/vt/ns///`