|
| 1 | +--- |
| 2 | +name: Shard Splitting |
| 3 | +about: Introduce Shard Splitting to CouchDB |
| 4 | +title: 'Shard Splitting' |
| 5 | +labels: rfc, discussion |
| 6 | +assignees: '@nickva' |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +# Introduction |
| 11 | + |
| 12 | +This RFC proposes adding the capability to split shards to CouchDB. The API and |
| 13 | +the internals will also allow for other operations on shards in the future such |
| 14 | +as merging or rebalancing. |
| 15 | + |
| 16 | +## Abstract |
| 17 | + |
| 18 | +Since CouchDB 2.0 clustered databases have had a fixed Q value defined at |
| 19 | +creation. This often requires users to predict database usage ahead of time |
| 20 | +which can be hard to do. A too low of a value might result in large shards, |
| 21 | +slower performance, and needing more disk space to do compactions. |
| 22 | + |
| 23 | +It would be nice to start with a low Q initially, for example Q=1, and as |
| 24 | +usage grows to be able to split some shards that grow too big. Especially |
| 25 | +with partitioned queries being available there will be a higher chance |
| 26 | +of having uneven sized shards and so it would be beneficial to split the |
| 27 | +larger ones to even out the size distribution across the cluster. |
| 28 | + |
| 29 | +## Requirements Language |
| 30 | + |
| 31 | +[NOTE]: # ( Do not alter the section below. Follow its instructions. ) |
| 32 | + |
| 33 | +The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", |
| 34 | +"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this |
| 35 | +document are to be interpreted as described in |
| 36 | +[RFC 2119](https://www.rfc-editor.org/rfc/rfc2119.txt). |
| 37 | + |
| 38 | +## Terminology |
| 39 | + |
| 40 | +*resharding* : Manipulating CouchDB shards. Could be splitting, merging, |
| 41 | +rebalancing or other operations. This will be used as the top-level API |
| 42 | +endpoint name with the idea that in the future different types of shard |
| 43 | +manipulation jobs would be added. |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +# Detailed Description |
| 48 | + |
| 49 | +From the user's perspective there would be a new HTTP API endpoint - |
| 50 | +`_reshard/*`. A POST request to `_reshard/jobs/` would start resharding jobs. |
| 51 | +Initially these will be of just one "type":"split" but in the future other |
| 52 | +types could be added. |
| 53 | + |
| 54 | +Users would then be able to monitor the state of these jobs to inspect their |
| 55 | +progress, see when they completed or failed. |
| 56 | + |
| 57 | +The API should be designed to be consistent with `_scheduler/jobs` as much as |
| 58 | +possible since that is another recent CouchDB's API exposing an internal jobs |
| 59 | +list. |
| 60 | + |
| 61 | +Most of the code implementing this would live in the mem3 application with some |
| 62 | +lower level components in the *couch* application. There will be a new child in |
| 63 | +the *mem3_sup* supervisor responsible for resharding called *mem3_reshard_sup*. |
| 64 | +It will have a *mem3_reshard* manager process which should have an Erlang API |
| 65 | +for starting jobs, stopping jobs, removing them, and inspecting their state. |
| 66 | +Individual jobs would be instances of a gen_server defined in |
| 67 | +*mem3_reshard_job* module. There will be simple-one-for-one supervisor under |
| 68 | +*mem3_reshard_sup* named *mem3_reshard_job_sup* to keep track of |
| 69 | +*mem3_reshard_job* children . |
| 70 | + |
| 71 | +An individual shard splitting job will follow roughly these steps in order: |
| 72 | + |
| 73 | +- **Create targets**. Targets are created. Some target properties should match |
| 74 | + the source. This means matching the PSE engine if source uses a custom one. |
| 75 | + If source is partitioned, targets should be partitioned as well, etc. |
| 76 | + |
| 77 | +- **Initial bulk copy.** After the targets are created, copy all the document |
| 78 | + in the source shard to the targets. This operation should be as optimized as |
| 79 | + possible as it could potentially copy tens of GBs of data. For this reason |
| 80 | + this piece of code will be closer the what the compactor does. |
| 81 | + |
| 82 | +- **Build indices**. The source shard might have had up-to-date indices and so |
| 83 | + it is beneficial for the split version to have them as well. Here we'd |
| 84 | + inspect all `_design` docs and rebuild all the known indices. After this step |
| 85 | + there will be a "topoff" step to replicate any change that might have |
| 86 | + occurred on the source while the indices were built. |
| 87 | + |
| 88 | +- **Update shard map**. Here the global shard map is updated to remove the old |
| 89 | + source shard and replace it with the targets. There will be a corresponding |
| 90 | + entry added into the shard's document `changelog entry` indicating that a |
| 91 | + split happened. To avoid conflicts being generated when multiple copies of a |
| 92 | + range finish splitting and race to update the shard map. All shard map |
| 93 | + updates will be routes through one consistently picked node (lowest in the |
| 94 | + list connected nodes when they are sorted). After shard map is updated. There |
| 95 | + will be another topoff replication job to bring in changes from the source |
| 96 | + shard to the targets that might have occurred while the shard map was |
| 97 | + updating. |
| 98 | + |
| 99 | +- **Delete source shard** |
| 100 | + |
| 101 | +This progression of split states will be visible when inspecting a job's status |
| 102 | +as well as in the history in the `detail` field of each event. |
| 103 | + |
| 104 | + |
| 105 | +# Advantages and Disadvantages |
| 106 | + |
| 107 | +Main advantage is to dynamically change shard size distribution on a cluster in |
| 108 | +response to changing user requirements without having to delete and recreate |
| 109 | +databases. |
| 110 | + |
| 111 | +One disadvantage is that it might break some basic constraints about all copies |
| 112 | +of a shard range being the same size. A user could choose to split for example |
| 113 | +a shard copy 00..-ff... on node1 only so on node2 and node3 the copy will be |
| 114 | +00-..ff.. but on node1 there will now be 00-..7f.. and 80-ff... External |
| 115 | +tooling inspecting $db/_shards endpoint might need to be updated to handle this |
| 116 | +scenario. A mitigating factor here is that resharding in the current proposal |
| 117 | +is not automatic it is an operation triggered manually by the users. |
| 118 | + |
| 119 | +# Key Changes |
| 120 | + |
| 121 | +The main change is the ability to split shard via the `_reshard/*` HTTP API |
| 122 | + |
| 123 | +## Applications and Modules affected |
| 124 | + |
| 125 | +Most of the changes will be in the *mem3* application with some changes in the *couch* application as well. |
| 126 | + |
| 127 | +## HTTP API additions |
| 128 | + |
| 129 | +`* GET /_reshard` |
| 130 | + |
| 131 | +Top level summary. Besides the new _reshard endpoint, there `reason` and the stats are more detailed. |
| 132 | + |
| 133 | +Returns |
| 134 | + |
| 135 | +``` |
| 136 | +{ |
| 137 | + "completed": 3, |
| 138 | + "failed": 4, |
| 139 | + "running": 0, |
| 140 | + "state": "stopped", |
| 141 | + "state_reason": "Manual rebalancing", |
| 142 | + "stopped": 0, |
| 143 | + "total": 7 |
| 144 | +} |
| 145 | +``` |
| 146 | + |
| 147 | +* `PUT /_reshard/state` |
| 148 | + |
| 149 | +Start or stop global rebalacing. |
| 150 | + |
| 151 | +Body |
| 152 | +``` |
| 153 | +{ |
| 154 | + "state": "stopped", |
| 155 | + "reason": "Manual rebalancing" |
| 156 | +} |
| 157 | +``` |
| 158 | + |
| 159 | +Returns |
| 160 | + |
| 161 | +``` |
| 162 | +{ |
| 163 | + "ok": true |
| 164 | +} |
| 165 | +``` |
| 166 | + |
| 167 | +* `GET /_reshard/state` |
| 168 | + |
| 169 | +Return global resharding state and reason. |
| 170 | + |
| 171 | +``` |
| 172 | +{ |
| 173 | + "reason": "Manual rebalancing", |
| 174 | + "state": "stopped" |
| 175 | +} |
| 176 | +``` |
| 177 | + |
| 178 | +* `GET /_reshard/jobs` |
| 179 | + |
| 180 | +Get the state of all the resharding jobs on the cluster. Now we have a detailed |
| 181 | +state transition history which looks similar what _scheduler/jobs have. |
| 182 | + |
| 183 | +``` |
| 184 | +{ |
| 185 | + "jobs": [ |
| 186 | + { |
| 187 | + "history": [ |
| 188 | + { |
| 189 | + "detail": null, |
| 190 | + "timestamp": "2019-02-06T22:28:06Z", |
| 191 | + "type": "new" |
| 192 | + }, |
| 193 | + ... |
| 194 | + { |
| 195 | + "detail": null, |
| 196 | + "timestamp": "2019-02-06T22:28:10Z", |
| 197 | + "type": "completed" |
| 198 | + } |
| 199 | + ], |
| 200 | + "id": "001-0a308ef9f7bd24bd4887d6e619682a6d3bb3d0fd94625866c5216ec1167b4e23", |
| 201 | + "job_state": "completed", |
| 202 | + "node": "node1@127.0.0.1", |
| 203 | + "source": "shards/00000000-ffffffff/db1.1549492084", |
| 204 | + "split_state": "completed", |
| 205 | + "start_time": "2019-02-06T22:28:06Z", |
| 206 | + "state_info": {}, |
| 207 | + "target": [ |
| 208 | + "shards/00000000-7fffffff/db1.1549492084", |
| 209 | + "shards/80000000-ffffffff/db1.1549492084" |
| 210 | + ], |
| 211 | + "type": "split", |
| 212 | + "update_time": "2019-02-06T22:28:10Z" |
| 213 | + }, |
| 214 | + { |
| 215 | + .... |
| 216 | + }, |
| 217 | + ], |
| 218 | + "offset": 0, |
| 219 | + "total_rows": 7 |
| 220 | +} |
| 221 | +``` |
| 222 | + |
| 223 | +* `POST /_reshard/jobs` |
| 224 | + |
| 225 | +Create a new resharding job. This can now take other parameters and can split multiple ranges. |
| 226 | + |
| 227 | +To split one shard on a particular node |
| 228 | + |
| 229 | +``` |
| 230 | +{ |
| 231 | + "type": "split", |
| 232 | + "shard": "shards/80000000-bfffffff/db1.1549492084" |
| 233 | + "node": "node1@127.0.0.1" |
| 234 | +} |
| 235 | +``` |
| 236 | + |
| 237 | +To split a particular range on all nodes: |
| 238 | + |
| 239 | +``` |
| 240 | +{ |
| 241 | + "type": "split", |
| 242 | + "db" : "db1", |
| 243 | + "range" : "80000000-bfffffff" |
| 244 | +} |
| 245 | +``` |
| 246 | + |
| 247 | +To split a range on just one node: |
| 248 | + |
| 249 | +``` |
| 250 | +{ |
| 251 | + "type": "split", |
| 252 | + "db" : "db1", |
| 253 | + "range" : "80000000-bfffffff", |
| 254 | + "node": "node1@127.0.0.1" |
| 255 | +} |
| 256 | +``` |
| 257 | + |
| 258 | +To split all ranges of a db on one node: |
| 259 | + |
| 260 | +``` |
| 261 | +{ |
| 262 | + "type": "split", |
| 263 | + "db" : "db1", |
| 264 | + "node": "node1@127.0.0.1" |
| 265 | +} |
| 266 | +``` |
| 267 | + |
| 268 | +Result now may contain multiple job IDs |
| 269 | + |
| 270 | +``` |
| 271 | +[ |
| 272 | + { |
| 273 | + "id": "001-d457a4ea82877a26abbcbcc0e01c4b0070027e72b5bf0c4ff9c89eec2da9e790", |
| 274 | + "node": "node1@127.0.0.1", |
| 275 | + "ok": true, |
| 276 | + "shard": "shards/80000000-bfffffff/db1.1549986514" |
| 277 | + }, |
| 278 | + { |
| 279 | + "id": "001-7c1d20d2f7ef89f6416448379696a2cc98420e3e7855fdb21537d394dbc9b35f", |
| 280 | + "node": "node1@127.0.0.1", |
| 281 | + "ok": true, |
| 282 | + "shard": "shards/c0000000-ffffffff/db1.1549986514" |
| 283 | + } |
| 284 | +] |
| 285 | +``` |
| 286 | + |
| 287 | +* `GET /_reshard/jobs/$jobid` |
| 288 | + |
| 289 | +Get just one job by its ID |
| 290 | + |
| 291 | +``` |
| 292 | +{ |
| 293 | + "history": [ |
| 294 | + { |
| 295 | + "detail": null, |
| 296 | + "timestamp": "2019-02-12T16:55:41Z", |
| 297 | + "type": "new" |
| 298 | + }, |
| 299 | + { |
| 300 | + "detail": "Shard splitting disabled", |
| 301 | + "timestamp": "2019-02-12T16:55:41Z", |
| 302 | + "type": "stopped" |
| 303 | + } |
| 304 | + ], |
| 305 | + "id": "001-d457a4ea82877a26abbcbcc0e01c4b0070027e72b5bf0c4ff9c89eec2da9e790", |
| 306 | + "job_state": "stopped", |
| 307 | + "node": "node1@127.0.0.1", |
| 308 | + "source": "shards/80000000-bfffffff/db1.1549986514", |
| 309 | + "split_state": "new", |
| 310 | + "start_time": "1970-01-01T00:00:00Z", |
| 311 | + "state_info": { |
| 312 | + "reason": "Shard splitting disabled" |
| 313 | + }, |
| 314 | + "target": [ |
| 315 | + "shards/80000000-9fffffff/db1.1549986514", |
| 316 | + "shards/a0000000-bfffffff/db1.1549986514" |
| 317 | + ], |
| 318 | + "type": "split", |
| 319 | + "update_time": "2019-02-12T16:55:41Z" |
| 320 | +} |
| 321 | +``` |
| 322 | + |
| 323 | +* `GET /_reshard/jobs/$jobid/state` |
| 324 | + |
| 325 | +Get the running state of a particular job only |
| 326 | + |
| 327 | +``` |
| 328 | +{ |
| 329 | + "reason": "Shard splitting disabled", |
| 330 | + "state": "stopped" |
| 331 | +} |
| 332 | +``` |
| 333 | + |
| 334 | +* `PUT /_reshard/jobs/$jobid/state` |
| 335 | + |
| 336 | +Stop or resume a particular job |
| 337 | + |
| 338 | +Request body |
| 339 | + |
| 340 | +``` |
| 341 | +{ |
| 342 | + "state": "stopped", |
| 343 | + "reason": "Pause this job for now" |
| 344 | +} |
| 345 | +``` |
| 346 | + |
| 347 | + |
| 348 | +## HTTP API deprecations |
| 349 | + |
| 350 | +None |
| 351 | + |
| 352 | +# Security Considerations |
| 353 | + |
| 354 | +None. |
| 355 | + |
| 356 | +# References |
| 357 | + |
| 358 | +Original RFC-as-an-issue: |
| 359 | + |
| 360 | +https://github.com/apache/couchdb/issues/1920 |
| 361 | + |
| 362 | +Most of the discussion regarding this has happened on the `@dev` mailing list: |
| 363 | + |
| 364 | +https://mail-archives.apache.org/mod_mbox/couchdb-dev/201901.mbox/%3CCAJd%3D5Hbs%2BNwrt0%3Dz%2BGN68JPU5yHUea0xGRFtyow79TmjGN-_Sg%40mail.gmail.com%3E |
| 365 | + |
| 366 | +https://mail-archives.apache.org/mod_mbox/couchdb-dev/201902.mbox/%3CCAJd%3D5HaX12-fk2Lo8OgddQryZaj5KRa1GLN3P9LdYBQ5MT0Xew%40mail.gmail.com%3E |
| 367 | + |
| 368 | + |
| 369 | +# Acknowledgments |
| 370 | + |
| 371 | +@davisp @kocolosk : Collaborated on the initial idea and design |
| 372 | + |
| 373 | +@mikerhodes @wohali @janl @iilyak : Additionally collaborated on API design |
0 commit comments