Skip to content
This repository was archived by the owner on Oct 17, 2022. It is now read-only.

Commit 8eeebc2

Browse files
authored
Merge branch 'master' into feat/config
2 parents 3aa3978 + 26848c1 commit 8eeebc2

File tree

10 files changed

+1241
-15
lines changed

10 files changed

+1241
-15
lines changed

rfcs/002-shard-splitting.md

Lines changed: 373 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,373 @@
1+
---
2+
name: Shard Splitting
3+
about: Introduce Shard Splitting to CouchDB
4+
title: 'Shard Splitting'
5+
labels: rfc, discussion
6+
assignees: '@nickva'
7+
8+
---
9+
10+
# Introduction
11+
12+
This RFC proposes adding the capability to split shards to CouchDB. The API and
13+
the internals will also allow for other operations on shards in the future such
14+
as merging or rebalancing.
15+
16+
## Abstract
17+
18+
Since CouchDB 2.0 clustered databases have had a fixed Q value defined at
19+
creation. This often requires users to predict database usage ahead of time
20+
which can be hard to do. A too low of a value might result in large shards,
21+
slower performance, and needing more disk space to do compactions.
22+
23+
It would be nice to start with a low Q initially, for example Q=1, and as
24+
usage grows to be able to split some shards that grow too big. Especially
25+
with partitioned queries being available there will be a higher chance
26+
of having uneven sized shards and so it would be beneficial to split the
27+
larger ones to even out the size distribution across the cluster.
28+
29+
## Requirements Language
30+
31+
[NOTE]: # ( Do not alter the section below. Follow its instructions. )
32+
33+
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
34+
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
35+
document are to be interpreted as described in
36+
[RFC 2119](https://www.rfc-editor.org/rfc/rfc2119.txt).
37+
38+
## Terminology
39+
40+
*resharding* : Manipulating CouchDB shards. Could be splitting, merging,
41+
rebalancing or other operations. This will be used as the top-level API
42+
endpoint name with the idea that in the future different types of shard
43+
manipulation jobs would be added.
44+
45+
---
46+
47+
# Detailed Description
48+
49+
From the user's perspective there would be a new HTTP API endpoint -
50+
`_reshard/*`. A POST request to `_reshard/jobs/` would start resharding jobs.
51+
Initially these will be of just one "type":"split" but in the future other
52+
types could be added.
53+
54+
Users would then be able to monitor the state of these jobs to inspect their
55+
progress, see when they completed or failed.
56+
57+
The API should be designed to be consistent with `_scheduler/jobs` as much as
58+
possible since that is another recent CouchDB's API exposing an internal jobs
59+
list.
60+
61+
Most of the code implementing this would live in the mem3 application with some
62+
lower level components in the *couch* application. There will be a new child in
63+
the *mem3_sup* supervisor responsible for resharding called *mem3_reshard_sup*.
64+
It will have a *mem3_reshard* manager process which should have an Erlang API
65+
for starting jobs, stopping jobs, removing them, and inspecting their state.
66+
Individual jobs would be instances of a gen_server defined in
67+
*mem3_reshard_job* module. There will be simple-one-for-one supervisor under
68+
*mem3_reshard_sup* named *mem3_reshard_job_sup* to keep track of
69+
*mem3_reshard_job* children .
70+
71+
An individual shard splitting job will follow roughly these steps in order:
72+
73+
- **Create targets**. Targets are created. Some target properties should match
74+
the source. This means matching the PSE engine if source uses a custom one.
75+
If source is partitioned, targets should be partitioned as well, etc.
76+
77+
- **Initial bulk copy.** After the targets are created, copy all the document
78+
in the source shard to the targets. This operation should be as optimized as
79+
possible as it could potentially copy tens of GBs of data. For this reason
80+
this piece of code will be closer the what the compactor does.
81+
82+
- **Build indices**. The source shard might have had up-to-date indices and so
83+
it is beneficial for the split version to have them as well. Here we'd
84+
inspect all `_design` docs and rebuild all the known indices. After this step
85+
there will be a "topoff" step to replicate any change that might have
86+
occurred on the source while the indices were built.
87+
88+
- **Update shard map**. Here the global shard map is updated to remove the old
89+
source shard and replace it with the targets. There will be a corresponding
90+
entry added into the shard's document `changelog entry` indicating that a
91+
split happened. To avoid conflicts being generated when multiple copies of a
92+
range finish splitting and race to update the shard map. All shard map
93+
updates will be routes through one consistently picked node (lowest in the
94+
list connected nodes when they are sorted). After shard map is updated. There
95+
will be another topoff replication job to bring in changes from the source
96+
shard to the targets that might have occurred while the shard map was
97+
updating.
98+
99+
- **Delete source shard**
100+
101+
This progression of split states will be visible when inspecting a job's status
102+
as well as in the history in the `detail` field of each event.
103+
104+
105+
# Advantages and Disadvantages
106+
107+
Main advantage is to dynamically change shard size distribution on a cluster in
108+
response to changing user requirements without having to delete and recreate
109+
databases.
110+
111+
One disadvantage is that it might break some basic constraints about all copies
112+
of a shard range being the same size. A user could choose to split for example
113+
a shard copy 00..-ff... on node1 only so on node2 and node3 the copy will be
114+
00-..ff.. but on node1 there will now be 00-..7f.. and 80-ff... External
115+
tooling inspecting $db/_shards endpoint might need to be updated to handle this
116+
scenario. A mitigating factor here is that resharding in the current proposal
117+
is not automatic it is an operation triggered manually by the users.
118+
119+
# Key Changes
120+
121+
The main change is the ability to split shard via the `_reshard/*` HTTP API
122+
123+
## Applications and Modules affected
124+
125+
Most of the changes will be in the *mem3* application with some changes in the *couch* application as well.
126+
127+
## HTTP API additions
128+
129+
`* GET /_reshard`
130+
131+
Top level summary. Besides the new _reshard endpoint, there `reason` and the stats are more detailed.
132+
133+
Returns
134+
135+
```
136+
{
137+
"completed": 3,
138+
"failed": 4,
139+
"running": 0,
140+
"state": "stopped",
141+
"state_reason": "Manual rebalancing",
142+
"stopped": 0,
143+
"total": 7
144+
}
145+
```
146+
147+
* `PUT /_reshard/state`
148+
149+
Start or stop global rebalacing.
150+
151+
Body
152+
```
153+
{
154+
"state": "stopped",
155+
"reason": "Manual rebalancing"
156+
}
157+
```
158+
159+
Returns
160+
161+
```
162+
{
163+
"ok": true
164+
}
165+
```
166+
167+
* `GET /_reshard/state`
168+
169+
Return global resharding state and reason.
170+
171+
```
172+
{
173+
"reason": "Manual rebalancing",
174+
"state": "stopped"
175+
}
176+
```
177+
178+
* `GET /_reshard/jobs`
179+
180+
Get the state of all the resharding jobs on the cluster. Now we have a detailed
181+
state transition history which looks similar what _scheduler/jobs have.
182+
183+
```
184+
{
185+
"jobs": [
186+
{
187+
"history": [
188+
{
189+
"detail": null,
190+
"timestamp": "2019-02-06T22:28:06Z",
191+
"type": "new"
192+
},
193+
...
194+
{
195+
"detail": null,
196+
"timestamp": "2019-02-06T22:28:10Z",
197+
"type": "completed"
198+
}
199+
],
200+
"id": "001-0a308ef9f7bd24bd4887d6e619682a6d3bb3d0fd94625866c5216ec1167b4e23",
201+
"job_state": "completed",
202+
"node": "node1@127.0.0.1",
203+
"source": "shards/00000000-ffffffff/db1.1549492084",
204+
"split_state": "completed",
205+
"start_time": "2019-02-06T22:28:06Z",
206+
"state_info": {},
207+
"target": [
208+
"shards/00000000-7fffffff/db1.1549492084",
209+
"shards/80000000-ffffffff/db1.1549492084"
210+
],
211+
"type": "split",
212+
"update_time": "2019-02-06T22:28:10Z"
213+
},
214+
{
215+
....
216+
},
217+
],
218+
"offset": 0,
219+
"total_rows": 7
220+
}
221+
```
222+
223+
* `POST /_reshard/jobs`
224+
225+
Create a new resharding job. This can now take other parameters and can split multiple ranges.
226+
227+
To split one shard on a particular node
228+
229+
```
230+
{
231+
"type": "split",
232+
"shard": "shards/80000000-bfffffff/db1.1549492084"
233+
"node": "node1@127.0.0.1"
234+
}
235+
```
236+
237+
To split a particular range on all nodes:
238+
239+
```
240+
{
241+
"type": "split",
242+
"db" : "db1",
243+
"range" : "80000000-bfffffff"
244+
}
245+
```
246+
247+
To split a range on just one node:
248+
249+
```
250+
{
251+
"type": "split",
252+
"db" : "db1",
253+
"range" : "80000000-bfffffff",
254+
"node": "node1@127.0.0.1"
255+
}
256+
```
257+
258+
To split all ranges of a db on one node:
259+
260+
```
261+
{
262+
"type": "split",
263+
"db" : "db1",
264+
"node": "node1@127.0.0.1"
265+
}
266+
```
267+
268+
Result now may contain multiple job IDs
269+
270+
```
271+
[
272+
{
273+
"id": "001-d457a4ea82877a26abbcbcc0e01c4b0070027e72b5bf0c4ff9c89eec2da9e790",
274+
"node": "node1@127.0.0.1",
275+
"ok": true,
276+
"shard": "shards/80000000-bfffffff/db1.1549986514"
277+
},
278+
{
279+
"id": "001-7c1d20d2f7ef89f6416448379696a2cc98420e3e7855fdb21537d394dbc9b35f",
280+
"node": "node1@127.0.0.1",
281+
"ok": true,
282+
"shard": "shards/c0000000-ffffffff/db1.1549986514"
283+
}
284+
]
285+
```
286+
287+
* `GET /_reshard/jobs/$jobid`
288+
289+
Get just one job by its ID
290+
291+
```
292+
{
293+
"history": [
294+
{
295+
"detail": null,
296+
"timestamp": "2019-02-12T16:55:41Z",
297+
"type": "new"
298+
},
299+
{
300+
"detail": "Shard splitting disabled",
301+
"timestamp": "2019-02-12T16:55:41Z",
302+
"type": "stopped"
303+
}
304+
],
305+
"id": "001-d457a4ea82877a26abbcbcc0e01c4b0070027e72b5bf0c4ff9c89eec2da9e790",
306+
"job_state": "stopped",
307+
"node": "node1@127.0.0.1",
308+
"source": "shards/80000000-bfffffff/db1.1549986514",
309+
"split_state": "new",
310+
"start_time": "1970-01-01T00:00:00Z",
311+
"state_info": {
312+
"reason": "Shard splitting disabled"
313+
},
314+
"target": [
315+
"shards/80000000-9fffffff/db1.1549986514",
316+
"shards/a0000000-bfffffff/db1.1549986514"
317+
],
318+
"type": "split",
319+
"update_time": "2019-02-12T16:55:41Z"
320+
}
321+
```
322+
323+
* `GET /_reshard/jobs/$jobid/state`
324+
325+
Get the running state of a particular job only
326+
327+
```
328+
{
329+
"reason": "Shard splitting disabled",
330+
"state": "stopped"
331+
}
332+
```
333+
334+
* `PUT /_reshard/jobs/$jobid/state`
335+
336+
Stop or resume a particular job
337+
338+
Request body
339+
340+
```
341+
{
342+
"state": "stopped",
343+
"reason": "Pause this job for now"
344+
}
345+
```
346+
347+
348+
## HTTP API deprecations
349+
350+
None
351+
352+
# Security Considerations
353+
354+
None.
355+
356+
# References
357+
358+
Original RFC-as-an-issue:
359+
360+
https://github.com/apache/couchdb/issues/1920
361+
362+
Most of the discussion regarding this has happened on the `@dev` mailing list:
363+
364+
https://mail-archives.apache.org/mod_mbox/couchdb-dev/201901.mbox/%3CCAJd%3D5Hbs%2BNwrt0%3Dz%2BGN68JPU5yHUea0xGRFtyow79TmjGN-_Sg%40mail.gmail.com%3E
365+
366+
https://mail-archives.apache.org/mod_mbox/couchdb-dev/201902.mbox/%3CCAJd%3D5HaX12-fk2Lo8OgddQryZaj5KRa1GLN3P9LdYBQ5MT0Xew%40mail.gmail.com%3E
367+
368+
369+
# Acknowledgments
370+
371+
@davisp @kocolosk : Collaborated on the initial idea and design
372+
373+
@mikerhodes @wohali @janl @iilyak : Additionally collaborated on API design

0 commit comments

Comments
 (0)