-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[doc]improve storage design.md English version #1721
Conversation
@@ -79,8 +84,9 @@ | |||
|
|||
### Multi Raft Group | |||
|
|||
由于 Raft 的日志不允许空洞,几乎所有的实现都会采用 Multi Raft Group 来缓解这个问题,因此 partition 的数目几乎决定了整个 Raft Group 的性能。但这也并不是说 Partition 的数目越多越好:每一个 Raft Group 内部都要存储一系列的状态信息,并且每一个 Raft Group 有自己的 WAL 文件,因此 Partition 数目太多会增加开销。此外,当 Partition 太多时, 如果负载没有足够高,batch 操作是没有意义的。比如,一个有 1w tps 的线上系统单机,它的单机 partition 的数目超过 1w,可能每个 Partition 每秒的 tps 只有 1,这样 batch 操作就失去了意义,还增加了 CPU 开销。 | |||
实现 Multi Raft Group 的最关键之处有两点,**第一是共享 Transport 层**,因为每一个 Raft Group 内部都需要向对应的 peer 发送消息,如果不能共享 Transport 层,连接的开销巨大;**第二是线程模型**,Multi Raft Group 一定要共享一组线程池,否则会造成系统的线程数目过多,导致大量的 context switch 开销。 | |||
由于 Raft 的日志不允许空洞,几乎所有的实现都会采用 Multi Raft Group 来缓解这个问题,因此 partition 的数目几乎决定了整个 Raft Group 的性能。但这也并不是说 Partition 的数目越多越好:每一个 Raft Group 内部都要存储一系列的状态信息,并且每一个 Raft Group 有自己的 WAL 文件,因此 Partition 数目太多会增加开销。此外,当 Partition 太多时, 如果负载没有足够高,batch 操作是没有意义的。比如,对于一个有 1万 TPS 的线上系统,即使它的每台机器上 partition 的数目超过 1万,但很有可能每个 partition TPS 只有 1,这样 batch 操作就失去了意义,还增加了 CPU 开销。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why abandon the unordered list here? I think it's more clear.
docs/manual-CN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
- `Tag ID`: four bytes, used to relate tag | ||
- `Timestamp`: eight bytes, not available to users, used in MVCC in the future | ||
- `Type`: one byte, to indicate the key type. e.g. data, index, system, etc. | ||
- `Part ID`: three bytes, used to indicate the (sharding) partition id. It's designed for the data migration/balance operation by **prefix-scanning all the data in a partition.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id > ID
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
|
||
**Nebula Graph** shards data through `modulo operation` on vertex ID. All the _out-keys_, _in-keys_ and _tag data_ are placed in the same partition through `modulo operation` on vertex ID. In this way, query efficiency is increased dramatically. For on-line queries, the most common operation is to do Breadth-First-Search (BFS) expansion from a vertex, thus it's the basic operation to get a vertex's in-edge or out-edge, and such operation's performance determines the traverse's performance. In BFS, there are cases filtering some properties, **Nebula Graph** guarantees the operation efficiency by putting properties and vertices and edges together. At present, most graph databases verify their efficiency with Graph 500 or Twitter data set, which are of no eloquence because properties are not included. While most cases use property graphs and BFS needs a lot of filtering as well. | ||
**Nebula Graph** shards data through `modulo operation` on `vertex ID`. All the _out-keys_, _in-keys_ and _tag id_ are placed in the same partition. This improves query efficiency as a local/non-remote file access. Breadth-First-Search (BFS) expansion starting from a given vertex is a very common ad-hoc graph exploration. And During BFS, the performance of filtering out edge/vertex properties are time-consuming. **Nebula Graph** guarantees the operation efficiency by putting properties of a vertex and its edges locating near each other. It is worth noting that most graph databases vendors run their benchmark with Graph 500 or Twitter data set, which are of no eloquence because properties are not considered in this kind of graph exploration. While most production cases are not that simple. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use uppercase ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And During > And during
their benchmark > their benchmarks
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
- **Written in C++** | ||
- **High performance** a pure high performance key value store. | ||
- **Provided as a library**, as a strong typed database, the performance of storage layer is key to **Nebula Graph**. | ||
- **Strong data consistency**, since **Nebula Graph** is a distribution system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strong data consistency, benefiting from Nebula Graph's distribution system.
|
||
For users who are not sensitive to performance or unwilling to migrate data, **Nebula Graph** provides plugins for the whole kv store, builds the storage service on the third party kv store. Currently, **Nebula Graph** provides HBase plugin. | ||
For users who are not sensitive to performance or unwilling to migrate data from other storage system, such as HBase or MySQL, **Nebula Graph** also provides a plugin over the kv store to replace its default RocksDB. Currently, HBase plugin has been release yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from other storage systems
Currently, HBase plugin is yet to be released.
|
||
With RocksDB as the local storage engine, **Nebula Graph** supports self management of multiple hard disks to take full use of the parallel property. All users need to do is to configure multiple data directories. **Nebula Graph** manages the distributed kv store in a unified scheduling way with meta service. All the partition distribution data and current machine status can be found in the meta service. Users can input commands in the console to add or remove machines to generate and execute a balance plan in meta service. | ||
As RocksDB as the local storage engine, **Nebula Graph** can manage multiple hard disks to take full use of the parallel IO access. What a user need to do is to configure multiple data directories. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As RocksDB is the local storage engine,
What a user needs ...
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
|
||
### Learner | ||
|
||
New machines need to "catch up" data for quite a long time, and there may be accidents during the process. If a new machine joins the raft group as a follower, it will reduce the HA capacity of the entire cluster. **Nebula Graph** uses the command WAL mentioned above to implement its learner. If the leader encounters the `add learner command` when writing WAL, it will add the learner to its peers and mark it as a learner, so logs will send to them as normal. But the learners will not involve in the leader's election. | ||
When a new machine is added to a cluster, it has to catch up data for quite a long time. And there may be accidents during this process. If this one directly joins the raft group as a follower role, it will dramatically reduce the availability of the entire cluster. **Nebula Graph** introduce the learner role, and it is implemented by the command WAL mentioned above. When a leader is writing WAL and meets a `add learner command`, it will add the new coming-in learner to its peers list and mark it as a learner. The logs will send to all the peers, both the followers and the learner. But the learner can not vote for the leader's election. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this one directly joins the raft group as a follower...
Nebula Graph introduces ...
...meets an add learner command
...
I think learners
is better than learners
here
|
||
### Transfer Leadership | ||
|
||
Transfer leadership is extremely important for balance. When moving a partition from one machine to another, **Nebula Graph** first checks if the source is a leader. If so, it should be moved to another peer. After data migration is completed, please do balance the leader again, so that the load on each machine can be balanced. | ||
Transfer leadership is extremely important during a data balance operation. When migrating a partition from one machine to another, **Nebula Graph** will first check if it is a leader. If so, another follower should be elected as a leader before the migration. Otherwise, the cluster service are affected since the leader is on migration. After the migration is done, a `BALANCE LEADER` command is invoked, so that the work load on each machine can be balanced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transfer leadership is extremely important during data balance.
|
||
### Membership Change | ||
|
||
To avoid split-brain, when members in a Raft Group change, an intermediate state is required. In such state, the majority of the old group and new group always have an overlap, thus preventing the old or new group from making decisions unilaterally. This is the `joint consensus` mentioned in the thesis. To make it even simpler, Diego Ongaro suggests **adding or removing a peer once to ensure the overlap between the majority of the new group and the old group** in his doctoral thesis. **Nebula Graph**'s implementation also uses this approach, except that the way to add or remove member is different. For details, please refer to `addPeer/removePeer` in Raft Part class. | ||
To avoid the brain-split, when Raft Group members changed, an intermediate state is required. In such state, the majority of the old group and new group always have an overlap. This majority overlap will prevent neither group from making decisions unilaterally. This is the `joint consensus` as mentioned in the famous Raft thesis. To make it even simpler, Diego Ongaro suggests to **add or remove only one peer at a time to ensure the overlap between the majority** in his doctoral thesis. **Nebula Graph**'s implementation also uses this approach, except that the implementation to add or remove member is different. For details, please refer to `addPeer/removePeer` in Raft Part class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to refer to addPeer/removePeer
in Raft Part class, is there a link here?
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
The Interfaces of Storage Service layer are | ||
|
||
- `Insert vertex/edge`: insert a vertex or edge and its properties. | ||
- `getNeighbors`: get the in-edge or out-edge from a set of vertices. And return the edges and properties. Condition filtering are also considered. | ||
- `getProps`: get the properties of a vertex or an edge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getProps
: get the properties of a vertex or an edge.
703f590
to
e11de2d
Compare
docs/manual-CN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-CN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-CN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-CN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
@@ -90,7 +96,7 @@ | |||
|
|||
### Learner | |||
|
|||
Learner 这个角色的存在主要是为了 `应对扩容` 时,新机器需要"追"相当长一段时间的数据,而这段时间有可能会发生意外。如果直接以 follower 的身份开始追数据,就会使得整个集群的 HA 能力下降。 Nebula Graph 里面 learner 的实现就是采用了上面提到的 command wal,leader 在写 wal 时如果碰到 add learner 的 command, 就会将 learner 加入自己的 peers,并把它标记为 learner,这样在统计多数派的时候,就不会算上 learner,但是日志还是会照常发送给它们。当然 learner 也不会主动发起选举。 | |||
Learner 这个角色的存在主要是为了 `应对扩容` 时,新机器需要"追"相当长一段时间的数据,而这段时间有可能会发生意外。如果直接以 follower 的身份开始追数据,就会使得整个集群的 HA 能力下降。 Nebula Graph 里面 learner 的实现就是采用了上面提到的 command wal。 Leader 在写 wal 时如果碰到 add learner 的 command, 就会将 learner 加入自己的 peers,并把它标记为 learner,这样在统计多数派的时候,就不会算上 learner,但是日志还是会照常发送给它们。当然 learner 也不会主动发起选举。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nebula Graph
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
|
||
There are two key points to achieve the Multi Raft Group. **First is to share the transport layer**, because each Raft Group needs to send messages to the corresponding peers. If the transport layer cannot be shared, the connection costs are huge. **Second is the threading model**, multi Raft Group must share a set of thread pools, otherwise it will end with too many threads in the system, resulting in a large amount of context switch cost. | ||
There are two key challenges to implement the Multi Raft Group. **First one is how to share the transport layer**. Because each Raft Group sends messages to its corresponding peers, if the transport layer cannot be shared, the connection costs will be very high. **Second one is how to design the multi-threading model**. Raft Groups share the same thread pool to prevent staring too many threads and a high context switch cost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
staring?
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
docs/manual-EN/1.overview/3.design-and-architecture/2.storage-design.md
Outdated
Show resolved
Hide resolved
c703aac
to
c7d75ab
Compare
c7d75ab
to
b723920
Compare
* improve storage desing * minor * rebase upstream Co-authored-by: dutor <440396+dutor@users.noreply.github.com>
What changes were proposed in this pull request?
Improve storage design markdown file
Why are the changes needed?
Some errors.
Does this PR introduce any user-facing change?
No
How was this patch tested?
No need