Skip to content

Commit 4e61d95

Browse files
committed
HBASE-27516 Document the table based replication queue storage in ref guide
1 parent 772acaa commit 4e61d95

File tree

1 file changed

+61
-28
lines changed

1 file changed

+61
-28
lines changed

src/main/asciidoc/_chapters/ops_mgt.adoc

Lines changed: 61 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -2429,17 +2429,13 @@ This option was introduced in link:https://issues.apache.org/jira/browse/HBASE-1
24292429
==== Replication Internals
24302430

24312431
Replication State Storage::
2432-
In HBASE-15867, we abstract two interfaces for storing replication state,
2433-
`ReplicationPeerStorage` and `ReplicationQueueStorage`. The former one is for storing the
2434-
replication peer related states, and the latter one is for storing the replication queue related
2435-
states.
2436-
HBASE-15867 is only half done, as although we have abstract these two interfaces, we still only
2437-
have zookeeper based implementations.
2432+
And in HBASE-27110, we have implemented a file system based replication peer storage, to store replication peer state on file system. Of course you can still use the zookeeper based replication peer storage.
2433+
And in HBASE-27109, we have changed the replication queue storage from zookeeper based to hbase table based. See the below `Replication Queue State` in hbase:replication table section for more details.
24382434

24392435
Replication State in ZooKeeper::
24402436
By default, the state is contained in the base node _/hbase/replication_.
2441-
Usually this nodes contains two child nodes, the `peers` znode is for storing replication peer
2442-
state, and the `rs` znodes is for storing replication queue state.
2437+
Usually this nodes contains two child nodes, the peers znode is for storing replication peer state, and the rs znodes is for storing replication queue state. And if you choose the file system based replication peer storage, you will not see the peers znode.
2438+
And starting from 3.0.0, we have moved the replication queue state to <<hbase:replication,hbase:replication>> table, so you will not see the rs znode.
24432439

24442440
The `Peers` Znode::
24452441
The `peers` znode is stored in _/hbase/replication/peers_ by default.
@@ -2454,6 +2450,12 @@ The `RS` Znode::
24542450
The child znode name is the region server's hostname, client port, and start code.
24552451
This list includes both live and dead region servers.
24562452

2453+
[[hbase:replication]]
2454+
hbase:replication::
2455+
After 3.0.0, the `Queue` has been stored in the hbase:replication table, where the row key is <PeerId>-<ServerName>[/<SourceServerName>], the WAL group will be the qualifier, and the serialized ReplicationGroupOffset will be the value.
2456+
The ReplicationGroupOffset includes the wal file of the corresponding queue (<PeerId>-<ServerName>[/<SourceServerName>]) and its offset.
2457+
Because we track replication offset per queue instead of per file, we only need to store one replication offset per queue.
2458+
24572459
Other implementations for `ReplicationPeerStorage`::
24582460
Starting from 2.6.0, we introduce a file system based `ReplicationPeerStorage`, which stores
24592461
the replication peer state with files on HFile file system, instead of znodes on ZooKeeper.
@@ -2473,7 +2475,7 @@ A ZooKeeper watcher is placed on the _${zookeeper.znode.parent}/rs_ node of the
24732475
This watch is used to monitor changes in the composition of the slave cluster.
24742476
When nodes are removed from the slave cluster, or if nodes go down or come back up, the master cluster's region servers will respond by selecting a new pool of slave region servers to replicate to.
24752477

2476-
==== Keeping Track of Logs
2478+
==== Keeping Track of Logs(based on ZooKeeper)
24772479

24782480
Each master cluster region server has its own znode in the replication znodes hierarchy.
24792481
It contains one znode per peer cluster (if 5 slave clusters, 5 znodes are created), and each of these contain a queue of WALs to process.
@@ -2494,6 +2496,18 @@ If the log is in the queue, the path will be updated in memory.
24942496
If the log is currently being replicated, the change will be done atomically so that the reader doesn't attempt to open the file when has already been moved.
24952497
Because moving a file is a NameNode operation , if the reader is currently reading the log, it won't generate any exception.
24962498

2499+
==== Keeping Track of Logs(based on hbase table)
2500+
2501+
After 3.0.0, for table based implementation, we have server name in row key, which means we will have lots of rows for a given peer.
2502+
2503+
For a normal replication queue, where the WAL files belong to it is still alive, all the WAL files are kept in memory, so we do not need to get the WAL files from replication queue storage.
2504+
And for a recovered replication queue, we could get the WAL files of the dead region server by listing the old WAL directory on HDFS. So theoretically, we do not need to store every WAL file in replication queue storage.
2505+
And what’s more, we store the created time(usually) in the WAL file name, so for all the WAL files in a WAL group, we can sort them(actually we will sort them in the current replication framework), which means we only need to store one replication offset per queue.
2506+
When starting a recovered replication queue, we will skip all the files before this offset, and start replicating from this offset.
2507+
2508+
For ReplicationLogCleaner, all the files before this offset can be deleted, otherwise not.
2509+
2510+
24972511
==== Reading, Filtering and Sending Edits
24982512

24992513
By default, a source attempts to read from a WAL and ship log entries to a sink as quickly as possible.
@@ -2519,12 +2533,12 @@ The next time the cleaning process needs to look for a log, it starts by using i
25192533
NOTE: WALs are saved when replication is enabled or disabled as long as peers exist.
25202534

25212535
[[rs.failover.details]]
2522-
==== Region Server Failover
2536+
==== Region Server Failover(based on ZooKeeper)
25232537

25242538
When no region servers are failing, keeping track of the logs in ZooKeeper adds no value.
25252539
Unfortunately, region servers do fail, and since ZooKeeper is highly available, it is useful for managing the transfer of the queues in the event of a failure.
2526-
2527-
Each of the master cluster region servers keeps a watcher on every other region server, in order to be notified when one dies (just as the master does). When a failure happens, they all race to create a znode called `lock` inside the dead region server's znode that contains its queues.
2540+
Each of the master cluster region servers keeps a watcher on every other region server, in order to be notified when one dies (just as the master does). When a failure happens, they all race to create a
2541+
znode called `lock` inside the dead region server's znode that contains its queues.
25282542
The region server that creates it successfully then transfers all the queues to its own znode, one at a time since ZooKeeper does not support renaming queues.
25292543
After queues are all transferred, they are deleted from the old location.
25302544
The znodes that were recovered are renamed with the ID of the slave cluster appended with the name of the dead server.
@@ -2587,27 +2601,38 @@ Some new logs were also created in the normal queues.
25872601
The last region server will then try to lock 1.1.1.3's znode and will begin transferring all the queues.
25882602
The new layout will be:
25892603

2604+
==== Region Server Failover(based on hbase table)
2605+
2606+
When a region server fails, the HMaster of master cluster will trigger the SCP, and all replication queues on the failed region server will be claimed in the SCP.
2607+
The claim queue operation is just to remove the row of a replication queue, and insert a new row.
2608+
Here, we use multi row mutate endpoint atomically, so the data for a single peer must be in the same region.
2609+
2610+
Next, the master cluster region server creates one new source thread per copied queue, and each of the source threads follows the read/filter/ship pattern.
2611+
The main difference is that those queues will never receive new data, since they do not belong to their new region server.
2612+
When the reader hits the end of the last log, the queue's record is deleted and the master cluster region server closes that replication source.
2613+
2614+
Given a master cluster with 3 region servers replicating to a single slave with id `2`, the following info represents what the storage layout of queue in the hbase:replication at some point in time.
2615+
Row key is <PeerId>-<ServerName>[/<SourceServerName>], and value is WAL && Offset.
2616+
25902617
----
25912618
2592-
/hbase/replication/rs/
2593-
1.1.1.1,60020,123456780/
2594-
2/
2595-
1.1.1.1,60020.1378 (Contains a position)
2619+
<PeerId>-<ServerName>[/<SourceServerName>] WAL && Offset
2620+
2-1.1.1.1,60020,123456780 1.1.1.1,60020.1234 (Contains a position)
2621+
2-1.1.1.2,60020,123456790 1.1.1.2,60020.1214 (Contains a position)
2622+
2-1.1.1.3,60020,123456630 1.1.1.3,60020.1280 (Contains a position)
2623+
----
25962624

2597-
2-1.1.1.3,60020,123456630/
2598-
1.1.1.3,60020.1325 (Contains a position)
2599-
1.1.1.3,60020.1401
2625+
Assume that 1.1.1.2 failed.
2626+
The survivors will claim queue of that, and, arbitrarily, 1.1.1.3 wins.
2627+
It will claim all the queue of 1.1.1.2, including removing the row of a replication queue, and inserting a new row(where we change the server name to the region server which claims the queue).
2628+
Finally ,the layout will look like the following:
26002629

2601-
2-1.1.1.2,60020,123456790-1.1.1.3,60020,123456630/
2602-
1.1.1.2,60020.1312 (Contains a position)
2603-
1.1.1.3,60020,123456630/
2604-
lock
2605-
2/
2606-
1.1.1.3,60020.1325 (Contains a position)
2607-
1.1.1.3,60020.1401
2630+
----
26082631
2609-
2-1.1.1.2,60020,123456790/
2610-
1.1.1.2,60020.1312 (Contains a position)
2632+
<PeerId>-<ServerName>[/<SourceServerName>] WAL && Offset
2633+
2-1.1.1.1,60020,123456780 1.1.1.1,60020.1234 (Contains a position)
2634+
2-1.1.1.3,60020,123456630 1.1.1.3,60020.1280 (Contains a position)
2635+
2-1.1.1.3,60020,123456630 1.1.1.2,60020,123456790 1.1.1.2,60020.1214 (Contains a position)
26112636
----
26122637

26132638
=== Replication Metrics
@@ -2694,6 +2719,14 @@ The following metrics are exposed at the global region server level and at the p
26942719
| The directory for storing replication peer state, when filesystem replication
26952720
peer storage is specified
26962721
| peers
2722+
2723+
| hbase.replication.queue.table.name
2724+
| The table for storing replication queue state
2725+
| hbase:replication
2726+
2727+
| hbase.replication.queue.storage.impl
2728+
| The replication queue storage implementation
2729+
| TableReplicationQueueStorage
26972730
|===
26982731

26992732
=== Monitoring Replication Status

0 commit comments

Comments
 (0)