Skip to content

Commit 521b6e2

Browse files
committed
HBASE-27516 Document the table based replication queue storage in ref guide
1 parent b1a7069 commit 521b6e2

File tree

1 file changed

+35
-88
lines changed

1 file changed

+35
-88
lines changed

src/main/asciidoc/_chapters/ops_mgt.adoc

Lines changed: 35 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -2433,26 +2433,22 @@ Replication State Storage::
24332433
`ReplicationPeerStorage` and `ReplicationQueueStorage`. The former one is for storing the
24342434
replication peer related states, and the latter one is for storing the replication queue related
24352435
states.
2436-
HBASE-15867 is only half done, as although we have abstract these two interfaces, we still only
2437-
have zookeeper based implementations.
2436+
And in HBASE-27109, we have implemented the `ReplicationQueueStorage` interface to store the replication queue in the hbase:replication table.
24382437

24392438
Replication State in ZooKeeper::
24402439
By default, the state is contained in the base node _/hbase/replication_.
2441-
Usually this nodes contains two child nodes, the `peers` znode is for storing replication peer
2442-
state, and the `rs` znodes is for storing replication queue state.
2440+
After 3.0.0, it only contains one child node, but before 3.0.0, we still use zk to store queue data.
24432441

24442442
The `Peers` Znode::
24452443
The `peers` znode is stored in _/hbase/replication/peers_ by default.
24462444
It consists of a list of all peer replication clusters, along with the status of each of them.
24472445
The value of each peer is its cluster key, which is provided in the HBase Shell.
24482446
The cluster key contains a list of ZooKeeper nodes in the cluster's quorum, the client port for the ZooKeeper quorum, and the base znode for HBase in HDFS on that cluster.
24492447

2450-
The `RS` Znode::
2451-
The `rs` znode contains a list of WAL logs which need to be replicated.
2452-
This list is divided into a set of queues organized by region server and the peer cluster the region server is shipping the logs to.
2453-
The rs znode has one child znode for each region server in the cluster.
2454-
The child znode name is the region server's hostname, client port, and start code.
2455-
This list includes both live and dead region servers.
2448+
The `Queue` State::
2449+
After 3.0.0, the `Queue` has been stored in the hbase:replication table, where the row key is <PeerId>-<ServerName>[/<SourceServerName>], the WAL group will be the qualifier, and the serialized ReplicationGroupOffset will be the value.
2450+
The ReplicationGroupOffset includes the wal file of the corresponding queue (<PeerId>-<ServerName>[/<SourceServerName>]) and its offset.
2451+
Because we track replication offset per queue instead of per file, we only need to store one replication offset per queue.
24562452

24572453
Other implementations for `ReplicationPeerStorage`::
24582454
Starting from 2.6.0, we introduce a file system based `ReplicationPeerStorage`, which stores
@@ -2475,14 +2471,14 @@ When nodes are removed from the slave cluster, or if nodes go down or come back
24752471

24762472
==== Keeping Track of Logs
24772473

2478-
Each master cluster region server has its own znode in the replication znodes hierarchy.
2479-
It contains one znode per peer cluster (if 5 slave clusters, 5 znodes are created), and each of these contain a queue of WALs to process.
2474+
Before 3.0.0, for zookeeper based implementation, it is like a tree, we have a znode for a peer cluster, but under the znode we have lots of files.
2475+
But after 3.0.0, for table based implementation, we have server name in row key, which means we will have lots of rows for a given peer.
24802476
Each of these queues will track the WALs created by that region server, but they can differ in size.
24812477
For example, if one slave cluster becomes unavailable for some time, the WALs should not be deleted, so they need to stay in the queue while the others are processed.
24822478
See <<rs.failover.details,rs.failover.details>> for an example.
24832479

24842480
When a source is instantiated, it contains the current WAL that the region server is writing to.
2485-
During log rolling, the new file is added to the queue of each slave cluster's znode just before it is made available.
2481+
During log rolling, the new file is added to the queue of each slave cluster's record just before it is made available.
24862482
This ensures that all the sources are aware that a new log exists before the region server is able to append edits into it, but this operations is now more expensive.
24872483
The queue items are discarded when the replication thread cannot read more entries from a file (because it reached the end of the last block) and there are other files in the queue.
24882484
This means that if a source is up to date and replicates from the log that the region server writes to, reading up to the "end" of the current file will not delete the item in the queue.
@@ -2521,93 +2517,36 @@ NOTE: WALs are saved when replication is enabled or disabled as long as peers ex
25212517
[[rs.failover.details]]
25222518
==== Region Server Failover
25232519

2524-
When no region servers are failing, keeping track of the logs in ZooKeeper adds no value.
2525-
Unfortunately, region servers do fail, and since ZooKeeper is highly available, it is useful for managing the transfer of the queues in the event of a failure.
2526-
2527-
Each of the master cluster region servers keeps a watcher on every other region server, in order to be notified when one dies (just as the master does). When a failure happens, they all race to create a znode called `lock` inside the dead region server's znode that contains its queues.
2528-
The region server that creates it successfully then transfers all the queues to its own znode, one at a time since ZooKeeper does not support renaming queues.
2529-
After queues are all transferred, they are deleted from the old location.
2530-
The znodes that were recovered are renamed with the ID of the slave cluster appended with the name of the dead server.
2520+
When a region server fails, the HMaster of master cluster will trigger the SCP, and all replication queues on the failed region server will be claimed in the SCP.
2521+
The claim queue operation is just to remove the row of a replication queue, and insert a new row.
2522+
Here, we use multi row mutate endpoint atomically, so the data for a single peer must be in the same region.
25312523

25322524
Next, the master cluster region server creates one new source thread per copied queue, and each of the source threads follows the read/filter/ship pattern.
25332525
The main difference is that those queues will never receive new data, since they do not belong to their new region server.
2534-
When the reader hits the end of the last log, the queue's znode is deleted and the master cluster region server closes that replication source.
2535-
2536-
Given a master cluster with 3 region servers replicating to a single slave with id `2`, the following hierarchy represents what the znodes layout could be at some point in time.
2537-
The region servers' znodes all contain a `peers` znode which contains a single queue.
2538-
The znode names in the queues represent the actual file names on HDFS in the form `address,port.timestamp`.
2539-
2540-
----
2541-
2542-
/hbase/replication/rs/
2543-
1.1.1.1,60020,123456780/
2544-
2/
2545-
1.1.1.1,60020.1234 (Contains a position)
2546-
1.1.1.1,60020.1265
2547-
1.1.1.2,60020,123456790/
2548-
2/
2549-
1.1.1.2,60020.1214 (Contains a position)
2550-
1.1.1.2,60020.1248
2551-
1.1.1.2,60020.1312
2552-
1.1.1.3,60020, 123456630/
2553-
2/
2554-
1.1.1.3,60020.1280 (Contains a position)
2555-
----
2526+
When the reader hits the end of the last log, the queue's record is deleted and the master cluster region server closes that replication source.
25562527

2557-
Assume that 1.1.1.2 loses its ZooKeeper session.
2558-
The survivors will race to create a lock, and, arbitrarily, 1.1.1.3 wins.
2559-
It will then start transferring all the queues to its local peers znode by appending the name of the dead server.
2560-
Right before 1.1.1.3 is able to clean up the old znodes, the layout will look like the following:
2528+
Given a master cluster with 3 region servers replicating to a single slave with id `2`, the following info represents what the storage layout of queue in the hbase:replication at some point in time.
2529+
Row key is <PeerId>-<ServerName>[/<SourceServerName>], and value is WAL && Offset.
25612530

25622531
----
25632532
2564-
/hbase/replication/rs/
2565-
1.1.1.1,60020,123456780/
2566-
2/
2567-
1.1.1.1,60020.1234 (Contains a position)
2568-
1.1.1.1,60020.1265
2569-
1.1.1.2,60020,123456790/
2570-
lock
2571-
2/
2572-
1.1.1.2,60020.1214 (Contains a position)
2573-
1.1.1.2,60020.1248
2574-
1.1.1.2,60020.1312
2575-
1.1.1.3,60020,123456630/
2576-
2/
2577-
1.1.1.3,60020.1280 (Contains a position)
2578-
2579-
2-1.1.1.2,60020,123456790/
2580-
1.1.1.2,60020.1214 (Contains a position)
2581-
1.1.1.2,60020.1248
2582-
1.1.1.2,60020.1312
2533+
<PeerId>-<ServerName>[/<SourceServerName>] WAL && Offset
2534+
2-1.1.1.1,60020,123456780 1.1.1.1,60020.1234 (Contains a position)
2535+
2-1.1.1.2,60020,123456790 1.1.1.2,60020.1214 (Contains a position)
2536+
2-1.1.1.3,60020,123456630 1.1.1.3,60020.1280 (Contains a position)
25832537
----
25842538

2585-
Some time later, but before 1.1.1.3 is able to finish replicating the last WAL from 1.1.1.2, it dies too.
2586-
Some new logs were also created in the normal queues.
2587-
The last region server will then try to lock 1.1.1.3's znode and will begin transferring all the queues.
2588-
The new layout will be:
2539+
Assume that 1.1.1.2 failed.
2540+
The survivors will claim queue of that, and, arbitrarily, 1.1.1.3 wins.
2541+
It will claim all the queue of 1.1.1.2, including removing the row of a replication queue, and inserting a new row(where we change the server name to the region server which claims the queue).
2542+
Finally ,the layout will look like the following:
25892543

25902544
----
25912545
2592-
/hbase/replication/rs/
2593-
1.1.1.1,60020,123456780/
2594-
2/
2595-
1.1.1.1,60020.1378 (Contains a position)
2596-
2597-
2-1.1.1.3,60020,123456630/
2598-
1.1.1.3,60020.1325 (Contains a position)
2599-
1.1.1.3,60020.1401
2600-
2601-
2-1.1.1.2,60020,123456790-1.1.1.3,60020,123456630/
2602-
1.1.1.2,60020.1312 (Contains a position)
2603-
1.1.1.3,60020,123456630/
2604-
lock
2605-
2/
2606-
1.1.1.3,60020.1325 (Contains a position)
2607-
1.1.1.3,60020.1401
2608-
2609-
2-1.1.1.2,60020,123456790/
2610-
1.1.1.2,60020.1312 (Contains a position)
2546+
<PeerId>-<ServerName>[/<SourceServerName>] WAL && Offset
2547+
2-1.1.1.1,60020,123456780 1.1.1.1,60020.1234 (Contains a position)
2548+
2-1.1.1.3,60020,123456630 1.1.1.3,60020.1280 (Contains a position)
2549+
2-1.1.1.3,60020,123456630 1.1.1.2,60020,123456790 1.1.1.2,60020.1214 (Contains a position)
26112550
----
26122551

26132552
=== Replication Metrics
@@ -2694,6 +2633,14 @@ The following metrics are exposed at the global region server level and at the p
26942633
| The directory for storing replication peer state, when filesystem replication
26952634
peer storage is specified
26962635
| peers
2636+
2637+
| hbase.replication.queue.table.name
2638+
| The table for storing replication queue state
2639+
| hbase:replication
2640+
2641+
| hbase.replication.queue.storage.impl
2642+
| The replication queue storage implementation
2643+
| TableReplicationQueueStorage
26972644
|===
26982645

26992646
=== Monitoring Replication Status

0 commit comments

Comments
 (0)