Skip to content

Commit b1bd80a

Browse files
pizhenweisigemptyFujiZzhuojiang123zhangyiming1201
committed
Introduce Valkey Over RDMA protocol
RDMA is the abbreviation of remote direct memory access. It is a technology that enables computers in a network to exchange data in the main memory without involving the processor, cache, or operating system of either computer. This means RDMA has a better performance than TCP, the test results show Valkey Over RDMA has a ~2.5X QPS and lower latency. In recent years, RDMA gets popular in the data center, especially RoCE(RDMA over Converged Ethernet) architecture has been widely used. Cloud Vendors also start to support RDMA instance in order to accelerate networking performance. End-user would enjoy the improvement easily. Introduce Valkey Over RDMA protocol as a new transport for Valkey. For now, we defined 4 commands: - GetServerFeature & SetClientFeature: the two commands are used to negotiate features for further extension. There is no feature definition in this version. Flow control and multi-buffer may be supported in the future, this needs feature negotiation. - Keepalive - RegisterXferMemory: the heart to transfer the real payload. The 'TX buffer' and 'RX buffer' are designed by RDMA remote memory with RDMA write/write with imm, it's similar to several mechanisms introduced by papers(but not same): - Socksdirect: datacenter sockets can be fast and compatible <https://dl.acm.org/doi/10.1145/3341302.3342071> - LITE Kernel RDMA Support for Datacenter Applications <https://dl.acm.org/doi/abs/10.1145/3132747.3132762> - FaRM: Fast Remote Memory <https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf> Link: valkey-io/valkey#477 Co-authored-by: Xinhao Kong <xinhao.kong@duke.edu> Co-authored-by: Huaping Zhou <zhouhuaping.san@bytedance.com> Co-authored-by: zhuo jiang <jiangzhuo.cs@bytedance.com> Co-authored-by: Yiming Zhang <zhangyiming1201@bytedance.com> Co-authored-by: Jianxi Ye <jianxi.ye@bytedance.com> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
1 parent f4ce160 commit b1bd80a

File tree

3 files changed

+176
-0
lines changed

3 files changed

+176
-0
lines changed

topics/RDMA.md

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
---
2+
title: "RDMA support"
3+
linkTitle: "RDMA support"
4+
description: Valkey Over RDMA support
5+
---
6+
7+
Valkey supports the Remote Direct Memory Access (RDMA) connection type via a
8+
Valkey module that can be dynamically loaded on demand.
9+
10+
## Getting Started
11+
12+
RDMA enables direct data exchange between networked computers' main memory,
13+
bypassing processors and operating systems.
14+
15+
As a result, RDMA offers better performance compared to TCP. Test results indicate that
16+
Valkey Over RDMA achieves approximately 2 times higher QPS and lower latency.
17+
18+
Please note that Valkey Over RDMA is currently supported only on Linux.
19+
20+
## Running manually
21+
22+
To run a Valkey server with RDMA mode:
23+
24+
./src/valkey-server --protected-mode no \
25+
--loadmodule src/valkey-rdma.so bind=192.168.122.100 port=6379
26+
27+
RDMA's bind address/port can be modified at runtime using the following command:
28+
29+
192.168.122.100:6379> CONFIG SET rdma-port 6380
30+
31+
Valkey can run both RDMA and TCP concurrently on the same port:
32+
33+
./src/valkey-server --protected-mode no \
34+
--loadmodule src/valkey-rdma.so bind=192.168.122.100 port=6379 \
35+
--port 6379
36+
37+
Note that the network interface (192.168.122.100 of this example) should support
38+
RDMA. To test a server supports RDMA or not:
39+
40+
~# rdma dev show (a new version iproute2 package)
41+
Or:
42+
43+
~# ibv_devices (ibverbs-utils package of Debian/Ubuntu)
44+
45+
46+
## Protocol
47+
48+
The protocol defines the QP type RC (like TCP), communication commands, and payload exchange mechanism.
49+
This dependency is based solely on the RDMA (aka Infiniband) specification
50+
and is independent of both software (including the OS and user libraries)
51+
and hardware (including vendors and low-level transports).
52+
53+
Valkey Over RDMA has control-plane (control messages) and data-plane (payload transfer).
54+
55+
### Control message
56+
57+
Control messages use fixed 32-byte big-endian message structures:
58+
```C
59+
typedef struct ValkeyRdmaFeature {
60+
/* defined as following Opcodes */
61+
uint16_t opcode;
62+
/* select features */
63+
uint16_t select;
64+
uint8_t rsvd[20];
65+
/* feature bits */
66+
uint64_t features;
67+
} ValkeyRdmaFeature;
68+
69+
typedef struct ValkeyRdmaKeepalive {
70+
/* defined as following Opcodes */
71+
uint16_t opcode;
72+
uint8_t rsvd[30];
73+
} ValkeyRdmaKeepalive;
74+
75+
typedef struct ValkeyRdmaMemory {
76+
/* defined as following Opcodes */
77+
uint16_t opcode;
78+
uint8_t rsvd[14];
79+
/* address of a transfer buffer which is used to receive remote streaming data,
80+
* aka 'RX buffer address'. The remote side should use this as 'TX buffer address' */
81+
uint64_t addr;
82+
/* length of the 'RX buffer' */
83+
uint32_t length;
84+
/* the RDMA remote key of 'RX buffer' */
85+
uint32_t key;
86+
} ValkeyRdmaMemory;
87+
88+
typedef union ValkeyRdmaCmd {
89+
ValkeyRdmaFeature feature;
90+
ValkeyRdmaKeepalive keepalive;
91+
ValkeyRdmaMemory memory;
92+
} ValkeyRdmaCmd;
93+
```
94+
95+
### Opcodes
96+
|Command| Value | Description |
97+
| :----: | :----: | :----: |
98+
| `GetServerFeature` | 0 | required, get the features offered by Valkey server |
99+
| `SetClientFeature` | 1 | required, negotiate features and set it to Valkey server |
100+
| `Keepalive` | 2 | required, detect unexpected orphan connection |
101+
| `RegisterXferMemory` | 3 | required, tell the 'RX transfer buffer' information to the remote side, and the remote side uses this as 'TX transfer buffer' |
102+
103+
### RDMA Operations
104+
- Send a control message by RDMA '**`ibv_post_send`**' with opcode '**`IBV_WR_SEND`**' with structure
105+
'ValkeyRdmaCmd'.
106+
- Receive a control message by RDMA '**`ibv_post_recv`**', and the received buffer
107+
size should be size of 'ValkeyRdmaCmd'.
108+
- Transfer stream data by RDMA '**`ibv_post_send`**' with opcode '**`IBV_WR_RDMA_WRITE`**' (optional) and
109+
'**`IBV_WR_RDMA_WRITE_WITH_IMM`**' (required), to write data segments into a connection by
110+
RDMA [WRITE][WRITE][WRITE]...[WRITE WITH IMM], the length of total buffer is described by
111+
immediate data (unsigned int 32).
112+
113+
114+
### Maximum WQEs of RDMA
115+
o specific limit, 1024 recommended for WQEs. Recommended WQEs is 1024.
116+
Flow control for WQE MAY be defined/implemented in the future.
117+
118+
119+
### The workflow of this protocol
120+
```
121+
valkey-server
122+
listen RDMA port
123+
valkey-client
124+
-------------------RDMA connect-------------------->
125+
accept connection
126+
<--------------- Establish RDMA --------------------
127+
128+
--------Get server feature [@IBV_WR_SEND] --------->
129+
130+
--------Set client feature [@IBV_WR_SEND] --------->
131+
setup RX buffer
132+
<---- Register transfer memory [@IBV_WR_SEND] ------
133+
[@ibv_post_recv]
134+
setup TX buffer
135+
----- Register transfer memory [@IBV_WR_SEND] ----->
136+
[@ibv_post_recv]
137+
setup TX buffer
138+
-- Valkey commands [@IBV_WR_RDMA_WRITE_WITH_IMM] -->
139+
<- Valkey response [@IBV_WR_RDMA_WRITE_WITH_IMM] ---
140+
.......
141+
-- Valkey commands [@IBV_WR_RDMA_WRITE_WITH_IMM] -->
142+
<- Valkey response [@IBV_WR_RDMA_WRITE_WITH_IMM] ---
143+
.......
144+
145+
146+
RX is full
147+
----- Register transfer memory [@IBV_WR_SEND] ----->
148+
[@ibv_post_recv]
149+
setup TX buffer
150+
<- Valkey response [@IBV_WR_RDMA_WRITE_WITH_IMM] ---
151+
.......
152+
153+
RX is full
154+
<---- Register transfer memory [@IBV_WR_SEND] ------
155+
[@ibv_post_recv]
156+
setup TX buffer
157+
-- Valkey commands [@IBV_WR_RDMA_WRITE_WITH_IMM] -->
158+
<- Valkey response [@IBV_WR_RDMA_WRITE_WITH_IMM] ---
159+
.......
160+
161+
-------------------RDMA disconnect----------------->
162+
<------------------RDMA disconnect------------------
163+
```
164+
165+
The Valkey Over RDMA protocol is designed to efficiently transfer stream data and
166+
bears similarities to several mechanisms introduced in academic papers with some differences:
167+
168+
* [Socksdirect: datacenter sockets can be fast and compatible](https://dl.acm.org/doi/10.1145/3341302.3342071)
169+
* [LITE Kernel RDMA Support for Datacenter Applications](https://dl.acm.org/doi/abs/10.1145/3132747.3132762)
170+
* [FaRM: Fast Remote Memory](https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf)

topics/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@ Administration
5050
* [Persistence](persistence.md): Options for configuring durability using disk backups.
5151
* [Administration](admin.md): Various administration topics.
5252
* [Security](security.md): An overview of Valkey's security.
53+
* [RDMA](RDMA.md): An overview of RDMA support.
5354
* [Access Control Lists](acl.md): ACLs make it possible to allow users to run only selected commands and access only specific key patterns.
5455
* [Encryption](encryption.md): How to use TLS for communication.
5556
* [Signals Handling](signals.md): How Valkey handles signals.

wordlist

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,8 @@ ctx
157157
daemonize
158158
daemonized
159159
daemontools
160+
Datacenter
161+
datacenter
160162
dataset
161163
datastore
162164
dbid
@@ -346,6 +348,7 @@ incr
346348
incrby
347349
incrby_get_mget
348350
indexable
351+
Infiniband
349352
ing
350353
init
351354
int_vals
@@ -915,6 +918,8 @@ wherefrom
915918
whitespace
916919
whitespaces
917920
whos-using-redis
921+
WQE
922+
WQEs
918923
WSL2
919924
xack
920925
xadd

0 commit comments

Comments
 (0)