Skip to content

Commit 08349a3

Browse files
authored
RFD 100: Proxy gRPC transport (gravitational#19439)
1 parent f942a4e commit 08349a3

File tree

3 files changed

+239
-0
lines changed

3 files changed

+239
-0
lines changed

rfd/0100-proxy-ssh-grpc.md

Lines changed: 239 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,239 @@
1+
---
2+
authors: Tim Ross (tim.ross@goteleport.com)
3+
state: draft
4+
---
5+
6+
# RFD 0100 - Use gRPC to proxy SSH connections to Nodes
7+
8+
# Required Approvers
9+
10+
* Engineering @zmb3 && (fspmarshall || espadolini)
11+
12+
## What
13+
14+
Add an alternate transport mechanism to the Proxy for proxying connections
15+
to Nodes
16+
17+
## Why
18+
19+
One of the primary contributors to `tsh ssh` connection latency is the
20+
time it takes to perform an SSH handshake. All connections to a Node via
21+
`tsh` are proxied via a SSH session established with the Proxy. Which means
22+
that in order to connect to a Node `tsh` must perform at least two SSH handshakes,
23+
one with the Proxy to setup the connection transport and another with the
24+
target Node over the transport to establish the user's SSH connection.
25+
26+
## Details
27+
28+
`tsh ssh` needs to connect to the target Node via the Proxy, but it
29+
doesn't have to use SSH for that communication. A new gRPC service exposed
30+
by the Proxy could perform the same operations as the existing SSH server
31+
but without as much overhead required to establish the session. To minimize
32+
changes both in `tsh` and on Cluster admins, the existing SSH port can be multiplexed
33+
to accept both SSH and gRPC by leveraging the TLS ALPN protocol `teleport-proxy-grpc-ssh`.
34+
Any incoming requests on the SSH listener with said ALPN protocol will be routed
35+
to the gRPC server and all other requests to the SSH server.
36+
37+
Note: a gRPC server is already exposed via the Proxy web address that users the ALPN protocol
38+
`teleport-proxy-grpc`. In order to not conflict the new ALPN protocol is required. Reusing the
39+
existing gRPC server is not an option since it has aggressive keep alive
40+
parameters and is only enabled when TLS Routing is enabled.
41+
42+
### Proto Definition
43+
44+
The specification is modeled after the [ProxyService](https://github.com/gravitational/teleport/blob/master/api/proto/teleport/legacy/client/proto/proxyservice.proto)
45+
which is a similar transport mechanism leveraged for Proxy Peering.
46+
47+
```proto
48+
service ProxyConnectionService {
49+
// GetClusterDetails provides cluster information that may affect how transport
50+
// should occur.
51+
rpc GetClusterDetails(GetClusterDetailsRequest) returns (GetClusterDetailsResponse);
52+
53+
// ProxySSH establishes an SSH connection to the target host over a bidirectional stream.
54+
//
55+
// The client must first send a DialTarget before the connection is established. Agent frames
56+
// will be populated if SSH Agent forwarding is enabled for the connection.
57+
rpc ProxySSH(stream ProxySSHRequest) returns (stream ProxySSHResponse);
58+
59+
// ProxyCluster establishes a connection to the target cluster
60+
//
61+
// The client must first send a ProxyClusterRequest with the desired cluster before the
62+
// connection is establishsed.
63+
rpc ProxyCluster(stream ProxyClusterRequest) returns (stream ProxyClusterResponse);
64+
}
65+
66+
// Request for ProxySSH
67+
//
68+
// The client must send a request with the Target
69+
// populated before the transport is established
70+
message ProxySSHRequest {
71+
// Contains the information about the connection target. Must
72+
// be sent first so the SSH connection can be established.
73+
Target dial_target = 1;
74+
// Raw SSH payload
75+
Frame ssh_frame = 2;
76+
// Raw SSH Agent payload, populated for agent forwarding
77+
Frame agent_frame = 3;
78+
}
79+
80+
// Response for ProxySSH
81+
message ProxySSHResponse {
82+
// Cluster information returned *ONLY* with the first frame
83+
ClusterDetails details = 1;
84+
// SSH payload
85+
Frame ssh_frame = 2;
86+
// SSH Agent payload, populated for agent forwarding
87+
Frame agent_frame = 3;
88+
}
89+
90+
// Request for ProxyCluster
91+
//
92+
// The client must send a request with the Target
93+
// populated before the transport is established
94+
message ProxyClusterRequest {
95+
// Name of the cluster to connect to. Must
96+
// be sent first so the connection can be established.
97+
string cluster = 1;
98+
// Raw payload
99+
Frame frame = 2;
100+
}
101+
102+
// Response for ProxyCluster
103+
message ProxyClusterResponse {
104+
// Raw payload
105+
Frame frame = 1;
106+
}
107+
108+
// Encapsulates protocol specific payloads
109+
message Frame {
110+
// The raw packet of data
111+
bytes payload = 1;
112+
}
113+
114+
// TargetHost indicates which server the connection is for
115+
message TargetHost {
116+
// The hostname/ip/uuid of the remote host
117+
string host = 1;
118+
// The port to connect to on the remote host
119+
int port = 2;
120+
// The cluster the server is a member of
121+
string cluster = 3;
122+
}
123+
124+
// Request for GetClusterDetails.
125+
message GetClusterDetailsRequest { }
126+
127+
// Response for GetClusterDetails.
128+
message GetClusterDetailsResponse {
129+
// Cluster configuration details
130+
ClusterDetails details = 1;
131+
}
132+
133+
// ClusterDetails contains details about the cluster configuration
134+
message ClusterDetails {
135+
// If proxy recording mode is enabled
136+
bool recording_proxy = 1;
137+
// If the cluster is running in FIPS mode
138+
bool fips_enabled = 2;
139+
}
140+
```
141+
142+
The `ProxySSH` RPC establishes a connection to a Node on behalf of the user.
143+
The client must first send a `Target` message which declares the target server that
144+
the connection is for. If the target exists and session control allows, the server
145+
will establish the connection and respond with a message. Each side may then send
146+
`Frame`s until the connection is terminated.
147+
148+
Since the Proxy creates an SSH connection to the Node on behalf of the user in proxy
149+
recording mode the user *must* forward their agent to facilitate the connection.
150+
Currently when `tsh` determines the Proxy is performing the session recording it will
151+
forward the user's agent over a SSH channel. The Proxy then communicates SSH Agent protocol
152+
over that channel to sign requests. `tsh` utilizes `agent.ForwardToAgent` and
153+
`agent.RequestAgentForwarding` from `x/crypto/ssh/agent` to set up the channel and serve
154+
the agent over the channel to the Proxy.
155+
156+
To achieve the same functionality using the gRPC stream proposed above, the SSH Agent
157+
protocol can be multiplexed over the stream in addition to the SSH protocol. When `tsh`
158+
determines proxy recording is in effect it can leverage `agent.ServeAgent` directly, passing
159+
in an `io.ReadWriter`which sends and receives an agent `Frame`s when it is written to and
160+
read from. The server side can communicate with the local agent by using `agent.NewClient`
161+
on a similar `io.ReadWriter`.
162+
163+
The end result is both SSH and SSH Agent protocol being transported across the same stream
164+
to enable both the SSH connection to the target Node and allowing the Proxy to communicate
165+
with the user's local SSH agent in a similar manner to way it works to date.
166+
167+
## Performance
168+
169+
Below are two traces captured with both Proxy transport mechanisms that illustrate the latency
170+
reduction.
171+
172+
#### SSH
173+
![SSH Transport](assets/0100-ssh-transport.png)
174+
#### gRPC
175+
![gRPC Transport](assets/0100-grpc-transport.png)
176+
177+
178+
The existing SSH transport took 6.73s to execute `tsh user@foo uptime`, while the same
179+
command via the gRPC transport took 5.36s resulting in a ~20% reduction in latency.
180+
181+
## Future Considerations
182+
183+
### Session Resumption
184+
185+
The proposed transport mechanism can be extended to support session resumption by altering
186+
the `Target` and `Frame` messages to include a connection id and sequence number:
187+
188+
189+
```proto
190+
// Encapsulates protocol specific payloads
191+
message Frame {
192+
// The raw packet of data
193+
bytes payload = 1;
194+
// A unique identifier for connection
195+
uint64 connection_id = 2;
196+
// The position of the frame in relation to others
197+
// for this connection
198+
uint64 sequence_number = 3;
199+
}
200+
201+
// Target indicates which server to connect to
202+
message Target {
203+
// The hostname/ip/uuid of the remote host
204+
string host = 1;
205+
// The port to connect to on the remote host
206+
int port = 2;
207+
// The cluster the server is a member of
208+
string cluster = 3;
209+
// The unique identifier for the connection. When
210+
// populated it indicates the session is being resumed.
211+
uint64 connection_id = 4;
212+
// The frame to resume the connection from. Both the
213+
// connection_id and sequence_number must be provided for
214+
// resumption.
215+
uint64 sequence_number = 3;
216+
}
217+
```
218+
219+
The `connection_id` and `sequence_number` identify which connection a `Frame` is for and
220+
what position the `Frame` is relative to others for that connection. To resume a session the
221+
`Target` must populate both the `connection_id` and `sequence_number`. If `connection_id` is
222+
unknown by the Node then the connection is aborted. All frames with a `sequence_number` equal or
223+
greater than the provided will be resent after the SSH connection is established.
224+
225+
The Node must maintain a mapping of `connection_id` to `Frame`s which keeps a backlog of most
226+
recent `Frame`s in the correct order.
227+
228+
## Security
229+
230+
The gRPC server will require mTLS for authentication and perform the same RBAC
231+
and session control checks as the current SSH server does. Agent forwarding will
232+
occur as it does today with the exception that the SSH Agent Protocol will use a
233+
gRPC stream instead of an SSH channel for transport.
234+
235+
## UX
236+
237+
The behavior of `tsh ssh` should remain the same regardless of the configured
238+
session recording mode. The time it takes to establish a session may be noticeably
239+
faster depending on proximity of the client and the Proxy.

rfd/assets/0100-grpc-transport.png

411 KB
Loading

rfd/assets/0100-ssh-transport.png

645 KB
Loading

0 commit comments

Comments
 (0)