|
| 1 | +--- |
| 2 | +authors: Tim Ross (tim.ross@goteleport.com) |
| 3 | +state: draft |
| 4 | +--- |
| 5 | + |
| 6 | +# RFD 0100 - Use gRPC to proxy SSH connections to Nodes |
| 7 | + |
| 8 | +# Required Approvers |
| 9 | + |
| 10 | +* Engineering @zmb3 && (fspmarshall || espadolini) |
| 11 | + |
| 12 | +## What |
| 13 | + |
| 14 | +Add an alternate transport mechanism to the Proxy for proxying connections |
| 15 | +to Nodes |
| 16 | + |
| 17 | +## Why |
| 18 | + |
| 19 | +One of the primary contributors to `tsh ssh` connection latency is the |
| 20 | +time it takes to perform an SSH handshake. All connections to a Node via |
| 21 | +`tsh` are proxied via a SSH session established with the Proxy. Which means |
| 22 | +that in order to connect to a Node `tsh` must perform at least two SSH handshakes, |
| 23 | +one with the Proxy to setup the connection transport and another with the |
| 24 | +target Node over the transport to establish the user's SSH connection. |
| 25 | + |
| 26 | +## Details |
| 27 | + |
| 28 | +`tsh ssh` needs to connect to the target Node via the Proxy, but it |
| 29 | +doesn't have to use SSH for that communication. A new gRPC service exposed |
| 30 | +by the Proxy could perform the same operations as the existing SSH server |
| 31 | +but without as much overhead required to establish the session. To minimize |
| 32 | +changes both in `tsh` and on Cluster admins, the existing SSH port can be multiplexed |
| 33 | +to accept both SSH and gRPC by leveraging the TLS ALPN protocol `teleport-proxy-grpc-ssh`. |
| 34 | +Any incoming requests on the SSH listener with said ALPN protocol will be routed |
| 35 | +to the gRPC server and all other requests to the SSH server. |
| 36 | + |
| 37 | +Note: a gRPC server is already exposed via the Proxy web address that users the ALPN protocol |
| 38 | +`teleport-proxy-grpc`. In order to not conflict the new ALPN protocol is required. Reusing the |
| 39 | +existing gRPC server is not an option since it has aggressive keep alive |
| 40 | +parameters and is only enabled when TLS Routing is enabled. |
| 41 | + |
| 42 | +### Proto Definition |
| 43 | + |
| 44 | +The specification is modeled after the [ProxyService](https://github.com/gravitational/teleport/blob/master/api/proto/teleport/legacy/client/proto/proxyservice.proto) |
| 45 | +which is a similar transport mechanism leveraged for Proxy Peering. |
| 46 | + |
| 47 | +```proto |
| 48 | +service ProxyConnectionService { |
| 49 | + // GetClusterDetails provides cluster information that may affect how transport |
| 50 | + // should occur. |
| 51 | + rpc GetClusterDetails(GetClusterDetailsRequest) returns (GetClusterDetailsResponse); |
| 52 | +
|
| 53 | + // ProxySSH establishes an SSH connection to the target host over a bidirectional stream. |
| 54 | + // |
| 55 | + // The client must first send a DialTarget before the connection is established. Agent frames |
| 56 | + // will be populated if SSH Agent forwarding is enabled for the connection. |
| 57 | + rpc ProxySSH(stream ProxySSHRequest) returns (stream ProxySSHResponse); |
| 58 | + |
| 59 | + // ProxyCluster establishes a connection to the target cluster |
| 60 | + // |
| 61 | + // The client must first send a ProxyClusterRequest with the desired cluster before the |
| 62 | + // connection is establishsed. |
| 63 | + rpc ProxyCluster(stream ProxyClusterRequest) returns (stream ProxyClusterResponse); |
| 64 | +} |
| 65 | +
|
| 66 | +// Request for ProxySSH |
| 67 | +// |
| 68 | +// The client must send a request with the Target |
| 69 | +// populated before the transport is established |
| 70 | +message ProxySSHRequest { |
| 71 | + // Contains the information about the connection target. Must |
| 72 | + // be sent first so the SSH connection can be established. |
| 73 | + Target dial_target = 1; |
| 74 | + // Raw SSH payload |
| 75 | + Frame ssh_frame = 2; |
| 76 | + // Raw SSH Agent payload, populated for agent forwarding |
| 77 | + Frame agent_frame = 3; |
| 78 | +} |
| 79 | +
|
| 80 | +// Response for ProxySSH |
| 81 | +message ProxySSHResponse { |
| 82 | + // Cluster information returned *ONLY* with the first frame |
| 83 | + ClusterDetails details = 1; |
| 84 | + // SSH payload |
| 85 | + Frame ssh_frame = 2; |
| 86 | + // SSH Agent payload, populated for agent forwarding |
| 87 | + Frame agent_frame = 3; |
| 88 | +} |
| 89 | +
|
| 90 | +// Request for ProxyCluster |
| 91 | +// |
| 92 | +// The client must send a request with the Target |
| 93 | +// populated before the transport is established |
| 94 | +message ProxyClusterRequest { |
| 95 | + // Name of the cluster to connect to. Must |
| 96 | + // be sent first so the connection can be established. |
| 97 | + string cluster = 1; |
| 98 | + // Raw payload |
| 99 | + Frame frame = 2; |
| 100 | +} |
| 101 | +
|
| 102 | +// Response for ProxyCluster |
| 103 | +message ProxyClusterResponse { |
| 104 | + // Raw payload |
| 105 | + Frame frame = 1; |
| 106 | +} |
| 107 | +
|
| 108 | +// Encapsulates protocol specific payloads |
| 109 | +message Frame { |
| 110 | + // The raw packet of data |
| 111 | + bytes payload = 1; |
| 112 | +} |
| 113 | +
|
| 114 | +// TargetHost indicates which server the connection is for |
| 115 | +message TargetHost { |
| 116 | + // The hostname/ip/uuid of the remote host |
| 117 | + string host = 1; |
| 118 | + // The port to connect to on the remote host |
| 119 | + int port = 2; |
| 120 | + // The cluster the server is a member of |
| 121 | + string cluster = 3; |
| 122 | +} |
| 123 | +
|
| 124 | +// Request for GetClusterDetails. |
| 125 | +message GetClusterDetailsRequest { } |
| 126 | +
|
| 127 | +// Response for GetClusterDetails. |
| 128 | +message GetClusterDetailsResponse { |
| 129 | + // Cluster configuration details |
| 130 | + ClusterDetails details = 1; |
| 131 | +} |
| 132 | +
|
| 133 | +// ClusterDetails contains details about the cluster configuration |
| 134 | +message ClusterDetails { |
| 135 | + // If proxy recording mode is enabled |
| 136 | + bool recording_proxy = 1; |
| 137 | + // If the cluster is running in FIPS mode |
| 138 | + bool fips_enabled = 2; |
| 139 | +} |
| 140 | +``` |
| 141 | + |
| 142 | +The `ProxySSH` RPC establishes a connection to a Node on behalf of the user. |
| 143 | +The client must first send a `Target` message which declares the target server that |
| 144 | +the connection is for. If the target exists and session control allows, the server |
| 145 | +will establish the connection and respond with a message. Each side may then send |
| 146 | +`Frame`s until the connection is terminated. |
| 147 | + |
| 148 | +Since the Proxy creates an SSH connection to the Node on behalf of the user in proxy |
| 149 | +recording mode the user *must* forward their agent to facilitate the connection. |
| 150 | +Currently when `tsh` determines the Proxy is performing the session recording it will |
| 151 | +forward the user's agent over a SSH channel. The Proxy then communicates SSH Agent protocol |
| 152 | +over that channel to sign requests. `tsh` utilizes `agent.ForwardToAgent` and |
| 153 | +`agent.RequestAgentForwarding` from `x/crypto/ssh/agent` to set up the channel and serve |
| 154 | +the agent over the channel to the Proxy. |
| 155 | + |
| 156 | +To achieve the same functionality using the gRPC stream proposed above, the SSH Agent |
| 157 | +protocol can be multiplexed over the stream in addition to the SSH protocol. When `tsh` |
| 158 | +determines proxy recording is in effect it can leverage `agent.ServeAgent` directly, passing |
| 159 | +in an `io.ReadWriter`which sends and receives an agent `Frame`s when it is written to and |
| 160 | +read from. The server side can communicate with the local agent by using `agent.NewClient` |
| 161 | +on a similar `io.ReadWriter`. |
| 162 | + |
| 163 | +The end result is both SSH and SSH Agent protocol being transported across the same stream |
| 164 | +to enable both the SSH connection to the target Node and allowing the Proxy to communicate |
| 165 | +with the user's local SSH agent in a similar manner to way it works to date. |
| 166 | + |
| 167 | +## Performance |
| 168 | + |
| 169 | +Below are two traces captured with both Proxy transport mechanisms that illustrate the latency |
| 170 | +reduction. |
| 171 | + |
| 172 | +#### SSH |
| 173 | + |
| 174 | +#### gRPC |
| 175 | + |
| 176 | + |
| 177 | + |
| 178 | +The existing SSH transport took 6.73s to execute `tsh user@foo uptime`, while the same |
| 179 | +command via the gRPC transport took 5.36s resulting in a ~20% reduction in latency. |
| 180 | + |
| 181 | +## Future Considerations |
| 182 | + |
| 183 | +### Session Resumption |
| 184 | + |
| 185 | +The proposed transport mechanism can be extended to support session resumption by altering |
| 186 | +the `Target` and `Frame` messages to include a connection id and sequence number: |
| 187 | + |
| 188 | + |
| 189 | +```proto |
| 190 | +// Encapsulates protocol specific payloads |
| 191 | +message Frame { |
| 192 | + // The raw packet of data |
| 193 | + bytes payload = 1; |
| 194 | + // A unique identifier for connection |
| 195 | + uint64 connection_id = 2; |
| 196 | + // The position of the frame in relation to others |
| 197 | + // for this connection |
| 198 | + uint64 sequence_number = 3; |
| 199 | +} |
| 200 | +
|
| 201 | +// Target indicates which server to connect to |
| 202 | +message Target { |
| 203 | + // The hostname/ip/uuid of the remote host |
| 204 | + string host = 1; |
| 205 | + // The port to connect to on the remote host |
| 206 | + int port = 2; |
| 207 | + // The cluster the server is a member of |
| 208 | + string cluster = 3; |
| 209 | + // The unique identifier for the connection. When |
| 210 | + // populated it indicates the session is being resumed. |
| 211 | + uint64 connection_id = 4; |
| 212 | + // The frame to resume the connection from. Both the |
| 213 | + // connection_id and sequence_number must be provided for |
| 214 | + // resumption. |
| 215 | + uint64 sequence_number = 3; |
| 216 | +} |
| 217 | +``` |
| 218 | + |
| 219 | +The `connection_id` and `sequence_number` identify which connection a `Frame` is for and |
| 220 | +what position the `Frame` is relative to others for that connection. To resume a session the |
| 221 | +`Target` must populate both the `connection_id` and `sequence_number`. If `connection_id` is |
| 222 | +unknown by the Node then the connection is aborted. All frames with a `sequence_number` equal or |
| 223 | +greater than the provided will be resent after the SSH connection is established. |
| 224 | + |
| 225 | +The Node must maintain a mapping of `connection_id` to `Frame`s which keeps a backlog of most |
| 226 | +recent `Frame`s in the correct order. |
| 227 | + |
| 228 | +## Security |
| 229 | + |
| 230 | +The gRPC server will require mTLS for authentication and perform the same RBAC |
| 231 | +and session control checks as the current SSH server does. Agent forwarding will |
| 232 | +occur as it does today with the exception that the SSH Agent Protocol will use a |
| 233 | +gRPC stream instead of an SSH channel for transport. |
| 234 | + |
| 235 | +## UX |
| 236 | + |
| 237 | +The behavior of `tsh ssh` should remain the same regardless of the configured |
| 238 | +session recording mode. The time it takes to establish a session may be noticeably |
| 239 | +faster depending on proximity of the client and the Proxy. |
0 commit comments