Release 0.18.0

What's New

ziti#253 ziti-tunnel enroll should set non-zero
exit status if an error occur
Rewrite of Xgress with the following goals
- Fix deadlocks at high throughput
- Fix stalls when some endpoints are slower than others
- Improve windowing/retransmission by pulling forward some concepts from Michael Quigley's
  transwarp work
- Split xgress links into two separate connections, one for data and one for acks
Allow hosting applications to mark incoming connections as failed. Update go tunneler so when a
dial fails for hosted services, the failure gets propagated back to controller
Streamline edge hosting protocol by allowing router to assign connection ids
Edge REST query failures should now result in 4xx errors instead of 500 internal server errors
Fixed bug where listing terminators via ziti edge would fail when terminators referenced pure
fabric services

Xgress Rewrite

Overview

This rewrite fixed several deadlocks observed at high throughput. It also tries to ensure that slow
clients attached to a router can't block traffic/processing for faster clients. It does this by
dropping data for a client if the client isn't handling incoming traffic quickly enough. Dropped
payloads will be retransmitted. The new xgress implementation uses similar windowing and
retransmission strategies to the upcoming transwarp work.

Backwards Compatability

0.18+ routers will probably work with older router versions, but probably not well. 0.18+ xgress
instances expect to get round trip times and receive buffer sizes on ack messages. If they don't get
them then retransmission will likely be either too agressive or not aggressive enough.

Mixing 0.18+ routers with older router versions is not recommended without doing more testing first.

Xgress Options Changes

Added

txQueueSize - Number of payloads that can be queued for processing per client. Default value: 1
txPortalStartSize - Initial size of send window. Default value: 16Kb
txPortalMinSize - Smallest allowed send window size. Default value: 16Kb
txPortalMaxSize - Largest allowed send window size. Default value: 4MB
txPortalIncreaseThresh - Number of successful aks after which to increase send portal size:
Default value: 224
txPortalIncreaseScale - Send portal will be increased by amount of data sent since last
retransmission. This controls how much to scale that amount by. Default value: 1.0
txPortalRetxThresh - Number of retransmits after which to scale the send window. Default value: 64
txPortalRetxScale - Amount by which to scale the send window after the retransmission threshold is
hit. Default value: 0.75
txPortalDupAckThresh - Number of duplicates acks after which to scale the send window. Default
value: 64
txPortalDupAckScale - Amount by which to scale the send window after the duplicate ack threshold
is hit. Default value: 0.9
rxBufferSize - Receive buffer size. Default value: 4MB
retxStartMs - Time after which, if no ack has been received, a payload should be queued for
retransmission. Default value: 200ms
retxScale - Amount by which to scale the retranmission timeout, which is calculated from the round
trip time. Default value: 2.0
retxAddMs - Amount to add to the retransmission timeout after it has been scaled. Default value: 0
maxCloseWaitMs - Maximum amount of time to wait for queued payloads to be
acknowledged/retransmitted after an xgress session has been closed. If queued payloads are all
acknowledged before this timeout is hit, the xgress session will be closed sooner. Default value:
30s

REMOVED: The retransmission option is no longer available. Retransmission can't be toggled off
anymore as that would lead to lossy connections.

Xgress Metrics Changes

New metrics were introduced as part of the rewrite.

NOTE: Some of these metrics were introduced to try and find places where tuning was required.
They may not be interesting or useful in the long term and may be removed in a future release.

The new metrics include:

New Meters

xgress.dropped_payloads
- The count and rates payloads being dropped
xgress.retransmissions
- The count and rates payloads being retransmitted
xgress.retransmission_failures
- The count and rates payloads being retransmitted where the send fails
xgress.rx.acks
- The count and rates of acks being received
xgress.tx.acks
- The count and rates of acks being sent
xgress.ack_failures
- The count and rates of acks being sent where the send fails
xgress.ack_duplicates
- The count and rates of duplicate acks received

New Histograms

xgress.rtt
- Round trip time statistics aggregated across all xgress instances
xgress.tx_window_size
- Local window size statistics aggregated across all xgress instances
xgress.tx_buffer_size
- Local send buffer size statistics aggregated across all xgress instances
xgress.local.rx_buffer_bytes_size
- Receive buffer size statistics in bytes aggregated across all xgress instances
xgress.local.rx_buffer_msgs_size
- Receive buffer size statistics in number of messages aggregated across all xgress instances
xgress.remote.rx_buffer_size
- Receive buffer size from remote systems statistics aggregated across all xgress instances
xgress.tx_buffer_size
- Receive buffer size from remote systems statistics aggregated across all xgress instances

New Timers

xgress.tx_write_time
- Times how long it takes to write xgress payloads from xgress to the endpoint
xgress.tx_write_time
- Times how long it takes to write acks to the link
xgress.payload_buffer_time
- Times how long it takes to process xgress payloads coming off the link (mostly getting them
  into the receive buffer)
xgress.payload_relay_time
- Times how long it takes to get xgress payloads out of the recieve buffer and queued to be sent

New Gauges

xgress.blocked_by_local_window
- Count of how many xgress instances are blocked because the local tranmit buffer size equals or
  exceeds the window size
xgress.blocked_by_local_window
- Count of how many xgress instances are blocked because the remote receive buffer size equals
  or exceeds the window size
xgress.tx_unacked_payloads
- Count of payloads in the transmit buffer
xgress.tx_unacked_payload_bytes
- Size in bytes of the transmit buffer

Split Links

The fabric will now create two channels for each link, one for data and the other for acks. When
establishing links the dialing side will attach headers indicating the channel type and a shared
link ID. If the receiving side doesn't support split links then it will treat both channels as
regular links and send both data and acks over both.

If an older router dials a router expecting split links it won't have the link type and will be
treated as a regular, non-split link.

Allow SDK Hosting Applications to propagate Dial Failures

The service terminator strategies use dial failures to adjust terminator weights and/or mark
terminators as failed. Previously SDK applications didn't have a way to mark a dial as failed. If
the SDK was hosting an application, this was generally not a problem. If the application could be
reached, it wouldn't want to mark an incoming connection as failed. However, the tunneler is just
proxying connections. It wants to be able to reach out to another application when the service is
dialed and proxy data. If the dial fails, it wants to be able to notify the controller that the
application wasn't reachable. The golang SDK now has the capability.

There is a new API on edge.Listener.

	AcceptEdge() (Conn, error)

The Conn returned here is an edge.Conn (which extends net.Conn). edge.Conn has two new APIs.

	CompleteAcceptSuccess() error
	CompleteAcceptFailed(err error)

If ListenWithOptions is called with the ManualStart: true in the provided options, the
connection won't be established until CompleteAcceptSuccess is called. Writing or reading the
connection before call that method will have undefined results.

The ziti-tunnel has been updated to use this API, and so should now work correctly with the various
terminator strategies.

Edge Hosting Dial Protocol Enhancement

When establishing a new virtual connection to hosted SDK application the router had to execute the
following steps:

Send a Dial message to the sdk application
Receive the dial response, which included the sdk generaetd connection id.
Create the router side virtual connection with the new id and register it
Create the xgress instance tied to the new connection
Now that the xgress is created, send a message to the sdk application letting it now that it can
start sending traffic

If the connection id could be established on the router, we could simplify things as follows

Create the router side virtual connection with the new id and register it
Create the xgress instance tied to the new connection
Send the dial mesasge to the sdk with the connection id
Receive the response and return the result to the controller

We didn't do this previously because the sdk controls ids for outbound connection. To enable this we
have split the 32 bit id range in half. The top half is now reserved for hosted connection ids. This
behavior is controlled by the SDK, which requests it when it binds uisng a boolean flag. The new
flag is:

    RouterProvidedConnId = 1012

If the bind result from the router has the same flag set to true, then the sdk will expect Dial
messages from the router to have a connection id provided in the header keyed with the same 1012.

This means that this feature should be both backwards and forward compatible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.18.0