Skip to content

New and improved kestrel.connection.duration tags: error.type and protocol error #56164

Open
@JamesNK

Description

Background and Motivation

See #53358. We want error.type to provide a low-cardinality reason for why a connection was closed in error scenarios.

Proposed API

kestrel.connection.duration is a histogram counter. A timer starts when a connection starts, and the timer ends and is recorded when the connection ends. Because of this, there is the opportunity to provide information about why the connection ended.

I propose two changes:

  • Record the protocol error code for HTTP/2 and HTTP/3 on the connection duration metric.
    • Tag is named http.connection.protocol_error_code. An issue to discuss on OTEL semantic conventions: Add http.connection.protocol_error_code to connection duration metric open-telemetry/semantic-conventions#1135. If the tag likely won't be standardized, then it could be prefixed with kestrel..
    • The protocol error code is an unsigned integer.
    • Is omitted if the code is NO_ERROR (http/2) or H3_NO_ERROR (http/3).
    • Tag is present even if the server didn't physically send the error code to the client. For example, the transport for a HTTP connection closes unexpectedly and the server ends the connection with an error. The server internally says the connection ended with a specific error code, even if it doesn't have the opportunity to send to the client.
  • Kestrel keeps track of why a connection closes and sets the error.type tag to the close reason.
    • error.type isn't set for non-error reasons, e.g. the transport closing.
    • The error.type reason will mostly come from the HTTP layer, but HTTPS middleware can set a status if the connection failed the HTTPS handshake
    • If there is already an error.type value then it takes priority over the reason (basically, the first value set to error.type is what's used)
    • The connection end reasons that are set to error.type follow the standard OTEL naming standard for enums: snake case. e.g. app_shutdown.
    • Errors that don't fall into one of the known connection end reasons have an error.type value of _OTHER (similar to other OTEL enums that have a set range of values).

Usage Examples

Someone wants to monitor incoming HTTP connections to the server.

  • Enable OTEL metrics collection for the Kestrel meter name.
  • Export telemetry to telemetry store.
  • Queries kestrel.connection.duration to see error.type values.

Alternative Designs

The connection end reason could also always be set to its own tag, e.g. kestrel.connection.end_reason. In that case, it could include non-error values. However, there are few non-error connection end reasons. And error reasons would also be set to error.type. It doesn't seem valuable to me to have both tags when people are likely focused on connection errors.

kestrel.connection.end_reason could be added in the future if there is demand for it.

Risks

Ensure that Kestrel tags match future OTEL semantic conventions around connections.

Metadata

Assignees

No one assigned

    Labels

    api-approvedAPI was approved in API review, it can be implementedarea-networkingIncludes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions