Skip to content

feat: add specific request tracing hooks #1816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

SkArchon
Copy link
Contributor

@SkArchon SkArchon commented Apr 28, 2025

Motivation and Context

Sometimes subgraph requests can fail for users, and it might not be obvious why, by providing a way to log and add more span details, this can help users to figure out what is actually happening at the request level.

In order to do this the PR introduces the following

  • Add a traceclient wrapper: We create a custom roundtripper that uses the http.ClientTrace wrapper. This wrapper stores information from its hook into the request's context, which can be accessed outside the wrapper. (This is not tied to expressions and uses its own separate structs, but as an optimization we only wrap the roundTripper if subgraph.request.clientTrace used in any expression.)

  • Add subgraph to the expression context: This holds information about the currently executing subgraph. If clientTrace is used in an expression, we convert the struct used by the traceclient to the expression struct.

  • Enable expressions for subgraphs: At the moment we only allow defining expressions on access logs at the router level, this PR adds the capability to use expressions for access logs at the subgraph level.

  • Add tracingAttributes: Right now we have telemetryAttributes and metricAttributes, if a user defines telemetryAttributes we will attach all values to every span in the request AS WELL as metrics. We add tracingAttributes, which does the same but only attaches attributes to spans.

  • Evaluation of expressions that use subgraph.*, any expression that uses subgraph is evaluated in the engine_hooks_loader, because subgraph.* is empty outside of this. In order to do this we separate out expressions during the compilation step for both telemetryAttributes and tracingAttributes if they only should run in the subgraph. This prevents expressions focused on subgraphs being attached to unrelated spans.

Caveats

  • All expected output values for expressions are strings, which means if the output value is time, int, bool, the user will need to cast it to string. This can be done using the string function in exprlang. This follows the current behaviour of telemetryAttributes, which only allows strings.

  • Users should be aware of nils, if a hook wasnt executed for some reason the equivalent expr struct field would be nil.

Figuring out issues using this PR
Since the goal of this PR is to help users find out subgraph timeout, not connecting, etc issues, they can use the following to figure out (note that non string expressions need to be wrapped with string, but we will be ignoring using string to make it easy to read)

  • Assume that they want all values from the traceclient logged. For dialStart and dialDone, users will need to just log the slice
    Key: "connCreateTime", expr: `subgraph.request.clientTrace.connCreate?.time`
    Key: "connCreateHost", expr: `subgraph.request.clientTrace.connCreate?.hostPort`   
    Key: "connAcquired", expr: `subgraph.request.clientTrace.connAcquired?.reused`
    ...   
  • This is going to litter the subgraph Engine - Fetch span with information, if users are interested in Spans which error out, users can add the following
    subgraph.request.error != nil ? subgraph.request.clientTrace.connCreate?.time : nil

  • Users can get more specific information, if they are interested in duration used to aquire a connection, users could
    subgraph.request.clientTrace.connAcquired.time - subgraph.request.clientTrace.connCreate.time

  • Use the GetGroupedDials() helper, to get Start and Done dials and process durations

let groupedDials = subgraph.request.clientTrace.GetGroupedDials();
let dialDurations = map(groupedDials, #.doneTime - #.startTime);
dialDurations

More examples can be found at: https://github.com/wundergraph/cosmo/pull/1816/files#diff-a76256094262e1d7bf2857b5e8be79e45bcfcd298f918a42fb8db6e267e4c984R178-R179

ExprStructure

Subgraph:
  id: string
  name: string
  request:
    error: error
    clientTrace:
      ...
      # clientTraceHooks

Example Configuration for access logs (we only added expressions)

access_logs:
  subgraphs:
    enabled: true
    fields:
      - key: "dnsHost"
        value_from:
          expression: "subgraph.error ?? subgraph.operation.trace.dnsStart.host"

Example Configuration for tracing (We added tracing.attributes)

telemetry:
  tracing:
    attributes:
      - key: "wg.tracing.dnsHost"
        value_from:
           expression: "subgraph.error ?? subgraph.operation.trace.dnsStart.host"

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.
  • Documentation has been updated on https://github.com/wundergraph/cosmo-docs.
  • I have read the Contributors Guide.

Copy link

github-actions bot commented Apr 28, 2025

Router image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-713fdd616f3cab545f985cf9301c687e5df04c5e

@SkArchon SkArchon force-pushed the milinda/eng-6928-otel-attach-httptraceclienttrace-connection-information-to branch 4 times, most recently from 097c904 to 79e48b8 Compare April 29, 2025 15:00
@SkArchon SkArchon force-pushed the milinda/eng-6928-otel-attach-httptraceclienttrace-connection-information-to branch from eb9d0f0 to 24c4182 Compare April 29, 2025 22:02
@@ -44,7 +45,7 @@ func NewExpressionLogField(val any, key string, defaultValue any) zap.Field {
val = &assertVal
}

if v := val; v != "" {
if v := val; v != "" && v != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we validate no explicit nil can be returned, on case we use the optional chaining operator we can still get nil, which will result in no default value being set.
(This could be considered a bit of a breaking change, if nil is returned somehow we don't set the default value right now.)

@SkArchon
Copy link
Contributor Author

SkArchon commented May 5, 2025

Because we also introduce custom telemetry attributes, we don't have a way right now to allow custom attributes in studio. We will need to (initial thoughts at least) add a UI in studio which can allow users to specify what custom telemetry attributes they want to view in studio traces. (cc: @StarpTech)

@SkArchon SkArchon marked this pull request as ready for review May 5, 2025 10:53
@SkArchon SkArchon requested a review from StarpTech May 5, 2025 11:21
@SkArchon SkArchon force-pushed the milinda/eng-6928-otel-attach-httptraceclienttrace-connection-information-to branch from def1303 to dbe2f0b Compare May 6, 2025 20:11
import "github.com/wundergraph/cosmo/router/internal/httpclient"

// ConvertToExprTrace converts an OperationTrace to an expr.ClientTrace
func ConvertToExprTrace(trace *httpclient.ClientTrace) *ClientTrace {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a separate (but similar) structure for Expressions, which we need to transform to from the ClientTrace structure.


dialStartMu.Lock()
defer dialStartMu.Unlock()
eC.DialStart = append(eC.DialStart, start)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my testing I verified that multiple connection starts and done's can occur, I believe its important to record them all. Since we don't want a data race I use a lock.

// expressionsWithAuth is a map of expressions that can be used to resolve dynamic attributes and acces the auth
// argument
expressionsWithAuth map[string]*vm.Program
}

type VisitorCheckForRequestAuthAccess struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved this to the expr package where other visitors are stored.

metricsEnabled: opts.metricsEnabled,
traceEnabled: opts.traceEnabled,
mapper: opts.mapper,
traceAttributeExpressions: opts.telemetryAttributeExpressions,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a naming confusion here, which made telemetry -> trace, but applied metric attrs, thus renamed it to telemetry.

- "localhost:9092"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to revert

@SkArchon SkArchon closed this May 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants