Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat (spike): Package opentelemetry forwarding to xray #115

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

cfraz89
Copy link
Contributor

@cfraz89 cfraz89 commented Dec 14, 2022

Packages AWS opentelemetry distro for lambda into the functions, enabling capturing of opentelemetry instrumentation, and currently forwarding data to x-ray. An experiment on the impact on cold start times. Currently clocking in at 1.5s.

Have attempted to use own build of collector, enabling region-agnostic access, and enabling us to trim unnecessary modules from it, However so far haven't been able to get my build to forward to x-ray.

Next step will add some light instrumentation around activities and events to see how it turns out.

Comment on lines 17 to 34
export(
spans: ReadableSpan[],
resultCallback: (result: ExportResult) => void
): void {
console.log("Exporting!!!", spans);
const sym = Symbol();
const promise = client
.send(
new PutLogEventsCommand({
logGroupName: this.logGroupName,
logStreamName: this.logStreamName,
logEvents: spans.map((s) => ({
message: this.serializeSpan(s),
timestamp: new Date().getTime(),
})),
})
)
.then(() => {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this run in a side car or within the lambda execution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within the lambda, but asynchronously.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should do this. It couples the critical path to telemetry. Should we instead run it as a side car ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, it'd add some complexity and a bit of overhead though. What's the concern with it being on the critical path? Could just wrap the whole thing in a try/catch if failure is a worry

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request won't complete until the (slow) put logs request succeeds.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on not polluting logs.

Copy link
Contributor Author

@cfraz89 cfraz89 Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but wont the lambda execution be incomplete until the side car has completed sending to cloudwatch, and terminated, anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or are their lifetimes independent?

Copy link
Contributor Author

@cfraz89 cfraz89 Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, having issues with sequencing the logs to cloudwatch anyway, with this model. Did some reading up, I see lambda extensions get independent lifetimes from the function, ie the function can return back to caller, while the extension gets to clean up. So yeah I'll try making a cut down build of the collector that just exports to cloudwatch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@thantos thantos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have any sample of what value this change brings?

this.workflows.orchestrator,
"orchestrator"
);
this.telemetry.attachToFunction(this.scheduler.forwarder, "forwarder");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the scheduler forwarded needs this? What is it logging? If it does need it, then the scheuler.handler also needs it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure can take it off.

Comment on lines 47 to 50
const logStream = new LogStream(this, `LogStream${componentName}`, {
logGroup: this.logGroup,
logStreamName: componentName,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what we want? A log steam per function for all time? Or is this just an experiment?

There is a limit to the number of writes to a log stream.

5 requests per second per log stream. Additional requests are throttled. This quota can't be changed.

The orchestrator, for all workflow executions, would be limited to 5TPS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm ok didnt realise there was a throttle. I originally had it creating a new log stream every execution, but reliased without static streams it would be difficult to attach events listeners to the streams, to forward logs to the real collector. With static streams we can just set that up in cdk.

Copy link
Contributor Author

@cfraz89 cfraz89 Dec 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An option I can think of instead, actually, is just skip the logging to cloudwatch part. Instead the extension just sends the data to the otel collector running in a different lambda, over http

Comment on lines +54 to +59
const tracer = trace.getTracer(executionId, "0.0.0");
await tracer.startActiveSpan(
"startWorkflow",
{
attributes: { workflowName, input },
kind: SpanKind.PRODUCER,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to trace in the client or should we trace in the orchestrator (aka: those who call the client). Not all of the callers of the client will have tracing on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah orchestrator probably makes more sense.

Comment on lines 35 to 38
provider.addSpanProcessor(
new BatchSpanProcessor(new OTLPTraceExporter({ hostname: "127.0.0.1" }))
);
provider.addSpanProcessor(new SimpleSpanProcessor(new ConsoleSpanExporter()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


/**
* Creates an entrypoint function for orchestrating a workflow
* from within an AWS Lambda Function attached to a SQS FIFO queue.
*/
const traceProvider = registerTelemetryApi();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these functions compile to MJS, would it help to make the AWS clients and telemetry start in parallel (top level await)

Copy link
Contributor Author

@cfraz89 cfraz89 Dec 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially tried that somewhere else, but the functions also compile to cjs, and breaks in that form. Its our lowest common denominator. That being said, registerTelemetryApi isnt an async function, I dont think top level await's going to help us here.

Comment on lines 72 to 74
await tracer.startActiveSpan(
"createActivityWorker",
{ attributes: { command: request.command.name } },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have a span for createActivityWorker, but not createOrchestrator? Does this one time operation need a span?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only created a couple of spans for testing purposes. I'll leave it up to you guys to create more once everything's in place.

Comment on lines 117 to 120
const orchestrateSpan = tracer.startSpan("orchestrate");
const ret = await orchestrateExecution(workflow, executionId, records);
orchestrateSpan.end();
return ret;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the scope function be better here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be, for some reason that one isn't working right now though. Probably missing something in the setup of the sdk.

@cfraz89
Copy link
Contributor Author

cfraz89 commented Dec 24, 2022

Have any sample of what value this change brings?

Have any sample of what value this change brings?
Before sample, the obvious answer is threefold:

  • it enables us to push traces/metrics/logs to the entire otel ecosystem, which is adopted by many popular services like datadog and premetheus.
  • the extension can be used for any language we build support for, not just typescript.
  • provides an agnostic api for handling traces/metrics/logs across the codebase, including the shared parts. Should be useful when deploying to cloudflare etc.

Aside from that, I'm suspecting that the traces being sent to eg. grafana could do a good job at visualising the workflow runs, and with a bunch of customisation ability, better than our own visualiser will be able to provide for some time.

As for a sample, I'll put one together once everything works.

@netlify
Copy link

netlify bot commented Jan 1, 2023

Deploy Preview for preeminent-heliotrope-380b2a failed.

Name Link
🔨 Latest commit 9b62c5b
🔍 Latest deploy log https://app.netlify.com/sites/preeminent-heliotrope-380b2a/deploys/63b18311fcde380008d27c25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants