Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add conventions for log correlation #114

Closed
wants to merge 6 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions text/logs/0114-log-correlation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Conventions for Trace and Resource Association in Logs

Create standards for correlating traces and resources in existing text logs.

## Motivation

Traces and logs present two separate perspectives on what was occurring when an
operation was executed. In order to tie together the two perspectives, certain
information needs to be added to the logs when they were generated within an
open span. This document lays out the information that needs to be emitted in
order to correlate the two data types, and how the data can be added to existing
text log formats so that it can be recognized when the logs are parsed.

## Explanation

Two types of correlation need to happen to tie a log record into its full
execution context. The first is Request Correlation, which ties the record to
operations that were occurring when the record was created. The second is
Resource Correlation, which associates the entry with where the event occurred
such as a host, a pod, or a virtual machine.

### Request Correlation

Request correlation is achieved primarily with two values, a trace identifier
and a span identifier. A trace may contain multiple spans, arranged as a tree,
and may also contain links to other related spans. The combination of a trace
identifier and a span identifier corresponds to a specific scope of work. In the
tracing API, that scope can also contain attributes and events that describe
what work the program being traced was currently performing.

In most cases when traces and logs are used in tandem, the attributes of the
current span do not need to be added to the log entry, because it would duplicate
transmission of those values. As one span is likely associated with several (or
many) log entries, it is more efficient to transmit span attributes with the
span once rather than many times with each log entry. As a result, for most
purposes the goal is to set the three values of traceid, spanid, and traceflags
into each log entry that should be associated with that span.

Request correlation fields only make sense when the log event occurred within
an open span, so the fields in the table below are only required when the log
event is to be correlated.

| Field | Required | Format
| :--------- | :------- | :--------------------------------------------------
| traceid | Yes | 16-byte numeric value as Base16-encoded string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refers to base-16 encoded string. However, what if my logging medium is binary and allows proper representation of arbitrary by sequences? Do I still have to use base-16 strings or I can just emit the bytes? If this recommendation is for text medium (such as text log files) it may be worth calling out specifically.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the underlying presumption with this document is that if certain fields are set in certain ways, we will be able to recognize them and use them when the log is processed. I don't think that's going to be true one way or another with a binary representation, so I'll add the text-based caveat to the introduction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics of Required are unclear here. Some logs may be emitted when there is no trace context present, so they cannot contain trace-id. And the whole feature of trace/log correlation is technically optional.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clarify, thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in Python, the log format requires all fields to be present on the log record object or it throws an exception. Also, the log format cannot be changes dynamically depending on whether an active span (or fields on log record) is present or not. So in this case we can either leave the fields empty so that the formatted logs look like trace_id= span_id= or we can set them to zero so it looks like trace_id=0 span_id=0. Can we clarify what values the fields should be set to in case no active span is found in the context?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@owais what would you prefer? Also, it seems like a fairly specific use case, could it be left to whoever configures the logger on how they want to represent the lack of trace context?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for 0.

could it be left to whoever configures the logger on how they want to represent the lack of trace context?

It could but then backends will have to deal with multiple ways of representing nil values (0 vs '' vs null in JSON, etc). It would be nicer if the spec recommended one thing when the field cannot be omitted completely so analysis tools can have a standard way of detecting it.

| spanid | Yes | 8-byte value represented as Base16-encoded string
| traceflags | No | 8-bit numeric value as a Base16-encoded string

`traceflags` is a numeric field that corresponds to the W3C trace flags
[definition](https://www.w3.org/TR/trace-context/#trace-flags), and the
definition for OpenTelemetry logs should track updates to that definition
as they are made.

### Resource Correlation

As important as what was occurring in a program’s execution is, where it was
occurring is just as important. Resource correlation allows a log entry to be
associated with an infrastructure resource, and in turn system and program
metrics that describe the wider program state. The form that the resource takes
is also more diverse than the tracing scope: the resource may be a pod running
in a Kubernetes cluster, a virtual machine running in a cloud, a serverless
lambda, or an old-school server sitting in a data center. An application
environment may be an orchestration of multiple types of resource working
together.

From a logging standpoint, a resource is also almost always a constant within an
application process- a container may not have the same identifier on every run,
but it does keep that identifier while it’s running, and that resource
identifier is constant for every log entry created on that resource. As a
result, resource correlation may not happen at the log entry level, so we may or
may not put the resource correlation information in the log entry itself.

Resource information may be managed as part of the log ingestion process- for
example a Docker logging driver will know which container logs came from.
Full resource information may also not be available, as when logs are
aggregated by syslog or a similar system that has less context available to it.
As a result resource correlation information in the logs entries themselves
should be considered optional. If included, the resource information should
follow the [semantic conventions for resources](https://github.com/open-telemetry/opentelemetry-specification/tree/master/specification/resource/semantic_conventions).

If the log entry is expressed in key-value pairs, any resource keys should be
prepended with ‘resource’- for example, `resource.service.name=”shoppingcart”`.
If the entry is expressed in JSON, the resource key-values should be placed in
an object named “resource” at the top level of the object.

### Correlation Context

A Correlation Context is a set of key-value pairs that is shared amongst the
spans of a distributed trace. Like span attributes, the correlation context may
already be carried with span information, so duplicating this information may be
redundant. In certain cases it may be important to associate this context with
log entries. When the context is embedded in a log entry, the key-value
pairs should be placed in a 'ctx' namespace. Where key-value pairs are
supported, embed the correlation key as “correlation.key_name”. In JSON or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
supported, embed the correlation key as “correlation.key_name”. In JSON or
supported, embed the correlation key as “ctx.key_name”. In JSON or

other formats that allow nested structures, the key-value pairs should be
placed in an object named ‘correlation’ at the top of the object.
yurishkuro marked this conversation as resolved.
Show resolved Hide resolved

## Internal details

### Examples

#### Key-Value Pairs

```text
2020-05-20 20:13:31 INFO Message logged. resource.hostname=myhost
ctx.user=djones traceid=0354af75138b12921 spanid=14c902d73a traceflags=01
```

#### JSON

```json
{
"time": "2020-05-20 20:13:31",
"msg": "Message logged.",
"level": "INFO",
"ctx": {
"user": "djones"
},
"resource": {
"hostname":"myhost"
},
"traceid": "0354af75138b12921",
"spanid": "14c902d73a",
"traceflags": "01"
}
```

### Custom Format

Custom formats that don’t allow for automatic parsing of key-value pairs can be
used, but they will require synchronization between the output format and the
extraction mechanism. These types of extractions may have the advantage of being
less verbose, but they also have the disadvantage of requiring setting up a
custom extraction process, and may be more fragile. Since this approach is
vendor-dependent, there is little guidance that can be provided by
OpenTelemetry.

## Prior art and alternatives

Elastic Common Schema [has standards](https://www.elastic.co/guide/en/ecs/current/ecs-tracing.html#ecs-tracing)
for adding trace information to JSON log formats, but they do not support the
full OpenTelemetry correlation model.

## Future possibilities

A stronger specification should be created for logs that are generated by
OpenTelemetry instrumentation and adapters that support conversion from and
to OpenTelemetry's internal logging models. As that specification is created,
they should be kept in sync with these conventions.