Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set() in Transform processor appears to be getting values from other logs #32080

Open
mut3 opened this issue Apr 1, 2024 · 21 comments
Open

set() in Transform processor appears to be getting values from other logs #32080

mut3 opened this issue Apr 1, 2024 · 21 comments
Labels
bug Something isn't working discussion needed Community discussion needed never stale Issues marked with this label will be never staled and automatically removed pkg/ottl processor/transform Transform processor

Comments

@mut3
Copy link

mut3 commented Apr 1, 2024

Component(s)

pkg/ottl, pkg/stanza, processor/transform, receiver/filelog

What happened?

Description

We have a pipeline that uses a filelog input with some operators that set certain fields in attributes. We then have a transform processor that reads those fields to set() some Resource.attributes. The values that get set in the resource attributes of the log appear to be from a different log as if the transform is reading from one log and writing another.

Steps to Reproduce

I do not currently have a MVC for this issue, but I will include a stripped and sanitized config that contains the core logic/entities around the bug we are seeing.

Configuration
receivers:
      filelog:
        start_at: beginning
        operators:
          # route and parse log based on formatting
          - id: get-format
            type: router
            routes:
              - expr: body matches "^\\{"
                output: parser-docker
              - expr: body matches "^[^ Z]+ "
                output: parser-crio
              - expr: body matches "^[^ Z]+Z"
                output: parser-containerd
          - id: parser-crio
            type: regex_parser
            regex: ^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$
            timestamp:
              layout: 2006-01-02T15:04:05.999999999Z07:00
              layout_type: gotime
              parse_from: attributes.time
          - id: crio-recombine
            type: recombine
            combine_field: attributes.log
            combine_with: ""
            is_last_entry: attributes.logtag == 'F'
            output: extract_metadata_from_filepath
            source_identifier: attributes["log.file.path"]
          - id: parser-containerd
            type: regex_parser
            regex: ^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$
            timestamp:
              layout: "%Y-%m-%dT%H:%M:%S.%LZ"
              parse_from: attributes.time
          - id: containerd-recombine
            type: recombine
            combine_field: attributes.log
            combine_with: ""
            is_last_entry: attributes.logtag == 'F'
            output: extract_metadata_from_filepath
            source_identifier: attributes["log.file.path"]
          - id: parser-docker
            type: json_parser
            output: extract_metadata_from_filepath
            timestamp:
              layout: "%Y-%m-%dT%H:%M:%S.%LZ"
              parse_from: attributes.time
          - id: extract_metadata_from_filepath
            parse_from: attributes["log.file.path"]
            regex: ^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[a-f0-9\-]+)\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$
            type: regex_parser
          # move log to body, parse to attributes
          - type: move
            from: attributes.log
            to: body
          - type: json_parser
            parse_from: body
            parse_to: attributes

processors:
      transform/attribute_moves:
        error_mode: ignore
        log_statements:
          - context: log
            statements:
              - set(attributes["log.iostream"], attributes["stream"])
              - set(resource.attributes["k8s.container.name"], attributes["container_name"])
              - set(resource.attributes["k8s.namespace.name"], attributes["namespace"])
              - set(resource.attributes["k8s.pod.name"], attributes["pod_name"])
              - set(resource.attributes["k8s.container.restart_count"], attributes["restart_count"])
              - set(resource.attributes["k8s.pod.uid"], attributes["uid"])

Expected Result

Logs where attributes["pod_name"] == resource.attributes["k8s.pod.name"] and resource.attributes["k8s.namespace.name"] == attributes["namespace"]

This is the case for most of the logs emitted from our pipeline

Actual Result

For some small percentage of our logs, the values of

  • resource.attributes["k8s.container.name"]
  • resource.attributes["k8s.namespace.name"]
  • resource.attributes["k8s.pod.name"]
  • resource.attributes["k8s.container.restart_count"]
  • resource.attributes["k8s.pod.uid"]

Do not match their attribute counterparts, but instead appear to come from a different log event. The fields appear to be internally consistent, as if they all came from the same different log event. For example if resource.attributes["k8s.pod.name"] is wrong and holds the name of some other pod, the resource.attributes["k8s.namespace.name"] will have the namespace of that other pod.

Here is a sanitized example log json pulled from our elasticsearch

Sanitized incorrect log
{
  "_index": ".ds-otel-logs-stg-my-app-2024.03.07-000022",
  "_id": "Pz1km44B8ceE1n7Gbbpq",
  "_version": 1,
  "_score": 0,
  "_source": {
    "@timestamp": "2024-04-01T20:40:10.983479227Z",
    "Attributes": {
      "OTEL_collector_type": "daemonset",
      "canonical": {
        "owner": {
          "team_name": "unknown"
        }
      },
      "cloud": "aws",
      "cluster": "ci-stg",
      "container_name": "cilium-agent",
      "elasticsearch": {
        "index": {
          "suffix": "my-app"
        }
      },
      "environment": "stg",
      "log": {
        "file": {
          "path": "/var/log/pods/kube-system_cilium-hlkmw_fb7fae7c-d3ac-4308-885d-5d1426c96bcc/cilium-agent/0.log"
        },
        "iostream": "stderr"
      },
      "logtag": "F",
      "namespace": "kube-system",
      "pod_name": "cilium-hlkmw",
      "restart_count": "0",
      "stream": "stderr",
      "time": "2024-04-01T20:40:10.983479227Z",
      "uid": "fb7fae7c-d3ac-4308-885d-5d1426c96bcc"
    },
    "Body": "level=info msg=\"Delete endpoint request\" containerID=593b7f5fe7 subsys=daemon",
    "Resource": {
      "k8s": {
        "container": {
          "name": "my-app",
          "restart_count": "0"
        },
        "namespace": {
          "name": "my-app-namespace"
        },
        "pod": {
          "name": "my-app-65b4cf8769-bb66g",
          "uid": "33b38675-15a7-4f64-a3f7-f38a97455ce6"
        }
      }
    },
    "Scope": {
      "name": "",
      "version": ""
    },
    "SeverityNumber": 0,
    "TraceFlags": 0
  },
  "fields": {
    "Attributes.uid": [
      "fb7fae7c-d3ac-4308-885d-5d1426c96bcc"
    ],
    "Attributes.stream": [
      "stderr"
    ],
    "Attributes.environment": [
      "stg"
    ],
    "Attributes.canonical.owner.team_name": [
      "unknown"
    ],
    "Resource.k8s.container.restart_count": [
      "0"
    ],
    "Attributes.log.iostream": [
      "stderr"
    ],
    "Attributes.restart_count": [
      "0"
    ],
    "Attributes.pod_name": [
      "cilium-hlkmw"
    ],
    "Resource.k8s.namespace.name": [
      "my-app-namespace"
    ],
    "TraceFlags": [
      0
    ],
    "Resource.k8s.pod.name": [
      "my-app-65b4cf8769-bb66g"
    ],
    "Scope.name": [
      ""
    ],
    "Attributes.logtag": [
      "F"
    ],
    "Resource.k8s.pod.uid": [
      "33b38675-15a7-4f64-a3f7-f38a97455ce6"
    ],
    "Attributes.elasticsearch.index.suffix": [
      "my-app"
    ],
    "Attributes.cloud": [
      "aws"
    ],
    "SeverityNumber": [
      0
    ],
    "Attributes.log.file.path": [
      "/var/log/pods/kube-system_cilium-hlkmw_fb7fae7c-d3ac-4308-885d-5d1426c96bcc/cilium-agent/0.log"
    ],
    "Body": [
      "level=info msg=\"Delete endpoint request\" containerID=593b7f5fe7 subsys=daemon"
    ],
    "Attributes.time": [
      "2024-04-01T20:40:10.983Z"
    ],
    "Attributes.namespace": [
      "kube-system"
    ],
    "Resource.k8s.container.name": [
      "my-app"
    ],
    "@timestamp": [
      "2024-04-01T20:40:10.983Z"
    ],
    "Attributes.container_name": [
      "cilium-agent"
    ]
  }
}

Collector version

v0.96.0

Environment information

Environment

OS: Amazon Linux 2023 (Amazon's v1.25.16-eks ami)
Compiler(if manually compiled): Collector Helm Chart v0.84.0

OpenTelemetry Collector configuration

See Steps to Reproduce

Log output

No response

Additional context

We never experience this issue with these same moves to resources.attributes done via operators on the filelog receiver:

All operator solution
receivers:
      filelog:
        start_at: beginning
        operators:
          - id: get-format
            routes:
              - expr: body matches "^\\{"
                output: parser-docker
              - expr: body matches "^[^ Z]+ "
                output: parser-crio
              - expr: body matches "^[^ Z]+Z"
                output: parser-containerd
            type: router
          - id: parser-crio
            regex: ^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$
            timestamp:
              layout: 2006-01-02T15:04:05.999999999Z07:00
              layout_type: gotime
              parse_from: attributes.time
            type: regex_parser
          - combine_field: attributes.log
            combine_with: ""
            id: crio-recombine
            is_last_entry: attributes.logtag == 'F'
            output: extract_metadata_from_filepath
            source_identifier: attributes["log.file.path"]
            type: recombine
          - id: parser-containerd
            regex: ^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$
            timestamp:
              layout: "%Y-%m-%dT%H:%M:%S.%LZ"
              parse_from: attributes.time
            type: regex_parser
          - combine_field: attributes.log
            combine_with: ""
            id: containerd-recombine
            is_last_entry: attributes.logtag == 'F'
            output: extract_metadata_from_filepath
            source_identifier: attributes["log.file.path"]
            type: recombine
          - id: parser-docker
            output: extract_metadata_from_filepath
            timestamp:
              layout: "%Y-%m-%dT%H:%M:%S.%LZ"
              parse_from: attributes.time
            type: json_parser
          - id: extract_metadata_from_filepath
            parse_from: attributes["log.file.path"]
            regex: ^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[a-f0-9\-]+)\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$
            type: regex_parser
          - from: attributes.stream
            to: attributes["log.iostream"]
            type: move
          - from: attributes.container_name
            to: resource["k8s.container.name"]
            type: move
          - from: attributes.namespace
            to: resource["k8s.namespace.name"]
            type: move
          - from: attributes.pod_name
            to: resource["k8s.pod.name"]
            type: move
          - from: attributes.restart_count
            to: resource["k8s.container.restart_count"]
            type: move
          - from: attributes.uid
            to: resource["k8s.pod.uid"]
            type: move
          - from: attributes.log
            to: body
            type: move
          - type: json_parser
            parse_from: body
            parse_to: attributes
@mut3 mut3 added bug Something isn't working needs triage New item requiring triage labels Apr 1, 2024
Copy link
Contributor

github-actions bot commented Apr 1, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski
Copy link
Member

At a glance, I believe the issue is caused because ottl operates on a hierarchical data model which groups multiple log records by their shared resource attributes. In my opinion, when setting a value on the resource that comes from an individual log, this should produce a new resource with the updated value and move the log record into that resource.

An example with much unnecessary detail removed:

# input
logs:
  - resource:
    attributes:
      foo: bar
    records:
      - attributes:
          some.key: left
        body: Hello one
      - attributes:
          some.key: right
        body: Hello two

# Current result if we run "set(resource.attributes["num"], attributes["some.key"])
logs:
  - resource:
    attributes:
      foo: bar
      num: right # was "left" for one iteration then was overwritten
    records:
      - attributes:
          some.key: left
        body: Hello one
      - attributes:
          some.key: right
        body: Hello two

# Correct result (in my opinion)
logs:
  - resource:
    attributes:
      foo: bar
      num: left
    records:
      - attributes:
          some.key: left
        body: Hello one
  - resource:
    attributes:
      foo: bar
      num: right
    records:
      - attributes:
          some.key: right
        body: Hello two

In other words, the fact that our pdata packages represents data in a hierarchical manner should be an implementation detail rather than a guiding principle. User directed modifications should generally work as if the data model is completely flat (one log record per scope and resource) and split hierarchical structures accordingly.

@mut3
Copy link
Author

mut3 commented Apr 2, 2024

Thank your for the explanation!

Based on my understanding then, we should see many fewer issues such as these if we do not try to use ottl to modify resource attributes in processors. Would you expect that to workaround this behavior caused by pData structure grouping logs by resources?

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Apr 3, 2024

@djaglowski is correct.

When using the logs context in the transformprocessor it will cycles through each log, executing each statement in order. When you use a value from a log to set a property in that log's scope or resource all the logs for that scope/resource will try to do the same unless you add some clever where clauses.

As @djaglowski showed, if two logs in a resource have different values, the last (or first depending on Where clauses) will win - you cant get both values in the same resource attribute key unless you concat.

resource that comes from an individual log, this should produce a new resource with the updated value and move the log record into that resource.

I'm not sure we want this to be the default behavior either. It is totally valid for all the spans/datapoints/log records to share an attribute and the user wants that attribute set on the resource. If we always created a new resource each time we set from a "lower" context then we can end up with duplicate resources across the slice.

I also believe that if set had the capability of creating new Resource*Slice entries in my payload it would be overloaded. It would no longer only be setting an attribute, but also making the decision to create a new Resource*Slice. Having the function make that decision unilaterally feels incorrect for an OTTL function.

This issue has been known, but confusing/limiting, behavior since OTTL's inception. I think we'll need a more complex solution than to always split it out into its own Resource/Scope/Log_record combo. I see a couple options:

  • Use the groupbyattributes processor to organize the data before sending the to the transformprocessor. I think the processor would need a new feature where the original resource attributes may be included in the new resource.
  • Add an optional param to set that allows creating new resources in this situation.
  • Add a function that allows creating new ResourceLogsSlice from existing ResourceLogsSlice.
  • A new language feature that let's you "break" out of a list of statements. Something like "if this statement executed, then don't execute anymore statements for this transform context".

@mut3 since you're using the filelog receiver to collect this data we highly recommend doing as much work during the collection as you can. I recommend using the move operator to set the resource attributes instead of the transformprocessor since you have the opportunity to do it via the receiver. In general, the less components in your pipeline the better.

If you must use the transformprocessor for this data, I believe you could hack together a solution like this if you know the exact keys of all your resource attributes:

  1. Use a transformprocessor to move all the resource attributes to the logs:
  2. use a groupbyattributesor to rearrange the resource/scope/log relationship using ALL the resource keys you merged in step 1 and ALL the attribute keys you were trying to set.
  3. Use another transformprocessor to delete the resource keys of the log records.

I highly recommend using the filelog receiver instead of doing that approach lol

@TylerHelmuth TylerHelmuth added discussion needed Community discussion needed and removed needs triage New item requiring triage labels Apr 3, 2024
@TylerHelmuth
Copy link
Member

/cc @evan-bradley

@mut3
Copy link
Author

mut3 commented Apr 3, 2024

Thank you!

We moved all our resource.attributes setters back to filelog operators and that appears to be working as expected.

Maybe there should be some warnings about modifying resource.attributes from a log context included in the docs for transformprocessor. As a new user I was not at all aware of this danger from my reading of the transformprocessor and ottl context documentation.

@djaglowski
Copy link
Member

resource that comes from an individual log, this should produce a new resource with the updated value and move the log record into that resource.

I'm not sure we want this to be the default behavior either. It is totally valid for all the spans/datapoints/log records to share an attribute and the user wants that attribute set on the resource. If we always created a new resource each time we set from a "lower" context then we can end up with duplicate resources across the slice.

That's fair. What I should have said is move the log to a resource with the updated value, meaning that it either finds or creates the appropriate resource.

This issue has been known, but confusing/limiting, behavior since OTTL's inception. I think we'll need a more complex solution than to always split it out into its own Resource/Scope/Log_record combo.

It's a known issue but I think it's fundamentally a valid one which should eventually be fixed. The hierarchical representation (grouping logs/metrics/traces by resource and scope) is a direct reflection of pdata but not necessarily prescribed by the project's data model. In my opinion OTTL should not be designed for pdata but rather around it. That is, when data is moved or copied between contexts, it should behave logically as if the lower level context has its own copy of the higher level context(s). Representing the end result is an implementation detail.

Put another way, OTTL is treating resources and scopes a references, but users need to work with them as values.

@cdanis
Copy link

cdanis commented May 15, 2024

It's a known issue but I think it's fundamentally a valid one which should eventually be fixed. The hierarchical representation (grouping logs/metrics/traces by resource and scope) is a direct reflection of pdata but not necessarily prescribed by the project's data model. In my opinion OTTL should not be designed for pdata but rather around it. That is, when data is moved or copied between contexts, it should behave logically as if the lower level context has its own copy of the higher level context(s). Representing the end result is an implementation detail.

Put another way, OTTL is treating resources and scopes a references, but users need to work with them as values.

Hi. I am pretty sure I just got bit by this issue when trying to copy Span attributes into Resource attributes.

There's more context in the Wikimedia bug tracker but the short version is:

  • We are running Envoy v1.23.12, which is missing support for setting service.name in its OTTL exporter. (Yes, it is end of life, and yes, we will be upgrading soon.)
  • We do have a number of fields on each span that give good indications of the service name. I wrote a somewhat-complicated transformprocessor invocation to mangle them appropriately.
  • However, a good fraction of the time, we see "impossible" service names in resources that seem to originate from another span entirely.

I'll see if I can use the groupbyattributes processor to work around this.

But, this is really confusing, and entirely undocumented. The transformprocessor documentation even suggests that you can do something like this, in the Contexts section of its README.md:

This means that statements associated to a span/metric/log have access to the telemetry's instrumentation scope, and resource.

From an OTTL end user's perspective, that sentence is only true for very specific meanings of the words "associated" and "the" 😅

@TylerHelmuth
Copy link
Member

If all the spans in the resource don't have the same upstream_cluster.name and _k8s_namespace you'll experience the behavior from this issue. We are open to better documentation around this behavior when writing to "higher" fields in the TransformContext.

As for a change to OTTL's behavior, @djaglowski idea of move the log to a resource with the updated value, meaning that it either finds or creates the appropriate resource would solve the issue, but would be a large breaking change to OTTL, which is currently designed to act only within the TransformContext and not interact with anything outside of it. I am open to solutions to this problem, but I don't have time to solve it myself right now.

I'm also open to documenting this behavior. Since it is a write-only problem, I believe it only affects the transformprocessor.

wmfgerrit pushed a commit to wikimedia/operations-deployment-charts that referenced this issue May 15, 2024
Here we attempt to use the groupbyattrs processor to re-group traces by
a span-level attribute `service.name`.  This processor is designed to
create a new top-level Resource when it is necessary, so hopefully
this works around the issue if we are indeed encountering
open-telemetry/opentelemetry-collector-contrib#32080

Bug: T363407
Change-Id: I43ab98cf02ed712fc087335315a57638019a15a0
@djaglowski
Copy link
Member

which is currently designed to act only within the TransformContext and not interact with anything outside of it.

I think there is a contradiction here. The design cannot consider modifications to a resource or scope to be "only within the TransformContext" of a log record when we know those changes are directly modifying the TransformContexts of other log records as well.

@TylerHelmuth
Copy link
Member

I think there is a contradiction here. The design cannot consider modifications to a resource or scope to be "only within the TransformContext" of a log record when we know those changes are directly modifying the TransformContexts of other log records as well.

I was implying that the framework is made to act on a singular TransformContext, inclusive of a resource/scope, so if a transformation needed to go find a different resource to act upon it wouldn't have it.

I'm sure it is solvable, but I'd want to make sure that we don't lose:

  1. OTTL's performance
  2. OTLP hierarchical efficiency (I don't want to make a bunch of different ResourceLogs that have the exact same Resource)
  3. OTTL's reasonably simple syntax/language (we aren't trying to make OTTL Turing complete or anything)

If we can keep those things intact then we get the best of both sides: users can correctly transform fields from "higher" contexts in a fast and efficient way.

It is possible some ideas from #30800 would help here.

@djaglowski
Copy link
Member

Here's a possible solution which would leave the data corruption issue in place but also provides a decent workaround for it.

In short, we could add a new TransformContext for the top level struct of each data type (e.g. plog.Logs) and within this context add Flatten and Unflatten functions. The user can then Flatten, transform with a lower level context, and then Unflatten.

Alternatively, perhaps Flatten / Unflatten can be automatically wrapped around the transformprocessor based on a flag in the config.

Example

Input

Say we have the following plog.Logs. I'm omitting scopes for the sake of simplicity, but I think the principles described below apply similarly.

resources:
  - attributes: { name: AAA }
    records:
      - attributes: { host: Red }
        body: First record
      - attributes: { host: Blue }
        body: Second record
      - attributes: { host: Red }
        body: Third record
      - attributes: { host: Blue }
        body: Fouth record
  - attributes:  { name: BBBBBB }
    records:
      - attributes: { host: Red }
        body: First record
      - attributes: { host: Blue }
        body: Second record
      - attributes: { host: Red }
        body: Third record
      - attributes: { host: Blue }
        body: Fouth record

Flatten

A Flatten function would produce a plog.ResourceLogs per plog.LogRecord, so that each plog.ResourceLogs represents a non-overlapping TransformContext. Therefore, modifications can now be applied without concern for data corruption between records.

resources:
  - attributes: { name: AAA }
    records:
      - attributes: { host: Red }
        body: First record
  - attributes: { name: AAA }
    records:
      - attributes: { host: Blue }
        body: Second record
  - attributes: { name: AAA }
    records:
      - attributes: { host: Red }
        body: Third record
  - attributes: { name: AAA }
    records:
      - attributes: { host: Blue }
        body: Fouth record
  - attributes:  { name: BBBBBB }
    records:
      - attributes: { host: Red }
        body: First record
  - attributes:  { name: BBBBBB }
    records:
      - attributes: { host: Blue }
        body: Second record
  - attributes:  { name: BBBBBB }
    records:
      - attributes: { host: Red }
        body: Third record
  - attributes:  { name: BBBBBB }
    records:
      - attributes: { host: Blue }
        body: Fouth record

Transform

The user can then apply a transformation, e.g. moving a log record attribute to a resource attribute.

resources:
  - attributes: { name: AAA, host: Red }
    records:
      - attributes: {}
        body: First record
  - attributes: { name: AAA, host: Blue }
    records:
      - attributes: {}
        body: Second record
  - attributes: { name: AAA, host: Red }
    records:
      - attributes: {}
        body: Third record
  - attributes: { name: AAA, host: Blue }
    records:
      - attributes: {}
        body: Fouth record
  - attributes: { name: BBBBBB, host: Red }
    records:
      - attributes: {}
        body: First record
  - attributes: { name: BBBBBB, host: Blue }
    records:
      - attributes: {}
        body: Second record
  - attributes: { name: BBBBBB, host: Red }
    records:
      - attributes: {}
        body: Third record
  - attributes: { name: BBBBBB, host: Blue }
    records:
      - attributes: {}
        body: Fouth record

Unflatten

Then an Unflatten function regroups the logs by resource.

resources:
  - attributes: { name: AAA, host: Red }
    records:
      - attributes: {}
        body: First record
      - attributes: {}
        body: Third record
  - attributes: { name: AAA, host: Blue }
    records:
      - attributes: {}
        body: Second record
      - attributes: {}
        body: Fouth record
  - attributes: { name: BBBBBB, host: Red }
    records:
      - attributes: {}
        body: First record
      - attributes: {}
        body: Third record
  - attributes: { name: BBBBBB, host: Blue }
    records:
      - attributes: {}
        body: Second record
      - attributes: {}
        body: Fouth record

Order & Reversibility

Since plog manages resources and records as elements within slices, e.g. plog.ResourceLogsSlice, Flatten and Unflatten can both preserve the order of log records. This means that Flatten and Unflatten are both reversible operations, i.e. X == Unflatten( Flatten( X ) ).

Additionally, for any transformation T we can say that Unflatten( T( Flatten( X ) ) ) is always at least deterministic, and often preserves order in limited ways. Finally, if a transformation T is itself reversible then the entire process is reversible as well.

@evan-bradley
Copy link
Contributor

In short, we could add a new TransformContext for the top level struct of each data type (e.g. plog.Logs) and within this context add Flatten and Unflatten functions. The user can then Flatten, transform with a lower level context, and then Unflatten.

Alternatively, perhaps Flatten / Unflatten can be automatically wrapped around the transformprocessor based on a flag in the config.

I see how this would solve the issue at hand, but I'd like to avoid the extra steps if necessary unless flattening/unflattening achieves a goal outside of moving log records to a new resource object when modifying the resource from the log context.

If we're looking at creating new contexts, could we create contexts that handle the flattening/unflattening/copying/etc. operations for us? We could split the current resource path parser for each signal into separate parsers for each model, e.g. resource_rw for editing the resource without copies, and resource_ro for producing a new resource upon modification (happy to discuss other names like orig and copy or whatever is most obvious).

The ro context could have access to the parent signal objects on its interface and in its GetSetters could modify those objects when they are written to. So calling set(resource_ro["my_log_attr"], "attrvalue") inside the log context would move the log record to a new resource that has the new attribute/attribute value if a copy hasn't already been created, or would move it to an existing resource if one already exists with the same identity.

There would need to be a lot of internal machinery to support these operations, and we would need to decide upon and document what comprises resource object's identity, but I think this would solve this issue in a way that's both backwards-compatible and feels native to OTTL.

djaglowski added a commit that referenced this issue May 31, 2024
This is a first step towards a [Flatten/Unflatten
workaround](#32080)
for OTTL's data corruption issue, specifically this would support
Unflatten. The workaround was discussed in a recent collector SIG and it
sounded like it would be acceptable if available behind a feature gate
and only for the transform processor. If this is accepted I'll work on a
Flatten utility next, then integrate them into the tranform processor to
prove end-to-end. Finally, I'll implement similar features for metrics
and traces.

This PR adds an internal utility package which simplifies grouping logs
by resource and scope. I'm proposing this initially in
`internal/pdatautil` but the functionality could eventually be merged
into `pkg/pdatautil` if we find it useful elsewhere.
evan-bradley added a commit that referenced this issue Jun 13, 2024
This PR proposes a feature gate which would enable the Flatten/Unflatten
behavior described
[here](#32080 (comment)).
This was discussed in the Collector SIG recently and it was mentioned
that a feature gate would be a reasonable way to implement this.

One immediate question: Should this be purely a feature gate, or should
there be a config option on the processor which fails `Validate` if the
feature gate is not set?

---------

Co-authored-by: Curtis Robert <crobert@splunk.com>
Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>
Co-authored-by: Tyler Helmuth <12352919+TylerHelmuth@users.noreply.github.com>
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 22, 2024
@TylerHelmuth TylerHelmuth added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Jul 22, 2024
@TylerHelmuth
Copy link
Member

The flatten_data config option and associated feature gate have been in place for a while now and I've seen it solve end-user problems on multiple occasions. I would be ok moving forward with this solution for all signals.

For the sake of configuration simplicity, and compatibility with the current config, it would be nice if flatten_data can apply to all signals. For the sake of performance, though, would it be better to have signal-specific configuration? The counter argument would be to use different transformprocessor instances if a particular signal does not need flattening.

@djaglowski
Copy link
Member

A related question here is exactly how flattening behaves for traces and metric.

For traces, I could see a need for every span to have its own copy of resource & scope. Or, you may need every span event to have its own copy of resource, scope, and span.

For metrics, would it ever be required to given each metric its own copy of resource & scope? Is it ever needed for each data point?

Maybe there's a general rule that it is possible to flatten down to any context?

@TylerHelmuth
Copy link
Member

Maybe there's a general rule that it is possible to flatten down to any context?

Ya this is what I had in mind. Given flatten_data is enabled, the data is flattened for the specified context. So for the span context, each span gets its own resource and scope. For the datapoint context each datapoint gets its own resource, scope, and metric.

And then like the current implementation, after transformations occur the data would be regrouped.

For spanevent and datapoint this flattening would be pretty inefficient. But also it would be an advanced, opt-in situation.

@djaglowski
Copy link
Member

@TylerHelmuth, I have run into a few use cases lately that reminded me that Flatten and Unflatten as OTTL functions would be useful:

  • The routing connector could route on a per-log, per-data point, per span basis by using Flatten first.
  • Encoding extensions operate only on ptrace.Traces, pmetric.Metrics, or plog.Logs, but some encodings may need to emit each item separately. If the data could be flattened by the transform processor, it could be passed one item at a time to the encoding extension.

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Aug 20, 2024

@djaglowski I think there has been enough requests for these features that I am open to more fine-grained control via functions. We also have the proper testing framework in place to ensure they work as expected.

I am curious if Flatten and Unflatten should instead be editors that work directly on the data. The idea of using Flatten to set something in the cache or an attribute doesn't feel quite right.

@djaglowski
Copy link
Member

I am curious if Flatten and Unflatten should instead be editors that work directly on the data. The idea of using Flatten to set something in the cache or an attribute doesn't feel quite right.

I agree with this.

@andrzej-stencel
Copy link
Member

Removing pkg/stanza and receiver/filelog labels as this issue seems to only refer to OTTL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discussion needed Community discussion needed never stale Issues marked with this label will be never staled and automatically removed pkg/ottl processor/transform Transform processor
Projects
None yet
Development

No branches or pull requests

6 participants