Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/tailsampling] Data race during evaluation #24283

Closed
larsn777 opened this issue Jul 16, 2023 · 5 comments
Closed

[processor/tailsampling] Data race during evaluation #24283

larsn777 opened this issue Jul 16, 2023 · 5 comments
Assignees
Labels
bug Something isn't working processor/tailsampling Tail sampling processor Stale

Comments

@larsn777
Copy link
Contributor

Component(s)

processor/tailsampling

What happened?

Description

There is a data race condition during policies evaluation process. A data race occurs when the following events happen concurrently:

  • a policies evaluation process for a certain trace is in process
  • new spans has arrived for the same trace

The thing is that when we are trying to copy a slice of spans in some evaluation rules, actually we are copying a pointer (1,2) to reflect.SliceHeader, not the value. Thus we have concurrent read (evaluation) and write (add spans) for the same slice that could lead to panic.

The date race appears for the following evaluation policies:

  • latency
  • boolean_attribute
  • ottl_condition
  • numeric_attribute
  • status_code
  • string_attribute
  • trace_state

Steps to Reproduce

  • Add one of the evaluation policies described above to the tail sampling config
  • Build collector with race condition flag
  • Start collector
  • Continuously send spans with same trace id to collector. For better results you need to send as many spans per second as possible

Expected Result

There is no panics or data race messages in collector console output

Actual Result

There are panics or data race messages in output

Collector version

v0.81.0

Environment information

any

OpenTelemetry Collector configuration

receivers:
  # any convenient receiver could be used
  jaeger:
    protocols:
      thrift_compact:
        endpoint: 0.0.0.0:6831
        max_packet_size: 65000
        queue_size: 100
        workers: 10

processors:
  tail_sampling:
    num_traces: 100500
    decision_wait: 2s
    # duplicated policies are used to increase evaluation duration
    policies: [
        {
          name: latency-composite-policy,
          type: latency,
          latency: { threshold_ms: 100 }
        },
      {
        name: latency-composite-policy,
        type: latency,
        latency: { threshold_ms: 100 }
      },
      {
        name: latency-composite-policy,
        type: latency,
        latency: { threshold_ms: 100 }
      },
      {
        name: latency-composite-policy,
        type: latency,
        latency: { threshold_ms: 100 }
      },
    ]

exporters:
  logging:
    loglevel: error

service:
  telemetry:
    logs:
      level: info
  pipelines:
    traces:
      receivers: [jaeger]
      processors: [tail_sampling]
      exporters: [logging]

Log output

2023-07-16T17:15:26.398+0300	info	service/telemetry.go:81	Setting up own telemetry...
2023-07-16T17:15:26.398+0300	info	service/telemetry.go:104	Serving Prometheus metrics	{"address": ":8888", "level": "Detailed"}
2023-07-16T17:15:26.399+0300	info	exporter@v0.81.0/exporter.go:275	Development component. May change in the future.	{"kind": "exporter", "data_type": "traces", "name": "logging"}
2023-07-16T17:15:26.399+0300	warn	loggingexporter@v0.81.0/factory.go:98	'loglevel' option is deprecated in favor of 'verbosity'. Set 'verbosity' to equivalent value to preserve behavior.	{"kind": "exporter", "data_type": "traces", "name": "logging", "loglevel": "error", "equivalent verbosity level": "Basic"}
2023-07-16T17:15:26.401+0300	info	service/service.go:131	Starting otelcol-avito...	{"Version": "0.81.0", "NumCPU": 10}
2023-07-16T17:15:26.401+0300	info	extensions/extensions.go:30	Starting extensions...
2023-07-16T17:15:26.401+0300	info	service/service.go:148	Everything is ready. Begin running and processing data.
==================
WARNING: DATA RACE
Write at 0x00c000412198 by goroutine 43:
  go.opentelemetry.io/collector/pdata/ptrace.ResourceSpansSlice.AppendEmpty()
      /Users/salarionenko/go/pkg/mod/go.opentelemetry.io/collector/pdata@v1.0.0-rcv0013/ptrace/generated_resourcespansslice.go:82 +0x12c
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.appendToTraces()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.81.0/processor.go:425 +0x44
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).processTraces()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.81.0/processor.go:364 +0x484
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).ConsumeTraces()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.81.0/processor.go:296 +0x60
  go.opentelemetry.io/collector/consumer.Traces.ConsumeTraces-fm()
      <autogenerated>:1 +0x60
  go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces()
      /Users/salarionenko/go/pkg/mod/go.opentelemetry.io/collector/consumer@v0.81.0/traces.go:25 +0x64
  go.opentelemetry.io/collector/service/internal/graph.(*capabilitiesNode).ConsumeTraces()
      <autogenerated>:1 +0x20
  github.com/open-telemetry/opentelemetry-collector-contrib/receiver/jaegerreceiver.consumeTraces()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/receiver/jaegerreceiver@v0.81.0/trace_receiver.go:169 +0xa0
  github.com/open-telemetry/opentelemetry-collector-contrib/receiver/jaegerreceiver.(*agentHandler).EmitBatch()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/receiver/jaegerreceiver@v0.81.0/trace_receiver.go:202 +0xa8
  github.com/jaegertracing/jaeger/thrift-gen/agent.(*agentProcessorEmitBatch).Process()
      /Users/salarionenko/go/pkg/mod/github.com/jaegertracing/jaeger@v1.41.0/thrift-gen/agent/agent.go:178 +0xec
  github.com/jaegertracing/jaeger/thrift-gen/agent.(*AgentProcessor).Process()
      /Users/salarionenko/go/pkg/mod/github.com/jaegertracing/jaeger@v1.41.0/thrift-gen/agent/agent.go:125 +0x264
  github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer()
      /Users/salarionenko/go/pkg/mod/github.com/jaegertracing/jaeger@v1.41.0/cmd/agent/app/processors/thrift_processor.go:122 +0x27c
  github.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2()
      /Users/salarionenko/go/pkg/mod/github.com/jaegertracing/jaeger@v1.41.0/cmd/agent/app/processors/thrift_processor.go:87 +0x2c

Previous read at 0x00c000412198 by goroutine 34:
  go.opentelemetry.io/collector/pdata/ptrace.ResourceSpansSlice.Len()
      /Users/salarionenko/go/pkg/mod/go.opentelemetry.io/collector/pdata@v1.0.0-rcv0013/ptrace/generated_resourcespansslice.go:41 +0x4c
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling.hasSpanWithCondition()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.81.0/internal/sampling/util.go:57 +0x2c
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling.(*latency).Evaluate()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.81.0/internal/sampling/latency.go:41 +0xbc
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).makeDecision()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.81.0/processor.go:222 +0x220
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).samplingPolicyOnTick()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.81.0/processor.go:179 +0x7ec
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).samplingPolicyOnTick-fm()
      <autogenerated>:1 +0x34
  github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/timeutils.(*PolicyTicker).OnTick()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal@v0.81.0/timeutils/ticker_helper.go:45 +0x4c
  github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/timeutils.(*PolicyTicker).Start.func1()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal@v0.81.0/timeutils/ticker_helper.go:36 +0x30

Additional context

There is no cheap solution to copy the ptrace.Traces underlying slice. Therefore, one the possible fixes is ​​to extend spans lock before evaluation.
In my fork I've already fixed the problem and have written some unit test. After applying this solution in production, I haven't noticed any side effects: cpu consumption has remained the same, the processing pipeline hasn't slowed down.

@larsn777 larsn777 added bug Something isn't working needs triage New item requiring triage labels Jul 16, 2023
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Jul 16, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling
Copy link
Member

Thanks for the report, I'll take a look at the PRs.

@larsn777
Copy link
Contributor Author

Thanks for the report, I'll take a look at the PRs.

You're welcome

jpkrohling added a commit that referenced this issue Jul 17, 2023
**Description:**
This PR contains a fix for a race condition that occurs during policies
evaluation in the tail sampling processor. Data race occurs because of
concurrent read/write to ptrace.Traces underlying slice
(*[]*v1.ResourceSpans).So one of the possible solution is to extend
mutex lock to whole Evaluate method.

**Link to tracking Issue:** #24283 

**Testing:** 
Added new ConcurrentArrivalAndEvaluation unit test for data race case.

**Documentation:** <Describe the documentation added.>

---------

Co-authored-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 18, 2023
@jpkrohling
Copy link
Member

I think this can be closed, as the PR has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working processor/tailsampling Tail sampling processor Stale
Projects
None yet
Development

No branches or pull requests

2 participants