Skip to content

Commit

Permalink
Tweak threshold on flaky profiler test
Browse files Browse the repository at this point in the history
This test failed in a flaky way in CI
(<https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/9211/workflows/8431a38e-fa1d-4a93-a937-b8488b960a89/jobs/342577/tests#failed-test-0>).

Interesting observation in the output:

```
Failure/Error: expect(sample_count).to be >= 8, "sample_count: #{sample_count}, stats: #{stats}, debug_failures: #{debug_failures}"
  sample_count: 7, stats: {:trigger_sample_attempts=>8, :trigger_simulated_signal_delivery_attempts=>8, :simulated_signal_delivery=>8, :signal_handler_enqueued_sample=>8, :signal_handler_wrong_thread=>0, :sampled=>7, :sampling_time_ns_min=>122586, :sampling_time_ns_max=>2849779, :sampling_time_ns_total=>3664094, :sampling_time_ns_avg=>523442.0}, debug_failures: {:thread_list=>[#<Thread:0x000055df58fcafc0@Thread.main run>, #<Thread:0x000055df5b3d1f90 /app/spec/spec_helper.rb:241 sleep>, #<Thread:0x000055df5d235c48 /usr/local/bundle/gems/webrick-1.7.0/lib/webrick/utils.rb:158 sleep_forever>],
```

There was at least one sample that was super slow
(`sampling_time_ns_max` -- almost 3ms).

Because of that, the profiler throttled sampling down (using the dynamic
sampling rate component), and so did not take as many samples as we
were expecting. This is by design, although it's really hard to know
_why_ the profiler slowed down -- maybe the CI machine was busy?
Maybe that leftover webrick thread leaked from another test was
the culprit?

Right now the dynamic sampling rate is by design not very configurable,
so I've chosen to tweak the number of samples down yet again.

If this last adjustment isn't enough, I'll bite the bullet and make it
possible to disable the dynamic sampling rate (or tweak it), so as to
not cause issues for this spec.

(Like I mentioned in the comment -- I still think this test is worth
the trouble, as it validates an important behavior of the profiler
that otherwise may not get exercised in most validations we run --
profiling while idle).
  • Loading branch information
ivoanjo committed Mar 3, 2023
1 parent 1a3caef commit 7d22e54
Showing 1 changed file with 13 additions and 5 deletions.
18 changes: 13 additions & 5 deletions spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -310,11 +310,12 @@

cpu_and_wall_time_worker.stop

result = samples_for_thread(samples_from_pprof_without_gc_and_overhead(recorder.serialize!), Thread.current)
all_samples = samples_from_pprof_without_gc_and_overhead(recorder.serialize!)
result = samples_for_thread(all_samples, Thread.current)
sample_count = result.map { |it| it.values.fetch(:'cpu-samples') }.reduce(:+)

stats = cpu_and_wall_time_worker.stats
debug_failures = { thread_list: Thread.list, result: result }
debug_failures = { thread_list: Thread.list, all_samples: all_samples }

trigger_sample_attempts = stats.fetch(:trigger_sample_attempts)
simulated_signal_delivery = stats.fetch(:simulated_signal_delivery)
Expand All @@ -324,9 +325,16 @@

# Sanity checking

# We're currently targeting 100 samples per second, so this is a conservative approximation that hopefully
# will not cause flakiness
expect(sample_count).to be >= 8, "sample_count: #{sample_count}, stats: #{stats}, debug_failures: #{debug_failures}"
# We're currently targeting 100 samples per second (aka 20 in the 0.2 period above), so expecting 5 samples is a
# conservative choice that hopefully will not cause flakiness.
#
# @ivoanjo: One source of flakiness here in the past has been the dynamic sampling rate component -- if for some
# reason the profiler takes longer to sample (maybe the CI machine is busy today...), it will by design slow down
# and take less samples. For now, I tried to tune down this value, and am hesitating on adding a way to disable
# the dynamic sampling rate while we're running this test (because this is something we'd never do in production).
# But if spec proves to be flaky again, then I think we'll need to go with that solution (as I think this test
# is still quite valuable).
expect(sample_count).to be >= 5, "sample_count: #{sample_count}, stats: #{stats}, debug_failures: #{debug_failures}"
expect(trigger_sample_attempts).to be >= sample_count
end
end
Expand Down

0 comments on commit 7d22e54

Please sign in to comment.