Tweak threshold on flaky profiler test

This test failed in a flaky way in CI (<https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/9211/workflows/8431a38e-fa1d-4a93-a937-b8488b960a89/jobs/342577/tests#failed-test-0>). Interesting observation in the output: ``` Failure/Error: expect(sample_count).to be >= 8, "sample_count: #{sample_count}, stats: #{stats}, debug_failures: #{debug_failures}" sample_count: 7, stats: {:trigger_sample_attempts=>8, :trigger_simulated_signal_delivery_attempts=>8, :simulated_signal_delivery=>8, :signal_handler_enqueued_sample=>8, :signal_handler_wrong_thread=>0, :sampled=>7, :sampling_time_ns_min=>122586, :sampling_time_ns_max=>2849779, :sampling_time_ns_total=>3664094, :sampling_time_ns_avg=>523442.0}, debug_failures: {:thread_list=>[#<Thread:0x000055df58fcafc0@Thread.main run>, #<Thread:0x000055df5b3d1f90 /app/spec/spec_helper.rb:241 sleep>, #<Thread:0x000055df5d235c48 /usr/local/bundle/gems/webrick-1.7.0/lib/webrick/utils.rb:158 sleep_forever>], ``` There was at least one sample that was super slow (`sampling_time_ns_max` -- almost 3ms). Because of that, the profiler throttled sampling down (using the dynamic sampling rate component), and so did not take as many samples as we were expecting. This is by design, although it's really hard to know _why_ the profiler slowed down -- maybe the CI machine was busy? Maybe that leftover webrick thread leaked from another test was the culprit? Right now the dynamic sampling rate is by design not very configurable, so I've chosen to tweak the number of samples down yet again. If this last adjustment isn't enough, I'll bite the bullet and make it possible to disable the dynamic sampling rate (or tweak it), so as to not cause issues for this spec. (Like I mentioned in the comment -- I still think this test is worth the trouble, as it validates an important behavior of the profiler that otherwise may not get exercised in most validations we run -- profiling while idle).
DataDog · Mar 3, 2023 · 7d22e54 · 7d22e54
1 parent 1a3caef
commit 7d22e54
Showing 1 changed file with 13 additions and 5 deletions.
diff --git a/spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb b/spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb
@@ -310,11 +310,12 @@
 
         cpu_and_wall_time_worker.stop
 
-        result = samples_for_thread(samples_from_pprof_without_gc_and_overhead(recorder.serialize!), Thread.current)
+        all_samples = samples_from_pprof_without_gc_and_overhead(recorder.serialize!)
+        result = samples_for_thread(all_samples, Thread.current)
         sample_count = result.map { |it| it.values.fetch(:'cpu-samples') }.reduce(:+)
 
         stats = cpu_and_wall_time_worker.stats
-        debug_failures = { thread_list: Thread.list, result: result }
+        debug_failures = { thread_list: Thread.list, all_samples: all_samples }
 
         trigger_sample_attempts = stats.fetch(:trigger_sample_attempts)
         simulated_signal_delivery = stats.fetch(:simulated_signal_delivery)
@@ -324,9 +325,16 @@
 
         # Sanity checking
 
-        # We're currently targeting 100 samples per second, so this is a conservative approximation that hopefully
-        # will not cause flakiness
-        expect(sample_count).to be >= 8, "sample_count: #{sample_count}, stats: #{stats}, debug_failures: #{debug_failures}"
+        # We're currently targeting 100 samples per second (aka 20 in the 0.2 period above), so expecting 5 samples is a
+        # conservative choice that hopefully will not cause flakiness.
+        #
+        # @ivoanjo: One source of flakiness here in the past has been the dynamic sampling rate component -- if for some
+        # reason the profiler takes longer to sample (maybe the CI machine is busy today...), it will by design slow down
+        # and take less samples. For now, I tried to tune down this value, and am hesitating on adding a way to disable
+        # the dynamic sampling rate while we're running this test (because this is something we'd never do in production).
+        # But if spec proves to be flaky again, then I think we'll need to go with that solution (as I think this test
+        # is still quite valuable).
+        expect(sample_count).to be >= 5, "sample_count: #{sample_count}, stats: #{stats}, debug_failures: #{debug_failures}"
         expect(trigger_sample_attempts).to be >= sample_count
       end
     end