[PROF-10422] Fix flaky GVL profiling integration spec

**What does this PR do?** This PR fixes a flaky specs seen in https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/16487/workflows/df69cfde-1fd5-40dc-bff6-ac8a9d0bc713/jobs/593534 : ``` 1) Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start when main thread is sleeping but a background thread is working when GVL profiling is enabled records Waiting for GVL samples Failure/Error: expect(waiting_for_gvl_time).to be_within(5).percent_of(total_time) expected 402832467 to be within 5% of 492191168 ``` I was able to reproduce this failure on my local machine in around 1 in 20 runs. This flakiness is caused by the profiler starting after a Waiting for GVL already started, and thus the profiler does not see it and still categorizes that period as "unknown"; thus our tight assertion is blown. It's not a concidence that the difference between 492191168 and 402832467 is within ~100ms: that's the scheduler latency caused by the background thread that's burning CPU. To fix this issue, we subtract the duration of this initial period from the `total_time`: this is the period we care for. To avoid us accidentally passing this test by having no Waiting for GVL, I've added an additional assertion for a sane `total_value`. **Motivation:** Make sure the profiler has zero flaky specs. **Additional Notes:** N/A **How to test the change?** Validate that CI is green and that the flaky test does not show up again.
DataDog · Sep 20, 2024 · be8882a · be8882a
1 parent c8d5351
commit be8882a
Showing 1 changed file with 11 additions and 2 deletions.
diff --git a/spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb b/spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb
@@ -449,16 +449,25 @@
           samples = samples_for_thread(
             samples_from_pprof_without_gc_and_overhead(recorder.serialize!),
             background_thread_affected_by_gvl_contention
-          )
+          ).sort_by { |s| s.labels.fetch(:end_timestamp_ns) }
+
+          # Because the background_thread_affected_by_gvl_contention starts BEFORE the profiler, the first few samples
+          # will have an unknown state because the profiler may have missed the beginning of the Waiting for GVL
+          #
+          # So that the below assertions make sense (and are not flaky), we drop these first few samples from our
+          # consideration
+          missed_by_profiler_time =
+            samples.take_while { |s| s.labels[:state] == "unknown" }.sum { |sample| sample.values.fetch(:"wall-time") }
 
           waiting_for_gvl_samples = samples.select { |sample| sample.labels[:state] == "waiting for gvl" }
-          total_time = samples.sum { |sample| sample.values.fetch(:"wall-time") }
+          total_time = samples.sum { |sample| sample.values.fetch(:"wall-time") } - missed_by_profiler_time
           waiting_for_gvl_time = waiting_for_gvl_samples.sum { |sample| sample.values.fetch(:"wall-time") }
 
           expect(waiting_for_gvl_samples.size).to be > 0
 
           # The background thread should spend almost all of its time waiting to run (since when it gets to run
           # it just passes and starts waiting)
+          expect(total_time).to be >= 200_000_000 # This test should run for at least 200ms, which is how long we sleep for
           expect(waiting_for_gvl_time).to be < total_time
           expect(waiting_for_gvl_time).to be_within(5).percent_of(total_time)