Skip to content

Commit

Permalink
[PROF-10422] Fix flaky GVL profiling integration spec
Browse files Browse the repository at this point in the history
**What does this PR do?**

This PR fixes a flaky specs seen in
https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/16487/workflows/df69cfde-1fd5-40dc-bff6-ac8a9d0bc713/jobs/593534 :

```
  1) Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start when main thread is sleeping but a background thread is working when GVL profiling is enabled records Waiting for GVL samples
     Failure/Error: expect(waiting_for_gvl_time).to be_within(5).percent_of(total_time)
       expected 402832467 to be within 5% of 492191168
```

I was able to reproduce this failure on my local machine in around 1 in
20 runs.

This flakiness is caused by the profiler starting after a Waiting for
GVL already started, and thus the profiler does not see it and still
categorizes that period as "unknown"; thus our tight assertion is blown.

It's not a concidence that the difference between 492191168 and
402832467 is within ~100ms: that's the scheduler latency caused
by the background thread that's burning CPU.

To fix this issue, we subtract the duration of this initial period
from the `total_time`: this is the period we care for.

To avoid us accidentally passing this test by having no Waiting for GVL,
I've added an additional assertion for a sane `total_value`.

**Motivation:**

Make sure the profiler has zero flaky specs.

**Additional Notes:**

N/A

**How to test the change?**

Validate that CI is green and that the flaky test does not show up
again.
  • Loading branch information
ivoanjo committed Sep 20, 2024
1 parent c8d5351 commit be8882a
Showing 1 changed file with 11 additions and 2 deletions.
13 changes: 11 additions & 2 deletions spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -449,16 +449,25 @@
samples = samples_for_thread(
samples_from_pprof_without_gc_and_overhead(recorder.serialize!),
background_thread_affected_by_gvl_contention
)
).sort_by { |s| s.labels.fetch(:end_timestamp_ns) }

# Because the background_thread_affected_by_gvl_contention starts BEFORE the profiler, the first few samples
# will have an unknown state because the profiler may have missed the beginning of the Waiting for GVL
#
# So that the below assertions make sense (and are not flaky), we drop these first few samples from our
# consideration
missed_by_profiler_time =
samples.take_while { |s| s.labels[:state] == "unknown" }.sum { |sample| sample.values.fetch(:"wall-time") }

waiting_for_gvl_samples = samples.select { |sample| sample.labels[:state] == "waiting for gvl" }
total_time = samples.sum { |sample| sample.values.fetch(:"wall-time") }
total_time = samples.sum { |sample| sample.values.fetch(:"wall-time") } - missed_by_profiler_time
waiting_for_gvl_time = waiting_for_gvl_samples.sum { |sample| sample.values.fetch(:"wall-time") }

expect(waiting_for_gvl_samples.size).to be > 0

# The background thread should spend almost all of its time waiting to run (since when it gets to run
# it just passes and starts waiting)
expect(total_time).to be >= 200_000_000 # This test should run for at least 200ms, which is how long we sleep for
expect(waiting_for_gvl_time).to be < total_time
expect(waiting_for_gvl_time).to be_within(5).percent_of(total_time)

Expand Down

0 comments on commit be8882a

Please sign in to comment.