-
Notifications
You must be signed in to change notification settings - Fork 477
perf(profiling): unwind only one Frame per Task #15789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(profiling): unwind only one Frame per Task #15789
Conversation
Codeowners resolved as |
Performance SLOsComparing candidate kowalski/perf-profiling-unwind-only-one-frame-per-task (04aa20b) with baseline main (21d5a9b) 📈 Performance Regressions (3 suites)📈 iastaspects - 118/118✅ add_aspectTime: ✅ 17.903µs (SLO: <20.000µs 📉 -10.5%) vs baseline: 📈 +20.7% Memory: ✅ 42.605MB (SLO: <43.250MB 🟡 -1.5%) vs baseline: +4.9% ✅ add_inplace_aspectTime: ✅ 14.869µs (SLO: <20.000µs 📉 -25.7%) vs baseline: -0.3% Memory: ✅ 42.546MB (SLO: <43.250MB 🟡 -1.6%) vs baseline: +4.8% ✅ add_inplace_noaspectTime: ✅ 0.340µs (SLO: <10.000µs 📉 -96.6%) vs baseline: -0.4% Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +4.9% ✅ add_noaspectTime: ✅ 0.546µs (SLO: <10.000µs 📉 -94.5%) vs baseline: -0.7% Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +5.0% ✅ bytearray_aspectTime: ✅ 17.971µs (SLO: <30.000µs 📉 -40.1%) vs baseline: -0.1% Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +5.0% ✅ bytearray_extend_aspectTime: ✅ 23.778µs (SLO: <30.000µs 📉 -20.7%) vs baseline: -0.3% Memory: ✅ 42.526MB (SLO: <43.500MB -2.2%) vs baseline: +4.9% ✅ bytearray_extend_noaspectTime: ✅ 2.764µs (SLO: <10.000µs 📉 -72.4%) vs baseline: +0.1% Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +4.9% ✅ bytearray_noaspectTime: ✅ 1.483µs (SLO: <10.000µs 📉 -85.2%) vs baseline: ~same Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +4.9% ✅ bytes_aspectTime: ✅ 16.551µs (SLO: <20.000µs 📉 -17.2%) vs baseline: -1.4% Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +5.1% ✅ bytes_noaspectTime: ✅ 1.419µs (SLO: <10.000µs 📉 -85.8%) vs baseline: -0.1% Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +4.9% ✅ bytesio_aspectTime: ✅ 55.506µs (SLO: <70.000µs 📉 -20.7%) vs baseline: -0.5% Memory: ✅ 42.585MB (SLO: <43.500MB -2.1%) vs baseline: +4.8% ✅ bytesio_noaspectTime: ✅ 3.276µs (SLO: <10.000µs 📉 -67.2%) vs baseline: +0.5% Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +4.5% ✅ capitalize_aspectTime: ✅ 14.562µs (SLO: <20.000µs 📉 -27.2%) vs baseline: -0.5% Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +4.8% ✅ capitalize_noaspectTime: ✅ 2.582µs (SLO: <10.000µs 📉 -74.2%) vs baseline: -0.2% Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +5.0% ✅ casefold_aspectTime: ✅ 14.690µs (SLO: <20.000µs 📉 -26.5%) vs baseline: +0.9% Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +4.9% ✅ casefold_noaspectTime: ✅ 3.158µs (SLO: <10.000µs 📉 -68.4%) vs baseline: +0.2% Memory: ✅ 42.487MB (SLO: <43.500MB -2.3%) vs baseline: +4.5% ✅ decode_aspectTime: ✅ 15.647µs (SLO: <30.000µs 📉 -47.8%) vs baseline: +0.5% Memory: ✅ 42.526MB (SLO: <43.500MB -2.2%) vs baseline: +4.6% ✅ decode_noaspectTime: ✅ 1.616µs (SLO: <10.000µs 📉 -83.8%) vs baseline: +0.2% Memory: ✅ 42.585MB (SLO: <43.500MB -2.1%) vs baseline: +4.9% ✅ encode_aspectTime: ✅ 18.143µs (SLO: <30.000µs 📉 -39.5%) vs baseline: 📈 +23.2% Memory: ✅ 42.684MB (SLO: <43.500MB 🟡 -1.9%) vs baseline: +5.1% ✅ encode_noaspectTime: ✅ 1.489µs (SLO: <10.000µs 📉 -85.1%) vs baseline: -1.7% Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +5.0% ✅ format_aspectTime: ✅ 171.177µs (SLO: <200.000µs 📉 -14.4%) vs baseline: +0.1% Memory: ✅ 42.625MB (SLO: <43.250MB 🟡 -1.4%) vs baseline: +4.9% ✅ format_map_aspectTime: ✅ 190.989µs (SLO: <200.000µs -4.5%) vs baseline: +0.2% Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +5.0% ✅ format_map_noaspectTime: ✅ 3.830µs (SLO: <10.000µs 📉 -61.7%) vs baseline: +0.2% Memory: ✅ 42.546MB (SLO: <43.250MB 🟡 -1.6%) vs baseline: +4.5% ✅ format_noaspectTime: ✅ 3.139µs (SLO: <10.000µs 📉 -68.6%) vs baseline: -1.0% Memory: ✅ 42.526MB (SLO: <43.250MB 🟡 -1.7%) vs baseline: +4.7% ✅ index_aspectTime: ✅ 15.266µs (SLO: <20.000µs 📉 -23.7%) vs baseline: -0.4% Memory: ✅ 42.566MB (SLO: <43.250MB 🟡 -1.6%) vs baseline: +4.8% ✅ index_noaspectTime: ✅ 0.470µs (SLO: <10.000µs 📉 -95.3%) vs baseline: +0.9% Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +4.8% ✅ join_aspectTime: ✅ 16.993µs (SLO: <20.000µs 📉 -15.0%) vs baseline: -0.2% Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +4.8% ✅ join_noaspectTime: ✅ 1.554µs (SLO: <10.000µs 📉 -84.5%) vs baseline: +0.2% Memory: ✅ 42.566MB (SLO: <43.250MB 🟡 -1.6%) vs baseline: +4.8% ✅ ljust_aspectTime: ✅ 20.861µs (SLO: <30.000µs 📉 -30.5%) vs baseline: ~same Memory: ✅ 42.585MB (SLO: <43.250MB 🟡 -1.5%) vs baseline: +4.8% ✅ ljust_noaspectTime: ✅ 2.748µs (SLO: <10.000µs 📉 -72.5%) vs baseline: +1.8% Memory: ✅ 42.566MB (SLO: <43.250MB 🟡 -1.6%) vs baseline: +4.9% ✅ lower_aspectTime: ✅ 17.971µs (SLO: <30.000µs 📉 -40.1%) vs baseline: +0.5% Memory: ✅ 42.644MB (SLO: <43.500MB 🟡 -2.0%) vs baseline: +4.9% ✅ lower_noaspectTime: ✅ 2.441µs (SLO: <10.000µs 📉 -75.6%) vs baseline: +0.8% Memory: ✅ 42.546MB (SLO: <43.250MB 🟡 -1.6%) vs baseline: +4.8% ✅ lstrip_aspectTime: ✅ 17.716µs (SLO: <30.000µs 📉 -40.9%) vs baseline: +0.5% Memory: ✅ 42.605MB (SLO: <43.250MB 🟡 -1.5%) vs baseline: +4.8% ✅ lstrip_noaspectTime: ✅ 1.843µs (SLO: <10.000µs 📉 -81.6%) vs baseline: -1.9% Memory: ✅ 42.644MB (SLO: <43.500MB 🟡 -2.0%) vs baseline: +5.1% ✅ modulo_aspectTime: ✅ 166.169µs (SLO: <200.000µs 📉 -16.9%) vs baseline: ~same Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +4.9% ✅ modulo_aspect_for_bytearray_bytearrayTime: ✅ 174.337µs (SLO: <200.000µs 📉 -12.8%) vs baseline: -0.2% Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +4.7% ✅ modulo_aspect_for_bytesTime: ✅ 168.753µs (SLO: <200.000µs 📉 -15.6%) vs baseline: +0.2% Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +4.8% ✅ modulo_aspect_for_bytes_bytearrayTime: ✅ 172.180µs (SLO: <200.000µs 📉 -13.9%) vs baseline: ~same Memory: ✅ 42.664MB (SLO: <43.500MB 🟡 -1.9%) vs baseline: +5.0% ✅ modulo_noaspectTime: ✅ 3.756µs (SLO: <10.000µs 📉 -62.4%) vs baseline: +2.0% Memory: ✅ 42.526MB (SLO: <43.500MB -2.2%) vs baseline: +4.8% ✅ replace_aspectTime: ✅ 212.207µs (SLO: <300.000µs 📉 -29.3%) vs baseline: ~same Memory: ✅ 42.546MB (SLO: <44.000MB -3.3%) vs baseline: +4.5% ✅ replace_noaspectTime: ✅ 2.927µs (SLO: <10.000µs 📉 -70.7%) vs baseline: +0.9% Memory: ✅ 42.526MB (SLO: <43.500MB -2.2%) vs baseline: +4.5% ✅ repr_aspectTime: ✅ 1.416µs (SLO: <10.000µs 📉 -85.8%) vs baseline: -0.4% Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +4.9% ✅ repr_noaspectTime: ✅ 0.525µs (SLO: <10.000µs 📉 -94.7%) vs baseline: +0.2% Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +4.8% ✅ rstrip_aspectTime: ✅ 19.038µs (SLO: <30.000µs 📉 -36.5%) vs baseline: ~same Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +4.8% ✅ rstrip_noaspectTime: ✅ 2.059µs (SLO: <10.000µs 📉 -79.4%) vs baseline: +6.1% Memory: ✅ 42.644MB (SLO: <43.500MB 🟡 -2.0%) vs baseline: +5.0% ✅ slice_aspectTime: ✅ 15.978µs (SLO: <20.000µs 📉 -20.1%) vs baseline: +0.6% Memory: ✅ 42.684MB (SLO: <43.500MB 🟡 -1.9%) vs baseline: +5.2% ✅ slice_noaspectTime: ✅ 0.597µs (SLO: <10.000µs 📉 -94.0%) vs baseline: +0.1% Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +4.9% ✅ stringio_aspectTime: ✅ 54.132µs (SLO: <80.000µs 📉 -32.3%) vs baseline: ~same Memory: ✅ 42.507MB (SLO: <43.500MB -2.3%) vs baseline: +4.5% ✅ stringio_noaspectTime: ✅ 3.638µs (SLO: <10.000µs 📉 -63.6%) vs baseline: ~same Memory: ✅ 42.585MB (SLO: <43.500MB -2.1%) vs baseline: +5.0% ✅ strip_aspectTime: ✅ 17.569µs (SLO: <20.000µs 📉 -12.2%) vs baseline: -0.2% Memory: ✅ 42.684MB (SLO: <43.500MB 🟡 -1.9%) vs baseline: +5.2% ✅ strip_noaspectTime: ✅ 1.860µs (SLO: <10.000µs 📉 -81.4%) vs baseline: -0.3% Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +5.0% ✅ swapcase_aspectTime: ✅ 18.613µs (SLO: <30.000µs 📉 -38.0%) vs baseline: +0.6% Memory: ✅ 42.546MB (SLO: <43.500MB -2.2%) vs baseline: +4.8% ✅ swapcase_noaspectTime: ✅ 2.793µs (SLO: <10.000µs 📉 -72.1%) vs baseline: +0.3% Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +4.8% ✅ title_aspectTime: ✅ 21.873µs (SLO: <30.000µs 📉 -27.1%) vs baseline: 📈 +19.6% Memory: ✅ 42.605MB (SLO: <43.000MB 🟡 -0.9%) vs baseline: +4.9% ✅ title_noaspectTime: ✅ 2.675µs (SLO: <10.000µs 📉 -73.3%) vs baseline: -0.1% Memory: ✅ 42.526MB (SLO: <43.500MB -2.2%) vs baseline: +4.6% ✅ translate_aspectTime: ✅ 20.427µs (SLO: <30.000µs 📉 -31.9%) vs baseline: ~same Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +5.0% ✅ translate_noaspectTime: ✅ 4.340µs (SLO: <10.000µs 📉 -56.6%) vs baseline: +0.3% Memory: ✅ 42.664MB (SLO: <43.500MB 🟡 -1.9%) vs baseline: +5.1% ✅ upper_aspectTime: ✅ 18.029µs (SLO: <30.000µs 📉 -39.9%) vs baseline: ~same Memory: ✅ 42.644MB (SLO: <43.500MB 🟡 -2.0%) vs baseline: +5.0% ✅ upper_noaspectTime: ✅ 2.445µs (SLO: <10.000µs 📉 -75.5%) vs baseline: +0.5% Memory: ✅ 42.507MB (SLO: <43.500MB -2.3%) vs baseline: +4.5% 📈 iastaspectsospath - 24/24✅ ospathbasename_aspectTime: ✅ 5.156µs (SLO: <10.000µs 📉 -48.4%) vs baseline: 📈 +20.9% Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +4.9% ✅ ospathbasename_noaspectTime: ✅ 4.289µs (SLO: <10.000µs 📉 -57.1%) vs baseline: -0.7% Memory: ✅ 42.467MB (SLO: <43.500MB -2.4%) vs baseline: +4.8% ✅ ospathjoin_aspectTime: ✅ 6.273µs (SLO: <10.000µs 📉 -37.3%) vs baseline: +0.4% Memory: ✅ 42.467MB (SLO: <43.500MB -2.4%) vs baseline: +4.7% ✅ ospathjoin_noaspectTime: ✅ 6.317µs (SLO: <10.000µs 📉 -36.8%) vs baseline: ~same Memory: ✅ 42.526MB (SLO: <43.500MB -2.2%) vs baseline: +4.9% ✅ ospathnormcase_aspectTime: ✅ 3.555µs (SLO: <10.000µs 📉 -64.5%) vs baseline: -0.2% Memory: ✅ 42.507MB (SLO: <43.500MB -2.3%) vs baseline: +4.9% ✅ ospathnormcase_noaspectTime: ✅ 3.642µs (SLO: <10.000µs 📉 -63.6%) vs baseline: -0.6% Memory: ✅ 42.467MB (SLO: <43.500MB -2.4%) vs baseline: +4.7% ✅ ospathsplit_aspectTime: ✅ 4.938µs (SLO: <10.000µs 📉 -50.6%) vs baseline: +0.4% Memory: ✅ 42.585MB (SLO: <43.500MB -2.1%) vs baseline: +4.8% ✅ ospathsplit_noaspectTime: ✅ 5.030µs (SLO: <10.000µs 📉 -49.7%) vs baseline: +0.1% Memory: ✅ 42.487MB (SLO: <43.500MB -2.3%) vs baseline: +4.8% ✅ ospathsplitdrive_aspectTime: ✅ 3.793µs (SLO: <10.000µs 📉 -62.1%) vs baseline: +1.5% Memory: ✅ 42.408MB (SLO: <43.500MB -2.5%) vs baseline: +4.7% ✅ ospathsplitdrive_noaspectTime: ✅ 0.753µs (SLO: <10.000µs 📉 -92.5%) vs baseline: +0.4% Memory: ✅ 42.546MB (SLO: <43.500MB -2.2%) vs baseline: +4.6% ✅ ospathsplitext_aspectTime: ✅ 4.679µs (SLO: <10.000µs 📉 -53.2%) vs baseline: +0.8% Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +5.1% ✅ ospathsplitext_noaspectTime: ✅ 4.671µs (SLO: <10.000µs 📉 -53.3%) vs baseline: +0.3% Memory: ✅ 42.448MB (SLO: <43.500MB -2.4%) vs baseline: +4.7% 📈 telemetryaddmetric - 30/30✅ 1-count-metric-1-timesTime: ✅ 3.411µs (SLO: <20.000µs 📉 -82.9%) vs baseline: 📈 +15.2% Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +5.2% ✅ 1-count-metrics-100-timesTime: ✅ 201.064µs (SLO: <220.000µs -8.6%) vs baseline: +1.3% Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.9% ✅ 1-distribution-metric-1-timesTime: ✅ 3.352µs (SLO: <20.000µs 📉 -83.2%) vs baseline: +0.6% Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.9% ✅ 1-distribution-metrics-100-timesTime: ✅ 214.361µs (SLO: <230.000µs -6.8%) vs baseline: +0.8% Memory: ✅ 34.977MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +5.4% ✅ 1-gauge-metric-1-timesTime: ✅ 2.196µs (SLO: <20.000µs 📉 -89.0%) vs baseline: ~same Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.9% ✅ 1-gauge-metrics-100-timesTime: ✅ 136.624µs (SLO: <150.000µs -8.9%) vs baseline: +0.7% Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.7% ✅ 1-rate-metric-1-timesTime: ✅ 3.147µs (SLO: <20.000µs 📉 -84.3%) vs baseline: +0.9% Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.8% ✅ 1-rate-metrics-100-timesTime: ✅ 213.085µs (SLO: <250.000µs 📉 -14.8%) vs baseline: +1.1% Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +5.0% ✅ 100-count-metrics-100-timesTime: ✅ 19.805ms (SLO: <22.000ms -10.0%) vs baseline: -0.8% Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +5.2% ✅ 100-distribution-metrics-100-timesTime: ✅ 2.263ms (SLO: <2.550ms 📉 -11.3%) vs baseline: -0.1% Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.6% ✅ 100-gauge-metrics-100-timesTime: ✅ 1.407ms (SLO: <1.550ms -9.2%) vs baseline: +0.1% Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.7% ✅ 100-rate-metrics-100-timesTime: ✅ 2.168ms (SLO: <2.550ms 📉 -15.0%) vs baseline: -0.6% Memory: ✅ 34.977MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +5.0% ✅ flush-1-metricTime: ✅ 4.552µs (SLO: <20.000µs 📉 -77.2%) vs baseline: ~same Memory: ✅ 34.780MB (SLO: <35.500MB -2.0%) vs baseline: +4.7% ✅ flush-100-metricsTime: ✅ 174.183µs (SLO: <250.000µs 📉 -30.3%) vs baseline: -0.1% Memory: ✅ 35.212MB (SLO: <35.500MB 🟡 -0.8%) vs baseline: +4.9% ✅ flush-1000-metricsTime: ✅ 2.190ms (SLO: <2.500ms 📉 -12.4%) vs baseline: +0.5% Memory: ✅ 36.097MB (SLO: <36.500MB 🟡 -1.1%) vs baseline: +4.9% 🟡 Near SLO Breach (16 suites)🟡 coreapiscenario - 10/10 (1 unstable)
|
687aa2c to
d41ce1c
Compare
e8cde65 to
f21c6da
Compare
d41ce1c to
7aa74f5
Compare
f21c6da to
a24f407
Compare
## Description Related PRs - Related: #15712 - Dependent: #15789 - Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits (if needed for code archeology...) ### What is this about? This PR updates the Task unwinding logic in the Profiler to (more) properly handle race conditions around running/"on CPU" Tasks. A Task can be either in a _running_ state (i.e. actively _computing_ something itself, like executing a regular Python function) or in a _sleeping_ state (i.e. waiting for something else to happen to wake up). <img width="1076" height="434" alt="image" src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c" /> After those changes, this problem does not appear anymore: only Frames that are actually from the same Stack appear within a given Stack. <img width="1387" height="445" alt="image" src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f" /> ### Why do we need it? Because we don't take a "snapshot of the whole Python process at once", there is a race condition in our Sampler. We first capture the Thread Stack (i.e. for the current Thread, if it is running, what Python code the interpreter is running), then for each Task in the Thread's Event Loop [if it exists] we look at the Task's own Stack. (Since Task/Coroutines are pausable, they have their own Stack that is kept in memory when they're paused, then re-loaded into context when they're resumed. Walking each Task's Stack allows us to e.g. know what code they're "running", even when they aren't actually currently running code...) Going back to the race condition question, we may have a discrepancy between what the Python Thread Stack tells us (what the interpreter is running) and what Task objects themselves tell us (because a tiny amount of time actually elapses between the moment we capture the Thread Stack and the moment we inspect the Task objects, so _what is happening_ may have changed in the meantime). I've already in the past gone into more detail regarding what buggy/unexpected behaviour may result from that race condition; this PR improves this. Note that there is a pretty obvious tradeoff here. When we detect a discrepancy, we can: - Ignore the fact we know something bad is going to happen – I'd rather not do that because it can look terrible for customers (and we don't want to look _obviously wrong_ to them). That would mean _quantity over quality_. - Try to recover by doing clever tricks (this can be somewhat costly because we have to interleave the various Stacks we have... I think we may want to do that at some point but not without putting more thought into it; plus those clever tricks can also sometimes be brittle tricks). That would mean _quality and quantity over cost_ (which in practice probably also means _quality over quantity_ because increasing costs will most probably lead to more adaptive sampling). - Give up and just pretend this never happened – skip that Sample (for the current Task, and in certain cases for the current Thread altogether). That would mean _quality over quantity_. For the time being, things can only get better because we're in a state where we don't deal with the problem at all. The current PR biases towards a mix: we detect more reliably the depth of the pure Python Stack (which allows us not to rely on un-unwinding Task Stacks), and then we skip Samples that we know will be bogus. If the latter happens sufficiently rarely [a claim I still need numbers to back] then this is OK. ### How does it work? The main problem we are trying to avoid here is having some of Task A's Frames appearing as part of Task B's Stack. Working around this requires properly splitting the Python Stack when it says it is running a Task, such that we only push the `asyncio` runtime Frames on top of each non-Task A Task. Walking the Python Stack allows us to do that properly. We thus walk the Python Stack (once per Thread) to detect whether we see `Handle.run` Frames – those indicate that the Event Loop is currently _stepping_ the Coroutine – in other words executing code. (When that happens, we expect at least one Task to be marked as _running_ (there could be more – that's also a race condition, but it's OK, as far as CPU Time is not concerned...)) As soon as we see a `run` Frame, we know the depth of the "pure Python Stack" and we can push it on top of every Task's Stack! ### What does this cost us? This is not completely free – we're doing more work (namely, walking the stack at each Sample). Looking at Full Host Profiles on a high-CPU `asyncio`-based Python script, I'm getting the following difference. Note that the total Profiler overhead is about 360ms/minute, meaning the additional ~20ms we're using here represent an extra 5% overhead. Given the importance of getting Stacks right (or at least not completely wrong), I'd say it's worth it, but it's still noticeable. I tried to do otherwise – but as far as I can tell, as long as the race condition between unwinding the Python Stack and unwinding Task Stacks exists (which it can't not), we will not be able to tell for sure how many pure Python Stack Frames we need to push. There are heuristics that can get us there in theoretically better time (e.g. only walk the Python Stack if all Tasks reported as non-running), but those come at a correctness and code readability cost, and it's not even that certain their overhead would be lesser. I also have another PR that should reduce the cost of unwinding Tasks (that uses the fact we now walk the Python Stack only once): #15789 so hopefully it evens things out. <img width="1926" height="921" alt="image" src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332" />
a24f407 to
3a46ae6
Compare
ddtrace/internal/datadog/profiling/stack/echion/echion/stacks.h
Outdated
Show resolved
Hide resolved
## Description Related PRs - Related: #15712 - Dependent: #15789 - Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits (if needed for code archeology...) ### What is this about? This PR updates the Task unwinding logic in the Profiler to (more) properly handle race conditions around running/"on CPU" Tasks. A Task can be either in a _running_ state (i.e. actively _computing_ something itself, like executing a regular Python function) or in a _sleeping_ state (i.e. waiting for something else to happen to wake up). <img width="1076" height="434" alt="image" src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c" /> After those changes, this problem does not appear anymore: only Frames that are actually from the same Stack appear within a given Stack. <img width="1387" height="445" alt="image" src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f" /> ### Why do we need it? Because we don't take a "snapshot of the whole Python process at once", there is a race condition in our Sampler. We first capture the Thread Stack (i.e. for the current Thread, if it is running, what Python code the interpreter is running), then for each Task in the Thread's Event Loop [if it exists] we look at the Task's own Stack. (Since Task/Coroutines are pausable, they have their own Stack that is kept in memory when they're paused, then re-loaded into context when they're resumed. Walking each Task's Stack allows us to e.g. know what code they're "running", even when they aren't actually currently running code...) Going back to the race condition question, we may have a discrepancy between what the Python Thread Stack tells us (what the interpreter is running) and what Task objects themselves tell us (because a tiny amount of time actually elapses between the moment we capture the Thread Stack and the moment we inspect the Task objects, so _what is happening_ may have changed in the meantime). I've already in the past gone into more detail regarding what buggy/unexpected behaviour may result from that race condition; this PR improves this. Note that there is a pretty obvious tradeoff here. When we detect a discrepancy, we can: - Ignore the fact we know something bad is going to happen – I'd rather not do that because it can look terrible for customers (and we don't want to look _obviously wrong_ to them). That would mean _quantity over quality_. - Try to recover by doing clever tricks (this can be somewhat costly because we have to interleave the various Stacks we have... I think we may want to do that at some point but not without putting more thought into it; plus those clever tricks can also sometimes be brittle tricks). That would mean _quality and quantity over cost_ (which in practice probably also means _quality over quantity_ because increasing costs will most probably lead to more adaptive sampling). - Give up and just pretend this never happened – skip that Sample (for the current Task, and in certain cases for the current Thread altogether). That would mean _quality over quantity_. For the time being, things can only get better because we're in a state where we don't deal with the problem at all. The current PR biases towards a mix: we detect more reliably the depth of the pure Python Stack (which allows us not to rely on un-unwinding Task Stacks), and then we skip Samples that we know will be bogus. If the latter happens sufficiently rarely [a claim I still need numbers to back] then this is OK. ### How does it work? The main problem we are trying to avoid here is having some of Task A's Frames appearing as part of Task B's Stack. Working around this requires properly splitting the Python Stack when it says it is running a Task, such that we only push the `asyncio` runtime Frames on top of each non-Task A Task. Walking the Python Stack allows us to do that properly. We thus walk the Python Stack (once per Thread) to detect whether we see `Handle.run` Frames – those indicate that the Event Loop is currently _stepping_ the Coroutine – in other words executing code. (When that happens, we expect at least one Task to be marked as _running_ (there could be more – that's also a race condition, but it's OK, as far as CPU Time is not concerned...)) As soon as we see a `run` Frame, we know the depth of the "pure Python Stack" and we can push it on top of every Task's Stack! ### What does this cost us? This is not completely free – we're doing more work (namely, walking the stack at each Sample). Looking at Full Host Profiles on a high-CPU `asyncio`-based Python script, I'm getting the following difference. Note that the total Profiler overhead is about 360ms/minute, meaning the additional ~20ms we're using here represent an extra 5% overhead. Given the importance of getting Stacks right (or at least not completely wrong), I'd say it's worth it, but it's still noticeable. I tried to do otherwise – but as far as I can tell, as long as the race condition between unwinding the Python Stack and unwinding Task Stacks exists (which it can't not), we will not be able to tell for sure how many pure Python Stack Frames we need to push. There are heuristics that can get us there in theoretically better time (e.g. only walk the Python Stack if all Tasks reported as non-running), but those come at a correctness and code readability cost, and it's not even that certain their overhead would be lesser. I also have another PR that should reduce the cost of unwinding Tasks (that uses the fact we now walk the Python Stack only once): #15789 so hopefully it evens things out. <img width="1926" height="921" alt="image" src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332" />
## Description Related PRs - Related: #15712 - Dependent: #15789 - Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits (if needed for code archeology...) ### What is this about? This PR updates the Task unwinding logic in the Profiler to (more) properly handle race conditions around running/"on CPU" Tasks. A Task can be either in a _running_ state (i.e. actively _computing_ something itself, like executing a regular Python function) or in a _sleeping_ state (i.e. waiting for something else to happen to wake up). <img width="1076" height="434" alt="image" src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c" /> After those changes, this problem does not appear anymore: only Frames that are actually from the same Stack appear within a given Stack. <img width="1387" height="445" alt="image" src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f" /> ### Why do we need it? Because we don't take a "snapshot of the whole Python process at once", there is a race condition in our Sampler. We first capture the Thread Stack (i.e. for the current Thread, if it is running, what Python code the interpreter is running), then for each Task in the Thread's Event Loop [if it exists] we look at the Task's own Stack. (Since Task/Coroutines are pausable, they have their own Stack that is kept in memory when they're paused, then re-loaded into context when they're resumed. Walking each Task's Stack allows us to e.g. know what code they're "running", even when they aren't actually currently running code...) Going back to the race condition question, we may have a discrepancy between what the Python Thread Stack tells us (what the interpreter is running) and what Task objects themselves tell us (because a tiny amount of time actually elapses between the moment we capture the Thread Stack and the moment we inspect the Task objects, so _what is happening_ may have changed in the meantime). I've already in the past gone into more detail regarding what buggy/unexpected behaviour may result from that race condition; this PR improves this. Note that there is a pretty obvious tradeoff here. When we detect a discrepancy, we can: - Ignore the fact we know something bad is going to happen – I'd rather not do that because it can look terrible for customers (and we don't want to look _obviously wrong_ to them). That would mean _quantity over quality_. - Try to recover by doing clever tricks (this can be somewhat costly because we have to interleave the various Stacks we have... I think we may want to do that at some point but not without putting more thought into it; plus those clever tricks can also sometimes be brittle tricks). That would mean _quality and quantity over cost_ (which in practice probably also means _quality over quantity_ because increasing costs will most probably lead to more adaptive sampling). - Give up and just pretend this never happened – skip that Sample (for the current Task, and in certain cases for the current Thread altogether). That would mean _quality over quantity_. For the time being, things can only get better because we're in a state where we don't deal with the problem at all. The current PR biases towards a mix: we detect more reliably the depth of the pure Python Stack (which allows us not to rely on un-unwinding Task Stacks), and then we skip Samples that we know will be bogus. If the latter happens sufficiently rarely [a claim I still need numbers to back] then this is OK. ### How does it work? The main problem we are trying to avoid here is having some of Task A's Frames appearing as part of Task B's Stack. Working around this requires properly splitting the Python Stack when it says it is running a Task, such that we only push the `asyncio` runtime Frames on top of each non-Task A Task. Walking the Python Stack allows us to do that properly. We thus walk the Python Stack (once per Thread) to detect whether we see `Handle.run` Frames – those indicate that the Event Loop is currently _stepping_ the Coroutine – in other words executing code. (When that happens, we expect at least one Task to be marked as _running_ (there could be more – that's also a race condition, but it's OK, as far as CPU Time is not concerned...)) As soon as we see a `run` Frame, we know the depth of the "pure Python Stack" and we can push it on top of every Task's Stack! ### What does this cost us? This is not completely free – we're doing more work (namely, walking the stack at each Sample). Looking at Full Host Profiles on a high-CPU `asyncio`-based Python script, I'm getting the following difference. Note that the total Profiler overhead is about 360ms/minute, meaning the additional ~20ms we're using here represent an extra 5% overhead. Given the importance of getting Stacks right (or at least not completely wrong), I'd say it's worth it, but it's still noticeable. I tried to do otherwise – but as far as I can tell, as long as the race condition between unwinding the Python Stack and unwinding Task Stacks exists (which it can't not), we will not be able to tell for sure how many pure Python Stack Frames we need to push. There are heuristics that can get us there in theoretically better time (e.g. only walk the Python Stack if all Tasks reported as non-running), but those come at a correctness and code readability cost, and it's not even that certain their overhead would be lesser. I also have another PR that should reduce the cost of unwinding Tasks (that uses the fact we now walk the Python Stack only once): #15789 so hopefully it evens things out. <img width="1926" height="921" alt="image" src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332" /> (cherry picked from commit 61b1799)
## Description Related PRs - Related: #15712 - Dependent: #15789 - Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits (if needed for code archeology...) ### What is this about? This PR updates the Task unwinding logic in the Profiler to (more) properly handle race conditions around running/"on CPU" Tasks. A Task can be either in a _running_ state (i.e. actively _computing_ something itself, like executing a regular Python function) or in a _sleeping_ state (i.e. waiting for something else to happen to wake up). <img width="1076" height="434" alt="image" src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c" /> After those changes, this problem does not appear anymore: only Frames that are actually from the same Stack appear within a given Stack. <img width="1387" height="445" alt="image" src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f" /> ### Why do we need it? Because we don't take a "snapshot of the whole Python process at once", there is a race condition in our Sampler. We first capture the Thread Stack (i.e. for the current Thread, if it is running, what Python code the interpreter is running), then for each Task in the Thread's Event Loop [if it exists] we look at the Task's own Stack. (Since Task/Coroutines are pausable, they have their own Stack that is kept in memory when they're paused, then re-loaded into context when they're resumed. Walking each Task's Stack allows us to e.g. know what code they're "running", even when they aren't actually currently running code...) Going back to the race condition question, we may have a discrepancy between what the Python Thread Stack tells us (what the interpreter is running) and what Task objects themselves tell us (because a tiny amount of time actually elapses between the moment we capture the Thread Stack and the moment we inspect the Task objects, so _what is happening_ may have changed in the meantime). I've already in the past gone into more detail regarding what buggy/unexpected behaviour may result from that race condition; this PR improves this. Note that there is a pretty obvious tradeoff here. When we detect a discrepancy, we can: - Ignore the fact we know something bad is going to happen – I'd rather not do that because it can look terrible for customers (and we don't want to look _obviously wrong_ to them). That would mean _quantity over quality_. - Try to recover by doing clever tricks (this can be somewhat costly because we have to interleave the various Stacks we have... I think we may want to do that at some point but not without putting more thought into it; plus those clever tricks can also sometimes be brittle tricks). That would mean _quality and quantity over cost_ (which in practice probably also means _quality over quantity_ because increasing costs will most probably lead to more adaptive sampling). - Give up and just pretend this never happened – skip that Sample (for the current Task, and in certain cases for the current Thread altogether). That would mean _quality over quantity_. For the time being, things can only get better because we're in a state where we don't deal with the problem at all. The current PR biases towards a mix: we detect more reliably the depth of the pure Python Stack (which allows us not to rely on un-unwinding Task Stacks), and then we skip Samples that we know will be bogus. If the latter happens sufficiently rarely [a claim I still need numbers to back] then this is OK. ### How does it work? The main problem we are trying to avoid here is having some of Task A's Frames appearing as part of Task B's Stack. Working around this requires properly splitting the Python Stack when it says it is running a Task, such that we only push the `asyncio` runtime Frames on top of each non-Task A Task. Walking the Python Stack allows us to do that properly. We thus walk the Python Stack (once per Thread) to detect whether we see `Handle.run` Frames – those indicate that the Event Loop is currently _stepping_ the Coroutine – in other words executing code. (When that happens, we expect at least one Task to be marked as _running_ (there could be more – that's also a race condition, but it's OK, as far as CPU Time is not concerned...)) As soon as we see a `run` Frame, we know the depth of the "pure Python Stack" and we can push it on top of every Task's Stack! ### What does this cost us? This is not completely free – we're doing more work (namely, walking the stack at each Sample). Looking at Full Host Profiles on a high-CPU `asyncio`-based Python script, I'm getting the following difference. Note that the total Profiler overhead is about 360ms/minute, meaning the additional ~20ms we're using here represent an extra 5% overhead. Given the importance of getting Stacks right (or at least not completely wrong), I'd say it's worth it, but it's still noticeable. I tried to do otherwise – but as far as I can tell, as long as the race condition between unwinding the Python Stack and unwinding Task Stacks exists (which it can't not), we will not be able to tell for sure how many pure Python Stack Frames we need to push. There are heuristics that can get us there in theoretically better time (e.g. only walk the Python Stack if all Tasks reported as non-running), but those come at a correctness and code readability cost, and it's not even that certain their overhead would be lesser. I also have another PR that should reduce the cost of unwinding Tasks (that uses the fact we now walk the Python Stack only once): #15789 so hopefully it evens things out. <img width="1926" height="921" alt="image" src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332" /> (cherry picked from commit 61b1799)
…15854) Backport 61b1799 from #15780 to 4.1. ## Description Related PRs - Related: #15712 - Dependent: #15789 - Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits (if needed for code archeology...) ### What is this about? This PR updates the Task unwinding logic in the Profiler to (more) properly handle race conditions around running/"on CPU" Tasks. A Task can be either in a _running_ state (i.e. actively _computing_ something itself, like executing a regular Python function) or in a _sleeping_ state (i.e. waiting for something else to happen to wake up). <img width="1076" height="434" alt="image" src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c" /> After those changes, this problem does not appear anymore: only Frames that are actually from the same Stack appear within a given Stack. <img width="1387" height="445" alt="image" src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f" /> ### Why do we need it? Because we don't take a "snapshot of the whole Python process at once", there is a race condition in our Sampler. We first capture the Thread Stack (i.e. for the current Thread, if it is running, what Python code the interpreter is running), then for each Task in the Thread's Event Loop [if it exists] we look at the Task's own Stack. (Since Task/Coroutines are pausable, they have their own Stack that is kept in memory when they're paused, then re-loaded into context when they're resumed. Walking each Task's Stack allows us to e.g. know what code they're "running", even when they aren't actually currently running code...) Going back to the race condition question, we may have a discrepancy between what the Python Thread Stack tells us (what the interpreter is running) and what Task objects themselves tell us (because a tiny amount of time actually elapses between the moment we capture the Thread Stack and the moment we inspect the Task objects, so _what is happening_ may have changed in the meantime). I've already in the past gone into more detail regarding what buggy/unexpected behaviour may result from that race condition; this PR improves this. Note that there is a pretty obvious tradeoff here. When we detect a discrepancy, we can: - Ignore the fact we know something bad is going to happen – I'd rather not do that because it can look terrible for customers (and we don't want to look _obviously wrong_ to them). That would mean _quantity over quality_. - Try to recover by doing clever tricks (this can be somewhat costly because we have to interleave the various Stacks we have... I think we may want to do that at some point but not without putting more thought into it; plus those clever tricks can also sometimes be brittle tricks). That would mean _quality and quantity over cost_ (which in practice probably also means _quality over quantity_ because increasing costs will most probably lead to more adaptive sampling). - Give up and just pretend this never happened – skip that Sample (for the current Task, and in certain cases for the current Thread altogether). That would mean _quality over quantity_. For the time being, things can only get better because we're in a state where we don't deal with the problem at all. The current PR biases towards a mix: we detect more reliably the depth of the pure Python Stack (which allows us not to rely on un-unwinding Task Stacks), and then we skip Samples that we know will be bogus. If the latter happens sufficiently rarely [a claim I still need numbers to back] then this is OK. ### How does it work? The main problem we are trying to avoid here is having some of Task A's Frames appearing as part of Task B's Stack. Working around this requires properly splitting the Python Stack when it says it is running a Task, such that we only push the `asyncio` runtime Frames on top of each non-Task A Task. Walking the Python Stack allows us to do that properly. We thus walk the Python Stack (once per Thread) to detect whether we see `Handle.run` Frames – those indicate that the Event Loop is currently _stepping_ the Coroutine – in other words executing code. (When that happens, we expect at least one Task to be marked as _running_ (there could be more – that's also a race condition, but it's OK, as far as CPU Time is not concerned...)) As soon as we see a `run` Frame, we know the depth of the "pure Python Stack" and we can push it on top of every Task's Stack! ### What does this cost us? This is not completely free – we're doing more work (namely, walking the stack at each Sample). Looking at Full Host Profiles on a high-CPU `asyncio`-based Python script, I'm getting the following difference. Note that the total Profiler overhead is about 360ms/minute, meaning the additional ~20ms we're using here represent an extra 5% overhead. Given the importance of getting Stacks right (or at least not completely wrong), I'd say it's worth it, but it's still noticeable. I tried to do otherwise – but as far as I can tell, as long as the race condition between unwinding the Python Stack and unwinding Task Stacks exists (which it can't not), we will not be able to tell for sure how many pure Python Stack Frames we need to push. There are heuristics that can get us there in theoretically better time (e.g. only walk the Python Stack if all Tasks reported as non-running), but those come at a correctness and code readability cost, and it's not even that certain their overhead would be lesser. I also have another PR that should reduce the cost of unwinding Tasks (that uses the fact we now walk the Python Stack only once): #15789 so hopefully it evens things out. <img width="1926" height="921" alt="image" src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332" /> Co-authored-by: Thomas Kowalski <thomas.kowalski@datadoghq.com>
8304d45 to
912e274
Compare
912e274 to
04aa20b
Compare
r1viollet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
) ## Description Related PRs - Related: DataDog#15712 - Dependent: DataDog#15789 - Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits (if needed for code archeology...) ### What is this about? This PR updates the Task unwinding logic in the Profiler to (more) properly handle race conditions around running/"on CPU" Tasks. A Task can be either in a _running_ state (i.e. actively _computing_ something itself, like executing a regular Python function) or in a _sleeping_ state (i.e. waiting for something else to happen to wake up). <img width="1076" height="434" alt="image" src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c" /> After those changes, this problem does not appear anymore: only Frames that are actually from the same Stack appear within a given Stack. <img width="1387" height="445" alt="image" src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f" /> ### Why do we need it? Because we don't take a "snapshot of the whole Python process at once", there is a race condition in our Sampler. We first capture the Thread Stack (i.e. for the current Thread, if it is running, what Python code the interpreter is running), then for each Task in the Thread's Event Loop [if it exists] we look at the Task's own Stack. (Since Task/Coroutines are pausable, they have their own Stack that is kept in memory when they're paused, then re-loaded into context when they're resumed. Walking each Task's Stack allows us to e.g. know what code they're "running", even when they aren't actually currently running code...) Going back to the race condition question, we may have a discrepancy between what the Python Thread Stack tells us (what the interpreter is running) and what Task objects themselves tell us (because a tiny amount of time actually elapses between the moment we capture the Thread Stack and the moment we inspect the Task objects, so _what is happening_ may have changed in the meantime). I've already in the past gone into more detail regarding what buggy/unexpected behaviour may result from that race condition; this PR improves this. Note that there is a pretty obvious tradeoff here. When we detect a discrepancy, we can: - Ignore the fact we know something bad is going to happen – I'd rather not do that because it can look terrible for customers (and we don't want to look _obviously wrong_ to them). That would mean _quantity over quality_. - Try to recover by doing clever tricks (this can be somewhat costly because we have to interleave the various Stacks we have... I think we may want to do that at some point but not without putting more thought into it; plus those clever tricks can also sometimes be brittle tricks). That would mean _quality and quantity over cost_ (which in practice probably also means _quality over quantity_ because increasing costs will most probably lead to more adaptive sampling). - Give up and just pretend this never happened – skip that Sample (for the current Task, and in certain cases for the current Thread altogether). That would mean _quality over quantity_. For the time being, things can only get better because we're in a state where we don't deal with the problem at all. The current PR biases towards a mix: we detect more reliably the depth of the pure Python Stack (which allows us not to rely on un-unwinding Task Stacks), and then we skip Samples that we know will be bogus. If the latter happens sufficiently rarely [a claim I still need numbers to back] then this is OK. ### How does it work? The main problem we are trying to avoid here is having some of Task A's Frames appearing as part of Task B's Stack. Working around this requires properly splitting the Python Stack when it says it is running a Task, such that we only push the `asyncio` runtime Frames on top of each non-Task A Task. Walking the Python Stack allows us to do that properly. We thus walk the Python Stack (once per Thread) to detect whether we see `Handle.run` Frames – those indicate that the Event Loop is currently _stepping_ the Coroutine – in other words executing code. (When that happens, we expect at least one Task to be marked as _running_ (there could be more – that's also a race condition, but it's OK, as far as CPU Time is not concerned...)) As soon as we see a `run` Frame, we know the depth of the "pure Python Stack" and we can push it on top of every Task's Stack! ### What does this cost us? This is not completely free – we're doing more work (namely, walking the stack at each Sample). Looking at Full Host Profiles on a high-CPU `asyncio`-based Python script, I'm getting the following difference. Note that the total Profiler overhead is about 360ms/minute, meaning the additional ~20ms we're using here represent an extra 5% overhead. Given the importance of getting Stacks right (or at least not completely wrong), I'd say it's worth it, but it's still noticeable. I tried to do otherwise – but as far as I can tell, as long as the race condition between unwinding the Python Stack and unwinding Task Stacks exists (which it can't not), we will not be able to tell for sure how many pure Python Stack Frames we need to push. There are heuristics that can get us there in theoretically better time (e.g. only walk the Python Stack if all Tasks reported as non-running), but those come at a correctness and code readability cost, and it's not even that certain their overhead would be lesser. I also have another PR that should reduce the cost of unwinding Tasks (that uses the fact we now walk the Python Stack only once): DataDog#15789 so hopefully it evens things out. <img width="1926" height="921" alt="image" src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332" />
## Description This is a small (but still worth it) performance improvement in the context of `asyncio` Task unwinding. Previously, we would use the `unwind_frame` to get the current Frame for an `asyncio` Task. In the case of a running Task, this would also yield all the Python `asyncio` runtime Frames that were "on top" of that Task Frame (and that we would later have to remove from the Stack). Building that Python Stack that we don't care about takes some time (because we need to walk the Frame chain) – we should only do it if we need to. Here, we clearly don't need to, as we only care about the Task Frame, so I added a new argument to `unwind_frame` that allows to early exit after a certain Stack depth has been reached (and I set it to `1` when unwinding a Task Frame).
Description
This is a small (but still worth it) performance improvement in the context of
asyncioTask unwinding. Previously, we would use theunwind_frameto get the current Frame for anasyncioTask. In the case of a running Task, this would also yield all the Pythonasyncioruntime Frames that were "on top" of that Task Frame (and that we would later have to remove from the Stack).Building that Python Stack that we don't care about takes some time (because we need to walk the Frame chain) – we should only do it if we need to. Here, we clearly don't need to, as we only care about the Task Frame, so I added a new argument to
unwind_framethat allows to early exit after a certain Stack depth has been reached (and I set it to1when unwinding a Task Frame).