-
Notifications
You must be signed in to change notification settings - Fork 469
fix(core): use unpatched lock from forksafe module #15151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 238 ± 2 ms. The average import time from base is: 240 ± 2 ms. The import time difference between this PR and base is: -2.2 ± 0.1 ms. Import time breakdownThe following import paths have shrunk:
|
Performance SLOsComparing candidate taegyunkim/forksafe-lock (f7a9167) with baseline main (7bc20af) 🟡 Near SLO Breach (6 suites)🟡 djangosimple - 30/30✅ appsecTime: ✅ 20.516ms (SLO: <22.300ms -8.0%) vs baseline: ~same Memory: ✅ 66.208MB (SLO: <67.000MB 🟡 -1.2%) vs baseline: +4.8% ✅ exception-replay-enabledTime: ✅ 1.334ms (SLO: <1.450ms -8.0%) vs baseline: -0.4% Memory: ✅ 63.947MB (SLO: <67.000MB -4.6%) vs baseline: +4.4% ✅ iastTime: ✅ 20.436ms (SLO: <22.250ms -8.2%) vs baseline: -0.2% Memory: ✅ 66.159MB (SLO: <67.000MB 🟡 -1.3%) vs baseline: +5.0% ✅ profilerTime: ✅ 15.518ms (SLO: <16.550ms -6.2%) vs baseline: ~same Memory: ✅ 53.873MB (SLO: <54.500MB 🟡 -1.2%) vs baseline: +4.6% ✅ resource-renamingTime: ✅ 20.561ms (SLO: <21.750ms -5.5%) vs baseline: +0.2% Memory: ✅ 66.249MB (SLO: <67.000MB 🟡 -1.1%) vs baseline: +5.2% ✅ span-code-originTime: ✅ 25.425ms (SLO: <28.200ms -9.8%) vs baseline: +0.1% Memory: ✅ 67.986MB (SLO: <69.500MB -2.2%) vs baseline: +5.7% ✅ tracerTime: ✅ 20.507ms (SLO: <21.750ms -5.7%) vs baseline: +0.6% Memory: ✅ 66.047MB (SLO: <67.000MB 🟡 -1.4%) vs baseline: +4.8% ✅ tracer-and-profilerTime: ✅ 22.757ms (SLO: <23.500ms -3.2%) vs baseline: ~same Memory: ✅ 67.830MB (SLO: <68.000MB 🟡 -0.3%) vs baseline: +5.0% ✅ tracer-dont-create-db-spansTime: ✅ 19.340ms (SLO: <21.500ms 📉 -10.0%) vs baseline: ~same Memory: ✅ 66.073MB (SLO: <67.000MB 🟡 -1.4%) vs baseline: +4.8% ✅ tracer-minimalTime: ✅ 16.615ms (SLO: <17.500ms -5.1%) vs baseline: +0.1% Memory: ✅ 66.106MB (SLO: <67.000MB 🟡 -1.3%) vs baseline: +5.0% ✅ tracer-nativeTime: ✅ 20.464ms (SLO: <21.750ms -5.9%) vs baseline: +0.2% Memory: ✅ 67.712MB (SLO: <72.500MB -6.6%) vs baseline: +4.8% ✅ tracer-no-cachesTime: ✅ 18.538ms (SLO: <19.650ms -5.7%) vs baseline: +0.5% Memory: ✅ 66.028MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.6% ✅ tracer-no-databasesTime: ✅ 18.774ms (SLO: <20.100ms -6.6%) vs baseline: +0.2% Memory: ✅ 65.887MB (SLO: <67.000MB 🟡 -1.7%) vs baseline: +4.9% ✅ tracer-no-middlewareTime: ✅ 20.171ms (SLO: <21.500ms -6.2%) vs baseline: +0.4% Memory: ✅ 66.109MB (SLO: <67.000MB 🟡 -1.3%) vs baseline: +5.0% ✅ tracer-no-templatesTime: ✅ 20.289ms (SLO: <22.000ms -7.8%) vs baseline: +0.3% Memory: ✅ 66.171MB (SLO: <67.000MB 🟡 -1.2%) vs baseline: +5.0% 🟡 errortrackingdjangosimple - 6/6✅ errortracking-enabled-allTime: ✅ 17.983ms (SLO: <19.850ms -9.4%) vs baseline: -0.2% Memory: ✅ 66.135MB (SLO: <66.500MB 🟡 -0.5%) vs baseline: +4.7% ✅ errortracking-enabled-userTime: ✅ 18.067ms (SLO: <19.400ms -6.9%) vs baseline: +0.2% Memory: ✅ 66.021MB (SLO: <66.500MB 🟡 -0.7%) vs baseline: +4.6% ✅ tracer-enabledTime: ✅ 18.084ms (SLO: <19.450ms -7.0%) vs baseline: ~same Memory: ✅ 65.966MB (SLO: <66.500MB 🟡 -0.8%) vs baseline: +5.1% 🟡 errortrackingflasksqli - 6/6✅ errortracking-enabled-allTime: ✅ 2.073ms (SLO: <2.300ms -9.9%) vs baseline: +0.3% Memory: ✅ 52.612MB (SLO: <53.500MB 🟡 -1.7%) vs baseline: +5.1% ✅ errortracking-enabled-userTime: ✅ 2.067ms (SLO: <2.250ms -8.1%) vs baseline: -0.1% Memory: ✅ 52.514MB (SLO: <53.500MB 🟡 -1.8%) vs baseline: +4.9% ✅ tracer-enabledTime: ✅ 2.064ms (SLO: <2.300ms 📉 -10.3%) vs baseline: -0.4% Memory: ✅ 52.494MB (SLO: <53.500MB 🟡 -1.9%) vs baseline: +4.8% 🟡 flasksimple - 18/18✅ appsec-getTime: ✅ 4.590ms (SLO: <4.750ms -3.4%) vs baseline: -0.4% Memory: ✅ 62.340MB (SLO: <65.000MB -4.1%) vs baseline: +4.7% ✅ appsec-postTime: ✅ 6.636ms (SLO: <6.750ms 🟡 -1.7%) vs baseline: -0.1% Memory: ✅ 62.382MB (SLO: <65.000MB -4.0%) vs baseline: +4.9% ✅ appsec-telemetryTime: ✅ 4.588ms (SLO: <4.750ms -3.4%) vs baseline: -0.7% Memory: ✅ 62.322MB (SLO: <65.000MB -4.1%) vs baseline: +4.8% ✅ debuggerTime: ✅ 1.856ms (SLO: <2.000ms -7.2%) vs baseline: +0.2% Memory: ✅ 45.367MB (SLO: <47.000MB -3.5%) vs baseline: +4.9% ✅ iast-getTime: ✅ 1.881ms (SLO: <2.000ms -5.9%) vs baseline: +0.9% Memory: ✅ 42.132MB (SLO: <49.000MB 📉 -14.0%) vs baseline: +4.9% ✅ profilerTime: ✅ 1.911ms (SLO: <2.100ms -9.0%) vs baseline: -0.4% Memory: ✅ 46.708MB (SLO: <47.000MB 🟡 -0.6%) vs baseline: +4.9% ✅ resource-renamingTime: ✅ 3.368ms (SLO: <3.650ms -7.7%) vs baseline: -0.2% Memory: ✅ 52.613MB (SLO: <53.500MB 🟡 -1.7%) vs baseline: +4.8% ✅ tracerTime: ✅ 3.352ms (SLO: <3.650ms -8.2%) vs baseline: -0.3% Memory: ✅ 52.563MB (SLO: <53.500MB 🟡 -1.8%) vs baseline: +4.7% ✅ tracer-nativeTime: ✅ 3.352ms (SLO: <3.650ms -8.2%) vs baseline: ~same Memory: ✅ 54.118MB (SLO: <60.000MB -9.8%) vs baseline: +4.8% 🟡 flasksqli - 6/6✅ appsec-enabledTime: ✅ 3.970ms (SLO: <4.200ms -5.5%) vs baseline: -0.2% Memory: ✅ 62.344MB (SLO: <66.000MB -5.5%) vs baseline: +4.7% ✅ iast-enabledTime: ✅ 2.436ms (SLO: <2.800ms 📉 -13.0%) vs baseline: +0.3% Memory: ✅ 59.297MB (SLO: <60.000MB 🟡 -1.2%) vs baseline: +5.0% ✅ tracer-enabledTime: ✅ 2.061ms (SLO: <2.250ms -8.4%) vs baseline: +0.4% Memory: ✅ 52.632MB (SLO: <54.500MB -3.4%) vs baseline: +4.8% 🟡 recursivecomputation - 8/8✅ deepTime: ✅ 309.054ms (SLO: <320.950ms -3.7%) vs baseline: +0.3% Memory: ✅ 32.755MB (SLO: <34.500MB -5.1%) vs baseline: +4.6% ✅ deep-profiledTime: ✅ 327.978ms (SLO: <359.150ms -8.7%) vs baseline: +0.2% Memory: ✅ 38.586MB (SLO: <39.000MB 🟡 -1.1%) vs baseline: +5.0% ✅ mediumTime: ✅ 7.052ms (SLO: <7.400ms -4.7%) vs baseline: +0.3% Memory: ✅ 31.949MB (SLO: <34.000MB -6.0%) vs baseline: +4.5% ✅ shallowTime: ✅ 0.943ms (SLO: <1.050ms 📉 -10.2%) vs baseline: +0.3% Memory: ✅ 31.988MB (SLO: <34.000MB -5.9%) vs baseline: +5.1%
|
|
I'll add a test to catch scenarios like this |
vlad-scherbich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a new test that shows this scenario is fixed with the change?
## Description Temporarily skip IAST multiprocessing tests that are failing in CI due to fork + multithreading deadlocks. Despite extensive investigation and multiple attempted fixes, these tests remain unstable in the CI environment while working perfectly locally. ## Problem Statement Since merging commit e9582f2 (profiling test fix), several IAST multiprocessing tests began failing exclusively in CI environments, while continuing to pass reliably in local development. ### Affected Tests - `test_subprocess_has_tracer_running_and_iast_env` - `test_multiprocessing_with_iast_no_segfault` - `test_multiple_fork_operations` - `test_eval_in_forked_process` - `test_uvicorn_style_worker_with_eval` - `test_sequential_workers_stress_test` - `test_direct_fork_with_eval_no_crash` ### Symptoms **In CI:** - Child processes hang indefinitely or crash with `exitcode=None` - Tests that do complete are extremely slow (30-50+ seconds vs <1 second locally) - Error: `AssertionError: child process did not exit in time` - Telemetry recursion errors in logs: `maximum recursion depth exceeded while calling a Python object` **Locally:** - All tests pass reliably - Normal execution times (<1 second per test) - No deadlocks or hangs **Timeline:** - Branch 3.19: All tests work perfectly ✅ - After 4.0 merge (commit 89d69bd): Tests slow and failing ❌ ## Root Cause Analysis The issue is a **fork + multithreading deadlock**. When pytest loads ddtrace, several background services start threads: - Remote Configuration poller - Telemetry writer - Profiling collectors - Symbol Database uploader When tests call `fork()` or create `multiprocessing.Process()` while these threads are running, child processes inherit locks in unknown states. If any background thread held a lock during fork, that lock remains permanently locked in the child, causing deadlocks. **Why it fails in CI but not locally:** - CI has more services active (coverage, CI visibility, full telemetry) - More background threads running = higher chance of fork occurring while a lock is held - Different timing characteristics in CI environment ## Attempted Fixes ### Experiment 1: Environment Variables ```python env={ "DD_REMOTE_CONFIGURATION_ENABLED": "0", "DD_TELEMETRY_ENABLED": "0", "DD_PROFILING_ENABLED": "0", "DD_SYMBOL_DATABASE_UPLOAD_ENABLED": "0", "DD_TRACE_AGENT_URL": "http://localhost:9126", "DD_CIVISIBILITY_ITR_ENABLED": "0", "DD_CIVISIBILITY_FLAKY_RETRY_ENABLED": "0", } ``` Result: ❌ Tests still hang in CI Experiment 2: Fixture to Disable Services ```python @pytest.fixture(scope="module", autouse=True) def disable_threads(): """Disable remote config poller to prevent background threads that cause fork() deadlocks.""" remoteconfig_poller.disable() telemetry_writer.disable() yield ``` Result: ❌ Tests still hang in CI Experiment 3: Combined Approach (Env Vars + Fixtures) Applied both environment variables in riotfile.py and fixtures in conftest.py: ``` # conftest.py @pytest.fixture(scope="module", autouse=True) def disable_remoteconfig_poller(): """Disable remote config poller to prevent background threads that cause fork() deadlocks.""" remoteconfig_poller.disable() yield @pytest.fixture(autouse=True) def clear_iast_env_vars(): os.environ["DD_REMOTE_CONFIGURATION_ENABLED"] = "0" os.environ["DD_TELEMETRY_ENABLED"] = "0" os.environ["DD_PROFILING_ENABLED"] = "0" os.environ["DD_SYMBOL_DATABASE_UPLOAD_ENABLED"] = "0" yield ``` Result: ❌ Tests still hang in CI Experiment 4: Using --no-ddtrace Flag ``` command="pytest -vv --no-ddtrace --no-cov {cmdargs} tests/appsec/iast/" ``` Result: ❌ Tests still hang, telemetry recursion errors persist CI Error Logs ``` FAILED tests/appsec/iast/taint_tracking/test_multiprocessing_tracer_iast_env.py::test_subprocess_has_tracer_running_and_iast_env[py3.13] AssertionError: child process did not exit in time assert not True + where True = is_alive() + where is_alive = <Process name='Process-2' pid=2231 parent=2126 started daemon>.is_alive ------------------------------ Captured log call ------------------------------- DEBUG ddtrace.internal.telemetry.writer:writer.py:109 Failed to send Instrumentation Telemetry to http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry. Error: maximum recursion depth exceeded while calling a Python object ``` https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-py/-/jobs/1235039604 Performance Impact Tests that do complete in CI are dramatically slower: | Test | Local Time | CI Time | Slowdown | |-------------------------------------|------------|---------|----------| | test_fork_with_os_fork_no_segfault | ~0.5s | 51.48s | 100x | | test_direct_fork_with_eval_no_crash | ~0.5s | 30.75s | 60x | | test_osspawn_variants | ~1s | 27.48s | 27x | Decision: Skip Tests Temporarily After extensive investigation and multiple attempted fixes, we cannot reliably resolve this CI-specific issue. The tests work perfectly locally and in the 3.19 branch, indicating this is an environment-specific interaction introduced during the 4.0 merge. Next Steps: 1. File issue to track investigation with full context 2. Consider bisecting the 4.0 merge to find the specific change 3. Investigate differences between 3.19 and 4.0 threading models 4. Explore alternative test strategies (spawn vs fork, subprocess isolation) Related Issues - Commit that triggered issues: e9582f2 - 4.0 merge commit: 89d69bd - Related fix: #15151 (forksafe lock improvements) - Related fix: #15140 (symdb uploader spawn limiting)
Description
From a recent profiling escalation, we've seen a flamegraph where forksafe lock is calling into patched profiling lock.

The library used in above flamegraph is version 3.17.0.
From the leaf frame
dd-trace-py/ddtrace/internal/service.py
Lines 57 to 61 in 7451e84
self._service_lockis declared as aforksafe.Lock()dd-trace-py/ddtrace/internal/service.py
Line 33 in 7451e84
Testing
Risks
Additional Notes