-
Notifications
You must be signed in to change notification settings - Fork 470
chore: limit symdb uploaders under spawn #15140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
77ab2c8 to
3263680
Compare
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 245 ± 5 ms. The average import time from base is: 250 ± 10 ms. The import time difference between this PR and base is: -9.3 ± 0.5 ms. Import time breakdownThe following import paths have appeared:
|
Performance SLOsComparing candidate refactor/symdb-cap-uploads-via-ipc (2030061) with baseline main (81acf6d) 🟡 Near SLO Breach (5 suites)🟡 djangosimple - 30/30✅ appsecTime: ✅ 20.457ms (SLO: <22.300ms -8.3%) vs baseline: +0.1% Memory: ✅ 66.032MB (SLO: <67.000MB 🟡 -1.4%) vs baseline: +4.9% ✅ exception-replay-enabledTime: ✅ 1.341ms (SLO: <1.450ms -7.5%) vs baseline: +0.2% Memory: ✅ 64.295MB (SLO: <67.000MB -4.0%) vs baseline: +5.0% ✅ iastTime: ✅ 20.453ms (SLO: <22.250ms -8.1%) vs baseline: -0.1% Memory: ✅ 65.998MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.8% ✅ profilerTime: ✅ 15.556ms (SLO: <16.550ms -6.0%) vs baseline: +0.2% Memory: ✅ 54.009MB (SLO: <54.500MB 🟡 -0.9%) vs baseline: +4.7% ✅ resource-renamingTime: ✅ 20.538ms (SLO: <21.750ms -5.6%) vs baseline: -0.1% Memory: ✅ 66.034MB (SLO: <67.000MB 🟡 -1.4%) vs baseline: +4.8% ✅ span-code-originTime: ✅ 25.378ms (SLO: <28.200ms 📉 -10.0%) vs baseline: -0.3% Memory: ✅ 67.583MB (SLO: <69.500MB -2.8%) vs baseline: +5.1% ✅ tracerTime: ✅ 20.492ms (SLO: <21.750ms -5.8%) vs baseline: ~same Memory: ✅ 65.953MB (SLO: <67.000MB 🟡 -1.6%) vs baseline: +4.7% ✅ tracer-and-profilerTime: ✅ 22.684ms (SLO: <23.500ms -3.5%) vs baseline: -0.3% Memory: ✅ 67.850MB (SLO: <68.000MB 🟡 -0.2%) vs baseline: +4.6% ✅ tracer-dont-create-db-spansTime: ✅ 19.287ms (SLO: <21.500ms 📉 -10.3%) vs baseline: -0.7% Memory: ✅ 66.081MB (SLO: <67.000MB 🟡 -1.4%) vs baseline: +4.8% ✅ tracer-minimalTime: ✅ 16.668ms (SLO: <17.500ms -4.8%) vs baseline: +0.4% Memory: ✅ 66.031MB (SLO: <67.000MB 🟡 -1.4%) vs baseline: +4.9% ✅ tracer-nativeTime: ✅ 20.497ms (SLO: <21.750ms -5.8%) vs baseline: +0.4% Memory: ✅ 67.987MB (SLO: <72.500MB -6.2%) vs baseline: +4.8% ✅ tracer-no-cachesTime: ✅ 18.440ms (SLO: <19.650ms -6.2%) vs baseline: -0.5% Memory: ✅ 66.176MB (SLO: <67.000MB 🟡 -1.2%) vs baseline: +5.1% ✅ tracer-no-databasesTime: ✅ 18.756ms (SLO: <20.100ms -6.7%) vs baseline: -0.3% Memory: ✅ 65.814MB (SLO: <67.000MB 🟡 -1.8%) vs baseline: +4.7% ✅ tracer-no-middlewareTime: ✅ 20.139ms (SLO: <21.500ms -6.3%) vs baseline: +0.1% Memory: ✅ 66.124MB (SLO: <67.000MB 🟡 -1.3%) vs baseline: +4.8% ✅ tracer-no-templatesTime: ✅ 20.281ms (SLO: <22.000ms -7.8%) vs baseline: ~same Memory: ✅ 66.163MB (SLO: <67.000MB 🟡 -1.2%) vs baseline: +4.9% 🟡 errortrackingdjangosimple - 6/6✅ errortracking-enabled-allTime: ✅ 18.062ms (SLO: <19.850ms -9.0%) vs baseline: -0.2% Memory: ✅ 66.071MB (SLO: <66.500MB 🟡 -0.6%) vs baseline: +4.9% ✅ errortracking-enabled-userTime: ✅ 18.097ms (SLO: <19.400ms -6.7%) vs baseline: +0.1% Memory: ✅ 66.037MB (SLO: <66.500MB 🟡 -0.7%) vs baseline: +4.9% ✅ tracer-enabledTime: ✅ 18.092ms (SLO: <19.450ms -7.0%) vs baseline: +0.1% Memory: ✅ 66.023MB (SLO: <66.500MB 🟡 -0.7%) vs baseline: +5.0% 🟡 errortrackingflasksqli - 6/6✅ errortracking-enabled-allTime: ✅ 2.068ms (SLO: <2.300ms 📉 -10.1%) vs baseline: +0.4% Memory: ✅ 52.534MB (SLO: <53.500MB 🟡 -1.8%) vs baseline: +4.8% ✅ errortracking-enabled-userTime: ✅ 2.067ms (SLO: <2.250ms -8.2%) vs baseline: -0.5% Memory: ✅ 52.534MB (SLO: <53.500MB 🟡 -1.8%) vs baseline: +4.7% ✅ tracer-enabledTime: ✅ 2.063ms (SLO: <2.300ms 📉 -10.3%) vs baseline: -0.3% Memory: ✅ 52.671MB (SLO: <53.500MB 🟡 -1.5%) vs baseline: +4.9% 🟡 flasksimple - 18/18✅ appsec-getTime: ✅ 4.599ms (SLO: <4.750ms -3.2%) vs baseline: +0.2% Memory: ✅ 62.345MB (SLO: <65.000MB -4.1%) vs baseline: +4.8% ✅ appsec-postTime: ✅ 6.640ms (SLO: <6.750ms 🟡 -1.6%) vs baseline: ~same Memory: ✅ 62.322MB (SLO: <65.000MB -4.1%) vs baseline: +4.9% ✅ appsec-telemetryTime: ✅ 4.594ms (SLO: <4.750ms -3.3%) vs baseline: ~same Memory: ✅ 62.347MB (SLO: <65.000MB -4.1%) vs baseline: +4.8% ✅ debuggerTime: ✅ 1.862ms (SLO: <2.000ms -6.9%) vs baseline: -0.2% Memory: ✅ 45.349MB (SLO: <47.000MB -3.5%) vs baseline: +5.2% ✅ iast-getTime: ✅ 1.858ms (SLO: <2.000ms -7.1%) vs baseline: ~same Memory: ✅ 42.064MB (SLO: <49.000MB 📉 -14.2%) vs baseline: +4.6% ✅ profilerTime: ✅ 1.915ms (SLO: <2.100ms -8.8%) vs baseline: +0.3% Memory: ✅ 46.714MB (SLO: <47.000MB 🟡 -0.6%) vs baseline: +4.9% ✅ resource-renamingTime: ✅ 3.362ms (SLO: <3.650ms -7.9%) vs baseline: -0.2% Memory: ✅ 52.635MB (SLO: <53.500MB 🟡 -1.6%) vs baseline: +4.9% ✅ tracerTime: ✅ 3.353ms (SLO: <3.650ms -8.1%) vs baseline: ~same Memory: ✅ 52.567MB (SLO: <53.500MB 🟡 -1.7%) vs baseline: +4.8% ✅ tracer-nativeTime: ✅ 3.353ms (SLO: <3.650ms -8.1%) vs baseline: +0.1% Memory: ✅ 54.360MB (SLO: <60.000MB -9.4%) vs baseline: +4.9% 🟡 flasksqli - 6/6✅ appsec-enabledTime: ✅ 3.961ms (SLO: <4.200ms -5.7%) vs baseline: -0.3% Memory: ✅ 62.344MB (SLO: <66.000MB -5.5%) vs baseline: +4.9% ✅ iast-enabledTime: ✅ 2.449ms (SLO: <2.800ms 📉 -12.5%) vs baseline: ~same Memory: ✅ 58.825MB (SLO: <60.000MB 🟡 -2.0%) vs baseline: +4.9% ✅ tracer-enabledTime: ✅ 2.059ms (SLO: <2.250ms -8.5%) vs baseline: -0.7% Memory: ✅ 52.632MB (SLO: <54.500MB -3.4%) vs baseline: +4.9%
|
3263680 to
ab746d4
Compare
ce33357 to
c79a0d2
Compare
We use file-based IPC to ensure that Symbol DB has as most 2 active uploader processes under more general circumstances than fork, such as spawn.
97af775 to
89b66f2
Compare
89b66f2 to
432f73d
Compare
|
release note? |
This is just an implementation detail |
## Description Temporarily skip IAST multiprocessing tests that are failing in CI due to fork + multithreading deadlocks. Despite extensive investigation and multiple attempted fixes, these tests remain unstable in the CI environment while working perfectly locally. ## Problem Statement Since merging commit e9582f2 (profiling test fix), several IAST multiprocessing tests began failing exclusively in CI environments, while continuing to pass reliably in local development. ### Affected Tests - `test_subprocess_has_tracer_running_and_iast_env` - `test_multiprocessing_with_iast_no_segfault` - `test_multiple_fork_operations` - `test_eval_in_forked_process` - `test_uvicorn_style_worker_with_eval` - `test_sequential_workers_stress_test` - `test_direct_fork_with_eval_no_crash` ### Symptoms **In CI:** - Child processes hang indefinitely or crash with `exitcode=None` - Tests that do complete are extremely slow (30-50+ seconds vs <1 second locally) - Error: `AssertionError: child process did not exit in time` - Telemetry recursion errors in logs: `maximum recursion depth exceeded while calling a Python object` **Locally:** - All tests pass reliably - Normal execution times (<1 second per test) - No deadlocks or hangs **Timeline:** - Branch 3.19: All tests work perfectly ✅ - After 4.0 merge (commit 89d69bd): Tests slow and failing ❌ ## Root Cause Analysis The issue is a **fork + multithreading deadlock**. When pytest loads ddtrace, several background services start threads: - Remote Configuration poller - Telemetry writer - Profiling collectors - Symbol Database uploader When tests call `fork()` or create `multiprocessing.Process()` while these threads are running, child processes inherit locks in unknown states. If any background thread held a lock during fork, that lock remains permanently locked in the child, causing deadlocks. **Why it fails in CI but not locally:** - CI has more services active (coverage, CI visibility, full telemetry) - More background threads running = higher chance of fork occurring while a lock is held - Different timing characteristics in CI environment ## Attempted Fixes ### Experiment 1: Environment Variables ```python env={ "DD_REMOTE_CONFIGURATION_ENABLED": "0", "DD_TELEMETRY_ENABLED": "0", "DD_PROFILING_ENABLED": "0", "DD_SYMBOL_DATABASE_UPLOAD_ENABLED": "0", "DD_TRACE_AGENT_URL": "http://localhost:9126", "DD_CIVISIBILITY_ITR_ENABLED": "0", "DD_CIVISIBILITY_FLAKY_RETRY_ENABLED": "0", } ``` Result: ❌ Tests still hang in CI Experiment 2: Fixture to Disable Services ```python @pytest.fixture(scope="module", autouse=True) def disable_threads(): """Disable remote config poller to prevent background threads that cause fork() deadlocks.""" remoteconfig_poller.disable() telemetry_writer.disable() yield ``` Result: ❌ Tests still hang in CI Experiment 3: Combined Approach (Env Vars + Fixtures) Applied both environment variables in riotfile.py and fixtures in conftest.py: ``` # conftest.py @pytest.fixture(scope="module", autouse=True) def disable_remoteconfig_poller(): """Disable remote config poller to prevent background threads that cause fork() deadlocks.""" remoteconfig_poller.disable() yield @pytest.fixture(autouse=True) def clear_iast_env_vars(): os.environ["DD_REMOTE_CONFIGURATION_ENABLED"] = "0" os.environ["DD_TELEMETRY_ENABLED"] = "0" os.environ["DD_PROFILING_ENABLED"] = "0" os.environ["DD_SYMBOL_DATABASE_UPLOAD_ENABLED"] = "0" yield ``` Result: ❌ Tests still hang in CI Experiment 4: Using --no-ddtrace Flag ``` command="pytest -vv --no-ddtrace --no-cov {cmdargs} tests/appsec/iast/" ``` Result: ❌ Tests still hang, telemetry recursion errors persist CI Error Logs ``` FAILED tests/appsec/iast/taint_tracking/test_multiprocessing_tracer_iast_env.py::test_subprocess_has_tracer_running_and_iast_env[py3.13] AssertionError: child process did not exit in time assert not True + where True = is_alive() + where is_alive = <Process name='Process-2' pid=2231 parent=2126 started daemon>.is_alive ------------------------------ Captured log call ------------------------------- DEBUG ddtrace.internal.telemetry.writer:writer.py:109 Failed to send Instrumentation Telemetry to http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry. Error: maximum recursion depth exceeded while calling a Python object ``` https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-py/-/jobs/1235039604 Performance Impact Tests that do complete in CI are dramatically slower: | Test | Local Time | CI Time | Slowdown | |-------------------------------------|------------|---------|----------| | test_fork_with_os_fork_no_segfault | ~0.5s | 51.48s | 100x | | test_direct_fork_with_eval_no_crash | ~0.5s | 30.75s | 60x | | test_osspawn_variants | ~1s | 27.48s | 27x | Decision: Skip Tests Temporarily After extensive investigation and multiple attempted fixes, we cannot reliably resolve this CI-specific issue. The tests work perfectly locally and in the 3.19 branch, indicating this is an environment-specific interaction introduced during the 4.0 merge. Next Steps: 1. File issue to track investigation with full context 2. Consider bisecting the 4.0 merge to find the specific change 3. Investigate differences between 3.19 and 4.0 threading models 4. Explore alternative test strategies (spawn vs fork, subprocess isolation) Related Issues - Commit that triggered issues: e9582f2 - 4.0 merge commit: 89d69bd - Related fix: #15151 (forksafe lock improvements) - Related fix: #15140 (symdb uploader spawn limiting)
Description
We use file-based IPC to ensure that Symbol DB has as most 2 active uploader processes under more general circumstances than fork, such as spawn.