Skip to content

fix(grpc): leaking spans when using gRPC future interface #13847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

samschlegel
Copy link

@samschlegel samschlegel commented Jul 1, 2025

We've tracked down some issues related to broken span hierarchy down to the way that the gRPC instrumentation deals with entering spans, which causes the grpc span to leak outside it's call when using the .future() interface. This PR changes the instrumentation to explicitly only enter the span when calling the continuations instead of just entering immediately.

This has been broken for a while so will likely need to be backported to a bunch of versions. It doesn't seem like I have permissions to add the labels so I'll defer to someone who can there.

I've added tests, but haven't run them for a while and working on getting it running locally. Will update if I run into any issues

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

Copy link
Member

@brettlangdon brettlangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will need to remove ddtrace/internal/datadog/profiling/_ddup.cpp from the PR.

@brettlangdon
Copy link
Member

oh, also, are you able to sign your commits? we have new merge requirements that commits are signed.

if not, once the PR is ready, I can always squash and sign them myself so we can get it merged (I'd prefer for you to get the contribution credit though)

@samschlegel
Copy link
Author

you will need to remove ddtrace/internal/datadog/profiling/_ddup.cpp from the PR.

Oh interesting. Should that be gitignored and I missed it or maybe it was previously from an old branch?

oh, also, are you able to sign your commits? we have new merge requirements that commits are signed.

if not, once the PR is ready, I can always squash and sign them myself so we can get it merged (I'd prefer for you to get the contribution credit though)

Can do!

@brettlangdon
Copy link
Member

Hey, since I don't have permissions to push to your remote, this patch should fix the precheck/linting issues:

From 46ccd644e4b06bc223e41b81117d1876002b171b Mon Sep 17 00:00:00 2001
From: brettlangdon <brett.langdon@datadoghq.com>
Date: Thu, 3 Jul 2025 08:50:43 -0400
Subject: [PATCH] fix linting

---
 tests/contrib/grpc/test_grpc.py | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tests/contrib/grpc/test_grpc.py b/tests/contrib/grpc/test_grpc.py
index fa02d4300..5404059ee 100644
--- a/tests/contrib/grpc/test_grpc.py
+++ b/tests/contrib/grpc/test_grpc.py
@@ -612,12 +612,12 @@ class GrpcTestCase(GrpcBaseTestCase):
         client_span, server_span = spans
         assert client_span.name == "grpc.client.request", "Expected 'grpc.client.request', got %s" % client_span.name
         assert server_span.name == "grpc.server.request", "Expected 'grpc.server.request', got %s" % server_span.name
-    
+
     def test_span_parent_is_maintained(self):
-        with self.tracer.trace('root') as span:
+        with self.tracer.trace("root") as _:
             with grpc.insecure_channel("localhost:%d" % (_GRPC_PORT)) as channel:
                 stub = HelloStub(channel)
-                result = stub.SayHello(HelloRequest(name="test"))
+                _ = stub.SayHello(HelloRequest(name="test"))
 
         spans = self.get_spans_with_sync_and_assert(size=3, should_retry=True)
         root = spans[0]
@@ -651,7 +651,6 @@ class GrpcTestCase(GrpcBaseTestCase):
         assert client_span.get_tag("custom_interceptor_worked")
 
 
-
 class _CustomException(Exception):
     pass
 
@@ -726,6 +725,7 @@ def test_method_service(patch_grpc):
     finally:
         server.stop(None)
 
+
 class _SpanActivatationClientInterceptor(grpc.UnaryUnaryClientInterceptor):
     def __init__(self, tracer) -> None:
         super().__init__()
@@ -741,4 +741,4 @@ class _SpanActivatationClientInterceptor(grpc.UnaryUnaryClientInterceptor):
         return continuation(client_call_details, request_or_iterator)
 
     def intercept_unary_unary(self, continuation, client_call_details, request):
-        return self._intercept_call(continuation, client_call_details, request)
\ No newline at end of file
+        return self._intercept_call(continuation, client_call_details, request)
-- 
2.48.1

@brettlangdon
Copy link
Member

@samschlegel if you are able to apply the patch I provided, we can make sure this change keeps moving along.

@samschlegel
Copy link
Author

ah woops, lost track of this!

@samschlegel samschlegel force-pushed the samschlegel/grpc-future-span-leak-fix branch from 2c8d91c to 21e6d96 Compare July 8, 2025 15:16
@samschlegel
Copy link
Author

@brettlangdon I see there were are also failures in several tests related to the agent not being able to start up. I'm assuming that's probably not related to my change?

Also in regards to the relnote, would you have a better way to refer to this besides "span leaking out to the caller"? I can't find any notes about similar issues but would like to be consistent with language here.

@brettlangdon
Copy link
Member

@brettlangdon I see there were are also failures in several tests related to the agent not being able to start up. I'm assuming that's probably not related to my change?

Also in regards to the relnote, would you have a better way to refer to this besides "span leaking out to the caller"? I can't find any notes about similar issues but would like to be consistent with language here.

Thanks for making the change. I'll keep an eye on the CI now and see what is related to your change vs not, and I can make a suggestion for a release note as well

Copy link
Member

@brettlangdon brettlangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few tests are failing in CI for gRPC tests.

If you want to run them locally:

> docker compose up -d testagent
> export DD_TRACE_AGENT_URL=http://localhost:9126
> ./scripts/ddtest
$ riot -v run --pass-env 10bb064

After the first run of a given hash, you can do riot -v run -s --pass-env 10bb064 which will skip trying to install the dev package.

That random hash thing is from riot list grpc, grabbing just a single venv/scenario to run instead of doing riot run grpc which will run all of them.

@samschlegel
Copy link
Author

Ah yep, some things left over from a bad refactor. Thanks for the notes re: testing. I had tried getting it to run before putting it up this time but ended up getting stuck on something. Will take a look

Co-authored-by: Brett Langdon <me@brett.is>
@samschlegel
Copy link
Author

samschlegel commented Jul 8, 2025

ah I remember what it was. got hit with moby/moby#45757 when trying to run the testrunner image

failed to register layer: failed to Lchown "/usr/local/lib/node_modules/@datadog/datadog-ci/node_modules/buffer-equal-constant-time/.npmignore" for UID 718322462, GID 454177323 (try increasing the number of sub
ordinate IDs in /etc/subuid and /etc/subgid): lchown /usr/local/lib/node_modules/@datadog/datadog-ci/node_modules/buffer-equal-constant-time/.npmignore: invalid argument

I'm doing this dev inside a container, so it's probably that I'd have to adjust my namespace remapping on my host 🙃 Seems like if I build the testrunnner image myself it might work so crossing my fingies in hopes I don't need to make subuid/subgid changes on this otherwise beautiful day

@samschlegel
Copy link
Author

samschlegel commented Jul 8, 2025

There's one failing test remaining that I can't seem to track down the cause for.

______________________________________________________________________________________ GrpcTestCase.test_priority_sampling _______________________________________________________________________________________                                                    
self = <tests.contrib.grpc.test_grpc.GrpcTestCase testMethod=test_priority_sampling>
                                                                                                                                                                                                                  
    def test_priority_sampling(self):    
        # DEV: Priority sampling is enabled by default
        # Setting priority sampling reset the writer, we need to re-override it
     
        with grpc.insecure_channel("127.0.0.1:%d" % (_GRPC_PORT)) as channel:
            stub = HelloStub(channel)                                                                    
            response = stub.SayHello(HelloRequest(name="propogator"))  
                                                                                                         
        spans = self.get_spans_with_sync_and_assert(size=2)
        client_span, server_span = spans
        assert f"x-datadog-trace-id={str(client_span._trace_id_64bits)}" in response.message
        assert f"_dd.p.tid={_get_64_highest_order_bits_as_hex(client_span.trace_id)}" in response.message 
        assert "x-datadog-parent-id={}".format(client_span.span_id) in response.message
>       assert "x-datadog-sampling-priority=1" in response.message
E       assert 'x-datadog-sampling-priority=1' in 'x-datadog-trace-id=13107238161608788184;x-datadog-parent-id=1499394788139886997;x-datadog-tags=_dd.p.tid=686d65f600000000'
E        +  where 'x-datadog-trace-id=13107238161608788184;x-datadog-parent-id=1499394788139886997;x-datadog-tags=_dd.p.tid=686d65f600000000' = message: "x-datadog-trace-id=13107238161608788184;x-datadog-parent
-id=1499394788139886997;x-datadog-tags=_dd.p.tid=686d65f600000000"\n.message
                                                    
tests/contrib/grpc/test_grpc.py:356: AssertionError 

As far as I can tell, even if I create a span in the beginning of the test and check it's context, the sampling priority is set to None. Not sure how my changes would have led to this unless it happened to uncover some other bug 🤔

EDIT: perhaps something isn't being finished now that was before?

@samschlegel
Copy link
Author

samschlegel commented Jul 8, 2025

Aha! The span needed to be passed into HTTPPropagator.inject since it's not active. Kind of a weird API but 🤷 Does seem like there should be a warning if the propagator is called on a context with no active span, and it also seems like you could call it on a span context with an unrelated active span with no warning 🤔

Things should hopefully all be good now, at least the tests pass locally

@samschlegel samschlegel requested a review from brettlangdon July 8, 2025 19:50
@samschlegel
Copy link
Author

maybe something like this could cover that issue. can pull out into another issue/PR later

diff --git a/ddtrace/propagation/http.py b/ddtrace/propagation/http.py
index 7f2a48fad..c43466e8d 100644
--- a/ddtrace/propagation/http.py
+++ b/ddtrace/propagation/http.py
@@ -1067,6 +1067,16 @@ class HTTPPropagator(object):
             else:
                 root_span = core.tracer.current_root_span()
 
+                if root_span is not None and root_span.context is not span_context:
+                    log.error(
+                        "span_context and current_root_span.context are not the same, but should be if non_active_span is "
+                        "not passed in. current_root_span.context will be used to generate distributed tracing headers. "
+                        "span_context: {}, current active context: {}",
+                        span_context,
+                        root_span.context,
+                    )
+                    span_context = root_span.context
+
             if root_span is not None and root_span.context.sampling_priority is None:
                 core.tracer.sample(root_span)
         else:

Copy link
Member

@brettlangdon brettlangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @samschlegel since your latest commit isn't signed I wasn't able to clone as-is to #13870 to get the CI to run, so I just squash everything to run there (meaning the workflow statuses won't update on this PR since the commit is different). once the PR is all green we can just squash all the commits on this branch and make sure it is signed, then should be ok.

I'll keep you posted on the outcome of the CI run.

these changes look good, other than we need a release note

we have docs here on how to generate/write a release note: https://ddtrace.readthedocs.io/en/stable/releasenotes.html

Co-authored-by: Brett Langdon <me@brett.is>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants