-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reader/writer dynamic instrumentation configuration manager, and end to end tests #29454
Add reader/writer dynamic instrumentation configuration manager, and end to end tests #29454
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add DI paths to this list, so the CI detects the changes:
https://github.com/DataDog/datadog-agent/blob/main/.gitlab-ci.yml#L740
Gitlab CI Configuration Changes
|
Removed | Modified | Added | Renamed |
---|---|---|---|
0 | 25 | 0 | 0 |
ℹ️ Diff available in the job log.
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv create-vm --pipeline-id=45884912 --os-family=ubuntu Note: This applies to commit 63e1573 |
Are the tests that I added running in KMT? Some of the tests are designed to fail, but I don't see anything |
My tests are being triggered to run now, but it can't find the go binary for compiling the testing sample service. How does this work for USM or other places where Go is invoked? |
Go Package Import DifferencesBaseline: 70ded07
|
/trigger-ci --variable RUN_ALL_BUILDS=true --variable RUN_KITCHEN_TESTS=true --variable RUN_E2E_TESTS=on --variable RUN_UNIT_TESTS=on --variable RUN_KMT_TESTS=on |
🚨 Devflow Post "https://gitlab.ddbuild.io/api/v4/projects/DataDog%2Fdatadog-agent/pipeline": net/http: request canceled (Client.Timeout exceeded while awaiting headers) Details
If you need support, contact us on Slack #devflow with those details! |
/trigger-ci --variable RUN_ALL_BUILDS=true --variable RUN_KITCHEN_TESTS=true --variable RUN_E2E_TESTS=on --variable RUN_UNIT_TESTS=on --variable RUN_KMT_TESTS=on |
🚂 Gitlab pipeline started Started pipeline #45217774 |
[Fast Unit Tests Report] On pipeline 45884912 (CI Visibility). The following jobs did not run any unit tests: Jobs:
If you modified Go files and expected unit tests to run in these jobs, please double check the job logs. If you think tests should have been executed reach out to #agent-devx-help |
I have one failure due to rlimit memlock. Is it safe to remove the memory lock for this case? |
Regression DetectorRegression Detector ResultsRun ID: f9b4ece5-48ba-43b9-b66e-be90fcde77d5 Metrics dashboard Target profiles Baseline: 70ded07 Performance changes are noted in the perf column of each table:
No significant changes in experiment optimization goalsConfidence level: 90.00% There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | otel_to_otel_logs | ingress throughput | +1.77 | [+0.96, +2.59] | 1 | Logs |
➖ | pycheck_lots_of_tags | % cpu utilization | +0.77 | [-1.77, +3.30] | 1 | Logs |
➖ | tcp_syslog_to_blackhole | ingress throughput | +0.20 | [+0.16, +0.24] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.01, +0.01] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | -0.01 | [-0.11, +0.08] | 1 | Logs |
➖ | idle | memory utilization | -0.35 | [-0.39, -0.30] | 1 | Logs |
➖ | idle_all_features | memory utilization | -0.37 | [-0.46, -0.27] | 1 | Logs |
➖ | basic_py_check | % cpu utilization | -1.03 | [-3.72, +1.67] | 1 | Logs |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -1.09 | [-1.82, -0.36] | 1 | Logs |
➖ | file_tree | memory utilization | -1.39 | [-1.51, -1.27] | 1 | Logs |
Bounds Checks
perf | experiment | bounds_check_name | replicates_passed |
---|---|---|---|
✅ | idle | memory_usage | 10/10 |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
) | ||
|
||
func TestGoDI(t *testing.T) { | ||
if err := rlimit.RemoveMemlock(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the test will fail without this, I'd suggest require.NoError(t, rlimit.RemoveMemlock())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are calling it twice now? You just need the single line require.NoError(t, rlimit.RemoveMemlock())
|
||
// Read the configuration via the config manager | ||
_, err := cm.ConfigReader.Read(buf.Bytes()) | ||
time.Sleep(time.Second * 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why this sleep is here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we call ConfigReader.Read() this kicks off the process of DI reading the configuration, then reading the DWARF information of the binary, generating the bpf code, compiling, and attaching. Then the events are read from the ringbuffer async. So this gives it time to do all that. 5 seconds is likely overly generous, but we do need a sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary for this PR, but I'd suggest trying to find something to wait on instead. Perhaps something in eventOutputWriter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I started playing with this and realized two things:
-
We don't need to sleep for waiting for DI to start up, the DI entry point function returns when things are set up.
-
There is, however, a sort of race condition when a new config is read where the expected result is updated but there could still be events in the ringbuffer from the previously instrumented function.
So, the diagnostics system seems like the perfect place for creating statuses that can be waited on instead of sleeping. When a new probe gets updated, and when a previous one gets uninstalled, we can change what the expected result is. We can also use it for confirming that events are received.
That said, I think let's do that in a subsequent PR. I changed the sleep logic to fix it for now.
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grant.seltzerrichman <grant.seltzerrichman@datadoghq.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
e970a78
to
099089c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving the dynamic instrumentation changes - the new config managers are only used in tests and when running in offline mode.
I'm not familiar with the KMT test setup but I've seen that the new test pass in KMT runs.
} | ||
|
||
opts := &dynamicinstrumentation.DIOptions{ | ||
RateLimitPerProbePerSecond: 0.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to rate limit for testing, this means we'll collect events whenever they're produced and we can see if they're as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rate limit shouldn't affect the tests though, they should still pass.
I guess resources are not a problem and we test one probe at a time so it's fine to leave rate limiting turned off but we might want to use the same values in tests as in the expected user configuration.
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
b96b2b7
to
60bbf38
Compare
go func(updateChan <-chan []byte) { | ||
configUpdateLoop: | ||
c := make(chan os.Signal, 1) | ||
signal.Notify(c, os.Interrupt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit strange to handle the signal this deep. Also, a signal isn't the only way you want graceful termination. Within a test, you will want to stop everything gracefully, but there is no signal at test end.
A couple ways you could solve this:
- return a function that closes the channel, which will exit the goroutine. Call the function in the parent
Close
function. - create a wrapping type here that provides a
Stop/Close
function, and call that in the parent.
@@ -33,10 +32,11 @@ func (goDI *GoDI) startRingbufferConsumer() (func(), error) { | |||
closeFunc := func() { | |||
closed = true | |||
r.Close() | |||
r.Flush() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't really do anything since you already closed it.
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
Signed-off-by: grantseltzer <grantseltzer@gmail.com>
/merge |
🚂 MergeQueue: waiting for PR to be ready This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals. Use |
🚂 MergeQueue: pull request added to the queue The median merge time in Use |
What does this PR do?
Read()
, as well as pass custom io.Writer's for uploading events/diagnostic information.sample-service
, a simple program with lots of functions to test GoDI againstMotivation
Dynamic instrumentation needs to be tested in KMT to ensure stability, expose all features that are not yet supported, and track regressions. Since some tests are expected to fail, this shouldn't block CI.
Describe how to test/QA your changes
Tests should run in KMT.
Possible Drawbacks / Trade-offs
These tests are fully e2e. More tests will be added soon that are more fine-grained.