-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[dyninst/ebpf] Port sideeye ebpf program implementation and compatible go framing #37087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Uncompressed package size comparisonComparison with ancestor Diff per package
Decision |
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: e133641 Optimization Goals: ✅ No significant changes detected
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | +2.67 | [+1.75, +3.59] | 1 | Logs |
➖ | file_tree | memory utilization | +0.64 | [+0.42, +0.86] | 1 | Logs |
➖ | tcp_syslog_to_blackhole | ingress throughput | +0.54 | [+0.47, +0.62] | 1 | Logs |
➖ | docker_containers_memory | memory utilization | +0.41 | [+0.34, +0.48] | 1 | Logs |
➖ | quality_gate_logs | % cpu utilization | +0.21 | [-2.57, +2.99] | 1 | Logs bounds checks dashboard |
➖ | ddot_logs | memory utilization | +0.10 | [-0.02, +0.22] | 1 | Logs |
➖ | file_to_blackhole_100ms_latency | egress throughput | +0.08 | [-0.54, +0.70] | 1 | Logs |
➖ | file_to_blackhole_500ms_latency | egress throughput | +0.07 | [-0.56, +0.69] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency | egress throughput | +0.01 | [-0.58, +0.61] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | +0.01 | [-0.61, +0.63] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.03, +0.02] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | -0.00 | [-0.29, +0.28] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | -0.02 | [-0.66, +0.63] | 1 | Logs |
➖ | ddot_metrics | memory utilization | -0.02 | [-0.12, +0.07] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.05 | [-0.67, +0.58] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | -0.06 | [-0.28, +0.17] | 1 | Logs |
➖ | file_to_blackhole_300ms_latency | egress throughput | -0.07 | [-0.72, +0.58] | 1 | Logs |
➖ | quality_gate_idle | memory utilization | -0.23 | [-0.30, -0.15] | 1 | Logs bounds checks dashboard |
➖ | otlp_ingest_metrics | memory utilization | -0.30 | [-0.46, -0.15] | 1 | Logs |
➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.66 | [-0.72, -0.59] | 1 | Logs |
➖ | otlp_ingest_logs | memory utilization | -0.77 | [-0.88, -0.65] | 1 | Logs |
➖ | docker_containers_cpu | % cpu utilization | -0.78 | [-4.58, +3.01] | 1 | Logs |
➖ | quality_gate_idle_all_features | memory utilization | -1.66 | [-1.79, -1.52] | 1 | Logs bounds checks dashboard |
Bounds Checks: ❌ Failed
perf | experiment | bounds_check_name | replicates_passed | links |
---|---|---|---|---|
❌ | docker_containers_memory | memory_usage | 9/10 | |
✅ | docker_containers_cpu | simple_check_run | 10/10 | |
✅ | docker_containers_memory | simple_check_run | 10/10 | |
✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | lost_bytes | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | memory_usage | 10/10 | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
e5cb050
to
4073e5d
Compare
980a9ce
to
35d88bb
Compare
// The ID of the program that produced this event. | ||
uint32_t prog_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ID of the bpf program or the service being instrumented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ID of the ebpf program, corresponding to Program.ID in the ir/program.go.
I am not 100% this is enough to match what we need to match, but this is easy to change.
pkg/dyninst/ebpf/framing.h
Outdated
// The number of bytes for a stack trace that follows this header. | ||
uint16_t stack_bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stack_trace_len ? stack_trace_byte_len?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pkg/dyninst/ebpf/framing.h
Outdated
// Bytes of the frame pointer from the stack root. | ||
uint32_t frame_byte_depth; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Bytes of the frame pointer from the stack root. | |
uint32_t frame_byte_depth; | |
// Length in bytes of the frame pointer from the stack root. | |
uint32_t frame_byte_depth; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, this is obsolete
pkg/dyninst/ebpf/framing.h
Outdated
// If the stack_bytes field is 0, then this field is the program counter of | ||
// at the moment of the event. | ||
uint64_t stack_hash_or_pc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and otherwise what is it a hash of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified and improved comment (it's a hash of the stack pcs that follow this entry).
__attribute((aligned(8))) event_header_t; | ||
|
||
// The maximum number of pcs in a captured stack trace. | ||
#define STACK_DEPTH 511 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason this isn't 512?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that whole structure, including the length field, is 4KiB.
typedef struct data_item_header { | ||
// The type of the data item. | ||
uint32_t type; | ||
// The length of the data item. | ||
uint32_t length; | ||
// The address of the data item in the user process's address space. | ||
uint64_t address; | ||
} __attribute__((aligned(8))) data_item_header_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the purpose of type to reference the types full definition in the saved probe IR?
For non-basic types, will the length be the whole type (i.e. structs will be the sum of all fields)?
Also what's the intention of address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly, type references types in ir, length is the total size of this data item in the output buffer, which must be equal to the byte_size() property of the type in ir (structs will be the sum of their field lenghts),
When some structure contains a pointer, data item for that structure will contain just the value of that pointer. We will emit another data item for the object that pointer was pointing too. Address can then be used to figure out that this another data item is what the pointer in the first data item pointed to.
17bcb1c
to
40538a0
Compare
…m/DataExMachina-dev/side-eye Drop parts that we won't be using anytime soon: - capturing "thread profile" (known as "snapshots" in sideeye land) - capturing data from functions up the stack (only injected function is processed) Small refactor around file organization and naming.
40538a0
to
59eba7b
Compare
59eba7b
to
a74b80a
Compare
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
|
// LOG is a macro that prints a message if the level is less than or equal to | ||
// the DEBUG level. | ||
#define LOG(level, fmt, ...) \ | ||
if (level <= DEBUG) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think DEBUG
can both be a define and an integer level?
// the DEBUG level. | ||
#define LOG(level, fmt, ...) \ | ||
if (level <= DEBUG) { \ | ||
bpf_printk(fmt, ##__VA_ARGS__); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use log_debug
we will automatically handle the newline change that was made in 5.9, as long as the program patcher is included on the userspace side.
__attribute__((unused)) int process_steps = 0; | ||
__attribute__((unused)) int chase_steps = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__attribute__((unused)) int process_steps = 0; | |
__attribute__((unused)) int chase_steps = 0; | |
__maybe_unused int process_steps = 0; | |
__maybe_unused int chase_steps = 0; |
#include "bpf_helpers.h" | ||
#include "vmlinux.h" | ||
|
||
#define DEFINE_QUEUE(prefix, elem_t, max_length) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this compare to BPF_MAP_TYPE_QUEUE
?
|
||
package output | ||
|
||
// This file must be kept in sync with the ../ebpf/framing.h file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a way to handle this by auto-generating the corresponding Go type definitions. See https://github.com/DataDog/datadog-agent/blob/main/pkg/collector/corechecks/ebpf/probe/ebpfcheck/c_types.go (source) and https://github.com/DataDog/datadog-agent/blob/main/pkg/collector/corechecks/ebpf/probe/ebpfcheck/c_types_linux.go (generated) for an example. You will need to add the file to
datadog-agent/tasks/system_probe.py
Line 487 in 175931d
def_files = { |
What does this PR do?
Introduce ebpf program implementation ported from side-eye.
Strip parts that won't be relevant for foreseeable future (thread profile support, collecting data from multiple frames, etc).
Minor code cleanup, definition reorganiation.
Introduce framing.go for unframing ebpf program output data, with compatibility test.
Motivation
https://datadoghq.atlassian.net/browse/DEBUG-3887