-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add batching support for high-volume runtime monitoring #12168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add batching support for high-volume runtime monitoring #12168
Conversation
The remote sink was attempting to batch multiple messages in a single writev() call, which violates SOCK_SEQPACKET semantics. SOCK_SEQPACKET requires each message to be sent in a separate write to maintain message boundaries. Changes: - Modified flushBatchLocked() to send messages individually - Added writeSingleMessage() helper for consistent retry logic - Added brief spacing for large batches to avoid socket buffer overflow - Fixed binary marshaling to use explicit LittleEndian encoding This ensures proper message boundary preservation required by the seccheck protocol while maintaining batching efficiency through grouped flushes.
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
cc @fvoznika @vivaan-gupta-databricks could you also sign the CLA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Batching is something that we expected to require for high performance workloads, so it's great to see this happening. The PR seems to mix both a single message with all batched messages (see changes in proto files) and a change that holds points in memory and writes them one at a time (to preserve compatibility with clients).
There doesn't seem to be much difference in terms of performance from the existing code and the PR, given that the same number of writes need to be done at about the same frequency, except for better cache reuse for batching.
Do you have benchmarks showing the performance difference? I would prefer to see batching implemented as a single write that contains multiple points/messages. It requires bigger changes in the way the messages are written and processed, but I suspect it will give much bigger performance gains in the end.
|
||
// CurrentVersion is the current wire and protocol version. | ||
const CurrentVersion = 1 | ||
const CurrentVersion = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add version history to the comment, e.g. Version=1 doesn't support batching.
} | ||
|
||
// batchedMessage represents a message to be sent in batch. | ||
type batchedMessage struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use BatchEntry
here instead of defining a new type?
|
||
// Batching fields | ||
batchInterval time.Duration | ||
remoteVersion uint32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remoteVersion
doesn't seem to be used.
} | ||
|
||
// Version 2+ supports batching | ||
const batchingSupportedVersion = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used?
|
||
// setupWithVersion returns the file and the remote version. | ||
func setupWithVersion(path string) (*os.File, uint32, error) { | ||
return setup(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty implementation. I guess this is where you would want to process the remote's version and disable batching if the remote doesn't support it. Please add unit test to check the fallback behavior.
return nil, fmt.Errorf("endpoint %q is not a string", addrOpaque) | ||
} | ||
return setup(addr) | ||
file, _, err := setupWithVersion(addr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below, if setupWithVersion()
handles the remote's version, setupWithVersion()
doesn't need to return the version (which is ignored here anyways).
Summary
Adds configurable batching support to the remote sink to handle high-volume runtime monitoring scenarios efficiently.
Problem
Runtime monitoring of syscalls was too slow when sending each event individually to the monitoring system. This created performance bottlenecks when monitoring high-frequency syscalls in production environments.
Solution
Added batching functionality with a configurable
batch_interval
parameter:batch_interval
config parameter allows collecting syscalls over a specified time period (e.g., "5500ms")Configuration Example
Impact
This significantly improves performance for runtime monitoring systems that need to process high volumes of syscall events by:
Tested with syscall monitoring including write syscalls and other high-frequency events.