Skip to content

Conversation

vivaan-gupta-databricks
Copy link

@vivaan-gupta-databricks vivaan-gupta-databricks commented Sep 25, 2025

Summary

Adds configurable batching support to the remote sink to handle high-volume runtime monitoring scenarios efficiently.

Problem

Runtime monitoring of syscalls was too slow when sending each event individually to the monitoring system. This created performance bottlenecks when monitoring high-frequency syscalls in production environments.

Solution

Added batching functionality with a configurable batch_interval parameter:

  • New batch_interval config parameter allows collecting syscalls over a specified time period (e.g., "5500ms")
  • Messages are collected in memory and flushed together at the interval boundary
  • Maintains SOCK_SEQPACKET message boundaries by sending each batched message individually during flush
  • Reduces syscall overhead while preserving message segmentation required by monitoring receivers

Configuration Example

{
  "name": "remote",
  "config": {
    "endpoint": "/tmp/gvisor_events.sock",
    "retries": 3,
    "batch_interval": "5500ms"
  }
}

Impact

This significantly improves performance for runtime monitoring systems that need to process high volumes of syscall events by:

  • Reducing the number of flush operations from one-per-event to one-per-interval
  • Maintaining compatibility with monitoring tools that require discrete messages
  • Allowing tunable performance via the batch_interval parameter

Tested with syscall monitoring including write syscalls and other high-frequency events.

vivaan1304 and others added 2 commits September 9, 2025 15:07
The remote sink was attempting to batch multiple messages in a single writev()
call, which violates SOCK_SEQPACKET semantics. SOCK_SEQPACKET requires each
message to be sent in a separate write to maintain message boundaries.

Changes:
- Modified flushBatchLocked() to send messages individually
- Added writeSingleMessage() helper for consistent retry logic
- Added brief spacing for large batches to avoid socket buffer overflow
- Fixed binary marshaling to use explicit LittleEndian encoding

This ensures proper message boundary preservation required by the seccheck
protocol while maintaining batching efficiency through grouped flushes.
Copy link

google-cla bot commented Sep 25, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@vivaan-gupta-databricks vivaan-gupta-databricks changed the title Fix SOCK_SEQPACKET message boundary preservation in remote sink Add batching support for high-volume runtime monitoring Sep 25, 2025
@ayushr2
Copy link
Collaborator

ayushr2 commented Sep 25, 2025

cc @fvoznika

@vivaan-gupta-databricks could you also sign the CLA.

@ayushr2 ayushr2 requested a review from fvoznika September 25, 2025 16:48
Copy link
Member

@fvoznika fvoznika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Batching is something that we expected to require for high performance workloads, so it's great to see this happening. The PR seems to mix both a single message with all batched messages (see changes in proto files) and a change that holds points in memory and writes them one at a time (to preserve compatibility with clients).

There doesn't seem to be much difference in terms of performance from the existing code and the PR, given that the same number of writes need to be done at about the same frequency, except for better cache reuse for batching.

Do you have benchmarks showing the performance difference? I would prefer to see batching implemented as a single write that contains multiple points/messages. It requires bigger changes in the way the messages are written and processed, but I suspect it will give much bigger performance gains in the end.


// CurrentVersion is the current wire and protocol version.
const CurrentVersion = 1
const CurrentVersion = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add version history to the comment, e.g. Version=1 doesn't support batching.

}

// batchedMessage represents a message to be sent in batch.
type batchedMessage struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use BatchEntry here instead of defining a new type?


// Batching fields
batchInterval time.Duration
remoteVersion uint32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remoteVersion doesn't seem to be used.

}

// Version 2+ supports batching
const batchingSupportedVersion = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used?


// setupWithVersion returns the file and the remote version.
func setupWithVersion(path string) (*os.File, uint32, error) {
return setup(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty implementation. I guess this is where you would want to process the remote's version and disable batching if the remote doesn't support it. Please add unit test to check the fallback behavior.

return nil, fmt.Errorf("endpoint %q is not a string", addrOpaque)
}
return setup(addr)
file, _, err := setupWithVersion(addr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below, if setupWithVersion() handles the remote's version, setupWithVersion() doesn't need to return the version (which is ignored here anyways).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants