Skip to content

Conversation

@Molter73
Copy link
Collaborator

@Molter73 Molter73 commented Sep 23, 2025

Description

This change greatly reduces the amount of data sent in the ringbuffer and the effort the BPF hooks need to retrieve the cgroup of the current process. In exchange for these benefits, we now need to lookup and keep track of the cgroups and container IDs that exist on the system ourselves by iterating over the cgroupsfs.

Checklist

  • Investigated and inspected CI test results
  • Updated documentation accordingly

Automated testing

  • Added unit tests
  • Added integration tests
  • Added regression tests

If any of these don't apply, please comment below.

Testing Performed

Manually tested the fact generated events have the expected container ID and added an integration test.

@Molter73 Molter73 force-pushed the mauro/ROX-30714-use-cgroup-id branch from 97f35d7 to 8c3fb0c Compare September 24, 2025 08:31
@Molter73 Molter73 marked this pull request as ready for review September 25, 2025 09:44
@Stringy
Copy link
Contributor

Stringy commented Sep 26, 2025

Do we have some numbers on the performance benefits of doing this? e.g. how much memory saved per event, how many fewer events we're dropping etc? have we encountered throughput issues already?

I'm a little skeptical of the file system approach because of the race condition between an event happening and our ability to lookup the cgroup. Presumably it means that if we're unable to resolve the cgroup for the event, then activity might not be appropriately correlated with a deployment and as a result it might not be flagged as a violation?

@Molter73
Copy link
Collaborator Author

Do we have some numbers on the performance benefits of doing this? e.g. how much memory saved per event, how many fewer events we're dropping etc? have we encountered throughput issues already?

@erthalion is looking into performance testing, maybe we will have some numbers in the near future.

I'm a little skeptical of the file system approach because of the race condition between an event happening and our ability to lookup the cgroup. Presumably it means that if we're unable to resolve the cgroup for the event, then activity might not be appropriately correlated with a deployment and as a result it might not be flagged as a violation?

Yeah, I'm not super keen on this approach for this same reason. I think in theory, creating and destroying containers takes some time and because cgroupfs is a virtual file system (i.e everything is computed from memory when read like /proc) we should be relatively fast at finding new container ID mappings, but I don't know if we can guarantee 100% hit rate.

On the other hand, the current approach only supports 16 levels of cgroup nestings, anything beyond that will have a 100% miss ratio and it is heavier on the processes that attempt to write or create files, so 🤷🏻‍♂️ .

Maybe we can leave both implementations in and make which we use toggleable via configuration?

@Molter73 Molter73 force-pushed the mauro/ROX-30714-use-cgroup-id branch 2 times, most recently from ae53ab4 to 5df4d9e Compare September 30, 2025 10:43
This change greatly reduces the amount of data sent in the ringbuffer
and the effort the BPF hooks need to retrieve the cgroup of the current
process. In exchange for these benefits, we now need to lookup and keep
track of the cgroups and container IDs that exist on the system
ourselves by iterating over the cgroupsfs.

TODO: Add integration tests with containers.
Because capturing information for a process running in a container is
not fully possible from `/proc`, some information is passed to the
Process constructor directly to overwrite it. This should be fine
though, because we are using a specific container that should not have
major changes on how it executes the command we are expecting.
@Molter73 Molter73 force-pushed the mauro/ROX-30714-use-cgroup-id branch from 5df4d9e to 13456b9 Compare October 2, 2025 15:29
#[derive(Debug)]
struct ContainerIdEntry {
container_id: Option<Arc<String>>,
pub last_seen: SystemTime,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this public?

})
}

fn walk_cgroupfs(path: &PathBuf, map: &mut ContainerIdMap, parent_id: Option<Arc<String>>) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to something like walk_cgroupfs_and_add_to_map. The current name doesn't explain what is done while we walk through the cgroupfs.

ContainerIdCache(Arc::new(Mutex::new(map)))
}

fn update_unlocked(map: &mut ContainerIdMap) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to update_no_lock or update_without_lock. The name implies that this updates something that is unlocked, rather than doing an update without locking.

async fn prune(&mut self) {
let now = SystemTime::now();
self.0.lock().await.retain(|_, value| {
now.duration_since(value.last_seen).unwrap() < Duration::from_secs(30)
Copy link

@JoukoVirtanen JoukoVirtanen Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe set a variable to 30 seconds and maybe pass that variable to this method to make this more configurable. Also maybe change the name to something like prun_old.


pub fn start_worker(mut self, mut running: Receiver<bool>) -> JoinHandle<()> {
tokio::spawn(async move {
let mut update_interval = time::interval(time::Duration::from_secs(30));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe don't hard code 30 here.

@Molter73 Molter73 marked this pull request as draft November 10, 2025 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants