bcc: Allow specify wakeup_events for perf buffer #3801

chenhengqi · 2022-01-08T14:48:36Z

This commit adds a new option wakeup_events to the open_perf_buffer API.
This provides an alternative to solve #3793.

Signed-off-by: Hengqi Chen chenhengqi@outlook.com

This commit adds a new option `wakeup_events` to the open_perf_buffer API. This provides an alternative to solve iovisor#3793. Signed-off-by: Hengqi Chen <chenhengqi@outlook.com>

davemarchevsky · 2022-01-08T23:54:00Z

Is my understanding correct here?

If wakeup_events is set to something like 10, epoll won't trigger on the fd until 10 events have been collected (ignoring epoll_wait's timeout for now). When 10 events have been collected, perf_buffer__poll, will end up calling the handler callback once for each sample. So the handler cb doesn't need to be updated to handle multiple events if wakeup_events > 1.

Do we need to do anything special to ensure the 'remainder' events are handled? e.g. wakeup_events = 10, 9 events are collected but 10th never happens.

yonghong-song · 2022-01-09T01:19:06Z

Do we need to do anything special to ensure the 'remainder' events are handled? e.g. wakeup_events = 10, 9 events are collected but 10th never happens.

does epoll timeout will serve this purpose? If the number 10 event never comes up, timeout eventually happens and the user space should process it, right?

chenhengqi · 2022-01-09T06:01:02Z

The callchain is:

bpf_perf_event_output
    __bpf_perf_event_output
        perf_event_output
            __perf_event_output
                perf_output_sample
                perf_output_end
                    perf_output_put_handle

In perf_output_sample, rb->wakeup is updated only when events >= wakeup_events:

	if (!event->attr.watermark) {
		int wakeup_events = event->attr.wakeup_events;


		if (wakeup_events) {
			struct perf_buffer *rb = handle->rb;
			int events = local_inc_return(&rb->events);


			if (events >= wakeup_events) {
				local_sub(wakeup_events, &rb->events);
				local_inc(&rb->wakeup);
			}
		}
	}

In perf_output_put_handle, perf_output_wakeup is called after rb->wakeup changed:

	if (handle->wakeup != local_read(&rb->wakeup))
		perf_output_wakeup(handle);

And here is perf_output_wakeup:

static void perf_output_wakeup(struct perf_output_handle *handle)
{
	atomic_set(&handle->rb->poll, EPOLLIN);


	handle->event->pending_wakeup = 1;
	irq_work_queue(&handle->event->pending);
}

IIRC, the timeout won't take effect. And the users have to handle the remainder events themselves.

According to my test, each sample triggers the callback handler once, so we don't have to change the handler.

davemarchevsky · 2022-01-09T08:14:18Z

@yonghong-song

If the number 10 event never comes up, timeout eventually happens and the user space should process it, right?

I think we're on the same page. I didn't have a very clear idea of the scenarios I was worried about. After some more thought I think there's nothing to be worried about. For posterity, though:

Programs that use perf_events and run for a specified amount of time, like examples/PyPerf, already do an explicit poll of the perf buffer after detaching the event to catch 'remainder' events. So wakeup_events = 10 scenario is handled.
Tools that collect data until killed by terminal input, like libbpf-tools/execsnoop don't do an explicit extra poll, but are also not attempting to collect a precise time period's worth of events, so it's not a problem if remainder events aren't handled. Ostensibly if someone is setting wakeup_events > 1 in their tool they're tracing an event that occurs often.

So no problem from my end.

@chenhengqi

IIRC, the timeout won't take effect. And the users have to handle the remainder events themselves.

When you say "timeout won't take effect", are you referring to the case where >= wakeup_events events are collected, or more generally? Just want to confirm that it's the former, because some programs (like PyPerf in first bullet above) do a poll w/ 0 timeout to grab remainder events.

chenhengqi · 2022-01-09T09:24:14Z

@davemarchevsky

Yes, you are right.

By timeout won't take effect, I mean that timeout does not affect the behavior of PerfBuffer itself.

davemarchevsky · 2022-01-09T09:52:21Z

LGTM

yzhao1012 · 2022-01-10T18:33:13Z

Context: I filed #3793

IIUC, the updated API needs to be exposed to ebpf::BPFPerfBuffer as well, for users of that API to be able to change the wakeup_events.

But one thing I think my understanding does not match the reality:

It seems there is no change to the callback function signature?
I assumed that the callback needs to be changed to handle multiple events. Today the callback looks like: typedef void (*perf_reader_raw_cb)(void *cb_cookie, void *raw, int raw_size); I'd imagine it turns to something like typedef void (*perf_reader_raw_cb)(void *cb_cookie, void **raw, int *raw_sizes, int len); which gives the number of events, and the size of each individual event.

So here I assumed that, by waking up every (for example) 10 events, the 10 events are handled by 1 invocation of the callback on all events in the perf ring buffer (perfbuffer, not the new bpf ringbuf).

But it seems that this actually only invoke the callback on the 10th event? If that's the case, then we cannot use it, because we do need to handle all events, missing any of them would not be allowed.

Instead, we only want to minimize the overhead of repetitively invoking callback on events. Thus #3793

Edit: I plan to try this out to better understand the situation. Don't take the above as suggestions as I probably had a very poor understanding of the details anyway.

davemarchevsky · 2022-01-11T04:30:23Z

But it seems that this actually only invoke the callback on the 10th event?

In the wakeup_events=10 case, the callback will be invoked on every event, but the perf buffer's fd will not indicate to epoll that it has data ready to read until it contains 10 events. When it does indicate that it's ready to be read, all of the data in the buffer will be read and handled.

I see what you mean with the "handle many events per callback instead of 1" point, but I don't think this improve performance significantly relative to exposing wakeup_events. Bumping wakeup_events from 1 to 10 would reduce context switches due to poll 10x which would dwarf the perf win from reducing callback function call overhead.

Also, note that the samples in the perf buffer are not of fixed size and may loop around from the 'end' to 'beginning' of the ring buffer, so when pulling many samples out it's necessary to poke at each one to find size and possibly do some memcpying to deal with looped sample (see perf_event_read_simple in libbpf). You can probably create a modified perf_event_read_simple that creates a batch of events for a callback to deal with but I'm not sure if it would be generally useful.

IIUC, the updated API needs to be exposed to ebpf::BPFPerfBuffer as well, for users of that API to be able to change the wakeup_events.

Yep. This should be added, as well as a test validating this wakeup_events stuff.

While messing around with this PR's changes today I noticed that we're missing a perf_buffer__consume equivalent, which is necessary to handle 'remainder' events I mentioned in previous comments, as well as some other opportunities to simplify, so thanks for bumping this!

In order to read 'remainder' events from perf buffers w/ 'wakeup_events > 1', an API to force reading from a program's perf buffers regardless of whether their fds are in "ready to read" state is necessary. This commit introduces such an API, perf_reader_consume, modeled after libbpf's perf_buffer__consume. Future work will refactor bcc to use libbpf's perf buffer support as much as possible instead of duplicating functionality. For now, since iovisor#3801 was commited let's add this piece of missing functionality.

In order to read 'remainder' events from perf buffers w/ 'wakeup_events > 1', an API to force reading from a program's perf buffers regardless of whether their fds are in "ready to read" state is necessary. This commit introduces such an API, perf_reader_consume, modeled after libbpf's perf_buffer__consume. Future work will refactor bcc to use libbpf's perf buffer support as much as possible instead of duplicating functionality. For now, since #3801 was commited let's add this piece of missing functionality.

Like in iovisor#3801 and iovisor#3805, this patch adds a consume API which allows user to force reading from perf buffer and adds wake_events parameter on open_all_cpu API. Signed-off-by: Hengqi Chen <chenhengqi@outlook.com>

In order to read 'remainder' events from perf buffers w/ 'wakeup_events > 1', an API to force reading from a program's perf buffers regardless of whether their fds are in "ready to read" state is necessary. This commit introduces such an API, perf_reader_consume, modeled after libbpf's perf_buffer__consume. Future work will refactor bcc to use libbpf's perf buffer support as much as possible instead of duplicating functionality. For now, since iovisor#3801 was commited let's add this piece of missing functionality.

Like in iovisor#3801 and iovisor#3805, this patch adds a consume API which allows user to force reading from perf buffer and adds wake_events parameter on open_all_cpu API. Signed-off-by: Hengqi Chen <chenhengqi@outlook.com>

bcc: Allow specify wakeup_events for perf buffer

df01fd6

This commit adds a new option `wakeup_events` to the open_perf_buffer API. This provides an alternative to solve iovisor#3793. Signed-off-by: Hengqi Chen <chenhengqi@outlook.com>

chenhengqi requested review from 4ast, drzaeus77, goldshtn and yonghong-song as code owners January 8, 2022 14:48

chenhengqi mentioned this pull request Jan 8, 2022

[Feature request] Bulk polling API for perf ring buffer #3793

Open

davemarchevsky merged commit c756b92 into iovisor:master Jan 9, 2022

davemarchevsky mentioned this pull request Jan 11, 2022

Add perf_reader_consume #3805

Merged

chenhengqi mentioned this pull request Jan 16, 2022

bcc: Update perf buffer C++ API #3817

Merged

chenhengqi mentioned this pull request Apr 21, 2022

Handling lost samples #3941

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bcc: Allow specify wakeup_events for perf buffer #3801

bcc: Allow specify wakeup_events for perf buffer #3801

chenhengqi commented Jan 8, 2022

davemarchevsky commented Jan 8, 2022

yonghong-song commented Jan 9, 2022

chenhengqi commented Jan 9, 2022

davemarchevsky commented Jan 9, 2022

chenhengqi commented Jan 9, 2022

davemarchevsky commented Jan 9, 2022

yzhao1012 commented Jan 10, 2022 •

edited

Loading

davemarchevsky commented Jan 11, 2022

bcc: Allow specify wakeup_events for perf buffer #3801

bcc: Allow specify wakeup_events for perf buffer #3801

Conversation

chenhengqi commented Jan 8, 2022

davemarchevsky commented Jan 8, 2022

yonghong-song commented Jan 9, 2022

chenhengqi commented Jan 9, 2022

davemarchevsky commented Jan 9, 2022

chenhengqi commented Jan 9, 2022

davemarchevsky commented Jan 9, 2022

yzhao1012 commented Jan 10, 2022 • edited Loading

davemarchevsky commented Jan 11, 2022

yzhao1012 commented Jan 10, 2022 •

edited

Loading