Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcc: Allow specify wakeup_events for perf buffer #3801

Merged
merged 1 commit into from
Jan 9, 2022

Conversation

chenhengqi
Copy link
Collaborator

This commit adds a new option wakeup_events to the open_perf_buffer API.
This provides an alternative to solve #3793.

Signed-off-by: Hengqi Chen chenhengqi@outlook.com

This commit adds a new option `wakeup_events` to the open_perf_buffer API.
This provides an alternative to solve iovisor#3793.

Signed-off-by: Hengqi Chen <chenhengqi@outlook.com>
@davemarchevsky
Copy link
Collaborator

Is my understanding correct here?

If wakeup_events is set to something like 10, epoll won't trigger on the fd until 10 events have been collected (ignoring epoll_wait's timeout for now). When 10 events have been collected, perf_buffer__poll, will end up calling the handler callback once for each sample. So the handler cb doesn't need to be updated to handle multiple events if wakeup_events > 1.

Do we need to do anything special to ensure the 'remainder' events are handled? e.g. wakeup_events = 10, 9 events are collected but 10th never happens.

@yonghong-song
Copy link
Collaborator

Do we need to do anything special to ensure the 'remainder' events are handled? e.g. wakeup_events = 10, 9 events are collected but 10th never happens.

does epoll timeout will serve this purpose? If the number 10 event never comes up, timeout eventually happens and the user space should process it, right?

@chenhengqi
Copy link
Collaborator Author

The callchain is:

bpf_perf_event_output
    __bpf_perf_event_output
        perf_event_output
            __perf_event_output
                perf_output_sample
                perf_output_end
                    perf_output_put_handle

In perf_output_sample, rb->wakeup is updated only when events >= wakeup_events:

	if (!event->attr.watermark) {
		int wakeup_events = event->attr.wakeup_events;


		if (wakeup_events) {
			struct perf_buffer *rb = handle->rb;
			int events = local_inc_return(&rb->events);


			if (events >= wakeup_events) {
				local_sub(wakeup_events, &rb->events);
				local_inc(&rb->wakeup);
			}
		}
	}

In perf_output_put_handle, perf_output_wakeup is called after rb->wakeup changed:

	if (handle->wakeup != local_read(&rb->wakeup))
		perf_output_wakeup(handle);

And here is perf_output_wakeup:

static void perf_output_wakeup(struct perf_output_handle *handle)
{
	atomic_set(&handle->rb->poll, EPOLLIN);


	handle->event->pending_wakeup = 1;
	irq_work_queue(&handle->event->pending);
}

IIRC, the timeout won't take effect. And the users have to handle the remainder events themselves.

According to my test, each sample triggers the callback handler once, so we don't have to change the handler.

@davemarchevsky
Copy link
Collaborator

@yonghong-song

If the number 10 event never comes up, timeout eventually happens and the user space should process it, right?

I think we're on the same page. I didn't have a very clear idea of the scenarios I was worried about. After some more thought I think there's nothing to be worried about. For posterity, though:

  • Programs that use perf_events and run for a specified amount of time, like examples/PyPerf, already do an explicit poll of the perf buffer after detaching the event to catch 'remainder' events. So wakeup_events = 10 scenario is handled.
  • Tools that collect data until killed by terminal input, like libbpf-tools/execsnoop don't do an explicit extra poll, but are also not attempting to collect a precise time period's worth of events, so it's not a problem if remainder events aren't handled. Ostensibly if someone is setting wakeup_events > 1 in their tool they're tracing an event that occurs often.

So no problem from my end.

@chenhengqi

IIRC, the timeout won't take effect. And the users have to handle the remainder events themselves.

When you say "timeout won't take effect", are you referring to the case where >= wakeup_events events are collected, or more generally? Just want to confirm that it's the former, because some programs (like PyPerf in first bullet above) do a poll w/ 0 timeout to grab remainder events.

@chenhengqi
Copy link
Collaborator Author

@davemarchevsky

Yes, you are right.

By timeout won't take effect, I mean that timeout does not affect the behavior of PerfBuffer itself.

@davemarchevsky
Copy link
Collaborator

LGTM

@davemarchevsky davemarchevsky merged commit c756b92 into iovisor:master Jan 9, 2022
@yzhao1012
Copy link
Contributor

yzhao1012 commented Jan 10, 2022

Context: I filed #3793

IIUC, the updated API needs to be exposed to ebpf::BPFPerfBuffer as well, for users of that API to be able to change the wakeup_events.

But one thing I think my understanding does not match the reality:

  • It seems there is no change to the callback function signature?
  • I assumed that the callback needs to be changed to handle multiple events. Today the callback looks like: typedef void (*perf_reader_raw_cb)(void *cb_cookie, void *raw, int raw_size); I'd imagine it turns to something like typedef void (*perf_reader_raw_cb)(void *cb_cookie, void **raw, int *raw_sizes, int len); which gives the number of events, and the size of each individual event.

So here I assumed that, by waking up every (for example) 10 events, the 10 events are handled by 1 invocation of the callback on all events in the perf ring buffer (perfbuffer, not the new bpf ringbuf).

But it seems that this actually only invoke the callback on the 10th event? If that's the case, then we cannot use it, because we do need to handle all events, missing any of them would not be allowed.

Instead, we only want to minimize the overhead of repetitively invoking callback on events. Thus #3793

Edit: I plan to try this out to better understand the situation. Don't take the above as suggestions as I probably had a very poor understanding of the details anyway.

@davemarchevsky
Copy link
Collaborator

But it seems that this actually only invoke the callback on the 10th event?

In the wakeup_events=10 case, the callback will be invoked on every event, but the perf buffer's fd will not indicate to epoll that it has data ready to read until it contains 10 events. When it does indicate that it's ready to be read, all of the data in the buffer will be read and handled.

I see what you mean with the "handle many events per callback instead of 1" point, but I don't think this improve performance significantly relative to exposing wakeup_events. Bumping wakeup_events from 1 to 10 would reduce context switches due to poll 10x which would dwarf the perf win from reducing callback function call overhead.

Also, note that the samples in the perf buffer are not of fixed size and may loop around from the 'end' to 'beginning' of the ring buffer, so when pulling many samples out it's necessary to poke at each one to find size and possibly do some memcpying to deal with looped sample (see perf_event_read_simple in libbpf). You can probably create a modified perf_event_read_simple that creates a batch of events for a callback to deal with but I'm not sure if it would be generally useful.

IIUC, the updated API needs to be exposed to ebpf::BPFPerfBuffer as well, for users of that API to be able to change the wakeup_events.

Yep. This should be added, as well as a test validating this wakeup_events stuff.

While messing around with this PR's changes today I noticed that we're missing a perf_buffer__consume equivalent, which is necessary to handle 'remainder' events I mentioned in previous comments, as well as some other opportunities to simplify, so thanks for bumping this!

davemarchevsky added a commit to davemarchevsky/bcc that referenced this pull request Jan 11, 2022
In order to read 'remainder' events from perf buffers w/ 'wakeup_events
> 1', an API to force reading from a program's perf buffers regardless
of whether their fds are in "ready to read" state is necessary. This
commit introduces such an API, perf_reader_consume, modeled after
libbpf's perf_buffer__consume.

Future work will refactor bcc to use libbpf's perf buffer
support as much as possible instead of duplicating functionality. For
now, since iovisor#3801 was commited let's add this piece of missing
functionality.
yonghong-song pushed a commit that referenced this pull request Jan 11, 2022
In order to read 'remainder' events from perf buffers w/ 'wakeup_events
> 1', an API to force reading from a program's perf buffers regardless
of whether their fds are in "ready to read" state is necessary. This
commit introduces such an API, perf_reader_consume, modeled after
libbpf's perf_buffer__consume.

Future work will refactor bcc to use libbpf's perf buffer
support as much as possible instead of duplicating functionality. For
now, since #3801 was commited let's add this piece of missing
functionality.
chenhengqi added a commit to chenhengqi/bcc that referenced this pull request Jan 16, 2022
Like in iovisor#3801 and iovisor#3805, this patch adds a consume API
which allows user to force reading from perf buffer and
adds wake_events parameter on open_all_cpu API.

Signed-off-by: Hengqi Chen <chenhengqi@outlook.com>
chenhengqi added a commit to chenhengqi/bcc that referenced this pull request Jan 17, 2022
Like in iovisor#3801 and iovisor#3805, this patch adds a consume API
which allows user to force reading from perf buffer and
adds wake_events parameter on open_all_cpu API.

Signed-off-by: Hengqi Chen <chenhengqi@outlook.com>
CrackerCat pushed a commit to CrackerCat/bcc that referenced this pull request Jul 31, 2024
In order to read 'remainder' events from perf buffers w/ 'wakeup_events
> 1', an API to force reading from a program's perf buffers regardless
of whether their fds are in "ready to read" state is necessary. This
commit introduces such an API, perf_reader_consume, modeled after
libbpf's perf_buffer__consume.

Future work will refactor bcc to use libbpf's perf buffer
support as much as possible instead of duplicating functionality. For
now, since iovisor#3801 was commited let's add this piece of missing
functionality.
CrackerCat pushed a commit to CrackerCat/bcc that referenced this pull request Jul 31, 2024
Like in iovisor#3801 and iovisor#3805, this patch adds a consume API
which allows user to force reading from perf buffer and
adds wake_events parameter on open_all_cpu API.

Signed-off-by: Hengqi Chen <chenhengqi@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants