-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: runtime/pprof: extend profile configuration to support perf events #53286
Comments
I have a few minor questions:
|
Yes, still pprof. I didn't change the output format, just replace the original signal source (os-timer) with the PMU counter. The signal handler is also basically unchanged. So pprof is fitting well. The only change is to change the original sample type from
Yes, my plan is to only support linux, if we find a appropriate method to support PMU on Windows and Darwin in the future, we will add support for them then. It may be appropriate to panic in Start before it is implemented. |
Would it be possible to measure multiple events simultaneously by creating a |
My prototype implementation doesn't account for multiple events, but it's not hard to support multiple events. I can do it however I'm not a proposal approver. Thanks. |
I did some research today on other OSes today to get a sense of the capabilities. WindowsThe Windows Performance Recorder CLI supports collecting some PMU events. There is an API to control profile collection, though I believe this communicates with a centralized service that actually does profile collection. As far as I can tell, there is no direct kernel interface. Additionally, it is not possible to limit collection to a specific process as far as I can tell. Unanswered questions:
Open source libraries like PCM use a custom Windows kernel driver to access PMUs. Visual Studio has a command capable of collecting PMU events for a specific process. It is unclear to me how this works. Does it use the WPR API? My takeaway is that there isn't really anything feasible for us to programmatically integrate with, unless I've missed something. DarwinXcode Instruments supports collecting counts from PMU events. How it does so is unclear. PCM requires a custom kernel driver, like on Windows. The Darwin kernel has a PMC subsystem called I suspect that there is nothing feasible here either, but I am less confident than with Windows. FreeBSDFreeBSD has a complete |
Thanks @prattmic for the research, I don't know much about other OSes, I'll look into how to access the PMU programmatically on other OS's. I wonder if you guys @golang/windows can provide some hints for Windows? Also I wonder if it is really difficult to access PMU on a OS, can we not support it on this OS? Just like per thread OS timer, we also only support it on Linux. |
I should have added more context to my previous comment. The purpose of my research was to get a sense of how hard we should try to design a platform-independent API (even if only Linux is supported initially). My initial take-away is that it is probably less important if Windows and Darwin will be impossible to support. |
In this proposal, you are proposing a new type I actually like that approach better than adding a new top-level perf-specific profile type for a few reasons:
I also think that using Set methods rather than exported fields provides more flexibility on the struct internals plus the ability to return errors, though I don’t feel too strongly about this as it does make the API below more verbose. One such possible API: type CPUProfile struct { ... }
func (*CPUProfile) SetRate(int) error
func (*CPUProfile) SetSource(CPUProfileSource) error
type CPUProfileSource int
const (
CPUTimer CPUProfileSource = iota
CPUCycles
CPUCyclesPrecise
Custom // Configure via Sys().
)
// Returns *LinuxCPUProfile on Linux.
func (*CPUProfile) Sys() any
// LinuxCPUProfile contains Linux perf specific advanced configuration options.
type LinuxCPUProfile struct{ ... }
// Use one of the generic PERF_TYPE_HARDWARE events.
func (*LinuxCPUProfile) SetHardwareSource(LinuxCPUProfileHardwareSource) error
// Use a raw hardware event (PERF_TYPE_RAW). Format of config, config1, and config2 are microarchitecture dependent.
func (*LinuxCPUProfile) SetRawSource(config, config1, config2 uint64) error
// Additional options. e.g., setting the “precise” flag. The idea of hiding OS-dependent details behind a Another big unanswered question is how events should be described. The proposed That would be great, but presents a big problem: these names aren’t understood directly by the kernel interface. The simple generic events like There is such a huge number of possible events here that I don’t think it is feasible for the Go standard library to be in the business of understanding the names of every single event. In the proposed API above, I’ve addressed this by directly supporting the generic |
I will also note that as a first-step proposal, we could just take the simple part of the hardware CPU profile API above. i.e., type CPUProfile struct { ... }
func (*CPUProfile) SetRate(int) error
func (*CPUProfile) SetSource(CPUProfileSource) error
type CPUProfileSource int
const (
CPUTimer CPUProfileSource = iota
CPUCycles
CPUCyclesPrecise
) That is, we add support for using a hardware cycle counter for CPU profiles (perhaps exposing a ‘precise’ option, perhaps not). Nothing more. Under the hood this uses the perf interface, but we don’t expose an API for using any other event types. |
I don't have a particularly strong opinion on how to define this API, so I think your proposal is also good, not exposing the internal fields of CPUProfile gives more flexibility to our implementation. And I don't think it's necessary to support too much events, including raw hardware event?
Yes, I think we will only support a few common
Yes, they are. In contrast to exported constants, we don't need to export these strings, we won't support a lot of perf events.
I think at least
So this is the basis of the above discussion. Does it make sense if we can't find a proper way to access the PMU in Darwin and Windows? As far as I know, there is currently no way to access the Arm PMU on Windows, Arm has some work in progress to provide low level profiling infrastructure and tool for Windows, but it also seems to be struggling to meet our needs. So can we support the above API only on Linux? If so then what's the behavior of |
I completely agree. I was trying to be flexible since I've seen desire for arbitrary events (such as in #36821), but I think focusing just on
I think it is hard to decide what is important. There was a lot of discussion on #36821 about paring down the initial proposal to focus narrowly on improving CPU profiles without adding tons of extra options, and I think that is pertinent here. If there is general agreement that just
Yes, it would only be supported on Linux. The way I envisioned this API is that
P.S. it's possible that (1) will prevent us from making P.P.S. since we are introducing the first implementation of |
I will point out that https://pkg.go.dev/runtime/pprof#StartCPUProfile does not promise that profiles are measured in |
One concern that I believe @rhysh raised on #36821 (though I can't find the comment now) and that I share is whether or not the PMU will continue counting while executing in the kernel, even if we are not allowed access to kernel stacks. It appears that the answer is no. I tested with an application that does some spinning in userspace as well as performs an expensive system call. Here are the results:
My Perhaps there is a way to keep counting in kernel mode, but only receive user stacks, which would match the (Edit: I dug through the kernel code a bit and it doesn't look like this is possible. In fact, it turns out that at least for Intel chips the PMU hardware itself has config bits for ring 0 and ring 3 counting, so the kernel doesn't even have to turn counting off/on on entry/exit.) I think not counting is kernel mode is perhaps OK for an opt-in API, but it may be too big of a change from pprof to automatically select a PMU profile over a CPU timer. |
I think that comment is #36821 (comment) . The design of the perf_event support at the time used a file descriptor per thread, and to prevent the profiler from using many more FDs than GOMAXPROCS it took explicit action to detach the perf_event FD from the thread when making a syscall. So my comment wasn't about a limitation of the perf_event interface (though from your update today, @prattmic , that limitation seems exist!), it was about how the proposed Go runtime change chose to spend file descriptors. Yes, a default profiler that reports syscalls as taking no time at all seems likely to confuse.
For what it's worth, the feedback on one of my early attempts to change the default profiler (https://go.dev/cl/204279) included:
and
But, there's a much better reason this time. |
Ok, this can be a good start. If people feel that other events are needed in the future, it will be much simpler.
Yes, as you mentioned before, we can keep
perf_event_open has an option
If exclude_kernel is not set and the PMU keeps counting in the kernel mode, then this is feasible. As for the callchain, we collect it by stack tracing in go, so I don't think there will be any difference. Unless we want to read the FD directly to get the callchain, but there also seems to be options ( In my latest code, if it's pure go code, I set this option because it helps the precision a little bit. To be honest, there is no right or wrong to count the execution of the kernel or not. For example, if a kernel operation affects multiple threads (such as allocating a new heap arena), which thread is appropriate to assign this time to? But considering that to be consistent with the behavior of the OS timer, it may be reasonable to let the PMU counting the event of the kernel.
Thank @rhysh for raising this problem, I think we have some discussions in CL 410798 and 36821, and there seems to be no good solution for this problem. I didn't deal with this in CL 410798 because I think such cases are really rare, especially for go, where OS thread is shared by multiple goroutines. I'd like to hear more people's thoughts on whether this is an issue that needs to be dealt with. |
@rhysh you asked this question in #36821 (comment), and this question really stumped me. With the current framework, non-go threads can only receive profiling signals from the process signal source created by setitimer, so I installed a PMU counter for the process. However the non-go thread rarely receive the profiling signals, I thought the process signal would randomly choose a thread to receive, but it doesn't seem to be so random. I looked at the code ( complete_signal) of the kernel, it seems that the main thread always gets the signal first. Since our main thread does not block the SIGPROF signal, most of the signals are received by the main thread, and the main thread has a per-thread PMU counter, so it ignores the signal. I don't know how to fix this problem. |
@rsc Would you mind adding this proposal to active queue ? Thanks. |
Yes, that is correct. However, if (N.B., the perf tool automatically detects this case and sets As far as I know, must Linux distributions set I was brainstorming with @cherrymui earlier and one idea that came up is that we could collect both the OS timer and PMU profiles by default. The pprof format supports multiple different sample types in one file, so putting both in one file would not be a problem. I think the main downside is that we'd approximately double the overhead because we'd have SIGPROFs for both profiles. |
Well, that's a bit of a hassle. I wonder if we can do a check on this before starting the PMU, if the environment has the appropriate permissions we will allow the PMU profiling, otherwise not. I think usually the development environment and debugging environment meet these conditions. Or we don't count the execution of the kernel, which behaves a little differently from the OS timer source, but is actually more accurate for user programs. This is also the default mode of perf.
This is a bit tricky because different types of samples have different units, so we need to differentiate between them. |
Certainly we can check permissions ahead of time, but I am skeptical that most development environments will have
Yes, the
Note that this is displaying
Here are all the included types. Select |
I think you are right that the default
But I think what we want to have is one accurate event rather than two less accurate events. |
Based on the discussion above, I make a summary.
There are three problems: Second, for PMU to count the kernel events, Disposal method:
Third, it is difficult for non-go threads to receive signals from the PMU. Everyone is welcome to brainstorm to move this proposal forward, thanks. Edit1: Add Start method. |
This proposal has been added to the active column of the proposals project |
I posted a more fleshed-out and documented version of the base (non-PMU) API in the CPUProfile issue. |
Yes, it is.
This is for consistency with SetCPUProfileRate , If we don't care about this, then I think it's fine to report an error for a negative period.
Ok, make sense. Updated #53286 (comment) |
Here is the updated API per @erifan's comment:
Are there any remaining concerns about this API? |
Thank you for re-sharing the current proposal, it's been hard to track the changes and to know when to comment. I have a concern about the "cgo" ban, and some (I think) minor clarifications.
On Now that there's an explicit |
@aclements also lean to this behavior, I'm ok with it, so let's change it to return an error for this case.
Yes. Because we can't install a PMU event for the threads created in cgo, we can only rely on the process PMU signal source to send them signals. But I found in the implementation that the threads created in cgo can hardly receive the signal from the process pmu signal source. The signal(7) claims that the delivering of process signals is random, but through actual testing, I found that PMU process signals almost always give priority to threads with small tids. Since the threads created in cgo have larger tids, they hardly receive signals (in my test, the process pmu signal source sent 100,000 signals, but cgo threads received only one). But the thread created in cgo can receive the signal sent by the os-timer process signal source normally. By checking the kernel code, I think this may be related to the mechanism of pmu signal and os-timer signal, but I don't quite understand the specific details. This problem causes we can only chose os-timer signal source for cgo profiling, otherwise the profiling is biased. For cgo profiling, it seems to me that it is not a common scenario, and currently only a few architectures support this feature. And to enable cgo profiling, you need to first set functions such as runtime.cgoTraceback. The PMU support is also only for the Linux system, so I think it may not be a big problem that the PMU does not support cgo profiling.
Sorry I don't quite understand this, could you elaborate on this ? Thanks.
Yes, thanks for pointing this out. The updated proposal:
|
I'm not an expert on cgo. Most of what I know about it I learned from working on runtime/pprof's per-thread timers for Go 1.18. There are a few different flavors of "using cgo".
Before Go 1.17, profiling on Linux used setitimer to track CPU usage of the whole process and to use process-targeted signals for determining which code was using CPU time. That had some limitations, #35057. As of Go 1.18, profiling on Linux still uses setitimer for whole-process usage, but also uses timer_create to track usage on each thread that the Go runtime owns, with thread-targeted deliveries. Each SIGPROF says whether it was triggered by setitimer or timer_create, so the runtime can ignore the lower-quality setitimer signals if it knows the thread is also registered to receive signals from timer_create. This leads to good profiling (able to report more than "250%") of the work done by threads that Go created (whether running Go code or C code), and decent profiling (limited to "250%", depending on CONFIG_HZ) of work done by threads created in C code which are running C code. Because the I think that a perf_events-based profiler that refuses to even try when a program is built with cgo enabled, or in a process where threads that the Go runtime created make calls into C code, will be less useful than one that makes a best effort. I think a caveat like "this only reports on events triggered by threads that the Go runtime created" would be more appropriate. That's case 1 in my list above. |
I should have said that "when using cgo" refers to user code using cgo, or to I know there are many scenarios where Go interacts with C, but I don't know how to solve the above problem, if you have any solution please let me know so we don't have to add this limitation. The OS-timer based profiler is still available, users can choose the most suitable profiler they think according to their needs. I think PMU profiler is only suitable for applicable scenarios, it has many limitations, such as Linux, such as sudo permissions, and this one "cgo". Just like timer_create is also only available on Linux, one can also argue that a boost that is not available for Windows and Darwin is less useful. But it improves Linux without affecting other systems, then I think that's valuable. By the same token, I think the PMU profiler is an improvement for pure Go code on Linux, so why is it less useful? I believe writing Go code in the Go language is the common case. If the thread created in cgo cannot receive the profiling signal, the profiling is biased. But with OS-timer based profiler we can get a normal profiling result for cgo code. I think a normal result is more appropriate than a biased one? This is why I think profiling cgo programs with PMU should be disabled instead of a caveat. |
I worry that it's quite common for Go applications to have just a small amount of cgo use in them, and this makes the local effects of introducing cgo usage into a global constraint on the process. I'm not actually positive that we can't solve this limitation. This is Linux-specific anyway, so we can use Linux-specific techniques to enumerate all threads in the process when we start profiling. And I think we can set up inheritance so it automatically starts profiling any new threads created later. I'm not sure we can do that in a way that propagates into new threads, but not into new processes. There are various exec-related flags that might have this effect (recalling that Go processes never fork without an exec). I'm also not sure about the signal issues you mention. I assume you're using an overflow signal that triggers on any event in the mmap buffer, so we can do our own stack walk from the signal handler? |
Yes, I agree it would be best if we could fix this.
I have tried this method, with this option, the thread created in cgo can occasionally receive the profiling signal, without this option, it cannot receive at all. I have tried many different options. But didn't find a good way.
That's what I'm doing now, I haven't changed the logic of the stack walk. |
Let me describe the problem in detail. Test case https://go.dev/play/p/hImHumVUxFx (derived from TestCgoPprofThread, I found this problem because this test always fails in PMU mode). In the implementation, we install an os-timer or pmu counter for each thread maintained by the Go runtime. At the same time, install an os-timer or pmu counter for the entire Go process to provide signals for non-Go threads. The phenomenon is that if the os-timer is used as the process signal source, the threads created in cgo always receive the signal first. If the PMU counter is used as the process signal source, the threads created in cgo will rarely receive the profiling signal, causing the test to fail. Almost all signals are received by Go threads. Analysis: The man page says that a process signal will be delivered to a randomly selected thread within the process, but it seems to be not. Looking at the kernel code, I found that the signal delivering flow is like this, see complete_signal:
To wake a thread it has to either schedule it or send an IPI to another CPU to interrupt it if it's running. During that time another thread can come in and "steal" the signal from the queue (e.g. if it does a syscall and a signal is pending), so it's not necessarily the case that the thread that was woken always handles the signal. So actually the first thread to transition from kernel->user pops the signal from the queue. For the PMU mode. Since the PMU signal is generated from an interrupt handler, there's no user->kernel transition that checks the signal queue (execution returns directly from the interrupt handler to user-space without going through the kernel scheduler), so according to the above algorithm it's more likely that the main thread will wake up and dequeue the signal instead of the thread that was executing when the interrupt occurred. Therefore, in the above test, most of the process PMU signals are delivered to the main thread (a Go thread), and the threads created in cgo rarely receive signals. Test results:
For the os-timer mode. In the above test, the thread created in cgo is a cpu-bound thread, and during its execution, the main thread is sleeping. When the cpu-bound thread is de-scheduled by the kernel because its time slice is run out or for other reasons, the kernel will update the process timer. Note that the timers are updated just before a thread is about to be scheduled (not when it is descheduled). So the time between the signal being generated by the timer update and the CPU-bound thread returning to user space is very short, so it's likely to see the signal still in the queue (i.e. it's rare that the main thread can wake up (because it will take more time.) and start running in that time). So cgo threads are always more likely to get the signal. Test results:
I'm not sure if this is correct, but it seems to explain the evidence. Also, the cgo thread in the above test seems to be running in a new process? Hope this helps in understanding and resolving the problem. |
Hmm. I'm still wrapping my head around what you wrote, but one thing jumped out at me: if the signal is delivered to the whole process, then doing our own stack walk from the arbitrary thread that receives the signal is meaningless. E.g., suppose we're sampling branch misses and your process has two CPU-bound threads, but one has no branch misses and the other has many. If the PMU overflow signal can be delivered to either thread, then the signal context is unrelated to what caused the event, and doing the trace back from the signal context will result in an entirely incorrect profile. If that's all true, then our only option is to use information from the perf-generated event. We can get stacks from that using frame-pointer unwinding. They won't be exactly the same as Go stack walks, but it would be better than nothing. (Also thanks for digging into this so deeply. :) ) |
Yeah, and I think this also apply to the os-timer based profiler. For example, a process has two threads, one takes all of the cpu time, and the other is sleeping, takes no cpu time, but the process signal is delivered to a random thread. If the signal is received by the sleeping thread, then the generated profile is incorrect. It may not be correct to say "incorrect", because this process signal source (os-timer or PMU counter) measures a certain indicator of the entire process, so maybe it is not appropriate that we use it for profiling? I think "meaningless" you mentioned above is very accurate.
Are you referring to the PC and SP information recorded at the time the overflow occurred? If so then I feel like this is actually only slightly more accurate, since the thread that causes the overflow is also random. However, this method can avoid the extreme situations, a thread has no branch miss at all or does not occupy CPU time at all. |
I want to clarify that the current "os-timer" is, on Linux, a combination of two separate profiling mechanisms running concurrently. Some of the shortcomings and behaviors you've described @erifan are applicable to one, but not (as I understand it) to the other. The first is based on The second is based on the When programs that don't use cgo to create threads run on GOOS=linux and profile themselves, they will end up with one For programs that use cgo to create threads, the work done in those cgo-owned threads will only earn signals from the process-wide timer. But code running on Go-owned threads will continue to get the higher quality profiling from those threads' individual timers.
@aclements does that mean not collecting goroutine labels / tags, or is there a way to store those that could make them appear in the perf-generated events? |
I mean specifically the PC and SP (and possible stack) recorded by perf itself in the perf event. This should be extremely accurate (possibly with a few instruction skew, depending on the exact PMU configuration, but it's not going to be attributed to entirely the wrong PC).
Oh that's a good point. We might be able to still handle that. We can ask perf to record registers including the G register, which would let us peek into the goroutine's labels regardless of which thread the signal lands on. There are definitely some subtleties, like we have to make sure we can safely read another goroutine's labels (we already have to guard against nearly this, so this might already be okay), and we have to be careful to only follow the G register if it's definitely pointing to a G, so not if the sample landed in any C or non-ABIInternal Go code. |
Since the signal source is for the entire process, I don't know what kind of stack trace will be collected by the perf event. There are three possibilities:
I think what we want to get is the stack trace of a single goroutine, but if it is the first two, it is a bit troublesome. |
We might be talking past each other here. I don't mean the PC/SP in the signal frame. If the signal is delivered to an arbitrary thread in the process, that is indeed pretty meaningless. The PC/SP gathered by perf comes from the PMU interrupt itself and is thus very representative of the specific instruction that caused the PMU overflow. If you tell perf to unwind stacks, it will unwind from that PC/SP, so if the PMU overflow happened while executing a goroutine, it will collect the stack trace of the goroutine. If it happens while executing kernel code, then it depends on exactly what options you have set, but in general it unwinds the OS thread stack, then leaves a sentinel value in the stack trace, then continues with the user-space stack. I added platform-compatible frame pointers to Go on GOARCH=amd64 years ago almost entirely so that perf could correctly unwind Go stacks. :) There are some weird things in Go stacks that perf doesn't know about, but this works correctly the vast majority of the time. Nowadays, we have platform-compatible frame pointers on more GOARCHes, though I don't remember exactly which. |
@erifan (or anyone else) , do you see any deal-breakers with using stacks gathered directly by perf_events, rather than doing Go tracebacks in the signal handler? |
@aclements Sorry for the late reply due to the Chinese New Year holiday. I haven't tried this before, but it seems to be the only option. I have two concerns, hopefully neither of them are a problem: |
No worries. I didn't mean to rush you. :)
I'm not quite sure what you mean. perf_events will just give a list of PCs in the sample. We'll have to copy that into the proto format (via our internal profiling ring buffer). We might also have to do inline expansion on that; I forget if that happens before or after the internal ring buffer. I think for a first pass we can ignore the inline expansion problem. I'm planning to rewrite how we do inline expansion in general, at which point this will become easy to do.
Great! This is probably the only way to be sure. :) |
@erifan, thanks for doing a draft implementation. We will hold the final decision for that. |
Placed on hold. |
Change https://go.dev/cl/410797 mentions this issue: |
Change https://go.dev/cl/410796 mentions this issue: |
Change https://go.dev/cl/410798 mentions this issue: |
Change https://go.dev/cl/410799 mentions this issue: |
Hi, sorry for the late reply. 1, I found that the above-mentioned problem that the C threads created in cgo cannot capture the signal from PMU may be related to the specific kernel version. Here's what I found:
I haven't figured out what the kernel changes are, started from which version, because I happen to have the above environments, and I haven't tested other environments. But the test results show that 6.1.12-060112 is indeed good, no matter on x86 or arm64. 2, Reading samples directly from the perf event is feasible, but my implementation ran into some random crashes. I haven't figured out where the problem is. It seems to be related to GC, but I am not familiar with it, so it is progressing very slow. So here I want to reconfirm whether this is an issue that will block this proposal going forward? And I also want to clarify that:
The draft implementation based on the new APIs: https://go-review.googlesource.com/c/go/+/410798/12 In addition, I wonder if we can change the parameter of SetPeriod from int64 to int32. |
#42502 Proposes configurable CPU profiling via NewCPUProfile + cpuProfile.Start and the proposal has been accepted. But the proposal does not appear to approve any exported fields, nor does it support any perf event extensions. So I propose to extend this proposal to support configuring perf events.
I propose to add the following exported struct and method to the pprof package.
It would be better if a corresponding Stop method was exported, but it is not necessary, we can correctly stop the profiling by the cached configuration.
If we want to use PMU profiling in
http/pprof
andtesting
packages, then we need to make a few more changes.For
http/pprof
, we are not necessary to add a new interface, we can add some parameters to the profiling URL, and then do the corresponding profiling according to the parameters. for example:go tool pprof http://localhost:6060/debug/pprof/profile?event=cycles&&rate=10000
For the
testing
package, we need to add two options to thego test
command, one is-perfprofile perf.out
to specify the output file, and the other is-perfprofileconfig "conf"
to specify the configuration. These two options need to exist at the same time to work. The value of-perfprofileconfig
must meet the specified format:"{event=EEE rate=RRR}"
Example:go test -perfprofile perf.out -perfprofileconfig="{event=EEE rate=RRR}" -run .
The introduction of perf event helps to improve the precision of profiling, which is also beneficial for PGO in the future. I have made a preliminary implementation, see the series of patches https://go-review.googlesource.com/c/go/+/410796/5, the implementation does not make any changes to the
http/pprof
andtesting
packages.This proposal is related to #42502 and #36821.
The text was updated successfully, but these errors were encountered: