Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime/pprof: extend profile configuration to support perf events #53286

Open
erifan opened this issue Jun 8, 2022 · 86 comments
Open

Comments

@erifan
Copy link

erifan commented Jun 8, 2022

#42502 Proposes configurable CPU profiling via NewCPUProfile + cpuProfile.Start and the proposal has been accepted. But the proposal does not appear to approve any exported fields, nor does it support any perf event extensions. So I propose to extend this proposal to support configuring perf events.

I propose to add the following exported struct and method to the pprof package.

type PerfProfile {  // We already have struct Profile, name it as PerfProfile
    Event string  // perf event names, such as cycles, instructions
    Rate int  // perf event period or os-timer frequency
}

func (p *PerfProfile) Start(w io.Writer) error {}

It would be better if a corresponding Stop method was exported, but it is not necessary, we can correctly stop the profiling by the cached configuration.

If we want to use PMU profiling in http/pprof and testing packages, then we need to make a few more changes.

For http/pprof, we are not necessary to add a new interface, we can add some parameters to the profiling URL, and then do the corresponding profiling according to the parameters. for example:
go tool pprof http://localhost:6060/debug/pprof/profile?event=cycles&&rate=10000

For the testing package, we need to add two options to the go test command, one is -perfprofile perf.out to specify the output file, and the other is -perfprofileconfig "conf" to specify the configuration. These two options need to exist at the same time to work. The value of -perfprofileconfig must meet the specified format: "{event=EEE rate=RRR}" Example:
go test -perfprofile perf.out -perfprofileconfig="{event=EEE rate=RRR}" -run .

The introduction of perf event helps to improve the precision of profiling, which is also beneficial for PGO in the future. I have made a preliminary implementation, see the series of patches https://go-review.googlesource.com/c/go/+/410796/5, the implementation does not make any changes to the http/pprof and testing packages.

This proposal is related to #42502 and #36821.

@erifan erifan added the Proposal label Jun 8, 2022
@gopherbot gopherbot added this to the Proposal milestone Jun 8, 2022
@prattmic
Copy link
Member

prattmic commented Jun 8, 2022

I have a few minor questions:

  1. What is the output format of these profiles? Still pprof, I assume (rather than perf.data)? In that case, are there special considerations we need to take for particular perf events that may not map nicely to the existing pprof format, or should everything fit nicely?
  2. What is the plan for non-Linux OSes? proposal: runtime/pprof: add PMU-based profiles #36821 (comment) mentions that Darwin and Windows don't have clear PMU interfaces. Is PerfProfile only exported on Linux? Or Start returns an error on non-Linux?

@erifan
Copy link
Author

erifan commented Jun 9, 2022

  1. What is the output format of these profiles? Still pprof, I assume (rather than perf.data)? In that case, are there special considerations we need to take for particular perf events that may not map nicely to the existing pprof format, or should everything fit nicely?

Yes, still pprof. I didn't change the output format, just replace the original signal source (os-timer) with the PMU counter. The signal handler is also basically unchanged. So pprof is fitting well. The only change is to change the original sample type from nanoseconds to count. I'd like to convert count to nanoseconds because sometimes count doesn't seem so intuitive, but there doesn't seem to be a proper way to do this conversion.

  1. What is the plan for non-Linux OSes? proposal: runtime/pprof: add PMU-based profiles #36821 (comment) mentions that Darwin and Windows don't have clear PMU interfaces. Is PerfProfile only exported on Linux? Or Start returns an error on non-Linux?

Yes, my plan is to only support linux, if we find a appropriate method to support PMU on Windows and Darwin in the future, we will add support for them then. It may be appropriate to panic in Start before it is implemented.

@nsrip-dd
Copy link
Contributor

nsrip-dd commented Jun 9, 2022

Would it be possible to measure multiple events simultaneously by creating a PerfProfile for every event of interest? Alternatively, by giving PerfProfile a method like func (*PerfProfile) MeasureEvent(name string, period int) that can be called multiple times, and having all events go in the same profile?

@erifan
Copy link
Author

erifan commented Jun 10, 2022

Would it be possible to measure multiple events simultaneously by creating a PerfProfile for every event of interest? Alternatively, by giving PerfProfile a method like func (*PerfProfile) MeasureEvent(name string, period int) that can be called multiple times, and having all events go in the same profile?

My prototype implementation doesn't account for multiple events, but it's not hard to support multiple events. I can do it however I'm not a proposal approver. Thanks.

@prattmic
Copy link
Member

prattmic commented Jul 25, 2022

What is the plan for non-Linux OSes?

I did some research today on other OSes today to get a sense of the capabilities.

Windows

The Windows Performance Recorder CLI supports collecting some PMU events. There is an API to control profile collection, though I believe this communicates with a centralized service that actually does profile collection. As far as I can tell, there is no direct kernel interface. Additionally, it is not possible to limit collection to a specific process as far as I can tell.

Unanswered questions:

  1. Is it possible to collect stack traces / auxiliary information with events, or only counts?
  2. What permissions are required to use this interface? (Given the system-wide collection, I imagine elevated privileges are required.)

Open source libraries like PCM use a custom Windows kernel driver to access PMUs.

Visual Studio has a command capable of collecting PMU events for a specific process. It is unclear to me how this works. Does it use the WPR API?

My takeaway is that there isn't really anything feasible for us to programmatically integrate with, unless I've missed something.

Darwin

Xcode Instruments supports collecting counts from PMU events. How it does so is unclear.

PCM requires a custom kernel driver, like on Windows.

The Darwin kernel has a PMC subsystem called kpc. It looks like this is exposed to userspace, but is only usable by one "blessed" process?

I suspect that there is nothing feasible here either, but I am less confident than with Windows.

FreeBSD

FreeBSD has a complete libpmc library for accessing hardware performance counters in counting or sampling mode (includes support for sending SIGPROF). Internally this uses system calls to a kernel module. I haven't looked into this in much detail, but clearly there is something feasible here.

@erifan
Copy link
Author

erifan commented Jul 26, 2022

Thanks @prattmic for the research, I don't know much about other OSes, I'll look into how to access the PMU programmatically on other OS's. I wonder if you guys @golang/windows can provide some hints for Windows?

Also I wonder if it is really difficult to access PMU on a OS, can we not support it on this OS? Just like per thread OS timer, we also only support it on Linux.

@prattmic
Copy link
Member

I should have added more context to my previous comment. The purpose of my research was to get a sense of how hard we should try to design a platform-independent API (even if only Linux is supported initially). My initial take-away is that it is probably less important if Windows and Darwin will be impossible to support.

@prattmic
Copy link
Member

prattmic commented Jul 26, 2022

In this proposal, you are proposing a new type PerfProfile. However, your original CL introduced the previously approved (#42502) CPUProfile type and extended it with perf event support.

I actually like that approach better than adding a new top-level perf-specific profile type for a few reasons:

  • Most of the events that you’d use are CPU profiles. The existing StartCPUProfile is a CPU time profile. The standard perf cycles event would be a CPU cycles profile. Similar with instructions, branch mispredictions. Even events like L1D cache misses are attributed to specific instructions, and thus I think could be reasonably considered CPU profiles.
    • The main exceptions are much more obscure events that measure things like DRAM read/write bandwidth. I am not sure these would make much sense to use in a sampling mode anyways.
  • I think one of the most exciting parts of this proposal is the potential to automatically provide more accurate CPU profiles to everyone using StartCPUProfile/CPUProfile/-cpuprofile if the PMU is available. I think this conceptually make the most sense if the PMU options are directly in CPUProfile.
    • That said, perhaps CPUProfile could just provide options for timer (ITIMER_PROF), cycles (PMU), cycles-precise (PMU+PEBS equivalent) and any more advanced usage is in another profile type.

I also think that using Set methods rather than exported fields provides more flexibility on the struct internals plus the ability to return errors, though I don’t feel too strongly about this as it does make the API below more verbose. One such possible API:

type CPUProfile struct { ... }

func (*CPUProfile) SetRate(int) error
func (*CPUProfile) SetSource(CPUProfileSource) error

type CPUProfileSource int

const (
	CPUTimer CPUProfileSource = iota
	CPUCycles
	CPUCyclesPrecise
	Custom // Configure via Sys().
)

// Returns *LinuxCPUProfile on Linux.
func (*CPUProfile) Sys() any

// LinuxCPUProfile contains Linux perf specific advanced configuration options.
type LinuxCPUProfile struct{ ... }

// Use one of the generic PERF_TYPE_HARDWARE events.
func (*LinuxCPUProfile) SetHardwareSource(LinuxCPUProfileHardwareSource) error

// Use a raw hardware event (PERF_TYPE_RAW). Format of config, config1, and config2 are microarchitecture dependent.
func (*LinuxCPUProfile) SetRawSource(config, config1, config2 uint64) error

// Additional options. e.g., setting the “precise” flag.

The idea of hiding OS-dependent details behind a Sys() method comes from os.ProcessState.Sys() and os/exec.Cmd.SysProcAttr. I’m not the biggest fan of this pattern, but it is nice to be able to provide advanced options without needing to an OS-independent API for every feature, especially given #53286 (comment) indicating that we are unlikely to support Windows or Darwin anyways.


Another big unanswered question is how events should be described. The proposed PerfProfile API says Event string // perf event names, such as cycles, instructions. Are these names as described in the documentation for perf record -e and shown in perf list?

That would be great, but presents a big problem: these names aren’t understood directly by the kernel interface.

The simple generic events like cycles and instructions are simply stringified versions of PERF_TYPE_HARDWARE, so those are simple to support. However, another event I’ve used recently for example is cycle_activity.stalls_mem_any. This is defined here (for Skylake, other files for other microarches). This JSON file is used to generate code for the userspace perf tool that describes the raw event values for this named event. libpfm4 has a similar giant table that describes this event.

There is such a huge number of possible events here that I don’t think it is feasible for the Go standard library to be in the business of understanding the names of every single event. In the proposed API above, I’ve addressed this by directly supporting the generic PERF_TYPE_HARDWARE events and for anything beyond that just providing SetRawSource that accepts PERF_TYPE_RAW config values (which would be derived from the EventCode, UMask, etc seen in the JSON file). SetRawSource is not user-friendly, but it provides just enough API for a third-party libpfm4-style package to understand the full set of possible events to help users configure it.

@prattmic
Copy link
Member

I will also note that as a first-step proposal, we could just take the simple part of the hardware CPU profile API above. i.e.,

type CPUProfile struct { ... }

func (*CPUProfile) SetRate(int) error
func (*CPUProfile) SetSource(CPUProfileSource) error

type CPUProfileSource int

const (
	CPUTimer CPUProfileSource = iota
	CPUCycles
	CPUCyclesPrecise
)

That is, we add support for using a hardware cycle counter for CPU profiles (perhaps exposing a ‘precise’ option, perhaps not). Nothing more. Under the hood this uses the perf interface, but we don’t expose an API for using any other event types.

@erifan
Copy link
Author

erifan commented Jul 27, 2022

I will also note that as a first-step proposal, we could just take the simple part of the hardware CPU profile API above.

I don't have a particularly strong opinion on how to define this API, so I think your proposal is also good, not exposing the internal fields of CPUProfile gives more flexibility to our implementation. And I don't think it's necessary to support too much events, including raw hardware event?

Most of the events that you’d use are CPU profiles

Yes, I think we will only support a few common PERF_TYPE_HARDWARE events. Do we have to support too many perf events? Because in my opinion our main purpose is to improve the CPU profiling precise of the pprof, not to implement a complete perf .

Are these names as described in the documentation for perf record -e and shown in perf list?

Yes, they are. In contrast to exported constants, we don't need to export these strings, we won't support a lot of perf events.

const (
CPUTimer CPUProfileSource = iota
CPUCycles
CPUCyclesPrecise
)
Maybe we can support a little more events. CL 410799 supported 6 events:

cycles,
instructions,
cache-references,
cache-misses,
branch-misses,
bus-cycles.

I think at least cycles, instructions, cache-misses and branch-misses are still quite important.

My initial take-away is that it is probably less important if Windows and Darwin will be impossible to support.

So this is the basis of the above discussion. Does it make sense if we can't find a proper way to access the PMU in Darwin and Windows? As far as I know, there is currently no way to access the Arm PMU on Windows, Arm has some work in progress to provide low level profiling infrastructure and tool for Windows, but it also seems to be struggling to meet our needs. So can we support the above API only on Linux? If so then what's the behavior of SetSource on unsupported OSes? Panic or use os timer by default?

@prattmic
Copy link
Member

prattmic commented Jul 27, 2022

Yes, I think we will only support a few common PERF_TYPE_HARDWARE events. Do we have to support too many perf events? Because in my opinion our main purpose is to improve the CPU profiling precise of the pprof, not to implement a complete perf .

I completely agree. I was trying to be flexible since I've seen desire for arbitrary events (such as in #36821), but I think focusing just on cycles first makes the most sense.

Maybe we can support a little more events. I think at least cycles, instructions, cache-misses and branch-misses are still quite important.

I think it is hard to decide what is important. There was a lot of discussion on #36821 about paring down the initial proposal to focus narrowly on improving CPU profiles without adding tons of extra options, and I think that is pertinent here. If there is general agreement that just cycles is valuable, then I think we start there.

So can we support the above API only on Linux? If so then what's the behavior of SetSource on unsupported OSes? Panic or use os timer by default?

Yes, it would only be supported on Linux.

The way I envisioned this API is that NewCPUProfile() automatically selects the best available profile source. Users that know they want a specific profile source can call SetSource, which returns an error if that source is not available (either due to unsupported OS or lack of HW support). The primary use-cases I imagine for this API are:

  1. Folks that always want the timer-based profile either for more consistency across all profiles in their fleet or they have tooling that expects a cpu-s profile.
  2. Folks that definitely want a precise cycles profile and would rather not profile at all if that isn't available.

P.S. it's possible that (1) will prevent us from making CPUProfile/StartCPUProfile automatically use PMU profiles because it is too much of a behavior change. I think this needs more thought, but I hope it isn't a blocker because I'd love to magically provide better profiles for a new release rather than hiding them behind a method most folks won't use.

P.P.S. since we are introducing the first implementation of NewCPUProfile with this proposal, a halfway point would be for NewCPUProfile to automatically select the best available source, but StartCPUProfile to always use the OS timer. Existing code wouldn't automatically get better profiles, but StartCPUProfile would be effectively deprecated, so new code would at least get better profiles by default.

@prattmic
Copy link
Member

P.S. it's possible that (1) will prevent us from making CPUProfile/StartCPUProfile automatically use PMU profiles because it is too much of a behavior change. I think this needs more thought, but I hope it isn't a blocker because I'd love to magically provide better profiles for a new release rather than hiding them behind a method most folks won't use.

I will point out that https://pkg.go.dev/runtime/pprof#StartCPUProfile does not promise that profiles are measured in cpu-seconds, only that this provides a "CPU profile", without ever defining what that means. But of course Hyrum's Law doesn't care what the API says. :)

@prattmic
Copy link
Member

prattmic commented Jul 27, 2022

One concern that I believe @rhysh raised on #36821 (though I can't find the comment now) and that I share is whether or not the PMU will continue counting while executing in the kernel, even if we are not allowed access to kernel stacks. It appears that the answer is no.

I tested with an application that does some spinning in userspace as well as performs an expensive system call.

Here are the results:

pprof (CPU timers):
image

sudo perf -e cycles (kernel+user):
image

perf -e cycles:u (user only):
image

My perf_event_paranoid == 2, which prevents access to kernel sampling. Simply using perf record -e cycles has the same behavior of auto-selecting user mode only.

Perhaps there is a way to keep counting in kernel mode, but only receive user stacks, which would match the pprof behavior, but honestly I doubt it since I can imagine that even getting counts with small periods could provide interesting side-channel security insights into what the kernel is doing that they'd want to stop.

(Edit: I dug through the kernel code a bit and it doesn't look like this is possible. In fact, it turns out that at least for Intel chips the PMU hardware itself has config bits for ring 0 and ring 3 counting, so the kernel doesn't even have to turn counting off/on on entry/exit.)

I think not counting is kernel mode is perhaps OK for an opt-in API, but it may be too big of a change from pprof to automatically select a PMU profile over a CPU timer.

@rhysh
Copy link
Contributor

rhysh commented Jul 28, 2022

One concern that I believe @rhysh raised on #36821 (though I can't find the comment now) and that I share is whether or not the PMU will continue counting while executing in the kernel, even if we are not allowed access to kernel stacks. It appears that the answer is no.

I think that comment is #36821 (comment) . The design of the perf_event support at the time used a file descriptor per thread, and to prevent the profiler from using many more FDs than GOMAXPROCS it took explicit action to detach the perf_event FD from the thread when making a syscall. So my comment wasn't about a limitation of the perf_event interface (though from your update today, @prattmic , that limitation seems exist!), it was about how the proposed Go runtime change chose to spend file descriptors.

Yes, a default profiler that reports syscalls as taking no time at all seems likely to confuse.


I will point out that https://pkg.go.dev/runtime/pprof#StartCPUProfile does not promise that profiles are measured in cpu-seconds, only that this provides a "CPU profile", without ever defining what that means. But of course Hyrum's Law doesn't care what the API says. :)

For what it's worth, the feedback on one of my early attempts to change the default profiler (https://go.dev/cl/204279) included:

Right now, in practice, everybody sets the profiling rate to 100, runtime/pprof.StartCPUProfile. [...] I'm concerned that this will confuse people used to the standard rate.

and

I expect that many people look at the sample count and mentally divide by 100 to get the time spent.

But, there's a much better reason this time.

@erifan
Copy link
Author

erifan commented Jul 28, 2022

If there is general agreement that just cycles is valuable, then I think we start there.

Ok, this can be a good start. If people feel that other events are needed in the future, it will be much simpler.

I will point out that https://pkg.go.dev/runtime/pprof#StartCPUProfile does not promise that profiles are measured in cpu-seconds, only that this provides a "CPU profile", without ever defining what that means. But of course Hyrum's Law doesn't care what the API says. :)

Yes, as you mentioned before, we can keep StartCPUProfile still using the OS timer so that its behavior doesn't change. For users who don't care about the unit of the sample and only care about the profiling precision, we choose the best available source for them in NewCPUProfile. For users who wish to use a definite source can use SetSource to set the source.

One concern that I believe @rhysh raised on #36821 (though I can't find the comment now) and that I share is whether or not the PMU will continue counting while executing in the kernel, even if we are not allowed access to kernel stacks. It appears that the answer is no.

perf_event_open has an option exclude_kernel, enabling it can exclude kernel events, so my understanding is that if it is not set, the kernel events will be counted, that is, the PMU will count in the kernel execution state. I'm not sure if you are referring to this.

Perhaps there is a way to keep counting in kernel mode, but only receive user stacks, which would match the pprof behavior

If exclude_kernel is not set and the PMU keeps counting in the kernel mode, then this is feasible. As for the callchain, we collect it by stack tracing in go, so I don't think there will be any difference. Unless we want to read the FD directly to get the callchain, but there also seems to be options (exclude_callchain_kernel and exclude_callchain_user ) to control whether to include the user or kernel callchain.

In my latest code, if it's pure go code, I set this option because it helps the precision a little bit. To be honest, there is no right or wrong to count the execution of the kernel or not. For example, if a kernel operation affects multiple threads (such as allocating a new heap arena), which thread is appropriate to assign this time to? But considering that to be consistent with the behavior of the OS timer, it may be reasonable to let the PMU counting the event of the kernel.

So my comment wasn't about a limitation of the perf_event interface (though from your update today, @prattmic , that limitation seems exist!), it was about how the proposed Go runtime change chose to spend file descriptors.

Thank @rhysh for raising this problem, I think we have some discussions in CL 410798 and 36821, and there seems to be no good solution for this problem. I didn't deal with this in CL 410798 because I think such cases are really rare, especially for go, where OS thread is shared by multiple goroutines. I'd like to hear more people's thoughts on whether this is an issue that needs to be dealt with.

@erifan
Copy link
Author

erifan commented Jul 28, 2022

How does the perf_event_open profiler deal with threads that aren't created by the Go runtime?

@rhysh you asked this question in #36821 (comment), and this question really stumped me. With the current framework, non-go threads can only receive profiling signals from the process signal source created by setitimer, so I installed a PMU counter for the process. However the non-go thread rarely receive the profiling signals, I thought the process signal would randomly choose a thread to receive, but it doesn't seem to be so random. I looked at the code ( complete_signal) of the kernel, it seems that the main thread always gets the signal first. Since our main thread does not block the SIGPROF signal, most of the signals are received by the main thread, and the main thread has a per-thread PMU counter, so it ignores the signal. I don't know how to fix this problem.

@erifan
Copy link
Author

erifan commented Jul 28, 2022

@rsc Would you mind adding this proposal to active queue ? Thanks.

@prattmic
Copy link
Member

perf_event_open has an option exclude_kernel, enabling it can exclude kernel events, so my understanding is that if it is not set, the kernel events will be counted, that is, the PMU will count in the kernel execution state. I'm not sure if you are referring to this.

Yes, that is correct. However, if /proc/sys/kernel/perf_event_paranoid >= 2, then setting exclude_kernel=false requires CAP_PERFMON or CAP_SYS_ADMIN (effectively, you must be root): https://elixir.bootlin.com/linux/v5.18.14/source/include/linux/perf_event.h#L1351

(N.B., the perf tool automatically detects this case and sets exclude_kernel=true if you are unpriviledged)

As far as I know, must Linux distributions set /proc/sys/kernel/perf_event_paranoid to 2 by default. In practice this means that very few Go programs will be able to enable kernel counting.

I was brainstorming with @cherrymui earlier and one idea that came up is that we could collect both the OS timer and PMU profiles by default. The pprof format supports multiple different sample types in one file, so putting both in one file would not be a problem. I think the main downside is that we'd approximately double the overhead because we'd have SIGPROFs for both profiles.

@erifan
Copy link
Author

erifan commented Jul 29, 2022

However, if /proc/sys/kernel/perf_event_paranoid >= 2, then setting exclude_kernel=false requires CAP_PERFMON or CAP_SYS_ADMIN (effectively, you must be root)

Well, that's a bit of a hassle. I wonder if we can do a check on this before starting the PMU, if the environment has the appropriate permissions we will allow the PMU profiling, otherwise not. I think usually the development environment and debugging environment meet these conditions.

Or we don't count the execution of the kernel, which behaves a little differently from the OS timer source, but is actually more accurate for user programs. This is also the default mode of perf.

one idea that came up is that we could collect both the OS timer and PMU profiles by default.

This is a bit tricky because different types of samples have different units, so we need to differentiate between them.

@prattmic
Copy link
Member

I wonder if we can do a check on this before starting the PMU, if the environment has the appropriate permissions we will allow the PMU profiling, otherwise not. I think usually the development environment and debugging environment meet these conditions.

Certainly we can check permissions ahead of time, but I am skeptical that most development environments will have perf_event_paranoid < 2. Perhaps we should do a survey of the defaults across some common Linux distributions.

This is a bit tricky because different types of samples have different units, so we need to differentiate between them.

Yes, the pprof tool treats them as completely different, and only displays one at a time. You can see an example of this behavior by collecting a perf.data with multiple counters:

$ perf record -e cycles,instructions ./set
$ pprof perf.data                                                                                                                           
File: set
perf-version:5.17.11
perf-command:/usr/bin/perf record -e cycles,instructions ./set
Type: instructions:u_event
Duration: 5.43s, Total samples = 18125609784
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

Note that this is displaying Type: instructions:u_event.

(pprof) options
...
  sample_index              = 0                    //: [cycles:u_sample | cycles:u_event | instructions:u_sample | instructions:u_event]
...

Here are all the included types. Select cycles:u_event with pprof -sample_index=1.

@erifan
Copy link
Author

erifan commented Aug 1, 2022

but I am skeptical that most development environments will have perf_event_paranoid < 2. Perhaps we should do a survey of the defaults across some common Linux distributions.

I think you are right that the default perf_event_paranoid value is 2 and 3 or 4 on some distributions, I wonder if we can make PMU profiling a feature that requires users to give more permissions or privileged users, similar with perf tool. Or don't count kernel events.

Yes, the pprof tool treats them as completely different, and only displays one at a time. You can see an example of this behavior by collecting a perf.data with multiple counters:

But I think what we want to have is one accurate event rather than two less accurate events.

@erifan
Copy link
Author

erifan commented Aug 1, 2022

Based on the discussion above, I make a summary.
The proposal (by @prattmic) is this, I removed CPUCyclesPrecise, target for Linux only.

// 1, Default event type is os-timer.
// 2, Default period is provided.
type CPUProfile struct { ... }

// 1, The unit of the period is related to the event type. For time-based events such as os-timer, it's nanosecond,
//  for other events, it is the count of events that passed between samples.
// 2, If the parameter <= 0, reset it to 0, stop profiling.
// 3, The profiling period has a minimum threshold to prevent the profiling rate from being too high.
// When the parameter value exceeds the threshold, it will be adjusted to the minimum threshold.
// 4, If c is nil, report an error and return.
// 5, If the profiling has been started, calling this function will report an error, but will not affect the running profiling.
func (c *CPUProfile) SetPeriod(int64) error

// 1, If c is nil, report an error and return.
// 2, If the profiling has been started, we can't resest the event by call SetEvent.
// 3, If the source is not available, report and error and return.
// 4, The second parameter indicates whether to count kernel events. For the os-timer event, this parameter is always true.
// 5, For PMU to count the kernel events, the CAP_PERFORM permission or the admin permission is required, if the condition is not met and the second parameter is also specified as true, an error will be reported.
// 6, If cgo is used, only the CPUTimer event can be used.
// 7, The precise_ip of the PMU event is set to the maximum value available.
func (c *CPUProfile) SetEvent(CPUProfileEvent, bool) error

// SetSampleKernel sets whether to count the kernel events, pass in “true” to count the kernel events, otherwise not count.
// This function has no effect on os-timer based events, because os-timer events always count kernel events.
// To count the kernel events, the CAP_PERFORM permission or the admin permission is required.
func (c *CPUProfile) SetSampleKernel(bool) error

// 1, If the method receiver c is nil, report an error.
// 2, If the profiling has been started, report an error.
func (c *CPUProfile) Start(w io.Writer) error

// Stop stops a running CPUProfile and flushes any pending writes.
// It returns only after all writes for the profile have completed.
// If c is not running, Stop does nothing and returns nil.
//
// It is the caller's responsibility to close the profile's Writer if appropriate.
//
// A CPUProfile can be started again after calling Stop.
func (c *CPUProfile) Stop() error

type CPUProfileEvent int

const (
	CPUTimer CPUProfileEvent = iota
	CPUCycles
        CPUBranchMisses
        CPUCacheMisses
        CPUInstructions
)

There are three problems:
First, FD may be depleted. But that's probably not a problem as this is very rare.
Disposal method: Do not handle.

Second, for PMU to count the kernel events, perf_event_open requires the CAP_PERFORM permission or the admin permission, otherwise the behavior of the PMU will be inconsistent with the behavior of the OS timer.

Disposal method:

  1. Ask the user to give the appropriate permissions.
  2. Kernel events are not counted if the permissions are not satisfied.

Third, it is difficult for non-go threads to receive signals from the PMU.
Disposal method:
If CGO profiling is on, only os-timer events can be used.

Everyone is welcome to brainstorm to move this proposal forward, thanks.

Edit1: Add Start method.
Edit2: Added some problem and behavior descriptions that I'm not sure about.
Edit3: Change func (c *CPUProfile) SetRate(int) error to func (c *CPUProfile) SetPeriod(int64) error.
Rename CPUProfileSource to CPUProfileEvent.
Rename func (c *CPUProfile) SetSource(CPUProfileSource) error to func (c *CPUProfile) SetEvent(CPUProfileEvent) error.
Add func SetPMUEventPrecise(PMUEventPrecise ) error and type PMUEventPrecise int.
Add CPUBranchMisses, CPUCacheMisses and CPUInstructions.
Add Disposal methods for the above three problems.
Edit4: Change func SetPMUEventPrecise(PMUEventPrecise) error to func (c *CPUProfile) SetPMUEventSkid(PMUEventSkid) error
Add func (c *CPUProfile) Stop(w io.Writer) error method
Modified some behavior descriptions.
Edit5: Modified some behavior descriptions.
Changes func (c *CPUProfile) SetEvent(CPUProfileEvent) error to func (c *CPUProfile) SetEvent( CPUProfileEvent, bool) error
Edit6: Add method func (c *CPUProfile) SetSampleKernel(bool) error
Edit7: updated the behavior description of the Start method

@rsc rsc moved this to Incoming in Proposals Aug 10, 2022
@rsc rsc added this to Proposals Aug 10, 2022
@rsc
Copy link
Contributor

rsc commented Sep 7, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@aclements
Copy link
Member

I posted a more fleshed-out and documented version of the base (non-PMU) API in the CPUProfile issue.

@erifan
Copy link
Author

erifan commented Nov 24, 2022

I would expect SetPeriod to always return an error without affecting the profile if the profile is already running.

Yes, it is.

2, If the parameter <= 0, reset it to 0, stop profiling.

This is for consistency with SetCPUProfileRate , If we don't care about this, then I think it's fine to report an error for a negative period.

pprof.StartCPUProfile says that it "returns an error if profiling is already enabled." I think that's both reasonable, and we want to be consistent. In this case, I think Start should definitely return an error if c is already running, and should be allowed to return an error if any CPU profile is running.

Ok, make sense. Updated #53286 (comment)

@rsc
Copy link
Contributor

rsc commented Dec 14, 2022

Here is the updated API per @erifan's comment:

// 1, Default event type is os-timer.
// 2, Default period is provided.
type CPUProfile struct { ... }

// 1, The unit of the period is related to the event type. For time-based events such as os-timer, it's nanosecond,
//  for other events, it is the count of events that passed between samples.
// 2, If the parameter <= 0, reset it to 0, stop profiling.
// 3, The profiling period has a minimum threshold to prevent the profiling rate from being too high.
// When the parameter value exceeds the threshold, it will be adjusted to the minimum threshold.
// 4, If c is nil, report an error and return.
// 5, If the profiling has been started, calling this function will report an error, but will not affect the running profiling.
func (c *CPUProfile) SetPeriod(int64) error

// 1, If c is nil, report an error and return.
// 2, If the profiling has been started, we can't resest the event by call SetEvent.
// 3, If the source is not available, report and error and return.
// 4, The second parameter indicates whether to count kernel events. For the os-timer event, this parameter is always true.
// 5, For PMU to count the kernel events, the CAP_PERFORM permission or the admin permission is required, if the condition is not met and the second parameter is also specified as true, an error will be reported.
// 6, If cgo is used, only the CPUTimer event can be used.
// 7, The precise_ip of the PMU event is set to the maximum value available.
func (c *CPUProfile) SetEvent(CPUProfileEvent, bool) error

// SetSampleKernel sets whether to count the kernel events, pass in “true” to count the kernel events, otherwise not count.
// This function has no effect on os-timer based events, because os-timer events always count kernel events.
// To count the kernel events, the CAP_PERFORM permission or the admin permission is required.
func (c *CPUProfile) SetSampleKernel(bool) error

// 1, If the method receiver c is nil, report an error.
// 2, If the profiling has been started, report an error.
func (c *CPUProfile) Start(w io.Writer) error

// Stop stops a running CPUProfile and flushes any pending writes.
// It returns only after all writes for the profile have completed.
// If c is not running, Stop does nothing and returns nil.
//
// It is the caller's responsibility to close the profile's Writer if appropriate.
//
// A CPUProfile can be started again after calling Stop.
func (c *CPUProfile) Stop() error

type CPUProfileEvent int

const (
	CPUTimer CPUProfileEvent = iota
	CPUCycles
        CPUBranchMisses
        CPUCacheMisses
        CPUInstructions
)

Are there any remaining concerns about this API?

@rhysh
Copy link
Contributor

rhysh commented Dec 15, 2022

Thank you for re-sharing the current proposal, it's been hard to track the changes and to know when to comment. I have a concern about the "cgo" ban, and some (I think) minor clarifications.

// 2, If the parameter <= 0, reset it to 0, stop profiling. looks like that behavior is about compatibility with runtime.SetCPUProfileRate. From #40094, I don't think it should influence new APIs. I'd guess that calling with <=0 would result in an unusable CPUProfile, and would likely return an error. See also // 5, If the profiling has been started, calling this function will report an error, but will not affect the running profiling..

On // 6, If cgo is used, only the CPUTimer event can be used., what does it mean to use cgo? I take this to be about the difficulty of tracking threads in the process that aren't known to the Go runtime, created via cgo+pthreads or similar. But I wouldn't expect using the system's DNS resolver—or compiling+linking a program capable of selecting that resolver—to be enough cgo to cause a problem. I'd hope for "best effort" here, where a process that contains 0, or 1, or more pthread-created threads would report on behavior in the Go-runtime-owned threads.

Now that there's an explicit SetSampleKernel(bool) error method, the second arg of SetEvent(CPUProfileEvent, bool) error should go away.

@erifan
Copy link
Author

erifan commented Dec 15, 2022

I'd guess that calling with <=0 would result in an unusable CPUProfile, and would likely return an error.

@aclements also lean to this behavior, I'm ok with it, so let's change it to return an error for this case.

On // 6, If cgo is used, only the CPUTimer event can be used., what does it mean to use cgo? I take this to be about the difficulty of tracking threads in the process that aren't known to the Go runtime, created via cgo+pthreads or similar.

Yes. Because we can't install a PMU event for the threads created in cgo, we can only rely on the process PMU signal source to send them signals. But I found in the implementation that the threads created in cgo can hardly receive the signal from the process pmu signal source. The signal(7) claims that the delivering of process signals is random, but through actual testing, I found that PMU process signals almost always give priority to threads with small tids. Since the threads created in cgo have larger tids, they hardly receive signals (in my test, the process pmu signal source sent 100,000 signals, but cgo threads received only one). But the thread created in cgo can receive the signal sent by the os-timer process signal source normally. By checking the kernel code, I think this may be related to the mechanism of pmu signal and os-timer signal, but I don't quite understand the specific details. This problem causes we can only chose os-timer signal source for cgo profiling, otherwise the profiling is biased.

For cgo profiling, it seems to me that it is not a common scenario, and currently only a few architectures support this feature. And to enable cgo profiling, you need to first set functions such as runtime.cgoTraceback. The PMU support is also only for the Linux system, so I think it may not be a big problem that the PMU does not support cgo profiling.

But I wouldn't expect using the system's DNS resolver—or compiling+linking a program capable of selecting that resolver—to be enough cgo to cause a problem. I'd hope for "best effort" here, where a process that contains 0, or 1, or more pthread-created threads would report on behavior in the Go-runtime-owned threads.

Sorry I don't quite understand this, could you elaborate on this ? Thanks.

Now that there's an explicit SetSampleKernel(bool) error method, the second arg of SetEvent(CPUProfileEvent, bool) error should go away.

Yes, thanks for pointing this out.

The updated proposal:

// 1, Default event type is os-timer.
// 2, Default period is provided.
type CPUProfile struct { ... }

// 1, The unit of the period is related to the event type. For time-based events such as os-timer, it's nanosecond,
//  for other events, it is the count of events that passed between samples.
// 2, If the parameter <= 0, report an error and return.
// 3, The profiling period has a minimum threshold to prevent the profiling rate from being too high.
// When the parameter value exceeds the threshold, it will be adjusted to the minimum threshold.
// 4, If c is nil, report an error and return.
// 5, If the profiling has been started, calling this function will report an error, but will not affect the running profiling.
func (c *CPUProfile) SetPeriod(int64) error

// 1, If c is nil, report an error and return.
// 2, If the profiling has been started, we can't resest the event by call SetEvent.
// 3, If the source is not available, report and error and return.
// 4, If cgo is used, only the CPUTimer event can be used.
// 5, The precise_ip of the PMU event is set to the maximum value available.
func (c *CPUProfile) SetEvent(CPUProfileEvent) error

// SetSampleKernel sets whether to count the kernel events, pass in “true” to count the kernel events, otherwise not count.
// This function has no effect on os-timer based events, because os-timer events always count kernel events.
// To count the kernel events, the CAP_PERFORM permission or the admin permission is required.
func (c *CPUProfile) SetSampleKernel(bool) error

// 1, If the method receiver c is nil, report an error.
// 2, If the profiling has been started, report an error.
func (c *CPUProfile) Start(w io.Writer) error

// Stop stops a running CPUProfile and flushes any pending writes.
// It returns only after all writes for the profile have completed.
// If c is not running, Stop does nothing and returns nil.
//
// It is the caller's responsibility to close the profile's Writer if appropriate.
//
// A CPUProfile can be started again after calling Stop.
func (c *CPUProfile) Stop() error

type CPUProfileEvent int

const (
	CPUTimer CPUProfileEvent = iota
	CPUCycles
        CPUBranchMisses
        CPUCacheMisses
        CPUInstructions
)

@rhysh
Copy link
Contributor

rhysh commented Dec 15, 2022

But I wouldn't expect using the system's DNS resolver—or compiling+linking a program capable of selecting that resolver—to be enough cgo to cause a problem. I'd hope for "best effort" here, where a process that contains 0, or 1, or more pthread-created threads would report on behavior in the Go-runtime-owned threads.

Sorry I don't quite understand this, could you elaborate on this ? Thanks.

I'm not an expert on cgo. Most of what I know about it I learned from working on runtime/pprof's per-thread timers for Go 1.18.

There are a few different flavors of "using cgo".

  1. A Go function running on a Go-managed thread can use cgo to call into C code. This is somewhat common, including in the net package: https://pkg.go.dev/net#hdr-Name_Resolution
  2. A C function can create a new thread, which won't be known to the Go runtime. That thread can run C code.
  3. A C function can create a new thread, which won't be known to the Go runtime. That thread can call into Go code.
  4. Other combinations, probably. C-owned thread calling a Go function that calls back into C, for instance.

Before Go 1.17, profiling on Linux used setitimer to track CPU usage of the whole process and to use process-targeted signals for determining which code was using CPU time. That had some limitations, #35057. As of Go 1.18, profiling on Linux still uses setitimer for whole-process usage, but also uses timer_create to track usage on each thread that the Go runtime owns, with thread-targeted deliveries. Each SIGPROF says whether it was triggered by setitimer or timer_create, so the runtime can ignore the lower-quality setitimer signals if it knows the thread is also registered to receive signals from timer_create. This leads to good profiling (able to report more than "250%") of the work done by threads that Go created (whether running Go code or C code), and decent profiling (limited to "250%", depending on CONFIG_HZ) of work done by threads created in C code which are running C code.

Because the net package is able to use the system's native resolver, it's common for Go programs on GOOS=linux to be built with cgo enabled. And depending on the system configuration (what's in /etc/resolv.conf, whether the operator runs the program with GODEBUG=netdns=cgo, etc), it's easy to end up with a process that uses Go-owned threads to make calls into C code using cgo.

I think that a perf_events-based profiler that refuses to even try when a program is built with cgo enabled, or in a process where threads that the Go runtime created make calls into C code, will be less useful than one that makes a best effort.

I think a caveat like "this only reports on events triggered by threads that the Go runtime created" would be more appropriate. That's case 1 in my list above.

@erifan
Copy link
Author

erifan commented Dec 16, 2022

I should have said that "when using cgo" refers to user code using cgo, or to runtime.iscgo == true. Several libraries such as net do use c code, but it does not set runtime.iscgo.

I know there are many scenarios where Go interacts with C, but I don't know how to solve the above problem, if you have any solution please let me know so we don't have to add this limitation.

The OS-timer based profiler is still available, users can choose the most suitable profiler they think according to their needs. I think PMU profiler is only suitable for applicable scenarios, it has many limitations, such as Linux, such as sudo permissions, and this one "cgo". Just like timer_create is also only available on Linux, one can also argue that a boost that is not available for Windows and Darwin is less useful. But it improves Linux without affecting other systems, then I think that's valuable. By the same token, I think the PMU profiler is an improvement for pure Go code on Linux, so why is it less useful? I believe writing Go code in the Go language is the common case.

If the thread created in cgo cannot receive the profiling signal, the profiling is biased. But with OS-timer based profiler we can get a normal profiling result for cgo code. I think a normal result is more appropriate than a biased one? This is why I think profiling cgo programs with PMU should be disabled instead of a caveat.

@aclements
Copy link
Member

I worry that it's quite common for Go applications to have just a small amount of cgo use in them, and this makes the local effects of introducing cgo usage into a global constraint on the process.

I'm not actually positive that we can't solve this limitation. This is Linux-specific anyway, so we can use Linux-specific techniques to enumerate all threads in the process when we start profiling. And I think we can set up inheritance so it automatically starts profiling any new threads created later. I'm not sure we can do that in a way that propagates into new threads, but not into new processes. There are various exec-related flags that might have this effect (recalling that Go processes never fork without an exec). I'm also not sure about the signal issues you mention. I assume you're using an overflow signal that triggers on any event in the mmap buffer, so we can do our own stack walk from the signal handler?

@erifan
Copy link
Author

erifan commented Dec 22, 2022

Yes, I agree it would be best if we could fix this.

And I think we can set up inheritance so it automatically starts profiling any new threads created later.

I have tried this method, with this option, the thread created in cgo can occasionally receive the profiling signal, without this option, it cannot receive at all. I have tried many different options. But didn't find a good way.

I assume you're using an overflow signal that triggers on any event in the mmap buffer, so we can do our own stack walk from the signal handler?

That's what I'm doing now, I haven't changed the logic of the stack walk.

@erifan
Copy link
Author

erifan commented Dec 22, 2022

Let me describe the problem in detail.

Test case https://go.dev/play/p/hImHumVUxFx (derived from TestCgoPprofThread, I found this problem because this test always fails in PMU mode).
The draft implementation CL 410799.

In the implementation, we install an os-timer or pmu counter for each thread maintained by the Go runtime. At the same time, install an os-timer or pmu counter for the entire Go process to provide signals for non-Go threads.

The phenomenon is that if the os-timer is used as the process signal source, the threads created in cgo always receive the signal first. If the PMU counter is used as the process signal source, the threads created in cgo will rarely receive the profiling signal, causing the test to fail. Almost all signals are received by Go threads.

Analysis: The man page says that a process signal will be delivered to a randomly selected thread within the process, but it seems to be not. Looking at the kernel code, I found that the signal delivering flow is like this, see complete_signal:

  1. A process signal is generated (by os-timer or PMU counter), the kernel put the signal in a queue that is shared by the whole process, and then wakes one thread. To choose which thread to wake:
  2. If the main thread has not blocked the signal, then wake the main thread.
  3. Otherwise T is a pointer into a circular list of threads in the process. If T has not blocked the signal then wake T, otherwise advance T to the next thread and repeat.

To wake a thread it has to either schedule it or send an IPI to another CPU to interrupt it if it's running. During that time another thread can come in and "steal" the signal from the queue (e.g. if it does a syscall and a signal is pending), so it's not necessarily the case that the thread that was woken always handles the signal. So actually the first thread to transition from kernel->user pops the signal from the queue.

For the PMU mode. Since the PMU signal is generated from an interrupt handler, there's no user->kernel transition that checks the signal queue (execution returns directly from the interrupt handler to user-space without going through the kernel scheduler), so according to the above algorithm it's more likely that the main thread will wake up and dequeue the signal instead of the thread that was executing when the interrupt occurred. Therefore, in the above test, most of the process PMU signals are delivered to the main thread (a Go thread), and the threads created in cgo rarely receive signals. Test results:

c tid1 = 2661774
c tid2 = 2661775
2661768  ==>  68485
2661775  ==>  1
2661769  ==>  31513

For the os-timer mode. In the above test, the thread created in cgo is a cpu-bound thread, and during its execution, the main thread is sleeping. When the cpu-bound thread is de-scheduled by the kernel because its time slice is run out or for other reasons, the kernel will update the process timer. Note that the timers are updated just before a thread is about to be scheduled (not when it is descheduled). So the time between the signal being generated by the timer update and the CPU-bound thread returning to user space is very short, so it's likely to see the signal still in the queue (i.e. it's rare that the main thread can wake up (because it will take more time.) and start running in that time). So cgo threads are always more likely to get the signal. Test results:

c tid2 = 2659014
c tid1 = 2659013
2659014  ==>  217
2659008  ==>  1
2659013  ==>  1098

I'm not sure if this is correct, but it seems to explain the evidence.

Also, the cgo thread in the above test seems to be running in a new process?

Hope this helps in understanding and resolving the problem.

@aclements
Copy link
Member

aclements commented Jan 4, 2023

Hmm. I'm still wrapping my head around what you wrote, but one thing jumped out at me: if the signal is delivered to the whole process, then doing our own stack walk from the arbitrary thread that receives the signal is meaningless. E.g., suppose we're sampling branch misses and your process has two CPU-bound threads, but one has no branch misses and the other has many. If the PMU overflow signal can be delivered to either thread, then the signal context is unrelated to what caused the event, and doing the trace back from the signal context will result in an entirely incorrect profile.

If that's all true, then our only option is to use information from the perf-generated event. We can get stacks from that using frame-pointer unwinding. They won't be exactly the same as Go stack walks, but it would be better than nothing.

(Also thanks for digging into this so deeply. :) )

@erifan
Copy link
Author

erifan commented Jan 5, 2023

if the signal is delivered to the whole process, then doing our own stack walk from the arbitrary thread that receives the signal is meaningless.

Yeah, and I think this also apply to the os-timer based profiler. For example, a process has two threads, one takes all of the cpu time, and the other is sleeping, takes no cpu time, but the process signal is delivered to a random thread. If the signal is received by the sleeping thread, then the generated profile is incorrect. It may not be correct to say "incorrect", because this process signal source (os-timer or PMU counter) measures a certain indicator of the entire process, so maybe it is not appropriate that we use it for profiling? I think "meaningless" you mentioned above is very accurate.

If that's all true, then our only option is to use information from the perf-generated event. We can get stacks from that using frame-pointer unwinding.

Are you referring to the PC and SP information recorded at the time the overflow occurred? If so then I feel like this is actually only slightly more accurate, since the thread that causes the overflow is also random. However, this method can avoid the extreme situations, a thread has no branch miss at all or does not occupy CPU time at all.

@rhysh
Copy link
Contributor

rhysh commented Jan 5, 2023

I want to clarify that the current "os-timer" is, on Linux, a combination of two separate profiling mechanisms running concurrently. Some of the shortcomings and behaviors you've described @erifan are applicable to one, but not (as I understand it) to the other.

The first is based on setitimer. This one has been in place for several years and is the only profiler for Go programs on most unix-like OSes. There's a single timer for the whole process, and on overflow the OS delivers a signal to the process as a whole. When you say "the process signal is delivered to a random thread", that lines up with my experience with this profiler on GOOS=darwin, and somewhat with its behavior on GOOS=linux.

The second is based on the timer_create set of syscalls. This is new in Go 1.18, and is only for GOOS=linux. Each thread that is known to the Go runtime asks for a separate timer, and when each of those timers overflows the resulting signal goes only to the correct thread.

When programs that don't use cgo to create threads run on GOOS=linux and profile themselves, they will end up with one setitimer timer and one timer_create timer for each thread in the program. (All threads in the program are owned by the Go runtime.) When a thread gets a SIGPROF, it determines whether it came from the per-thread timer or the process-wide timer—and if the thread has its own timer_create timer, it'll discard any signals that were triggered by the process-wide timer (to avoid double-counting). For those programs, the process-wide timer (the one that is "delivered to a random thread") has no effect on the profile, and all recorded samples are from OS threads that really did consume on-CPU time.

For programs that use cgo to create threads, the work done in those cgo-owned threads will only earn signals from the process-wide timer. But code running on Go-owned threads will continue to get the higher quality profiling from those threads' individual timers.


If that's all true, then our only option is to use information from the perf-generated event. We can get stacks from that using frame-pointer unwinding. They won't be exactly the same as Go stack walks, but it would be better than nothing.

@aclements does that mean not collecting goroutine labels / tags, or is there a way to store those that could make them appear in the perf-generated events?

@aclements
Copy link
Member

Are you referring to the PC and SP information recorded at the time the overflow occurred? If so then I feel like this is actually only slightly more accurate, since the thread that causes the overflow is also random. However, this method can avoid the extreme situations, a thread has no branch miss at all or does not occupy CPU time at all.

I mean specifically the PC and SP (and possible stack) recorded by perf itself in the perf event. This should be extremely accurate (possibly with a few instruction skew, depending on the exact PMU configuration, but it's not going to be attributed to entirely the wrong PC).

@aclements does that mean not collecting goroutine labels / tags, or is there a way to store those that could make them appear in the perf-generated events?

Oh that's a good point. We might be able to still handle that. We can ask perf to record registers including the G register, which would let us peek into the goroutine's labels regardless of which thread the signal lands on. There are definitely some subtleties, like we have to make sure we can safely read another goroutine's labels (we already have to guard against nearly this, so this might already be okay), and we have to be careful to only follow the G register if it's definitely pointing to a G, so not if the sample landed in any C or non-ABIInternal Go code.

@erifan
Copy link
Author

erifan commented Jan 6, 2023

I mean specifically the PC and SP (and possible stack) recorded by perf itself in the perf event. This should be extremely accurate (possibly with a few instruction skew, depending on the exact PMU configuration, but it's not going to be attributed to entirely the wrong PC).

Since the signal source is for the entire process, I don't know what kind of stack trace will be collected by the perf event. There are three possibilities:

  1. Stack trace of a single OS thread.
  2. pc/sp randomly sampled for all os threads of the process.
  3. Stack trace of a goroutine.

I think what we want to get is the stack trace of a single goroutine, but if it is the first two, it is a bit troublesome.

@aclements
Copy link
Member

We might be talking past each other here.

I don't mean the PC/SP in the signal frame. If the signal is delivered to an arbitrary thread in the process, that is indeed pretty meaningless.

The PC/SP gathered by perf comes from the PMU interrupt itself and is thus very representative of the specific instruction that caused the PMU overflow. If you tell perf to unwind stacks, it will unwind from that PC/SP, so if the PMU overflow happened while executing a goroutine, it will collect the stack trace of the goroutine. If it happens while executing kernel code, then it depends on exactly what options you have set, but in general it unwinds the OS thread stack, then leaves a sentinel value in the stack trace, then continues with the user-space stack. I added platform-compatible frame pointers to Go on GOARCH=amd64 years ago almost entirely so that perf could correctly unwind Go stacks. :) There are some weird things in Go stacks that perf doesn't know about, but this works correctly the vast majority of the time. Nowadays, we have platform-compatible frame pointers on more GOARCHes, though I don't remember exactly which.

@aclements
Copy link
Member

aclements commented Jan 25, 2023

@erifan (or anyone else) , do you see any deal-breakers with using stacks gathered directly by perf_events, rather than doing Go tracebacks in the signal handler?

@erifan
Copy link
Author

erifan commented Jan 28, 2023

@erifan (or anyone else) , do you see any deal-breakers with using stacks gathered directly by perf_events, rather than doing Go tracebacks in the signal handler?

@aclements Sorry for the late reply due to the Chinese New Year holiday. I haven't tried this before, but it seems to be the only option. I have two concerns, hopefully neither of them are a problem:
One is that collecting the perf_events stacks will take a different code path from the existing implementation.
Second, is the format of the sample collected by perf_events consistent with the existing sample format?
If the format is not a problem, perhaps PMU implementations should always use the stacks gathered directly by perf_events, this method is more efficient.
I will try to make a draft implementation and then reply to your question. Thanks.

@aclements
Copy link
Member

Sorry for the late reply due to the Chinese New Year holiday.

No worries. I didn't mean to rush you. :)

Second, is the format of the sample collected by perf_events consistent with the existing sample format?

I'm not quite sure what you mean. perf_events will just give a list of PCs in the sample. We'll have to copy that into the proto format (via our internal profiling ring buffer). We might also have to do inline expansion on that; I forget if that happens before or after the internal ring buffer. I think for a first pass we can ignore the inline expansion problem. I'm planning to rewrite how we do inline expansion in general, at which point this will become easy to do.

I will try to make a draft implementation and then reply to your question. Thanks.

Great! This is probably the only way to be sure. :)

@rsc
Copy link
Contributor

rsc commented Feb 1, 2023

@erifan, thanks for doing a draft implementation. We will hold the final decision for that.

@rsc rsc moved this from Active to Hold in Proposals Feb 1, 2023
@rsc
Copy link
Contributor

rsc commented Feb 1, 2023

Placed on hold.
— rsc for the proposal review group

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/410797 mentions this issue: runtime: add perfEventOpen, fcntl, and ioctl and some consts for PMU profiling

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/410796 mentions this issue: runtime: add CPU profiling configuration structure in runtime

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/410798 mentions this issue: runtime: support PMU events based cpu profiling for linux

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/410799 mentions this issue: [DO-NOT-SUBMIT] runtime: support PMU profiling for non-go threads

@erifan
Copy link
Author

erifan commented Mar 29, 2023

Hi, sorry for the late reply.
I recently spent some time on this draft implementation, this is the update:

1, I found that the above-mentioned problem that the C threads created in cgo cannot capture the signal from PMU may be related to the specific kernel version. Here's what I found:

x86:
	22.04:
		Bad: 5.15.0-53,5.15.0-56, 5.15.0-57
		Good: 6.1.12-060112
	20.04:
		Bad: 5.15.0-60
arm64:
	22.04:
		Good: 6.1.12-060112

I haven't figured out what the kernel changes are, started from which version, because I happen to have the above environments, and I haven't tested other environments. But the test results show that 6.1.12-060112 is indeed good, no matter on x86 or arm64.

2, Reading samples directly from the perf event is feasible, but my implementation ran into some random crashes. I haven't figured out where the problem is. It seems to be related to GC, but I am not familiar with it, so it is progressing very slow.

So here I want to reconfirm whether this is an issue that will block this proposal going forward? And I also want to clarify that:

  1. this problem only exists for C threads created in C code.
  2. In kernel versions above 6.1.12-060112, this problem disappears.
  3. In the bad kernel versions, It is not that the C thread cannot receive the PMU signal, but sometimes it can receive it, and sometimes it cannot. The probability of receiving the PMU signal is not 100%.

The draft implementation based on the new APIs: https://go-review.googlesource.com/c/go/+/410798/12
Reading samples from perf event: https://go-review.googlesource.com/c/go/+/410799/13 (not finished yet, crash randomly)

In addition, I wonder if we can change the parameter of SetPeriod from int64 to int32.
func (c *CPUProfile) SetPeriod(int64) error
to
func (c *CPUProfile) SetPeriod(int32) error
So the configuration structure will use less memory. But it's not a big problem, just saying that if int32 is enough, maybe we can use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Hold
Development

No branches or pull requests

8 participants