Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: feedback-guided optimization #28262

Closed
josharian opened this issue Oct 17, 2018 · 20 comments
Closed

cmd/compile: feedback-guided optimization #28262

josharian opened this issue Oct 17, 2018 · 20 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance umbrella
Milestone

Comments

@josharian
Copy link
Contributor

This is an umbrella issue for supporting feedback-guided optimization. This has been discussed offhand in a few other contexts (inlining decisions: #17566, code layout decisions: #20356, language changes: #24204, stack growth: #18138).

It is not clear what the design for FGO/PGO support might look like. In particular, what do the feedback/profile files look like?

@aclements observed during conversation at GopherCon 2018 that in order to preserve reproducible builds and efficiency, the feedback file needs to be hashed by cmd/go along with all other inputs, which suggests that it ought to be committed to version control.

cc @randall77 @ianlancetaylor @aclements @crawshaw @bradfitz @CAFxX (and the list could go on, but I'm sure everyone will find it)

@josharian josharian added this to the Unplanned milestone Oct 17, 2018
@seebs
Copy link
Contributor

seebs commented Aug 29, 2019

Oh, hey, I have a concrete example of a potentially-relevant case for this, which is a recurring thing that bites people with microbenchmarks, but may also affect real code:

In rare cases, trivial questions of code alignment can produce very large performance differences. I had a test case, at one point, where I had two functions which were getting roughly a 2x performance difference in benchmarking. (As in, one took twice as long as the other.) But the more I experimented, the more confused I became, because it turned out the functions were identical, and I could change which one was faster by reordering them in source.

The microbenchmark case probably doesn't matter, but there are real-world cases where a couple of likely-concurrent hot paths end up with some kind of problem which is entirely a function of the exact location of their code -- I'm guessing either "aligned or not aligned with cache line" or "in same cache-associativity group as other hot paths", but I don't have the expertise to dive into it far enough to be sure.

My guess is that there's some actual specialized code out there where inserting a handful of nops before some specific functions could produce a +/-20% performance change in the program as a whole. Figuring out which ones seems likely to be arbitrarily hard; you'd need some kind of framework for figuring out how to benchmark the functions, and how to benchmark different sets of functions in parallel, and so on.

This is the sort of thing which is a huge amount of work, and which doesn't provide any noticeable benefit until it does, and then it can be inordinately significant. (The big question is whether you end up perceiving it as "sometimes we get lucky and our code runs faster" or "sometimes we get unlucky and our code runs slower").

@dangit815
Copy link

More than two years have passed since the last discussion, is there any progress or plan on FGO recently?

@aclements
Copy link
Member

I hadn't remembered we even had an open issue for this!

@cherrymui and @prattmic are actively doing prototype work right now to help us understand the potential wins from PGO/FGO/FDO and how to best integrate it into the build process. It's early stages, but we're tentatively hoping to have initial PGO optimizations available in 1.19.

@josharian
Copy link
Contributor Author

Here's a related idea I've been kicking around.

Instead of doing running program -> profile -> compiler, add an extra intermediate step: running program -> profile -> optimization file -> compiler. The optimization file consists of optimization information and directives to the compiler.

There are a few advantages to this:

  • the compiler can stay simpler
  • being able to inspect and edit the optimizations file directly (or via a tool) will make it easier to understand what is happening, bisect, test, and minimize issues
  • it opens the door for other mechanisms/tooling to add information and optimizations, such as: offline static analysis, offline human analysis, bespoke profiling techniques, or combining profiles taken across a fleet of machines

@josharian
Copy link
Contributor Author

@cherrymui and @prattmic are actively doing prototype work right now to help us understand the potential wins from PGO/FGO/FDO and how to best integrate it into the build process

Though I am delighted to hear this, I feel compelled to say yet again: I wish we had a more predictable, repeatable mechanism to learn things like this than someone happening to ping a relevant issue.

@thepudds
Copy link
Contributor

thepudds commented Oct 18, 2021

Though I am delighted to hear this, I feel compelled to say yet again: I wish we had a more predictable, repeatable mechanism to learn things like this than someone happening to ping a relevant issue.

Is this the type of thing that could get covered in the “Compiler and Runtime Meeting Notes“ tracking issue #43930 ?

That did seem useful, although I know it takes some work.

@prattmic
Copy link
Member

Instead of doing running program -> profile -> compiler, add an extra intermediate step: running program -> profile -> optimization file -> compiler. The optimization file consists of optimization information and directives to the compiler.

I have received similar feedback in discussing this with one user: that they would like a way to manually specify optimizations that a PGO profile might indicate if capturing actual profiles is not possible/representative or via, as you say "offline human analysis". They specifically mentioned //go: directives, but I think a separate edit-able file would work as well.

@jeremyfaller
Copy link
Contributor

Is this the type of thing that could get covered in the “Compiler and Runtime Meeting Notes“ tracking issue #43930 ?

That did seem useful, although I know it takes some work.

CC @josharian

Yeah. I haven't been maintaining that of late. My apologies. (It's kinda fallen on the pile of "stuff I should get to".) I'll try to do better. :)

Having said that it's unlikely that PGO would have made that tracking issue. PGO started in experimental form, and we were only trying to see if it was worth pursuing. Our experimentation and the results from some external groups lead us to believe PGO might make the 2022 roadmap, but we're still trying to figure out what form it will take.

@cristaloleg
Copy link

cristaloleg commented Nov 8, 2021

Regarding PGO in the latest .NET 6 release https://devblogs.microsoft.com/dotnet/announcing-net-6/#dynamic-pgo and https://devblogs.microsoft.com/dotnet/announcing-net-6/#dynamic-pgo (UPD: there is a 2nd Dynamic PGO paragraph closer to the end of the article)

@mvdan
Copy link
Member

mvdan commented Nov 22, 2021

To echo #28262 (comment), #49688 came as a surprise to seemingly everyone except the compiler team. If #43930 isn't the right medium to keep everyone else in the loop, perhaps consider trying something else. My gut feeling is that those public weekly notes are a good starting point, and I'd like to see them resume - especially if they included a very brief summary of the main projects currently in progress.

To hopefully give a bit of perspective: the lack of transparency into what is being actively considered or experimented on discourages external contributions. Using this thread as an example, it seems like Josh has been thinking about this idea for a while, and I'd bet he would contribute towards the design, testing, and perhaps even coding/reviews. That kind of external involvement simply isn't going to happen if there isn't a public way to be kept in the loop, though.

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Nov 22, 2021

@mvdan I'm sorry it seems that there is a lack of transparency by the Go team. The compiler team at Google hasn't done any work on PGO that is not public. The code in #49688 was developed entirely independently at Uber. The compiler team only became aware of it relatively recently when Uber reached out privately. The public #49688 is a step toward transparency and keeping everybody in the loop.

It would have been better to make a public announcement about it first. Still, we would now be in the same position anyhow. The communication on this is a lesson to bear in mind the next time there is a substantial privately developed contribution. Sorry for the confusion and angst.

@heschi heschi added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 22, 2021
@mvdan
Copy link
Member

mvdan commented Nov 23, 2021

Perhaps I'm not seeing the full picture - in #28262 (comment) it was described that some prototype work was already taking place, and in #49688 (comment) it seems like the design and coordination with Uber to upstream their work is currently underway.

These updates are certainly welcome, and I'm really excited to see some progress in this space. However, I'd like to clarify that I don't think we'd have arrived at the same position anyway. To once again take Josh as an example, he's only learning about developments around this proposal via comments that feel like after-thoughts, leaving him little room to participate or even be kept in the loop with some detail. Which is unfortunate, given he's the author of the proposal and some of its design ideas, and has been one of the most reliable compiler contributors over the years :)

Maybe the way the compiler team functions is indeed to work on this issue privately, and then come out publicly with a proposal document and possibly even a prototype implementation. I'm trying to argue, hopefully in a constructive way, that we should aim to be more transparent and welcoming to external contributors. A good first step would be to post regular and early updates somewhere like #43930, making it easy for others to be kept up to date in a timely manner. Ideally some of the team's communication channels would be public too, much like golang-tools has been doing for years, but at least weekly written updates are an easier incremental step.

@jeremyfaller
Copy link
Contributor

@mvdan I don't think you're missing the picture – there are a few details that could help fill you in, but most of it is all current events.

A bit of background:
The Go Compiler and Runtime team is split between two locations, Cambridge MA (with a small contingent in NYC), and Sunnyvale. The majority of the generics work has been handled in Sunnyvale, and while Go focused on 1.18, Cambridge decided to minimize instability in the compiler and runtime. So, we decided to do some experimentation. While lots of times work is internally motivated by things Google would like to see from Go (like the register ABI in 1.16/1.17/1.18), Google's happy if we just focus on generics right now. CAM decided to spend a small amount of resources to characterize what PGO would give us, but we didn't know if we would go further than experimentation. (and we still don't)

The present day:
About when Cherry started to get numbers from PGO inlining, Uber reached out about a PGO implementation they have. They have been thinking about it for longer than we have, and implemented more than we have with our recent experiments. Uber has expressed that they want to upstream their work, but we don't know it's the right thing for the Go ecosystem as a whole, so we need to take a look at it. I will be honest and say that there are parts of it weren't not sure are a good fit for cloud Go users and embedded Go users, but at first blush lots of decisions they've made sound good. We still need to see what they've got before we make any decisions. I doubt their work will be upstreamed without modification.

That is where we are. Uber wants us to consider their inlining implementation, and we need to take a look at it.

The feedback that the compiler and runtime team could do a better job is correct, but in this instance, I feel like we communicated when we knew things. The PGO work we undertook in 2021H2 was truly an experiment. We do experimentation all the time, and when we can share results (ie, it isn't based on internal Google data), we do.

I hope this explanation helps, and I'm happy to chat at any time.

@mvdan
Copy link
Member

mvdan commented Nov 28, 2021

Thanks, Jeremy - I appreciate the extra updates and context on PGO in this thread, as well as all the work that is happening here.

The Go Compiler and Runtime team

Just so we're clear, this implies only Google, correct? I think that's what I'm trying to get at. People like Josh certainly deserve being considered part of the compiler team, but are repeatedly kept out of relevant threads and video calls simply because they are not employed by Google :)

I admit this is getting off-topic, though, so this will be my last comment on the topic here. I've already outlined how I think you could improve transparency and encourage more significant contributions from non-Googlers, and I'd rather not repeat myself as I don't have much else to add.

I hope it's clear this is not an anti-Google sentiment in any way - but rather, how other companies (like Josh's Tailscale or this thread's Uber) would likely participate more regularly and actively in those teams if they were given an honest chance.

@jeremyfaller
Copy link
Contributor

Just so we're clear, this implies only Google, correct? I think that's what I'm trying to get at. People like Josh certainly deserve being considered part of the compiler team, but are repeatedly kept out of relevant threads and video calls simply because they are not employed by Google :)

Sorry, yes. That's how we split things up within google. These are the people within Google who think about the C&RT, don't mean to sound like we're alone in this fight. :)

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/357330 mentions this issue: cmd/compile: implement jump tables

gopherbot pushed a commit that referenced this issue Apr 14, 2022
Performance is kind of hard to exactly quantify.

One big difference between jump tables and the old binary search
scheme is that there's only 1 branch statement instead of O(n) of
them. That can be both a blessing and a curse, and can make evaluating
jump tables very hard to do.

The single branch can become a choke point for the hardware branch
predictor. A branch table jump must fit all of its state in a single
branch predictor entry (technically, a branch target predictor entry).
With binary search that predictor state can be spread among lots of
entries. In cases where the case selection is repetitive and thus
predictable, binary search can perform better.

The big win for a jump table is that it doesn't consume so much of the
branch predictor's resources. But that benefit is essentially never
observed in microbenchmarks, because the branch predictor can easily
keep state for all the binary search branches in a microbenchmark. So
that benefit is really hard to measure.

So predictable switch microbenchmarks are ~useless - they will almost
always favor the binary search scheme. Fully unpredictable switch
microbenchmarks are better, as they aren't lying to us quite so
much. In a perfectly unpredictable situation, a jump table will expect
to incur 1-1/N branch mispredicts, where a binary search would incur
lg(N)/2 of them. That makes the crossover point at about N=4. But of
course switches in real programs are seldom fully unpredictable, so
we'll use a higher crossover point.

Beyond the branch predictor, jump tables tend to execute more
instructions per switch but have no additional instructions per case,
which also argues for a larger crossover.

As far as code size goes, with this CL cmd/go has a slightly smaller
code segment and a slightly larger overall size (from the jump tables
themselves which live in the data segment).

This is a case where some FDO (feedback-directed optimization) would
be really nice to have. #28262

Some large-program benchmarks might help make the case for this
CL. Especially if we can turn on branch mispredict counters so we can
see how much using jump tables can free up branch prediction resources
that can be gainfully used elsewhere in the program.

name                         old time/op  new time/op  delta
Switch8Predictable         1.89ns ± 2%  1.27ns ± 3%  -32.58%  (p=0.000 n=9+10)
Switch8Unpredictable       9.33ns ± 1%  7.50ns ± 1%  -19.60%  (p=0.000 n=10+9)
Switch32Predictable        2.20ns ± 2%  1.64ns ± 1%  -25.39%  (p=0.000 n=10+9)
Switch32Unpredictable      10.0ns ± 2%   7.6ns ± 2%  -24.04%  (p=0.000 n=10+10)

Fixes #5496
Update #34381

Change-Id: I3ff56011d02be53f605ca5fd3fb96b905517c34f
Reviewed-on: https://go-review.googlesource.com/c/go/+/357330
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@google.com>
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022
@prattmic
Copy link
Member

@mvdan posted a comment at #49688 (comment) asking for a tracking bug, and updates on what's going on. This is the right tracking bug.

Apologies for lack of updates, our progress has been very on-and-off with other things going on. The current status here is:

  • @rajbarik and others at Uber have been prototyping compiler internals of a few different types of optimizations that we could do with profiling data. Their prototype is in pgo: profile-guided optimization for golang v1.17 #49688.
  • @cherrymui, myself, and @aclements have been thinking about what an interface to PGO could look like in terms of profile collection and use via the go tool. We hope to have a concrete initial proposal to post in the next few weeks, which will be a good discussion point and hopefully can accelerate progress around collaboration on optimizations, etc.

@cherrymui
Copy link
Member

The proposal @prattmic mentioned above is filed at #55022 . (Filing a new issue to make it clearer what we are proposing.)
Thanks.

@qiulaidongfeng
Copy link
Member

Go already supports PGO, why not close this issue?

@cherrymui
Copy link
Member

The initial PGO work is done at #55022, and some followup works are tracked at #62463. We can close this issue.

@golang golang locked and limited conversation to collaborators Oct 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance umbrella
Projects
None yet
Development

No branches or pull requests