Description
The compiler currently compiles a+b+c+d
as a+(b+(c+d))
. It should use (a+b)+(c+d)
instead, because the latter can be executed out of order.
More broadly, we should balance trees of associative computation.
Doing this in the compiler with a rule tailored for a single computation type in round 4 of md5block.go yielded a 15% throughput improvement. (See below for details.)
It's not obvious to me whether we can do this with carefully crafted rewrite rules or whether a dedicated pass would be better. But it looks like there may be significant performance wins available on tight mathematical code.
I don't plan to work on this further, but I really hope someone else picks it up.
cc @randall77 @martisch @FiloSottile @mmcloughlin @mdempsky
To reproduce the md5 results, disable the optimized assembly routines, and add this rewrite rule:
(Add32 <t> c:(Const32) (Add32 d:(Load _ _) (Add32 x:(Xor32 _ _) a:(Add32 _ _)))) => (Add32 (Add32 <t> c d) (Add32 <t> x a))
and disable this one to avoid an infinite loop:
(Add32 (Add32 i:(Const32 <t>) z) x) && (z.Op != OpConst32 && x.Op != OpConst32) => (Add32 i (Add32 <t> z x))
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Triage Backlog
Activity