Skip to content

8278114: New addnode ideal optimization: converting "x + x" into "x << 1" #6675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

CptGit
Copy link
Contributor

@CptGit CptGit commented Dec 2, 2021

A new ideal optimization can be introduced for addnode: converting "x + x" into "x << 1".

// Convert "x + x" into "x << 1"
if (in1 == in2) {
  return new LShiftINode(in1, phase->intcon(1));
}

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8278114: New addnode ideal optimization: converting "x + x" into "x << 1"

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6675/head:pull/6675
$ git checkout pull/6675

Update a local copy of the PR:
$ git checkout pull/6675
$ git pull https://git.openjdk.java.net/jdk pull/6675/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 6675

View PR using the GUI difftool:
$ git pr show -t 6675

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6675.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 2, 2021

👋 Welcome back CptGit! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 2, 2021
@openjdk
Copy link

openjdk bot commented Dec 2, 2021

@CptGit The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Dec 2, 2021
@mlbridge
Copy link

mlbridge bot commented Dec 2, 2021

@CptGit
Copy link
Contributor Author

CptGit commented Dec 16, 2021

@merykitty If you get a chance could you please take a quick look at this PR? I was wondering if this conversion makes sense. I am concerned if this obvious optimization has been done somewhere else which I was not able to find. Thank you very much.

@merykitty
Copy link
Member

Tbh I don't find this transformation necessary, x + x is a very cheap operation and is generally easier for the optimiser to work with than x << 1.
Cheers.

@cl4es
Copy link
Member

cl4es commented Dec 16, 2021

Tbh I don't find this transformation necessary, x + x is a very cheap operation and is generally easier for the optimiser to work with than x << 1. Cheers.

I agree this looks to be of dubious value on the face of it. A microbenchmark to prove it's beneficial in some scenario feels like a requirement here.

A targeted microbenchmark could explore if we already do or could better allow constant folding of expressions that does subsequent shifts, e.g. by turning (x + x) << 3 into x << 4.

@CptGit
Copy link
Contributor Author

CptGit commented Dec 17, 2021

Tbh I don't find this transformation necessary, x + x is a very cheap operation and is generally easier for the optimiser to work with than x << 1. Cheers.

I agree this looks to be of dubious value on the face of it. A microbenchmark to prove it's beneficial in some scenario feels like a requirement here.

A targeted microbenchmark could explore if we already do or could better allow constant folding of expressions that does subsequent shifts, e.g. by turning (x + x) << 3 into x << 4.

Thank you both for commenting and suggestions! @merykitty @cl4es

I moved the transformation into LShiftNode, i.e., to convert (x + x) << 3 into x << 4, and I observed a 5% performance improvement (percentage of average time reduced) via the microbenchmark I write. I am not sure how much improvement we usually expect for a transformation, is this one currently a beneficial transformation? Thank you!

@State(Scope.Thread)
@Warmup(iterations = 20, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 20, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3 , jvmArgsAppend = {"-XX:-TieredCompilation"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is -TieredCompilation necessary to demonstrate an effect? Settings that are not strictly necessary - such as tuning - should generally be avoided (someone might want to set up separate runs with TieredCompilation enabled and disabled on a higher level, for example)

@cl4es
Copy link
Member

cl4es commented Dec 18, 2021

Thank you both for commenting and suggestions! @merykitty @cl4es

I moved the transformation into LShiftNode, i.e., to convert (x + x) << 3 into x << 4, and I observed a 5% performance improvement (percentage of average time reduced) via the microbenchmark I write. I am not sure how much improvement we usually expect for a transformation, is this one currently a beneficial transformation? Thank you!

For such a small win you might be looking at noise and should inspect the generated assembly to see that there's any real difference in the emitted code.

Running your test microbenchmark using -prof perfasm on my Linux workstation the relevant code generated for the helper method with your patch looks like this:

  4.40%     0x00007f2b0d6d720c:   mov    %esi,%eax
            0x00007f2b0d6d720e:   and    $0x1fffff,%eax               ;*iushr {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper@8 (line 88)

i.e., the ((x + x) << 10) >> 11 is fully transformed down into x & 0x1fffff, which seem like it could be a pretty optimal result for this code (zeroes out the top 11 bits).

Baseline build generates this:

  4.46%      0x00007f93e16d988c:   add    %esi,%esi
             0x00007f93e16d988e:   shl    $0xa,%esi
             0x00007f93e16d9891:   mov    %esi,%eax
             0x00007f93e16d9893:   shr    $0xb,%eax                    ;*iushr {reexecute=0 rethrow=0 return_oop=0}
                                                                       ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper@8 (line 88)

Which is a pretty direct transformation of the Java code into assembly. So it seems your patch does enable more optimal transformations!

Only a few percent of the benchmark time seem to be spent in the relevant code, though. As written I can't establish a significant result between a baseline VM and a patched one as the variance from run to run is more than a few percent. A narrowed down benchmark that only tests a single value seem to better demonstrate the low-level effect:

    @Benchmark
    public int testInt() {
        return helper(ints_a[4711]);
    }
    
    @Benchmark
    public long testLong() {
        return helper(longs_a[4711]);
    }

Not using the hand-rolled sink blackhole means we rely on JMH blackholes. Those traditionally had a bit more overhead than what you get with sink, but can since recently cooperate with the VM to remove overhead almost completely if you run with -Djmh.blackhole.autoDetect=true (soon enabled by default). I also removed -TieredCompilation from the @Fork and get slightly better and stable scores overall, though a bit longer warmup is needed.

Baseline:

Benchmark                         Mode  Cnt  Score   Error  Units
AddIdeal_XPlusX_LShiftC.testInt   avgt   15  8.925 ± 0.019  ns/op
AddIdeal_XPlusX_LShiftC.testLong  avgt   15  8.969 ± 0.107  ns/op

Patch:

Benchmark                         Mode  Cnt  Score   Error  Units
AddIdeal_XPlusX_LShiftC.testInt   avgt   15  7.650 ± 0.024  ns/op
AddIdeal_XPlusX_LShiftC.testLong  avgt   15  7.655 ± 0.087  ns/op

A ~1.2ns improvement - or 1.15x speed-up after removing as much infrastructural overhead as possible. I think this looks in line with the observable change in the generated assembly (while we get rid of more than one instruction, they are likely pretty nicely pipelined on my Intel CPU). I think this looks like a reasonable improvement for such a straightforward change to C2.

@CptGit
Copy link
Contributor Author

CptGit commented Dec 18, 2021

Thank you @cl4es so much for taking time to review my pull request and deeply investigate the microbenchmark! I learnt a lot on writing better JMH microbenchmarks by reading your comment.

Thank you both for commenting and suggestions! @merykitty @cl4es
I moved the transformation into LShiftNode, i.e., to convert (x + x) << 3 into x << 4, and I observed a 5% performance improvement (percentage of average time reduced) via the microbenchmark I write. I am not sure how much improvement we usually expect for a transformation, is this one currently a beneficial transformation? Thank you!

For such a small win you might be looking at noise and should inspect the generated assembly to see that there's any real difference in the emitted code.

Great to know we can print assembly! Thanks.

Running your test microbenchmark using -prof perfasm on my Linux workstation the relevant code generated for the helper method with your patch looks like this:

  4.40%     0x00007f2b0d6d720c:   mov    %esi,%eax
            0x00007f2b0d6d720e:   and    $0x1fffff,%eax               ;*iushr {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper@8 (line 88)

i.e., the ((x + x) << 10) >> 11 is fully transformed down into x & 0x1fffff, which seem like it could be a pretty optimal result for this code (zeroes out the top 11 bits).

Baseline build generates this:

  4.46%      0x00007f93e16d988c:   add    %esi,%esi
             0x00007f93e16d988e:   shl    $0xa,%esi
             0x00007f93e16d9891:   mov    %esi,%eax
             0x00007f93e16d9893:   shr    $0xb,%eax                    ;*iushr {reexecute=0 rethrow=0 return_oop=0}
                                                                       ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper@8 (line 88)

Which is a pretty direct transformation of the Java code into assembly. So it seems your patch does enable more optimal transformations!

Only a few percent of the benchmark time seem to be spent in the relevant code, though. As written I can't establish a significant result between a baseline VM and a patched one as the variance from run to run is more than a few percent. A narrowed down benchmark that only tests a single value seem to better demonstrate the low-level effect:

    @Benchmark
    public int testInt() {
        return helper(ints_a[4711]);
    }
    
    @Benchmark
    public long testLong() {
        return helper(longs_a[4711]);
    }

Thanks for point out this new feature. I did use Blackhole but found it brought too much overhead from infra so I referred to JMH examples for a sink method. I'll start using Backhole with -Djmh.blackhole.autoDetect=true from now on.

Not using the hand-rolled sink blackhole means we rely on JMH blackholes. Those traditionally had a bit more overhead than what you get with sink, but can since recently cooperate with the VM to remove overhead almost completely if you run with -Djmh.blackhole.autoDetect=true (soon enabled by default). I also removed -TieredCompilation from the @Fork and get slightly better and stable scores overall, though a bit longer warmup is needed.

Baseline:

Benchmark                         Mode  Cnt  Score   Error  Units
AddIdeal_XPlusX_LShiftC.testInt   avgt   15  8.925 ± 0.019  ns/op
AddIdeal_XPlusX_LShiftC.testLong  avgt   15  8.969 ± 0.107  ns/op

Patch:

Benchmark                         Mode  Cnt  Score   Error  Units
AddIdeal_XPlusX_LShiftC.testInt   avgt   15  7.650 ± 0.024  ns/op
AddIdeal_XPlusX_LShiftC.testLong  avgt   15  7.655 ± 0.087  ns/op

If you get a chance, could you please share the modifications you made to the microbenchmark, if you made other changes than removing -TieredCompilation and testing only a single value, so I can test on my end, for more data points? Also, the examples in JMH repository seem out-of-date (e.g., the Blackhole autodetect feature) I think your example will be a good study material for me. Thank you.

A ~1.2ns improvement - or 1.15x speed-up after removing as much infrastructural overhead as possible. I think this looks in line with the observable change in the generated assembly (while we get rid of more than one instruction, they are likely pretty nicely pipelined on my Intel CPU). I think this looks like a reasonable improvement for such a straightforward change to C2.

@cl4es
Copy link
Member

cl4es commented Dec 18, 2021

Thank you @cl4es so much for taking time to review my pull request and deeply investigate the microbenchmark! I learnt a lot on writing better JMH microbenchmarks by reading your comment.

My pleasure. I was intrigued since you picked up on my suggestion to enable folding of constant shift transforms so fast and wanted to see for myself that it ended up compiling down to something more compact/optimal.

If you get a chance, could you please share the modifications you made to the microbenchmark, if you made other changes than removing -TieredCompilation and testing only a single value, so I can test on my end, for more data points? Also, the examples in JMH repository seem out-of-date (e.g., the Blackhole autodetect feature) I think your example will be a good study material for me. Thank you.

No other code changes, just added in those two simplified benchmarks and commented out the existing ones. I did tweak the command line to print using ns (you can change the annotation to get the same result) and picked the smallest number of iterations that still gave stable results on my machine while trying things out: -i 15 -wi 15 -tu ns.

@CptGit
Copy link
Contributor Author

CptGit commented Dec 18, 2021

No other code changes, just added in those two simplified benchmarks and commented out the existing ones. I did tweak the command line to print using ns (you can change the annotation to get the same result) and picked the smallest number of iterations that still gave stable results on my machine while trying things out: -i 15 -wi 15 -tu ns.

Is there special configuration you were using? I was not able to reproduce the similar results.

This is when I used 20 iterations.
Baseline:

Benchmark                         Mode  Cnt  Score   Error  Units
AddIdeal_XPlusX_LShiftC.testInt   avgt   60  3.846 ? 0.164  ns/op
AddIdeal_XPlusX_LShiftC.testLong  avgt   60  3.332 ? 0.008  ns/op

Patch:

Benchmark                         Mode  Cnt  Score   Error  Units
AddIdeal_XPlusX_LShiftC.testInt   avgt   60  3.743 ? 0.245  ns/op
AddIdeal_XPlusX_LShiftC.testLong  avgt   60  3.664 ? 0.224  ns/op

Then I tried 100 interations to reduce noise.
Baseline:

Benchmark                         Mode  Cnt  Score   Error  Units
AddIdeal_XPlusX_LShiftC.testInt   avgt  300  3.566 ? 0.068  ns/op
AddIdeal_XPlusX_LShiftC.testLong  avgt  300  3.310 ? 0.004  ns/op

Patch:

Benchmark                         Mode  Cnt  Score   Error  Units
AddIdeal_XPlusX_LShiftC.testInt   avgt  300  3.724 ? 0.103  ns/op
AddIdeal_XPlusX_LShiftC.testLong  avgt  300  3.605 ? 0.086  ns/op

I were not able to print assembly by -prof perfasm on my end because of some errors from perf which I do not have permission to configure, but I checked native coverage and it showed this transformation did happen when the microbenchmark was running.

@merykitty
Copy link
Member

Alternatively, you could use -XX:CompileCommand=print,*AddIdeal_XPlusX_LShiftC.testInt which will print every detail about compilation including opto assembly. In this case, you could use -XX:-TieredCompilation to not receive details regarding C1 compiled method. Note that without hsdis the JVM will only output machine code which is very hard to analyse.

@CptGit
Copy link
Contributor Author

CptGit commented Dec 18, 2021

Alternatively, you could use -XX:CompileCommand=print,*AddIdeal_XPlusX_LShiftC.testInt which will print every detail about compilation including opto assembly. In this case, you could use -XX:-TieredCompilation to not receive details regarding C1 compiled method. Note that without hsdis the JVM will only output machine code which is very hard to analyse.

Thank you for providing this alternative.

It seems make install-hsdis does not install hsdis into the correct place. In my case, I have to manually copy it from build/cov/jdk/lib to the right place which is build/cov/images/jdk/lib/server so it can work. I don't know if this is some bug in makefile.

This is the output what I got, given @Fork(value = 1, jvmArgsAppend = {"-XX:-TieredCompilation", "-XX:CompileCommand=print,*AddIdeal_XPlusX_LShiftC.helper"}). I have two questions:

  • The assmbly looks weird to me because I did not see any shift instruction?
    (update: Never mind now. This is the patch version so shift instructions were optimized out. I checked baseline version and shl and shr were both there.)
  • Why did it print at and only at the first warmup iteration? I did not expect c2 compilation to happen so early. Does JMH framework force a compilation for every benchmark even before the first iteration?

Thanks in advance.

Running test 'micro:AddIdeal_XPlusX_LShiftC'
# JMH version: 1.33
# VM version: JDK 19-internal, OpenJDK 64-Bit Server VM, 19-internal+0-adhoc..my-jdk
# VM invoker: /home/zzq/jog/testing/my-jdk/build/cov/images/jdk/bin/java
# VM options: -Djava.library.path=/home/zzq/jog/testing/my-jdk/build/cov/images/test/micro/native -XX:-TieredCompilation -XX:CompileCommand=print,*AddIdeal_XPlusX_LShiftC.helper
# Blackhole mode: full + dont-inline hint (default, use -Djmh.blackhole.autoDetect=true to auto-detect)
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC.testInt

# Run progress: 0.00% complete, ETA 00:00:40
# Fork: 1 of 1
CompileCommand: print *AddIdeal_XPlusX_LShiftC.helper bool print = true
# Warmup Iteration   1: 
============================= C2-compiled nmethod ==============================
----------------------- MetaData before Compile_id = 188 ------------------------
{method}
 - this oop:          0x00007f977c476548
 - method holder:     'org/openjdk/bench/vm/compiler/AddIdeal_XPlusX_LShiftC'
 - constants:         0x00007f977c475f18 constant pool [76] {0x00007f977c475f18} for 'org/openjdk/bench/vm/compiler/AddIdeal_XPlusX_LShiftC' cache=0x00007f977c4ab6e0
 - access:            0x8100000a  private static 
 - name:              'helper'
 - signature:         '(I)I'
 - max stack:         3
 - max locals:        1
 - size of params:    1
 - method size:       13
 - vtable index:      -2
 - i2i entry:         0x00007f97b92f9c00
 - adapters:          AHE@0x00007f97c41125b0: 0xa i2c: 0x00007f97b9406660 c2i: 0x00007f97b9406719 c2iUV: 0x00007f97b94066e3 c2iNCI: 0x00007f97b9406756
 - compiled entry     0x00007f97b9406719
 - code size:         10
 - code start:        0x00007f977c476520
 - code end (excl):   0x00007f977c47652a
 - method data:       0x00007f977c4b4418
 - checked ex length: 0
 - linenumber start:  0x00007f977c47652a
 - localvar length:   1
 - localvar start:    0x00007f977c476532

------------------------ OptoAssembly for Compile_id = 188 -----------------------
#
#  int ( int )
#
#r018 rsi   : parm 0: int
# -- Old rsp -- Framesize: 32 --
#r591 rsp+28: in_preserve
#r590 rsp+24: return address
#r589 rsp+20: in_preserve
#r588 rsp+16: saved fp register
#r587 rsp+12: pad2, stack alignment
#r586 rsp+ 8: pad2, stack alignment
#r585 rsp+ 4: Fixed slot 1
#r584 rsp+ 0: Fixed slot 0
#
000     N1: #	out( B1 ) <- in( B1 )  Freq: 1

000     B1: #	out( N1 ) <- BLOCK HEAD IS JUNK  Freq: 1
000     # stack bang (96 bytes)
	pushq   rbp	# Save rbp
	subq    rsp, #16	# Create frame

00c     movl    RAX, RSI	# spill
00e     andl    RAX, #2097151	# int
014     addq    rsp, 16	# Destroy frame
	popq    rbp
	cmpq     rsp, poll_offset[r15_thread] 
	ja       #safepoint_stub	# Safepoint: poll for GC

026     ret

--------------------------------------------------------------------------------
----------------------------------- Assembly -----------------------------------

Compiled method (c2)    1974  188             org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper (10 bytes)
 total in heap  [0x00007f97b9418c10,0x00007f97b9418e60] = 592
 relocation     [0x00007f97b9418d80,0x00007f97b9418d90] = 16
 main code      [0x00007f97b9418da0,0x00007f97b9418de0] = 64
 stub code      [0x00007f97b9418de0,0x00007f97b9418df8] = 24
 oops           [0x00007f97b9418df8,0x00007f97b9418e00] = 8
 metadata       [0x00007f97b9418e00,0x00007f97b9418e08] = 8
 scopes data    [0x00007f97b9418e08,0x00007f97b9418e18] = 16
 scopes pcs     [0x00007f97b9418e18,0x00007f97b9418e58] = 64
 dependencies   [0x00007f97b9418e58,0x00007f97b9418e60] = 8

[Disassembly]
--------------------------------------------------------------------------------
[Constant Pool (empty)]

--------------------------------------------------------------------------------

[Verified Entry Point]
  # {method} {0x00007f977c476548} 'helper' '(I)I' in 'org/openjdk/bench/vm/compiler/AddIdeal_XPlusX_LShiftC'
  # parm0:    rsi       = int
  #           [sp+0x20]  (sp of caller)
 ;; N1: #	out( B1 ) <- in( B1 )  Freq: 1
 ;; B1: #	out( N1 ) <- BLOCK HEAD IS JUNK  Freq: 1
  0x00007f97b9418da0:   mov    %eax,-0x16000(%rsp)
  0x00007f97b9418da7:   push   %rbp
  0x00007f97b9418da8:   sub    $0x10,%rsp                   ;*synchronization entry
                                                            ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper@-1 (line 98)
  0x00007f97b9418dac:   mov    %esi,%eax
  0x00007f97b9418dae:   and    $0x1fffff,%eax               ;*iushr {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper@8 (line 98)
  0x00007f97b9418db4:   add    $0x10,%rsp
  0x00007f97b9418db8:   pop    %rbp
  0x00007f97b9418db9:   cmp    0x388(%r15),%rsp             ;   {poll_return}
  0x00007f97b9418dc0:   ja     0x00007f97b9418dc7
  0x00007f97b9418dc6:   ret    
  0x00007f97b9418dc7:   movabs $0x7f97b9418db9,%r10         ;   {internal_word}
  0x00007f97b9418dd1:   mov    %r10,0x3a0(%r15)
  0x00007f97b9418dd8:   jmp    0x00007f97b9408120           ;   {runtime_call SafepointBlob}
  0x00007f97b9418ddd:   hlt    
  0x00007f97b9418dde:   hlt    
  0x00007f97b9418ddf:   hlt    
[Exception Handler]
  0x00007f97b9418de0:   jmp    0x00007f97b936a6a0           ;   {no_reloc}
[Deopt Handler Code]
  0x00007f97b9418de5:   call   0x00007f97b9418dea
  0x00007f97b9418dea:   subq   $0x5,(%rsp)
  0x00007f97b9418def:   jmp    0x00007f97b9407240           ;   {runtime_call DeoptimizationBlob}
  0x00007f97b9418df4:   hlt    
  0x00007f97b9418df5:   hlt    
  0x00007f97b9418df6:   hlt    
  0x00007f97b9418df7:   hlt    
--------------------------------------------------------------------------------
[/Disassembly]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Oops:
  0x00007f97b9418df8:   0x00000007ff755d38 a 'jdk/internal/loader/ClassLoaders$AppClassLoader'{0x00000007ff755d38}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Metadata:
  0x00007f97b9418e00:   0x00007f977c476548 {method} {0x00007f977c476548} 'helper' '(I)I' in 'org/openjdk/bench/vm/compiler/AddIdeal_XPlusX_LShiftC'
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
pc-bytecode offsets:
PcDesc(pc=0x00007f97b9418d9f offset=ffffffff bits=0):
PcDesc(pc=0x00007f97b9418dac offset=c bits=0):
   org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper@-1 (line 98)
PcDesc(pc=0x00007f97b9418db4 offset=14 bits=0):
   org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper@8 (line 98)
PcDesc(pc=0x00007f97b9418df9 offset=59 bits=0):
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
oop maps:ImmutableOopMapSet contains 0 OopMaps

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
scopes:
ScopeDesc(pc=0x00007f97b9418dac offset=c):
   org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper@-1 (line 98)
ScopeDesc(pc=0x00007f97b9418db4 offset=14):
   org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper@8 (line 98)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
relocations:
         @0x00007f97b9418d80: b019
relocInfo@0x00007f97b9418d80 [type=11(poll_return) addr=0x00007f97b9418db9 offset=25]
         @0x00007f97b9418d82: f00e800e
relocInfo@0x00007f97b9418d84 [type=8(internal_word) addr=0x00007f97b9418dc7 offset=14 data=14] | [target=0x00007f97b9418db9]
         @0x00007f97b9418d86: 6411
relocInfo@0x00007f97b9418d86 [type=6(runtime_call) addr=0x00007f97b9418dd8 offset=17 format=1] | [destination=0x00007f97b9408120]
         @0x00007f97b9418d88: 0008
relocInfo@0x00007f97b9418d88 [type=0(none) addr=0x00007f97b9418de0 offset=8]
         @0x00007f97b9418d8a: 6400
relocInfo@0x00007f97b9418d8a [type=6(runtime_call) addr=0x00007f97b9418de0 offset=0 format=1] | [destination=0x00007f97b936a6a0]
         @0x00007f97b9418d8c: 640f
relocInfo@0x00007f97b9418d8c [type=6(runtime_call) addr=0x00007f97b9418def offset=15 format=1] | [destination=0x00007f97b9407240]
         @0x00007f97b9418d8e: 0000
relocInfo@0x00007f97b9418d8e [type=0(none) addr=0x00007f97b9418def offset=0]
         @0x00007f97b9418d90: 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Dependencies:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
ExceptionHandlerTable (size = 0 bytes)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
ImplicitExceptionTable is empty
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Recorded oops:
#0: 0x0000000000000000 NULL-oop
#1: 0x00000007ff755d38 a 'jdk/internal/loader/ClassLoaders$AppClassLoader'{0x00000007ff755d38}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Recorded metadata:
#0: 0x0000000000000000 NULL-oop
#1: 0x00007f977c476548 {method} {0x00007f977c476548} 'helper' '(I)I' in 'org/openjdk/bench/vm/compiler/AddIdeal_XPlusX_LShiftC'
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
3.711 ns/op
# Warmup Iteration   2: 3.515 ns/op
# Warmup Iteration   3: 3.313 ns/op
# Warmup Iteration   4: 3.433 ns/op
# Warmup Iteration   5: 3.526 ns/op
# Warmup Iteration   6: 3.458 ns/op
# Warmup Iteration   7: 3.446 ns/op
# Warmup Iteration   8: 3.497 ns/op
# Warmup Iteration   9: 3.471 ns/op
# Warmup Iteration  10: 3.461 ns/op
# Warmup Iteration  11: 3.429 ns/op
# Warmup Iteration  12: 3.533 ns/op
# Warmup Iteration  13: 3.545 ns/op
# Warmup Iteration  14: 3.613 ns/op
# Warmup Iteration  15: 3.534 ns/op
# Warmup Iteration  16: 3.495 ns/op
# Warmup Iteration  17: 3.567 ns/op
# Warmup Iteration  18: 3.466 ns/op
# Warmup Iteration  19: 3.310 ns/op
# Warmup Iteration  20: 3.305 ns/op
Iteration   1: 3.285 ns/op
Iteration   2: 3.368 ns/op
Iteration   3: 3.315 ns/op
Iteration   4: 3.304 ns/op
Iteration   5: 3.359 ns/op
Iteration   6: 3.340 ns/op
Iteration   7: 3.330 ns/op
Iteration   8: 3.331 ns/op
Iteration   9: 3.447 ns/op
Iteration  10: 3.370 ns/op
Iteration  11: 3.389 ns/op
Iteration  12: 3.342 ns/op
Iteration  13: 3.332 ns/op
Iteration  14: 3.349 ns/op
Iteration  15: 3.344 ns/op
Iteration  16: 3.330 ns/op
Iteration  17: 3.327 ns/op
Iteration  18: 3.419 ns/op
Iteration  19: 3.341 ns/op
Iteration  20: 3.346 ns/op
------------------------------------------------------------------------
static org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper(I)I
  interpreter_invocation_count:       93898
  invocation_counter:                 93898
  backedge_counter:                       0
  decompile_count:                        0
  mdo size: 392 bytes

0 iload_0
1 iload_0
2 iadd
3 bipush 10
5 ishl
6 bipush 11
8 iushr
9 ireturn
------------------------------------------------------------------------
Total MDO size: 392 bytes


Result "org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC.testInt":
  3.348 ±(99.9%) 0.032 ns/op [Average]
  (min, avg, max) = (3.285, 3.348, 3.447), stdev = 0.037
  CI (99.9%): [3.316, 3.381] (assumes normal distribution)


# Run complete. Total time: 00:00:42

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                        Mode  Cnt  Score   Error  Units
AddIdeal_XPlusX_LShiftC.testInt  avgt   20  3.348 ± 0.032  ns/op
Finished running test 'micro:AddIdeal_XPlusX_LShiftC'

@merykitty
Copy link
Member

merykitty commented Dec 19, 2021

The reason you don't see any shift instruction is that x << 11 >>> 11 is transformed into x & 0x1FFFFF which is the instruction you see in 0x00007f97b9418dae: and $0x1fffff,%eax.

As you can see for these kinds of microbenchmarks the cost of calling functions dominates the cost of the actual transformation, you can mitigate this by pulling the helper method out to be the benchmark (I don't see the reason to have a separate helper method, either), and putting the operation in a loop that has the results being sunk in a compiler blackhole (full blackhole or a non-inlined sink method won't work as effective in this case simply because it is also a function call) in each iteration.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@openjdk
Copy link

openjdk bot commented Jan 4, 2022

@CptGit This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8278114: New addnode ideal optimization: converting "x + x" into "x << 1"

Reviewed-by: kvn, redestad

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 117 new commits pushed to the master branch:

  • 191f730: Merge
  • 95a3010: 8275830: C2: Receiver downcast is missing when inlining through method handle linkers
  • 58b5fb3: 8265317: [vector] assert(payload->is_object()) failed: expected 'object' value for scalar-replaced boxed vector but got: NULL
  • 06f3713: 8279379: GHA: Print tests that are in error
  • e7244c1: 8278966: two microbenchmarks tests fail "assert(!jvms->method()->has_exception_handlers()) failed: no exception handler expected" after JDK-8275638
  • b4b0328: 8278824: Uneven work distribution when scanning heap roots in G1
  • 99a8351: 8279386: Remove duplicate RefProcPhaseTimeTracker
  • d1e6f26: 8279351: [TESTBUG] SADebugDTest.java does not handle "Address already in use" error
  • 93c7d90: 8278282: G1: Log basic statistics for evacuation failure
  • 1ffdc52: 8279412: [JVMCI] failed speculations list must outlive any nmethod that refers to it
  • ... and 107 more: https://git.openjdk.java.net/jdk/compare/937126b1406ff0f6ac0828310e5e09003692dcd3...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@vnkozlov, @cl4es) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jan 4, 2022
@vnkozlov
Copy link
Contributor

vnkozlov commented Jan 4, 2022

You still need approval from @cl4es

@CptGit
Copy link
Contributor Author

CptGit commented Jan 4, 2022

/reviewers 2

@openjdk
Copy link

openjdk bot commented Jan 4, 2022

@CptGit
The number of required reviews for this PR is now set to 2 (with at least 1 of role reviewers).

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Jan 4, 2022
Copy link
Member

@cl4es cl4es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The further microbenchmark improvements suggested by @merykitty makes a lot of sense given the context.

The comment about compiler blackhole is becoming a bit redundant with the upgrade to JMH 1.34 (#6955), which enables the compiler assisted blackholes by default.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jan 4, 2022
Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to request to update copyright year to 2022 in changed and new files headers.

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Jan 4, 2022
@CptGit
Copy link
Contributor Author

CptGit commented Jan 4, 2022

I forgot to request to update copyright year to 2022 in changed files headers.

Good catch! Also I addressed the redundant comment about compiler blackhole in the microbenchmark.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good. And my testing passed.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jan 4, 2022
@CptGit
Copy link
Contributor Author

CptGit commented Jan 4, 2022

Good. And my testing passed.

Cool, thanks for checking. I think I can integrate now?

@vnkozlov
Copy link
Contributor

vnkozlov commented Jan 5, 2022

Cool, thanks for checking. I think I can integrate now?

Yes. And I will sponsor it.

@CptGit
Copy link
Contributor Author

CptGit commented Jan 5, 2022

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Jan 5, 2022
@openjdk
Copy link

openjdk bot commented Jan 5, 2022

@CptGit
Your change (at version 49f9572) is now ready to be sponsored by a Committer.

@vnkozlov
Copy link
Contributor

vnkozlov commented Jan 5, 2022

/sponsor

@openjdk
Copy link

openjdk bot commented Jan 5, 2022

Going to push as commit f326305.
Since your change was applied there have been 117 commits pushed to the master branch:

  • 191f730: Merge
  • 95a3010: 8275830: C2: Receiver downcast is missing when inlining through method handle linkers
  • 58b5fb3: 8265317: [vector] assert(payload->is_object()) failed: expected 'object' value for scalar-replaced boxed vector but got: NULL
  • 06f3713: 8279379: GHA: Print tests that are in error
  • e7244c1: 8278966: two microbenchmarks tests fail "assert(!jvms->method()->has_exception_handlers()) failed: no exception handler expected" after JDK-8275638
  • b4b0328: 8278824: Uneven work distribution when scanning heap roots in G1
  • 99a8351: 8279386: Remove duplicate RefProcPhaseTimeTracker
  • d1e6f26: 8279351: [TESTBUG] SADebugDTest.java does not handle "Address already in use" error
  • 93c7d90: 8278282: G1: Log basic statistics for evacuation failure
  • 1ffdc52: 8279412: [JVMCI] failed speculations list must outlive any nmethod that refers to it
  • ... and 107 more: https://git.openjdk.java.net/jdk/compare/937126b1406ff0f6ac0828310e5e09003692dcd3...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jan 5, 2022
@openjdk openjdk bot closed this Jan 5, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Jan 5, 2022
@openjdk
Copy link

openjdk bot commented Jan 5, 2022

@vnkozlov @CptGit Pushed as commit f326305.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

5 participants