8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() #22922

erifan · 2025-01-06T07:55:39Z

Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example:

x*8 can be optimized as x<<3.
x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64.

Currently OpenJDK implemented a few such patterns in mid-end, including:

|C| = 1<<n (n>0)
|C| = (1<<n) - 1 (n>0)
|C| = (1<<m) + (1<<n) (m>n, n>=0)

The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction.

But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns:
3.1. C = (1<<n) + 1 (n>0)
3.2. C = -((1<<n) + 1) (n>0)
3.3. C = (1<<m) + (1<<n) (m>n, n>0)
3.4. C = -((1<<m) + (1<<n)) (m>n, n>0)

According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example.

Before (MUL is not converted):

  mov x1, #C
  mul x2, x1, x0

Now (MUL is converted):
For 3.1:

    add x2, x0, x0, lsl #n

For 3.2:

    add x2, x0, x0, lsl #n  // same cost with mul if n > 4
    neg x2, x2

For 3.3:

    lsl x1, x0, #m
    add x2, x1, x0, lsl #n  // same cost with mul if n > 4

For 3.4:

    lsl x1, x0, #m
    add x2, x1, x0, lsl #n  // same cost with mul if n > 4
    neg x2, x2

Test results (ns/op) on Arm Neoverse V2:

		Before	Now	Uplift		Pattern	Notes
testInt9	103.379	60.702	1.70305756	3.1
testIntN33	103.231	106.825	0.96635619	3.2	n > 4
testIntN9	103.448	103.005	1.004300762	3.2	n <= 4
testInt18	103.354	99.271	1.041129837	3.3	m <= 4, n <= 4
testInt36	103.396	99.186	1.042445506	3.3	m > 4, n <= 4
testInt96	103.337	105.416	0.980278136	3.3	m > 4, n > 4
testIntN18	103.333	139.258	0.742025593	3.4	m <= 4, n <= 4
testIntN36	103.208	139.132	0.741799155	3.4	m > 4, n <= 4
testIntN96	103.367	139.471	0.74113615	3.4	m > 4, n > 4

(S1) From this point on, we should treat pattern 3 as follows:
3.1 C = (1<<n) + 1 (n>0)
3.2 C = -((1<<n) + 1) (0<n<=4)
3.3 C = (1<<m) + (1<<n) (m>n, 0<n<=4)
3.4 C = -((1<<m) + (1<<n)) (disable)

Since this conversion is implemented in mid-end, it impacts other optimizations, such as auto-vectorization. Assume there's the following loop which can be vectorized.
Vector-A:

for (int i=0; i<len; i++) {
  sum += a[i] * C;
}

Before:

  movi    v19.4s, #C  // this will be hoisted out of the loop
  mla     v16.4s, v17.4s, v19.4s

After:
For 3.1:

    shl     v19.4s, v17.4s, #m
    add     v17.4s, v19.4s, v17.4s
    add     v16.4s, v16.4s, v17.4s

For 3.2:

    (add    w11, w11, w11, lsl #m
    sub     w11, w12, w11) * 4  // *4 is for 4 ints

For 3.3:

    shl     v18.4s, v17.4s, #m
    shl     v19.4s, v17.4s, #n
    add     v18.4s, v19.4s, v18.4s
    add     v16.4s, v16.4s, v18.4s

For 3.4:

    (lsl    w12, w4, #m
    add     w11, w12, w4, lsl #n
    sub     w13, w13, w11) * 4  // *4 is for 4 ints

The generated instruction before is more simple and performing:

			Before	Now	Uplift		Pattern
testInt9AddSum		47.958	63.696	0.752920121	3.1
testIntN33AddSum	48.013	147.834	0.324776438	3.2
testIntN9AddSum		48.026	149.149	0.322000148	3.2
testInt18AddSum		47.971	69.393	0.691294511	3.3
testInt36AddSum		47.98	69.395	0.69140428	3.3
testInt96AddSum		47.992	69.453	0.690999669	3.3
testIntN18AddSum	48.014	157.132	0.305564748	3.4
testIntN36AddSum	48.02	157.094	0.305676856	3.4
testIntN96AddSum	48.032	153.642	0.312622851	3.4

(S2) From this point on, we should disable pattern 3 totally.

But we can have different cases, for example:
Vector-B:

for (int i=0; i<100000; i++) {
  a[i] = a[i] * C;
}

Test results:

		Before	Now	Uplift		Pattern
testInt9Store	43.392	33.338	1.301577779	3.1
testIntN33Store	43.365	75.993	0.570644665	3.2
testIntN9Store	43.5	75.452	0.576525473	3.2
testInt18Store	43.442	41.847	1.038115038	3.3
testInt36Store	43.369	41.843	1.03646966	3.3
testInt96Store	43.389	41.931	1.03477141	3.3
testIntN18Store	43.372	57.909	0.748968209	3.4
testIntN36Store	43.373	57.042	0.760369552	3.4
testIntN96Store	43.405	58.145	0.746495829	3.4

(S3) From this point on, we should treat pattern 3 as follow:
3.1 C = (1<<n) + 1 (n>0)
3.2 C = -((1<<n) + 1) (disable)
3.3 C = (1<<m) + (1<<n) (m>n, n>0)
3.4 C = -((1<<m) + (1<<n)) (disable)

Combining S1, S2 and S3, we get:
Pattern S1 S2 S3
3.1 (n>0, 1.7) (disable, 0.75) (n>0, 1.3)
3.2 (0<n<=4, 1.0) (disable, 0.32) (disable, 0.57) 3.3 (m>n, 0<n<=4, 1.04) (disable, 0.69) (m>n, n>0, 1.03) 3.4 (disable, 0.74) (disable, 0.30) (disable, 0.74)

For 3.1, it's similar with pattern 2, usually be lowered as only one instruction, so we tend to keep it in mid-end.
For 3.2, we tend to disable it in mid-end, and do S1 in back-end if it's profitable.
For 3.3, although S3 has 3% performance gain, but S2 has 31% performance regression. So we tend to disable it in mid-end and redo S1 in back-end. For 3.4, we shouldn't do this optimization anywhere.

In theory, auto-vectorization should be able to generate the best vectorized code, and cases that cannot be vectorized will be converted into other more optimal scalar instructions in the architecture backend (this is what gcc and llvm do). However, we currently do not have a cost model and vplan, and the results of auto-vectorization are significantly affected by its input. Therefore, this patch turns off pattern 3.2, 3.3 and 3.4 in mid-end. Then if it's profitable, implement these patterns in the backend. If we implement a cost model and vplan in the future, it is best to move all patterns to the backend, this patch does not conflict with this direction.

I also tested this patch on Arm N1, Intel SPR and AMD Genoa machines, No noticeable performance degradation was seen on any of the machines.

Here are the test results on an Arm V2 and an AMD Genoa machine:

Benchmark	  V2-now	V2-after	Uplift	Genoa-now	Genoa-after	Uplift	Pattern	Notes
testInt8	60.36989	60.276736	1	116.768294	116.772547	0.99	1
testInt8AddSum	63.658064	63.797732	0.99	16.04973	16.051491	0.99	1
testInt8Store	38.829618	39.054129	0.99	19.857453	20.006321	0.99	1
testIntN8	59.99655	60.150053	0.99	132.269926	132.252473	1	1
testIntN8AddSum	145.678098	146.181549	0.99	158.546226	158.806476	0.99	1
testIntN8Store	32.802445	32.897907	0.99	19.047873	19.065941	0.99	1
testInt7	98.978213	99.176574	0.99	114.07026	113.08989	1	2
testInt7AddSum	62.675636	62.310799	1	23.370851	20.971655	1.11	2
testInt7Store	32.850828	32.923315	0.99	23.884952	23.628681	1.01	2
testIntN7	60.27949	60.668158	0.99	174.224893	174.102295	1	2
testIntN7AddSum	62.746696	62.288476	1	20.93192	20.964557	0.99	2
testIntN7Store	32.812906	32.851355	0.99	23.810024	23.526074	1.01	2
testInt9	60.820402	60.331938	1	108.850777	108.846161	1	3.1
testInt9AddSum	62.24679	62.374637	0.99	20.698749	20.741137	0.99	3.1
testInt9Store	32.871723	32.912065	0.99	19.055537	19.080735	0.99	3.1
testIntN33	106.517618	103.450746	1.02	153.894345	140.641135	1.09	3.2	n > 4
testIntN33AddSum	147.589815	47.911612	3.08	153.851885	17.008453	9.04	3.2
testIntN33Store	75.434513	43.473053	1.73	26.612181	20.436323	1.3	3.2
testIntN9	102.173268	103.70682	0.98	155.858169	140.718967	1.1	3.2	n <= 4
testIntN9AddSum	148.724952	47.963305	3.1	186.902111	20.249414	9.23	3.2
testIntN9Store	74.783788	43.339188	1.72	20.150159	20.888448	0.96	3.2
testInt18	98.905625	102.942092	0.96	142.480636	140.748778	1.01	3.3	m <= 4, n <= 4
testInt18AddSum	68.695585	48.103536	1.42	26.88524	16.77886	1.6	3.3
testInt18Store	41.307909	43.385183	0.95	21.233238	20.875026	1.01	3.3
testInt36	99.039742	103.714745	0.95	142.265806	142.334039	0.99	3.3	m > 4, n <= 4
testInt36AddSum	68.736756	47.952189	1.43	26.868362	17.030035	1.57	3.3
testInt36Store	41.403698	43.414093	0.95	21.225454	20.52266	1.03	3.3
testInt96	105.00287	103.528144	1.01	237.649526	140.643255	1.68	3.3	m > 4, n > 4
testInt96AddSum	68.481133	48.04549	1.42	26.877407	16.918209	1.58	3.3
testInt96Store	41.276292	43.512994	0.94	23.456117	20.540181	1.14	3.3
testIntN18	138.629044	103.269657	1.34	210.315628	140.716818	1.49	3.4	m <= 4, n <= 4
testIntN18AddSum	156.635652	48.003989	3.26	215.807135	16.917665	12.75	3.4
testIntN18Store	57.584487	43.410415	1.32	26.819827	20.707778	1.29	3.4
testIntN36	139.068861	103.766774	1.34	209.522432	140.720322	1.48	3.4	m > 4, n <= 4
testIntN36AddSum	156.36928	48.027779	3.25	215.705842	16.893192	12.76	3.4
testIntN36Store	57.715418	43.493958	1.32	21.651252	20.676877	1.04	3.4
testIntN96	139.151761	103.453665	1.34	269.254161	140.753499	1.91	3.4	m > 4, n > 4
testIntN96AddSum	153.123557	48.110524	3.18	263.262635	17.011144	15.47	3.4
testIntN96Store	57.793179	43.47574	1.32	24.444592	20.530219	1.19	3.4

limitations:
1, This patch only analyzes two vector cases, there may be other vector cases that may get performance regression with this patch. 2, This patch does not implement the disabled patterns in the backend, I will propose a follow-up patch to implement these patterns in the aarch64 backend.
3, This patch does not handle the long type, because different architectures have different auto-vectorization support for long type, resulting in very different performance, and it is difficult to find a solution that does not introduce significant performance degradation.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/22922/head:pull/22922
$ git checkout pull/22922

Update a local copy of the PR:
$ git checkout pull/22922
$ git pull https://git.openjdk.org/jdk.git pull/22922/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 22922

View PR using the GUI difftool:
$ git pr show -t 22922

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/22922.diff

Using Webrev

Link to Webrev Comment

…::Ideal() Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example: 1. x*8 can be optimized as x<<3. 2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64. Currently OpenJDK implemented a few such patterns in mid-end, including: 1. |C| = 1<<n (n>0) 2. |C| = (1<<n) - 1 (n>0) 3. |C| = (1<<m) + (1<<n) (m>n, n>=0) The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction. But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns: 3.1. C = (1<<n) + 1 (n>0) 3.2. C = -((1<<n) + 1) (n>0) 3.3. C = (1<<m) + (1<<n) (m>n, n>0) 3.4. C = -((1<<m) + (1<<n)) (m>n, n>0) According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example. Before (MUL is not converted): ``` mov x1, #C mul x2, x1, x0 ``` Now (MUL is converted): For 3.1: ``` add x2, x0, x0, lsl #n ``` For 3.2: ``` add x2, x0, x0, lsl #n // same cost with mul if n > 4 neg x2, x2 ``` For 3.3: ``` lsl x1, x0, #m add x2, x1, x0, lsl #n // same cost with mul if n > 4 ``` For 3.4: ``` lsl x1, x0, #m add x2, x1, x0, lsl #n // same cost with mul if n > 4 neg x2, x2 ``` Test results (ns/op) on Arm Neoverse V2: ``` Before Now Uplift Pattern Notes testInt9 103.379 60.702 1.70305756 3.1 testIntN33 103.231 106.825 0.96635619 3.2 n > 4 testIntN9 103.448 103.005 1.004300762 3.2 n <= 4 testInt18 103.354 99.271 1.041129837 3.3 m <= 4, n <= 4 testInt36 103.396 99.186 1.042445506 3.3 m > 4, n <= 4 testInt96 103.337 105.416 0.980278136 3.3 m > 4, n > 4 testIntN18 103.333 139.258 0.742025593 3.4 m <= 4, n <= 4 testIntN36 103.208 139.132 0.741799155 3.4 m > 4, n <= 4 testIntN96 103.367 139.471 0.74113615 3.4 m > 4, n > 4 ``` **(S1) From this point on, we should treat pattern 3 as follows:** 3.1 C = (1<<n) + 1 (n>0) 3.2 C = -((1<<n) + 1) (0<n<=4) 3.3 C = (1<<m) + (1<<n) (m>n, 0<n<=4) 3.4 C = -((1<<m) + (1<<n)) (disable) Since this conversion is implemented in mid-end, it impacts other optimizations, such as auto-vectorization. Assume there's the following loop which can be vectorized. Vector-A: ``` for (int i=0; i<len; i++) { sum += a[i] * C; } ``` Before: ``` movi v19.4s, #C // this will be hoisted out of the loop mla v16.4s, v17.4s, v19.4s ``` After: For 3.1: ``` shl v19.4s, v17.4s, #m add v17.4s, v19.4s, v17.4s add v16.4s, v16.4s, v17.4s ``` For 3.2: ``` (add w11, w11, w11, lsl #m sub w11, w12, w11) * 4 // *4 is for 4 ints ``` For 3.3: ``` shl v18.4s, v17.4s, #m shl v19.4s, v17.4s, #n add v18.4s, v19.4s, v18.4s add v16.4s, v16.4s, v18.4s ``` For 3.4: ``` (lsl w12, w4, #m add w11, w12, w4, lsl #n sub w13, w13, w11) * 4 // *4 is for 4 ints ``` The generated instruction before is more simple and performing: ``` Before Now Uplift Pattern testInt9AddSum 47.958 63.696 0.752920121 3.1 testIntN33AddSum 48.013 147.834 0.324776438 3.2 testIntN9AddSum 48.026 149.149 0.322000148 3.2 testInt18AddSum 47.971 69.393 0.691294511 3.3 testInt36AddSum 47.98 69.395 0.69140428 3.3 testInt96AddSum 47.992 69.453 0.690999669 3.3 testIntN18AddSum 48.014 157.132 0.305564748 3.4 testIntN36AddSum 48.02 157.094 0.305676856 3.4 testIntN96AddSum 48.032 153.642 0.312622851 3.4 ``` **(S2) From this point on, we should disable pattern 3 totally.** But we can have different cases, for example: Vector-B: ``` for (int i=0; i<100000; i++) { a[i] = a[i] * C; } ``` Test results: ``` Before Now Uplift Pattern testInt9Store 43.392 33.338 1.301577779 3.1 testIntN33Store 43.365 75.993 0.570644665 3.2 testIntN9Store 43.5 75.452 0.576525473 3.2 testInt18Store 43.442 41.847 1.038115038 3.3 testInt36Store 43.369 41.843 1.03646966 3.3 testInt96Store 43.389 41.931 1.03477141 3.3 testIntN18Store 43.372 57.909 0.748968209 3.4 testIntN36Store 43.373 57.042 0.760369552 3.4 testIntN96Store 43.405 58.145 0.746495829 3.4 ``` **(S3) From this point on, we should treat pattern 3 as follow:** 3.1 C = (1<<n) + 1 (n>0) 3.2 C = -((1<<n) + 1) (disable) 3.3 C = (1<<m) + (1<<n) (m>n, n>0) 3.4 C = -((1<<m) + (1<<n)) (disable) Combining S1, S2 and S3, we get: Pattern S1 S2 S3 3.1 (n>0, 1.7) (disable, 0.75) (n>0, 1.3) 3.2 (0<n<=4, 1.0) (disable, 0.32) (disable, 0.57) 3.3 (m>n, 0<n<=4, 1.04) (disable, 0.69) (m>n, n>0, 1.03) 3.4 (disable, 0.74) (disable, 0.30) (disable, 0.74) For 3.1, it's similar with pattern 2, usually be lowered as only one instruction, so we tend to keep it in mid-end. For 3.2, we tend to disable it in mid-end, and do S1 in back-end if it's profitable. For 3.3, although S3 has 3% performance gain, but S2 has 31% performance regression. So we tend to disable it in mid-end and redo S1 in back-end. For 3.4, we shouldn't do this optimization anywhere. In theory, auto-vectorization should be able to generate the best vectorized code, and cases that cannot be vectorized will be converted into other more optimal scalar instructions in the architecture backend (this is what gcc and llvm do). However, we currently do not have a cost model and vplan, and the results of auto-vectorization are significantly affected by its input. Therefore, this patch turns off pattern 3.2, 3.3 and 3.4 in mid-end. Then if it's profitable, implement these patterns in the backend. If we implement a cost model and vplan in the future, it is best to move all patterns to the backend, this patch does not conflict with this direction. I also tested this patch on Arm N1, Intel SPR and AMD Genoa machines, No noticeable performance degradation was seen on any of the machines. Here are the test results on an Arm V2 and an AMD Genoa machine: ``` Benchmark V2-now V2-after Uplift Genoa-now Genoa-after Uplift Pattern Notes testInt8 60.36989 60.276736 1 116.768294 116.772547 0.99 1 testInt8AddSum 63.658064 63.797732 0.99 16.04973 16.051491 0.99 1 testInt8Store 38.829618 39.054129 0.99 19.857453 20.006321 0.99 1 testIntN8 59.99655 60.150053 0.99 132.269926 132.252473 1 1 testIntN8AddSum 145.678098 146.181549 0.99 158.546226 158.806476 0.99 1 testIntN8Store 32.802445 32.897907 0.99 19.047873 19.065941 0.99 1 testInt7 98.978213 99.176574 0.99 114.07026 113.08989 1 2 testInt7AddSum 62.675636 62.310799 1 23.370851 20.971655 1.11 2 testInt7Store 32.850828 32.923315 0.99 23.884952 23.628681 1.01 2 testIntN7 60.27949 60.668158 0.99 174.224893 174.102295 1 2 testIntN7AddSum 62.746696 62.288476 1 20.93192 20.964557 0.99 2 testIntN7Store 32.812906 32.851355 0.99 23.810024 23.526074 1.01 2 testInt9 60.820402 60.331938 1 108.850777 108.846161 1 3.1 testInt9AddSum 62.24679 62.374637 0.99 20.698749 20.741137 0.99 3.1 testInt9Store 32.871723 32.912065 0.99 19.055537 19.080735 0.99 3.1 testIntN33 106.517618 103.450746 1.02 153.894345 140.641135 1.09 3.2 n > 4 testIntN33AddSum 147.589815 47.911612 3.08 153.851885 17.008453 9.04 3.2 testIntN33Store 75.434513 43.473053 1.73 26.612181 20.436323 1.3 3.2 testIntN9 102.173268 103.70682 0.98 155.858169 140.718967 1.1 3.2 n <= 4 testIntN9AddSum 148.724952 47.963305 3.1 186.902111 20.249414 9.23 3.2 testIntN9Store 74.783788 43.339188 1.72 20.150159 20.888448 0.96 3.2 testInt18 98.905625 102.942092 0.96 142.480636 140.748778 1.01 3.3 m <= 4, n <= 4 testInt18AddSum 68.695585 48.103536 1.42 26.88524 16.77886 1.6 3.3 testInt18Store 41.307909 43.385183 0.95 21.233238 20.875026 1.01 3.3 testInt36 99.039742 103.714745 0.95 142.265806 142.334039 0.99 3.3 m > 4, n <= 4 testInt36AddSum 68.736756 47.952189 1.43 26.868362 17.030035 1.57 3.3 testInt36Store 41.403698 43.414093 0.95 21.225454 20.52266 1.03 3.3 testInt96 105.00287 103.528144 1.01 237.649526 140.643255 1.68 3.3 m > 4, n > 4 testInt96AddSum 68.481133 48.04549 1.42 26.877407 16.918209 1.58 3.3 testInt96Store 41.276292 43.512994 0.94 23.456117 20.540181 1.14 3.3 testIntN18 138.629044 103.269657 1.34 210.315628 140.716818 1.49 3.4 m <= 4, n <= 4 testIntN18AddSum 156.635652 48.003989 3.26 215.807135 16.917665 12.75 3.4 testIntN18Store 57.584487 43.410415 1.32 26.819827 20.707778 1.29 3.4 testIntN36 139.068861 103.766774 1.34 209.522432 140.720322 1.48 3.4 m > 4, n <= 4 testIntN36AddSum 156.36928 48.027779 3.25 215.705842 16.893192 12.76 3.4 testIntN36Store 57.715418 43.493958 1.32 21.651252 20.676877 1.04 3.4 testIntN96 139.151761 103.453665 1.34 269.254161 140.753499 1.91 3.4 m > 4, n > 4 testIntN96AddSum 153.123557 48.110524 3.18 263.262635 17.011144 15.47 3.4 testIntN96Store 57.793179 43.47574 1.32 24.444592 20.530219 1.19 3.4 ``` limitations: 1, This patch only analyzes two vector cases, there may be other vector cases that may get performance regression with this patch. 2, This patch does not implement the disabled patterns in the backend, I will propose a follow-up patch to implement these patterns in the aarch64 backend. 3, This patch does not handle the long type, because different architectures have different auto-vectorization support for long type, resulting in very different performance, and it is difficult to find a solution that does not introduce significant performance degradation.

bridgekeeper · 2025-01-06T07:56:05Z

Hi @erifan, welcome to this OpenJDK project and thanks for contributing!

We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request.

If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user erifan" as summary for the issue.

If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.

openjdk · 2025-01-06T07:56:26Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2025-01-06T07:56:57Z

@erifan The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

erifan · 2025-01-06T08:02:39Z

/signed

bridgekeeper · 2025-01-06T08:03:23Z

Thank you! Please allow for up to two weeks to process your OCA, although it is usually done within one to two business days. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!

erifan · 2025-01-06T08:05:06Z

/covered

bridgekeeper · 2025-01-06T08:07:56Z

Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!

eme64

Hi @erifan

I have some first questions / comments. I only scanned through quickly.

My biggest question:
You only mention aarch64. But we would need to know that your changes also work well on x64.

Also:
Can you summarize if your changes are only for performane of vectorization, or also for scalar code?

test/hotspot/jtreg/compiler/loopopts/superword/TestAlignVector.java

eme64 · 2025-01-07T17:16:15Z

src/hotspot/share/opto/mulnode.cpp

@@ -250,27 +250,33 @@ Node *MulINode::Ideal(PhaseGVN *phase, bool can_reshape) {
    sign_flip = true;
  }

-  // Get low bit; check for being the only bit
+  // TODO: abs_con = (1<<m)+(1<<n) and con = -((1<<n)+1) was supported


Is this TODO here on purpose?

Yes, it's a reminder to implement these two patterns for scalar cases in backends if they are worthwhile on the specific architecture.

Aha. I see. We don't leave TODO's in the code. Because nobody will ever look at it again. If we agree to go ahead with this, then you should rather file an RFE to keep track of this. But before you file it, let's first discuss the over-all strategy.

eme64 · 2025-01-07T17:17:22Z

test/hotspot/jtreg/compiler/c2/TestSerialAdditions.java

    private static int addTo6(int a) {
-        return a + a + a + a + a + a; // a*6 => (a<<1) + (a<<2)
+        return a + a + a + a + a + a;


Is this an improvement?

Is this an improvement on aarch64 for all implementations? What about x64?

If a*6 is in a loop and can be vectorized, there may be big performance improvement. If it's not vectorized, there may be some small performance loss. See the test results of a*18 = (a<<4) + (a<<1), (same with a*6 = (a<<2) + (a<<1)) in three different cases:

Benchmark V2-now V2-after Uplift Genoa-now Genoa-after Uplift Notes testInt18 98.90 102.94 0.96 142.48 140.75 1.01 scalar testInt18AddSum 68.70 48.10 1.42 26.88 16.78 1.6 vectorized testInt18Store 41.31 43.39 0.95 21.23 20.88 1.01 vectorized

We can see that for scalar case the conversion from a*6 => (a<<2) + (a<<1) is profitable on aarch64, I have a follow up patch to reimplement this pattern in aarch64 backend, I'll file it later. But for x64, there is no obvious performance change whether or not to do this conversion. So this is also why I leave a TODO in mulnode.cpp

Benchmark V2-now V2-after Uplift Genoa-now Genoa-after Uplift Notes testInt18 98.90 102.94 0.96 142.48 140.75 1.01 scalar

Ok, that would be a 4% regression on V2. It is not much, but still possibly relevant.

I think I would need to see a clear strategy that we can actually pull off. Otherwise it may be that you introduce a regression here, that then nobody gets around to fixing later.

eme64 · 2025-01-07T17:21:46Z

src/hotspot/share/opto/mulnode.cpp

+  // But if it's not vectorizable, maybe it's profitable to do the conversion on
+  // some architectures, support it in backends if it's worthwhile.


If it is all about vectorization only: we could consider delaying the Ideal optimization until after loop-opts. Then we can keep the multiplication for vectorization, and only use shift/add once we know that we cannot vectorize.

But for some cases, it's better to do the conversion before vectorization, for example:
x * 8 => x << 3

Through test results and theoretical analysis (only on aarch64, see the commit message), I found that we'd better to do the conversion before vectorization if the multiplication can be transformed to less than one or two shift/add instructions.

In theory we could also do a transform with vectors, and convert a vector mul to a vector shift, right?

It is a bit scary to have these cases where some are better before and some better after vectorization. Makes performance quite unpredictable - your change may introduce improvements in some cases but regressions in others.

eme64 · 2025-01-07T17:24:24Z

src/hotspot/share/opto/mulnode.cpp

+    Node* n1 = phase->transform(
+        new LShiftINode(in(1), phase->intcon(log2i_exact(abs_con - 1))));
+    res = new AddINode(in(1), n1);
+  } else if (is_power_of_2(abs_con + 1)) {


So now you only check for power_of_2 +- 1, right? But before we also looked at patterns with 2 bits, such as 64 + 8.

You would really need to prove that this is not a loss on any of the platforms we care about, incl. x64.

Yes, I have tested the patch on several of x64 machines, including amd Genoa and Intel SPR and some older x86 machines, there's no noticeable performance loss. Test results on aarch64 and amd Genoa were included in the commit message, please take a look, thanks~

So now you only check for power_of_2 +- 1, right?

Yes now I only check these patterns:

1, Const = 1<<n, (n > 0) 2, Const = -(1<<n), (n > 0) 3, Const = (1<<n) + 1, (n > 0) 4, Const = (1<<n) - 1, (n > 0) 5, Const = -((1<<n) -1) (n > 0)

Removed these patterns:

1, Const = (1<<m) + (1<<n) (m > 0, n > 0) 2, Const = -((1<<m) + (1<<n)) (m > 0, n > 0) 3, Const = -((1<<n) + 1), (n > 0)

erifan · 2025-01-08T03:00:06Z

Hi @eme64 , thanks for your review!

My biggest question: You only mention aarch64. But we would need to know that your changes also work well on x64.

Yes, this patch also benefits x86_64 platform. I tested the patch on aarch64 V2 and N1 processors, AMD64 Genoa and Intel SPR processors (I can provide the test results if necessary). The test results show that for some cases the performance uplift is very large and there is no obvious performance degradation. I'm not familiar with x86_64 instruction set, so I didn't do any theoretical analysis on x86_64 platform, I did very detailed theoretical analysis on aarch64 platform.

I don't have machines with architectures other than aarch64 and x64, so this patch is not tested on platforms except for aarch64 and x64.

Also: Can you summarize if your changes are only for performane of vectorization, or also for scalar code?

For both vectorization and scalar cases. For example the pattern x * C => -((x<<m) + (x<<n)) is supported now, this pattern is a negative optimization from both the vectorization and scalar perspectives on both aarch64 and x64, and this patch removed this pattern.

As mentioned in my commit message, this patch is good for most cases, but there are small performance loss for some cases. Ideally, if we implement vplan, I think it would be better to keep the multiplication operation before vectorization and let the vectorizer generate the optimal vector code. Then for the scalar case that cannot be vectorized, convert to other more optimal instructions in the backend ( or in this pass if it's merged) if necessary.

eme64 · 2025-01-08T07:08:43Z

Hi @erifan

Thanks for your responses. I now looked at the benchmark results.
I see regressions in the range of 5% on both of your tested platforms. I'm hesitant to accept that now without the follow-up patches standing first. Maybe you can change the order of your RFE's so that we have no regressions in between?

Maybe you could even create one big patch with everything in it, just so that we can see that there are no regressions. Then split it up into parts (multiple RFEs) for easier review.

Also: It would be nice to see benchmarks on as many different architectures as you can show. And please make sure that the table is nicely aligned - currently it is a bit difficult to read.

f we implement vplan
What do you mean by VPlan? Are you talking about LLVM? I am working on something a little similar with VTransform. But I'm not sure if it is relevant. I mean in theory you can also use the vector API, and then it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. Actually, maybe you should first address that: which vector mul could be vector shift. That would be a nice stand-alone change that you could implement without regressions, right?

What do you think?

erifan · 2025-01-08T07:52:57Z

About the OCA, I am an employee from NVIDIA's Java compiler team, and NVIDIA has signed OCA.

erifan · 2025-01-08T08:03:54Z

Maybe you could even create one big patch with everything in it, just so that we can see that there are no regressions. Then split it up into parts (multiple RFEs) for easier review.

Ok, I'll combine the patches and file one big patch, and update the test results.

Also: It would be nice to see benchmarks on as many different architectures as you can show. And please make sure that the table is nicely aligned - currently it is a bit difficult to read.

OK, I can provide test results on aarch64 V2, N1, AMD64 Genoa and Intel SPR processors.

What do you mean by VPlan? Are you talking about LLVM? I am working on something a little similar with VTransform. But I'm not sure if it is relevant. I mean in theory you can also use the vector API, and then it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. Actually, maybe you should first address that: which vector mul could be vector shift. That would be a nice stand-alone change that you could implement without regressions, right?

Yes, I mean LLVM VPlan, I noticed your related work, it's very interesting.

it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable.

Yes.

eme64 · 2025-01-09T04:59:07Z

@erifan I did some more thinking when falling asleep / waking up. This is a really interesting problem here.

For MulINode::Ideal with patterns var * con, we really have these options in assembly:

mul general case.
shift and add when profitable.
lea could this be an improvement over shift and add?

The issue is that different platforms have different characteristics here for these instructions - we would have to see how they differ. As far as I remember mul is not always available on all ALUs, but add and shift should be available. This impacts their throughput (more ports / ALU means more throughput generally). But the instructions also have different latency. Further, I could imagine that at some point more instructions may not just affect the throughput, but also the code-size: that in turn would increase IR and may at some point affect the instruction cache.

Additionally: if your workload has other mul, shift and add mixed in, then some ports may already be saturated, and that could tilt the balance as to which option you are supposed to take.

And then the characteristics of scalar ops may not be identical to vector ops.

It would be interesting to have a really solid benchmark, where you explore the impact of these different effects.
And it would be interesting to extract a table of latency + throughput characteristics for all relevant scalar + vector ops, for a number of different CPUs. Just so we get an overview of how easy this is to tune.

Maybe perfect tuning is not possible. Maybe we are willing to take a 5% regression in some cases to boost other cases by 30%. But that is a big maybe: we really do not like getting regressions in existing code, it tends to upset people more if they get regressions compared to how much they enjoy speedups - so work like this can be delicate.

Anyway, I don't right now have much time to investigate and work on this myself. So you'd have to do the work, benchmark, explanation etc. But I think the 30% speedup indicates that this work could really have potential!

As to what to do in sequence, here a suggestion:

First work on Vector API cases of vector multiplication - this should have no impact on other things.
Delay the MulINode::Ideal optimizations until after loop-opts: scalar code would still be handled in the old way, but auto-vectorized code would then be turned into MulV. And then go into the mul -> shift optimization for vectors under point 1.
Tackle MulINode::Ideal for scalar cases after loop-opts, and see what you can do there.

This way you can separate scalar and vector changes.
All of this really depends on very good benchmarks, and benchmarks on various platforms. Good presentation would be key here. I find tables with numbers important, but a visual representation on top would be good too - it can give an easier overview over the patterns.

And please investigate when lea is applicable / profitable.

We may also want to get people from ARM and Intel into this discussion at some point.

That's enough for now. Let me know what you think :)

erifan · 2025-01-09T06:21:14Z

Hi @eme64 thanks for your review.

1. First work on Vector API cases of vector multiplication - this should have no impact on other things.
2. Delay the MulINode::Ideal optimizations until after loop-opts: scalar code would still be handled in the old way, but auto-vectorized code would then be turned into MulV. And then go into the mul -> shift optimization for vectors under point 1.
3. Tackle MulINode::Ideal for scalar cases after loop-opts, and see what you can do there.

I agree with you. I am actually working on 1. The slightly troublesome thing is that 1 and 3 are both related to the architecture, so it might take a little more time.

lea could this be an improvement over shift and add?

AARCH64 doesn't actually have a lea instruction. On x64 there are already some rules that turn shift add into lea.

The issue is that different platforms have different characteristics here for these instructions - we would have to see how they differ. As far as I remember mul is not always available on all ALUs, but add and shift should be available. This impacts their throughput (more ports / ALU means more throughput generally). But the instructions also have different latency. Further, I could imagine that at some point more instructions may not just affect the throughput, but also the code-size: that in turn would increase IR and may at some point affect the instruction cache.

Additionally: if your workload has other mul, shift and add mixed in, then some ports may already be saturated, and that could tilt the balance as to which option you are supposed to take.

And then the characteristics of scalar ops may not be identical to vector ops.

Yes this is very trick, the actual performance is related to many aspects, such as pipeline, latency, throughput, ROB, and even memory performance. We can only do optimization based on certain references and generalities, such as latency and throughput of different instructions. But when it comes to generalities, it is actually difficult to say which scenario is more general.

It would be interesting to have a really solid benchmark, where you explore the impact of these different effects.
And it would be interesting to extract a table of latency + throughput characteristics for all relevant scalar + vector ops, for a number of different CPUs. Just so we get an overview of how easy this is to tune.

I don't know such a benchmark suite yet. For AARCH64, I usually refer to the Arm Optimization Guide, but some instructions seem to be missing there. I guess AMD and Intel should have similar documents? I'm not sure.

Maybe perfect tuning is not possible. Maybe we are willing to take a 5% regression in some cases to boost other cases by 30%. But that is a big maybe: we really do not like getting regressions in existing code, it tends to upset people more if they get regressions compared to how much they enjoy speedups - so work like this can be delicate.

Yes I agree, I will deal with the performance loss.

And please investigate when lea is applicable / profitable.

Do you mean shift add => lea ? I think this is already done on x64.

We may also want to get people from ARM and Intel into this discussion at some point.

Yes.

eme64 · 2025-01-09T06:27:05Z

@erifan Amazing, thanks for the enthusiasm :)
Looking forward to what you come up with!

jaskarth · 2025-02-02T02:28:51Z

I also think that moving mul-related idealization to happen after loop opts would be a good idea. In the past I've noticed that patterns such as x * 4 * 4 do not properly fold into a single multiply, because the mul is strength reduced into left shifts before constant folding takes place (shifts aren't currently constant folded either, but I think that is a separate issue). By doing such strength reductions after loop opts we can fix that issue as well.

openjdk · 2025-02-10T21:21:44Z

@erifan this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8346964
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

mlbridge · 2025-02-10T21:24:59Z

Webrevs

Change-Id: Ib47ed4f9c6d69326a0b7cb8ba7c29f604b8fc1ec

bridgekeeper · 2025-03-02T04:30:16Z

@erifan This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper · 2025-03-30T05:05:20Z

@erifan This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

erifan · 2025-03-31T01:25:18Z

/open

openjdk · 2025-03-31T01:25:30Z

@erifan This pull request is now open

eme64 · 2025-04-02T07:12:46Z

@erifan you opened this again. Does that mean we should review again? I see that you did not make any changes since our last conversation. If it is not ready for review, could you please convert it to Draft, so it is clear that you are not asking for reviews currently?

erifan · 2025-04-02T09:15:38Z

@erifan you opened this again. Does that mean we should review again? I see that you did not make any changes since our last conversation. If it is not ready for review, could you please convert it to Draft, so it is clear that you are not asking for reviews currently?

Oh sorry to bother you. I'm working on something else. For this PR, actually I have finished the modification according to your suggestion, and tests on aarch64 and x86_64 also showed that there's no performance regression. But I didn't file the patch because of the following reasons:
1, The changes have grown so much that I'm wondering if it should be split into multiple patches so it will be easier to review.
2, We'll do almost the same optimization for scalar MulINode and vector MulVINode, this looks a bit repetitive, although the IR is different. I wonder if there's a better method for this situation. We haven’t done much optimization for vector nodes yet, but as we do more optimization, this will become an issue. For example, should we do constant reassociation for vector nodes?
3, The interaction between different optimization phases is a general problem, and I wonder if there is a more general solution.

Thanks!

bridgekeeper · 2025-05-28T11:29:38Z

@erifan This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

erifan · 2025-06-04T09:53:00Z

/keepalive

openjdk · 2025-06-04T09:54:03Z

@erifan The pull request is being re-evaluated and the inactivity timeout has been reset.

bridgekeeper · 2025-07-30T16:56:33Z

@erifan This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper bot added the oca Needs verification of OCA signatory status label Jan 6, 2025

openjdk bot changed the title ~~8346964: C2: Improve integer multiplication with constant in MulINode…~~ 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() Jan 6, 2025

openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Jan 6, 2025

bridgekeeper bot added the oca-verify Needs verification of OCA signatory status label Jan 6, 2025

eme64 reviewed Jan 7, 2025

View reviewed changes

bridgekeeper bot removed oca Needs verification of OCA signatory status oca-verify Needs verification of OCA signatory status labels Feb 10, 2025

openjdk bot added merge-conflict Pull request has merge conflict with target branch rfr Pull request is ready for review labels Feb 10, 2025

Merge branch 'master' into JDK-8346964

011c610

Change-Id: Ib47ed4f9c6d69326a0b7cb8ba7c29f604b8fc1ec

openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Feb 11, 2025

Merge master

37d0d5e

merykitty mentioned this pull request Mar 7, 2025

8342662: C2: Add new phase for backend-specific lowering #21599

Closed

3 tasks

bridgekeeper bot closed this Mar 30, 2025

openjdk bot reopened this Mar 31, 2025

erifan marked this pull request as draft April 2, 2025 08:47

openjdk bot removed the rfr Pull request is ready for review label Apr 2, 2025

Merge branch 'openjdk:master' into JDK-8346964

0c385fc

		// But if it's not vectorizable, maybe it's profitable to do the conversion on
		// some architectures, support it in backends if it's worthwhile.

8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() #22922

Are you sure you want to change the base?

8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() #22922

Conversation

erifan commented Jan 6, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewing

Uh oh!

bridgekeeper bot commented Jan 6, 2025

Uh oh!

openjdk bot commented Jan 6, 2025

Uh oh!

openjdk bot commented Jan 6, 2025

Uh oh!

erifan commented Jan 6, 2025

Uh oh!

bridgekeeper bot commented Jan 6, 2025

Uh oh!

erifan commented Jan 6, 2025

Uh oh!

bridgekeeper bot commented Jan 6, 2025

Uh oh!

eme64 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erifan commented Jan 8, 2025

Uh oh!

eme64 commented Jan 8, 2025

Uh oh!

erifan commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erifan commented Jan 8, 2025

Uh oh!

eme64 commented Jan 9, 2025

Uh oh!

erifan commented Jan 9, 2025

Uh oh!

eme64 commented Jan 9, 2025

Uh oh!

jaskarth commented Feb 2, 2025

Uh oh!

openjdk bot commented Feb 10, 2025

Uh oh!

mlbridge bot commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

bridgekeeper bot commented Mar 2, 2025

Uh oh!

bridgekeeper bot commented Mar 30, 2025

erifan commented Jan 6, 2025 •

edited by openjdk bot

Loading

erifan commented Jan 8, 2025 •

edited

Loading

mlbridge bot commented Feb 10, 2025 •

edited

Loading