Skip to content

8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() #22922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

erifan
Copy link
Contributor

@erifan erifan commented Jan 6, 2025

Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example:

  1. x*8 can be optimized as x<<3.
  2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64.

Currently OpenJDK implemented a few such patterns in mid-end, including:

  1. |C| = 1<<n (n>0)
  2. |C| = (1<<n) - 1 (n>0)
  3. |C| = (1<<m) + (1<<n) (m>n, n>=0)

The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction.

But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns:
3.1. C = (1<<n) + 1 (n>0)
3.2. C = -((1<<n) + 1) (n>0)
3.3. C = (1<<m) + (1<<n) (m>n, n>0)
3.4. C = -((1<<m) + (1<<n)) (m>n, n>0)

According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example.

Before (MUL is not converted):

  mov x1, #C
  mul x2, x1, x0

Now (MUL is converted):
For 3.1:

    add x2, x0, x0, lsl #n

For 3.2:

    add x2, x0, x0, lsl #n  // same cost with mul if n > 4
    neg x2, x2

For 3.3:

    lsl x1, x0, #m
    add x2, x1, x0, lsl #n  // same cost with mul if n > 4

For 3.4:

    lsl x1, x0, #m
    add x2, x1, x0, lsl #n  // same cost with mul if n > 4
    neg x2, x2

Test results (ns/op) on Arm Neoverse V2:

		Before	Now	Uplift		Pattern	Notes
testInt9	103.379	60.702	1.70305756	3.1
testIntN33	103.231	106.825	0.96635619	3.2	n > 4
testIntN9	103.448	103.005	1.004300762	3.2	n <= 4
testInt18	103.354	99.271	1.041129837	3.3	m <= 4, n <= 4
testInt36	103.396	99.186	1.042445506	3.3	m > 4, n <= 4
testInt96	103.337	105.416	0.980278136	3.3	m > 4, n > 4
testIntN18	103.333	139.258	0.742025593	3.4	m <= 4, n <= 4
testIntN36	103.208	139.132	0.741799155	3.4	m > 4, n <= 4
testIntN96	103.367	139.471	0.74113615	3.4	m > 4, n > 4

(S1) From this point on, we should treat pattern 3 as follows:
3.1 C = (1<<n) + 1 (n>0)
3.2 C = -((1<<n) + 1) (0<n<=4)
3.3 C = (1<<m) + (1<<n) (m>n, 0<n<=4)
3.4 C = -((1<<m) + (1<<n)) (disable)

Since this conversion is implemented in mid-end, it impacts other optimizations, such as auto-vectorization. Assume there's the following loop which can be vectorized.
Vector-A:

for (int i=0; i<len; i++) {
  sum += a[i] * C;
}

Before:

  movi    v19.4s, #C  // this will be hoisted out of the loop
  mla     v16.4s, v17.4s, v19.4s

After:
For 3.1:

    shl     v19.4s, v17.4s, #m
    add     v17.4s, v19.4s, v17.4s
    add     v16.4s, v16.4s, v17.4s

For 3.2:

    (add    w11, w11, w11, lsl #m
    sub     w11, w12, w11) * 4  // *4 is for 4 ints

For 3.3:

    shl     v18.4s, v17.4s, #m
    shl     v19.4s, v17.4s, #n
    add     v18.4s, v19.4s, v18.4s
    add     v16.4s, v16.4s, v18.4s

For 3.4:

    (lsl    w12, w4, #m
    add     w11, w12, w4, lsl #n
    sub     w13, w13, w11) * 4  // *4 is for 4 ints

The generated instruction before is more simple and performing:

			Before	Now	Uplift		Pattern
testInt9AddSum		47.958	63.696	0.752920121	3.1
testIntN33AddSum	48.013	147.834	0.324776438	3.2
testIntN9AddSum		48.026	149.149	0.322000148	3.2
testInt18AddSum		47.971	69.393	0.691294511	3.3
testInt36AddSum		47.98	69.395	0.69140428	3.3
testInt96AddSum		47.992	69.453	0.690999669	3.3
testIntN18AddSum	48.014	157.132	0.305564748	3.4
testIntN36AddSum	48.02	157.094	0.305676856	3.4
testIntN96AddSum	48.032	153.642	0.312622851	3.4

(S2) From this point on, we should disable pattern 3 totally.

But we can have different cases, for example:
Vector-B:

for (int i=0; i<100000; i++) {
  a[i] = a[i] * C;
}

Test results:

		Before	Now	Uplift		Pattern
testInt9Store	43.392	33.338	1.301577779	3.1
testIntN33Store	43.365	75.993	0.570644665	3.2
testIntN9Store	43.5	75.452	0.576525473	3.2
testInt18Store	43.442	41.847	1.038115038	3.3
testInt36Store	43.369	41.843	1.03646966	3.3
testInt96Store	43.389	41.931	1.03477141	3.3
testIntN18Store	43.372	57.909	0.748968209	3.4
testIntN36Store	43.373	57.042	0.760369552	3.4
testIntN96Store	43.405	58.145	0.746495829	3.4

(S3) From this point on, we should treat pattern 3 as follow:
3.1 C = (1<<n) + 1 (n>0)
3.2 C = -((1<<n) + 1) (disable)
3.3 C = (1<<m) + (1<<n) (m>n, n>0)
3.4 C = -((1<<m) + (1<<n)) (disable)

Combining S1, S2 and S3, we get:
Pattern S1 S2 S3
3.1 (n>0, 1.7) (disable, 0.75) (n>0, 1.3)
3.2 (0<n<=4, 1.0) (disable, 0.32) (disable, 0.57) 3.3 (m>n, 0<n<=4, 1.04) (disable, 0.69) (m>n, n>0, 1.03) 3.4 (disable, 0.74) (disable, 0.30) (disable, 0.74)

For 3.1, it's similar with pattern 2, usually be lowered as only one instruction, so we tend to keep it in mid-end.
For 3.2, we tend to disable it in mid-end, and do S1 in back-end if it's profitable.
For 3.3, although S3 has 3% performance gain, but S2 has 31% performance regression. So we tend to disable it in mid-end and redo S1 in back-end. For 3.4, we shouldn't do this optimization anywhere.

In theory, auto-vectorization should be able to generate the best vectorized code, and cases that cannot be vectorized will be converted into other more optimal scalar instructions in the architecture backend (this is what gcc and llvm do). However, we currently do not have a cost model and vplan, and the results of auto-vectorization are significantly affected by its input. Therefore, this patch turns off pattern 3.2, 3.3 and 3.4 in mid-end. Then if it's profitable, implement these patterns in the backend. If we implement a cost model and vplan in the future, it is best to move all patterns to the backend, this patch does not conflict with this direction.

I also tested this patch on Arm N1, Intel SPR and AMD Genoa machines, No noticeable performance degradation was seen on any of the machines.

Here are the test results on an Arm V2 and an AMD Genoa machine:

Benchmark	  V2-now	V2-after	Uplift	Genoa-now	Genoa-after	Uplift	Pattern	Notes
testInt8	60.36989	60.276736	1	116.768294	116.772547	0.99	1
testInt8AddSum	63.658064	63.797732	0.99	16.04973	16.051491	0.99	1
testInt8Store	38.829618	39.054129	0.99	19.857453	20.006321	0.99	1
testIntN8	59.99655	60.150053	0.99	132.269926	132.252473	1	1
testIntN8AddSum	145.678098	146.181549	0.99	158.546226	158.806476	0.99	1
testIntN8Store	32.802445	32.897907	0.99	19.047873	19.065941	0.99	1
testInt7	98.978213	99.176574	0.99	114.07026	113.08989	1	2
testInt7AddSum	62.675636	62.310799	1	23.370851	20.971655	1.11	2
testInt7Store	32.850828	32.923315	0.99	23.884952	23.628681	1.01	2
testIntN7	60.27949	60.668158	0.99	174.224893	174.102295	1	2
testIntN7AddSum	62.746696	62.288476	1	20.93192	20.964557	0.99	2
testIntN7Store	32.812906	32.851355	0.99	23.810024	23.526074	1.01	2
testInt9	60.820402	60.331938	1	108.850777	108.846161	1	3.1
testInt9AddSum	62.24679	62.374637	0.99	20.698749	20.741137	0.99	3.1
testInt9Store	32.871723	32.912065	0.99	19.055537	19.080735	0.99	3.1
testIntN33	106.517618	103.450746	1.02	153.894345	140.641135	1.09	3.2	n > 4
testIntN33AddSum	147.589815	47.911612	3.08	153.851885	17.008453	9.04	3.2
testIntN33Store	75.434513	43.473053	1.73	26.612181	20.436323	1.3	3.2
testIntN9	102.173268	103.70682	0.98	155.858169	140.718967	1.1	3.2	n <= 4
testIntN9AddSum	148.724952	47.963305	3.1	186.902111	20.249414	9.23	3.2
testIntN9Store	74.783788	43.339188	1.72	20.150159	20.888448	0.96	3.2
testInt18	98.905625	102.942092	0.96	142.480636	140.748778	1.01	3.3	m <= 4, n <= 4
testInt18AddSum	68.695585	48.103536	1.42	26.88524	16.77886	1.6	3.3
testInt18Store	41.307909	43.385183	0.95	21.233238	20.875026	1.01	3.3
testInt36	99.039742	103.714745	0.95	142.265806	142.334039	0.99	3.3	m > 4, n <= 4
testInt36AddSum	68.736756	47.952189	1.43	26.868362	17.030035	1.57	3.3
testInt36Store	41.403698	43.414093	0.95	21.225454	20.52266	1.03	3.3
testInt96	105.00287	103.528144	1.01	237.649526	140.643255	1.68	3.3	m > 4, n > 4
testInt96AddSum	68.481133	48.04549	1.42	26.877407	16.918209	1.58	3.3
testInt96Store	41.276292	43.512994	0.94	23.456117	20.540181	1.14	3.3
testIntN18	138.629044	103.269657	1.34	210.315628	140.716818	1.49	3.4	m <= 4, n <= 4
testIntN18AddSum	156.635652	48.003989	3.26	215.807135	16.917665	12.75	3.4
testIntN18Store	57.584487	43.410415	1.32	26.819827	20.707778	1.29	3.4
testIntN36	139.068861	103.766774	1.34	209.522432	140.720322	1.48	3.4	m > 4, n <= 4
testIntN36AddSum	156.36928	48.027779	3.25	215.705842	16.893192	12.76	3.4
testIntN36Store	57.715418	43.493958	1.32	21.651252	20.676877	1.04	3.4
testIntN96	139.151761	103.453665	1.34	269.254161	140.753499	1.91	3.4	m > 4, n > 4
testIntN96AddSum	153.123557	48.110524	3.18	263.262635	17.011144	15.47	3.4
testIntN96Store	57.793179	43.47574	1.32	24.444592	20.530219	1.19	3.4

limitations:
1, This patch only analyzes two vector cases, there may be other vector cases that may get performance regression with this patch. 2, This patch does not implement the disabled patterns in the backend, I will propose a follow-up patch to implement these patterns in the aarch64 backend.
3, This patch does not handle the long type, because different architectures have different auto-vectorization support for long type, resulting in very different performance, and it is difficult to find a solution that does not introduce significant performance degradation.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/22922/head:pull/22922
$ git checkout pull/22922

Update a local copy of the PR:
$ git checkout pull/22922
$ git pull https://git.openjdk.org/jdk.git pull/22922/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 22922

View PR using the GUI difftool:
$ git pr show -t 22922

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/22922.diff

Using Webrev

Link to Webrev Comment

…::Ideal()

Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB
instructions since generally these instructions have smaller latency
and larger throughput on most architectures. For example:
1. x*8 can be optimized as x<<3.
2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one
SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on
some architectures, like aarch64 and x86_64.

Currently OpenJDK implemented a few such patterns in mid-end, including:
1. |C| = 1<<n                (n>0)
2. |C| = (1<<n) - 1          (n>0)
3. |C| = (1<<m) + (1<<n)     (m>n, n>=0)

The first two are ok. Because on most architectures they are lowered as
only one ADD/SUB/SHIFT instruction.

But the third pattern doesn't always perform well on some architectures,
such as aarch64. The third pattern can be split as the following sub
patterns:
3.1. C = (1<<n) + 1          (n>0)
3.2. C = -((1<<n) + 1)       (n>0)
3.3. C = (1<<m) + (1<<n)     (m>n, n>0)
3.4. C = -((1<<m) + (1<<n))  (m>n, n>0)

According to Arm optimization guide, if the shift amount > 4, the
latency and throughput of ADD instruction is the same with MUL
instruction. So in this case, converting MUL to ADD is not profitable.
Take a[i] * C on aarch64 as an example.

Before (MUL is not converted):
```
  mov x1, #C
  mul x2, x1, x0
```

Now (MUL is converted):
  For 3.1:
```
    add x2, x0, x0, lsl #n
```

  For 3.2:
```
    add x2, x0, x0, lsl #n  // same cost with mul if n > 4
    neg x2, x2
```

  For 3.3:
```
    lsl x1, x0, #m
    add x2, x1, x0, lsl #n  // same cost with mul if n > 4
```

  For 3.4:
```
    lsl x1, x0, #m
    add x2, x1, x0, lsl #n  // same cost with mul if n > 4
    neg x2, x2
```

Test results (ns/op) on Arm Neoverse V2:
```
		Before	Now	Uplift		Pattern	Notes
testInt9	103.379	60.702	1.70305756	3.1
testIntN33	103.231	106.825	0.96635619	3.2	n > 4
testIntN9	103.448	103.005	1.004300762	3.2	n <= 4
testInt18	103.354	99.271	1.041129837	3.3	m <= 4, n <= 4
testInt36	103.396	99.186	1.042445506	3.3	m > 4, n <= 4
testInt96	103.337	105.416	0.980278136	3.3	m > 4, n > 4
testIntN18	103.333	139.258	0.742025593	3.4	m <= 4, n <= 4
testIntN36	103.208	139.132	0.741799155	3.4	m > 4, n <= 4
testIntN96	103.367	139.471	0.74113615	3.4	m > 4, n > 4
```

**(S1) From this point on, we should treat pattern 3 as follows:**
3.1  C = (1<<n) + 1          (n>0)
3.2  C = -((1<<n) + 1)       (0<n<=4)
3.3  C = (1<<m) + (1<<n)     (m>n, 0<n<=4)
3.4  C = -((1<<m) + (1<<n))  (disable)

Since this conversion is implemented in mid-end, it impacts other
optimizations, such as auto-vectorization. Assume there's the following
loop which can be vectorized.
Vector-A:
```
for (int i=0; i<len; i++) {
  sum += a[i] * C;
}
```

Before:
```
  movi    v19.4s, #C  // this will be hoisted out of the loop
  mla     v16.4s, v17.4s, v19.4s
```

After:
  For 3.1:
```
    shl     v19.4s, v17.4s, #m
    add     v17.4s, v19.4s, v17.4s
    add     v16.4s, v16.4s, v17.4s
```

  For 3.2:
```
    (add    w11, w11, w11, lsl #m
    sub     w11, w12, w11) * 4  // *4 is for 4 ints
```

  For 3.3:
```
    shl     v18.4s, v17.4s, #m
    shl     v19.4s, v17.4s, #n
    add     v18.4s, v19.4s, v18.4s
    add     v16.4s, v16.4s, v18.4s
```

  For 3.4:
```
    (lsl    w12, w4, #m
    add     w11, w12, w4, lsl #n
    sub     w13, w13, w11) * 4  // *4 is for 4 ints
```

The generated instruction before is more simple and performing:
```
			Before	Now	Uplift		Pattern
testInt9AddSum		47.958	63.696	0.752920121	3.1
testIntN33AddSum	48.013	147.834	0.324776438	3.2
testIntN9AddSum		48.026	149.149	0.322000148	3.2
testInt18AddSum		47.971	69.393	0.691294511	3.3
testInt36AddSum		47.98	69.395	0.69140428	3.3
testInt96AddSum		47.992	69.453	0.690999669	3.3
testIntN18AddSum	48.014	157.132	0.305564748	3.4
testIntN36AddSum	48.02	157.094	0.305676856	3.4
testIntN96AddSum	48.032	153.642	0.312622851	3.4
```

**(S2) From this point on, we should disable pattern 3 totally.**

But we can have different cases, for example:
Vector-B:
```
for (int i=0; i<100000; i++) {
  a[i] = a[i] * C;
}
```
Test results:
```
		Before	Now	Uplift		Pattern
testInt9Store	43.392	33.338	1.301577779	3.1
testIntN33Store	43.365	75.993	0.570644665	3.2
testIntN9Store	43.5	75.452	0.576525473	3.2
testInt18Store	43.442	41.847	1.038115038	3.3
testInt36Store	43.369	41.843	1.03646966	3.3
testInt96Store	43.389	41.931	1.03477141	3.3
testIntN18Store	43.372	57.909	0.748968209	3.4
testIntN36Store	43.373	57.042	0.760369552	3.4
testIntN96Store	43.405	58.145	0.746495829	3.4
```

**(S3) From this point on, we should treat pattern 3 as follow:**
3.1  C = (1<<n) + 1          (n>0)
3.2  C = -((1<<n) + 1)       (disable)
3.3  C = (1<<m) + (1<<n)     (m>n, n>0)
3.4  C = -((1<<m) + (1<<n))  (disable)

Combining S1, S2 and S3, we get:
Pattern		S1		S2			S3
3.1	(n>0, 1.7)		(disable, 0.75)		(n>0, 1.3)
3.2	(0<n<=4, 1.0)		(disable, 0.32)		(disable, 0.57)
3.3	(m>n, 0<n<=4, 1.04)	(disable, 0.69)		(m>n, n>0, 1.03)
3.4	(disable, 0.74)		(disable, 0.30)		(disable, 0.74)

For 3.1, it's similar with pattern 2, usually be lowered as only one
instruction, so we tend to keep it in mid-end.
For 3.2, we tend to disable it in mid-end, and do S1 in back-end if
it's profitable.
For 3.3, although S3 has 3% performance gain, but S2 has 31% performance
regression. So we tend to disable it in mid-end and redo S1 in back-end.
For 3.4, we shouldn't do this optimization anywhere.

In theory, auto-vectorization should be able to generate the best
vectorized code, and cases that cannot be vectorized will be converted
into other more optimal scalar instructions in the architecture backend
(this is what gcc and llvm do). However, we currently do not have a cost
model and vplan, and the results of auto-vectorization are significantly
affected by its input. Therefore, this patch turns off pattern 3.2, 3.3
and 3.4 in mid-end. Then if it's profitable, implement these patterns in
the backend. If we implement a cost model and vplan in the future, it is
best to move all patterns to the backend, this patch does not conflict
with this direction.

I also tested this patch on Arm N1, Intel SPR and AMD Genoa machines,
No noticeable performance degradation was seen on any of the machines.

Here are the test results on an Arm V2 and an AMD Genoa machine:
```
Benchmark	V2-now	V2-after	Uplift	Genoa-now	Genoa-after	Uplift	Pattern	Notes
testInt8	60.36989	60.276736	1	116.768294	116.772547	0.99	1
testInt8AddSum	63.658064	63.797732	0.99	16.04973	16.051491	0.99	1
testInt8Store	38.829618	39.054129	0.99	19.857453	20.006321	0.99	1
testIntN8	59.99655	60.150053	0.99	132.269926	132.252473	1	1
testIntN8AddSum	145.678098	146.181549	0.99	158.546226	158.806476	0.99	1
testIntN8Store	32.802445	32.897907	0.99	19.047873	19.065941	0.99	1
testInt7	98.978213	99.176574	0.99	114.07026	113.08989	1	2
testInt7AddSum	62.675636	62.310799	1	23.370851	20.971655	1.11	2
testInt7Store	32.850828	32.923315	0.99	23.884952	23.628681	1.01	2
testIntN7	60.27949	60.668158	0.99	174.224893	174.102295	1	2
testIntN7AddSum	62.746696	62.288476	1	20.93192	20.964557	0.99	2
testIntN7Store	32.812906	32.851355	0.99	23.810024	23.526074	1.01	2
testInt9	60.820402	60.331938	1	108.850777	108.846161	1	3.1
testInt9AddSum	62.24679	62.374637	0.99	20.698749	20.741137	0.99	3.1
testInt9Store	32.871723	32.912065	0.99	19.055537	19.080735	0.99	3.1
testIntN33	106.517618	103.450746	1.02	153.894345	140.641135	1.09	3.2	n > 4
testIntN33AddSum	147.589815	47.911612	3.08	153.851885	17.008453	9.04	3.2
testIntN33Store	75.434513	43.473053	1.73	26.612181	20.436323	1.3	3.2
testIntN9	102.173268	103.70682	0.98	155.858169	140.718967	1.1	3.2	n <= 4
testIntN9AddSum	148.724952	47.963305	3.1	186.902111	20.249414	9.23	3.2
testIntN9Store	74.783788	43.339188	1.72	20.150159	20.888448	0.96	3.2
testInt18	98.905625	102.942092	0.96	142.480636	140.748778	1.01	3.3	m <= 4, n <= 4
testInt18AddSum	68.695585	48.103536	1.42	26.88524	16.77886	1.6	3.3
testInt18Store	41.307909	43.385183	0.95	21.233238	20.875026	1.01	3.3
testInt36	99.039742	103.714745	0.95	142.265806	142.334039	0.99	3.3	m > 4, n <= 4
testInt36AddSum	68.736756	47.952189	1.43	26.868362	17.030035	1.57	3.3
testInt36Store	41.403698	43.414093	0.95	21.225454	20.52266	1.03	3.3
testInt96	105.00287	103.528144	1.01	237.649526	140.643255	1.68	3.3	m > 4, n > 4
testInt96AddSum	68.481133	48.04549	1.42	26.877407	16.918209	1.58	3.3
testInt96Store	41.276292	43.512994	0.94	23.456117	20.540181	1.14	3.3
testIntN18	138.629044	103.269657	1.34	210.315628	140.716818	1.49	3.4	m <= 4, n <= 4
testIntN18AddSum	156.635652	48.003989	3.26	215.807135	16.917665	12.75	3.4
testIntN18Store	57.584487	43.410415	1.32	26.819827	20.707778	1.29	3.4
testIntN36	139.068861	103.766774	1.34	209.522432	140.720322	1.48	3.4	m > 4, n <= 4
testIntN36AddSum	156.36928	48.027779	3.25	215.705842	16.893192	12.76	3.4
testIntN36Store	57.715418	43.493958	1.32	21.651252	20.676877	1.04	3.4
testIntN96	139.151761	103.453665	1.34	269.254161	140.753499	1.91	3.4	m > 4, n > 4
testIntN96AddSum	153.123557	48.110524	3.18	263.262635	17.011144	15.47	3.4
testIntN96Store	57.793179	43.47574	1.32	24.444592	20.530219	1.19	3.4
```

limitations:
1, This patch only analyzes two vector cases, there may be other vector
cases that may get performance regression with this patch.
2, This patch does not implement the disabled patterns in the backend,
I will propose a follow-up patch to implement these patterns in the
aarch64 backend.
3, This patch does not handle the long type, because different
architectures have different auto-vectorization support for long type,
resulting in very different performance, and it is difficult to find a
solution that does not introduce significant performance degradation.
@bridgekeeper bridgekeeper bot added the oca Needs verification of OCA signatory status label Jan 6, 2025
@bridgekeeper
Copy link

bridgekeeper bot commented Jan 6, 2025

Hi @erifan, welcome to this OpenJDK project and thanks for contributing!

We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request.

If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user erifan" as summary for the issue.

If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.

@openjdk
Copy link

openjdk bot commented Jan 6, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot changed the title 8346964: C2: Improve integer multiplication with constant in MulINode… 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() Jan 6, 2025
@openjdk
Copy link

openjdk bot commented Jan 6, 2025

@erifan The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Jan 6, 2025
@erifan
Copy link
Contributor Author

erifan commented Jan 6, 2025

/signed

@bridgekeeper bridgekeeper bot added the oca-verify Needs verification of OCA signatory status label Jan 6, 2025
@bridgekeeper
Copy link

bridgekeeper bot commented Jan 6, 2025

Thank you! Please allow for up to two weeks to process your OCA, although it is usually done within one to two business days. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!

@erifan
Copy link
Contributor Author

erifan commented Jan 6, 2025

/covered

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 6, 2025

Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @erifan

I have some first questions / comments. I only scanned through quickly.

My biggest question:
You only mention aarch64. But we would need to know that your changes also work well on x64.

Also:
Can you summarize if your changes are only for performane of vectorization, or also for scalar code?

@@ -250,27 +250,33 @@ Node *MulINode::Ideal(PhaseGVN *phase, bool can_reshape) {
sign_flip = true;
}

// Get low bit; check for being the only bit
// TODO: abs_con = (1<<m)+(1<<n) and con = -((1<<n)+1) was supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO here on purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a reminder to implement these two patterns for scalar cases in backends if they are worthwhile on the specific architecture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha. I see. We don't leave TODO's in the code. Because nobody will ever look at it again. If we agree to go ahead with this, then you should rather file an RFE to keep track of this. But before you file it, let's first discuss the over-all strategy.

private static int addTo6(int a) {
return a + a + a + a + a + a; // a*6 => (a<<1) + (a<<2)
return a + a + a + a + a + a;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an improvement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an improvement on aarch64 for all implementations? What about x64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a*6 is in a loop and can be vectorized, there may be big performance improvement. If it's not vectorized, there may be some small performance loss. See the test results of a*18 = (a<<4) + (a<<1), (same with a*6 = (a<<2) + (a<<1)) in three different cases:

Benchmark        V2-now	V2-after  Uplift Genoa-now Genoa-after Uplift  Notes
testInt18        98.90  102.94	  0.96     142.48   140.75	1.01   scalar
testInt18AddSum  68.70	48.10     1.42     26.88    16.78	1.6   vectorized
testInt18Store   41.31	43.39     0.95     21.23    20.88	1.01  vectorized

We can see that for scalar case the conversion from a*6 => (a<<2) + (a<<1) is profitable on aarch64, I have a follow up patch to reimplement this pattern in aarch64 backend, I'll file it later. But for x64, there is no obvious performance change whether or not to do this conversion. So this is also why I leave a TODO in mulnode.cpp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark        V2-now	V2-after  Uplift Genoa-now Genoa-after Uplift  Notes
testInt18        98.90  102.94	  0.96     142.48   140.75	1.01   scalar

Ok, that would be a 4% regression on V2. It is not much, but still possibly relevant.

I think I would need to see a clear strategy that we can actually pull off. Otherwise it may be that you introduce a regression here, that then nobody gets around to fixing later.

Comment on lines +262 to +263
// But if it's not vectorizable, maybe it's profitable to do the conversion on
// some architectures, support it in backends if it's worthwhile.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is all about vectorization only: we could consider delaying the Ideal optimization until after loop-opts. Then we can keep the multiplication for vectorization, and only use shift/add once we know that we cannot vectorize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for some cases, it's better to do the conversion before vectorization, for example:
x * 8 => x << 3

Through test results and theoretical analysis (only on aarch64, see the commit message), I found that we'd better to do the conversion before vectorization if the multiplication can be transformed to less than one or two shift/add instructions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory we could also do a transform with vectors, and convert a vector mul to a vector shift, right?

It is a bit scary to have these cases where some are better before and some better after vectorization. Makes performance quite unpredictable - your change may introduce improvements in some cases but regressions in others.

Node* n1 = phase->transform(
new LShiftINode(in(1), phase->intcon(log2i_exact(abs_con - 1))));
res = new AddINode(in(1), n1);
} else if (is_power_of_2(abs_con + 1)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now you only check for power_of_2 +- 1, right? But before we also looked at patterns with 2 bits, such as 64 + 8.

You would really need to prove that this is not a loss on any of the platforms we care about, incl. x64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have tested the patch on several of x64 machines, including amd Genoa and Intel SPR and some older x86 machines, there's no noticeable performance loss. Test results on aarch64 and amd Genoa were included in the commit message, please take a look, thanks~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now you only check for power_of_2 +- 1, right?

Yes now I only check these patterns:

1, Const = 1<<n,         (n > 0)
2, Const = -(1<<n),      (n > 0)
3, Const = (1<<n) + 1,   (n > 0)
4, Const = (1<<n) - 1,   (n > 0)
5, Const = -((1<<n) -1)  (n > 0)

Removed these patterns:

1, Const = (1<<m) + (1<<n)      (m > 0, n > 0)
2, Const = -((1<<m) + (1<<n))   (m > 0, n > 0)
3, Const = -((1<<n) + 1),       (n > 0)

@erifan
Copy link
Contributor Author

erifan commented Jan 8, 2025

Hi @eme64 , thanks for your review!

My biggest question: You only mention aarch64. But we would need to know that your changes also work well on x64.

Yes, this patch also benefits x86_64 platform. I tested the patch on aarch64 V2 and N1 processors, AMD64 Genoa and Intel SPR processors (I can provide the test results if necessary). The test results show that for some cases the performance uplift is very large and there is no obvious performance degradation. I'm not familiar with x86_64 instruction set, so I didn't do any theoretical analysis on x86_64 platform, I did very detailed theoretical analysis on aarch64 platform.

I don't have machines with architectures other than aarch64 and x64, so this patch is not tested on platforms except for aarch64 and x64.

Also: Can you summarize if your changes are only for performane of vectorization, or also for scalar code?

For both vectorization and scalar cases. For example the pattern x * C => -((x<<m) + (x<<n)) is supported now, this pattern is a negative optimization from both the vectorization and scalar perspectives on both aarch64 and x64, and this patch removed this pattern.

As mentioned in my commit message, this patch is good for most cases, but there are small performance loss for some cases. Ideally, if we implement vplan, I think it would be better to keep the multiplication operation before vectorization and let the vectorizer generate the optimal vector code. Then for the scalar case that cannot be vectorized, convert to other more optimal instructions in the backend ( or in this pass if it's merged) if necessary.

@eme64
Copy link
Contributor

eme64 commented Jan 8, 2025

Hi @erifan

Thanks for your responses. I now looked at the benchmark results.
I see regressions in the range of 5% on both of your tested platforms. I'm hesitant to accept that now without the follow-up patches standing first. Maybe you can change the order of your RFE's so that we have no regressions in between?

Maybe you could even create one big patch with everything in it, just so that we can see that there are no regressions. Then split it up into parts (multiple RFEs) for easier review.

Also: It would be nice to see benchmarks on as many different architectures as you can show. And please make sure that the table is nicely aligned - currently it is a bit difficult to read.

f we implement vplan
What do you mean by VPlan? Are you talking about LLVM? I am working on something a little similar with VTransform. But I'm not sure if it is relevant. I mean in theory you can also use the vector API, and then it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. Actually, maybe you should first address that: which vector mul could be vector shift. That would be a nice stand-alone change that you could implement without regressions, right?

What do you think?

@erifan
Copy link
Contributor Author

erifan commented Jan 8, 2025

About the OCA, I am an employee from NVIDIA's Java compiler team, and NVIDIA has signed OCA.

@erifan
Copy link
Contributor Author

erifan commented Jan 8, 2025

Maybe you could even create one big patch with everything in it, just so that we can see that there are no regressions. Then split it up into parts (multiple RFEs) for easier review.

Ok, I'll combine the patches and file one big patch, and update the test results.

Also: It would be nice to see benchmarks on as many different architectures as you can show. And please make sure that the table is nicely aligned - currently it is a bit difficult to read.

OK, I can provide test results on aarch64 V2, N1, AMD64 Genoa and Intel SPR processors.

What do you mean by VPlan? Are you talking about LLVM? I am working on something a little similar with VTransform. But I'm not sure if it is relevant. I mean in theory you can also use the vector API, and then it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. Actually, maybe you should first address that: which vector mul could be vector shift. That would be a nice stand-alone change that you could implement without regressions, right?

Yes, I mean LLVM VPlan, I noticed your related work, it's very interesting.

it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable.

Yes.

@eme64
Copy link
Contributor

eme64 commented Jan 9, 2025

@erifan I did some more thinking when falling asleep / waking up. This is a really interesting problem here.

For MulINode::Ideal with patterns var * con, we really have these options in assembly:

  • mul general case.
  • shift and add when profitable.
  • lea could this be an improvement over shift and add?

The issue is that different platforms have different characteristics here for these instructions - we would have to see how they differ. As far as I remember mul is not always available on all ALUs, but add and shift should be available. This impacts their throughput (more ports / ALU means more throughput generally). But the instructions also have different latency. Further, I could imagine that at some point more instructions may not just affect the throughput, but also the code-size: that in turn would increase IR and may at some point affect the instruction cache.

Additionally: if your workload has other mul, shift and add mixed in, then some ports may already be saturated, and that could tilt the balance as to which option you are supposed to take.

And then the characteristics of scalar ops may not be identical to vector ops.

It would be interesting to have a really solid benchmark, where you explore the impact of these different effects.
And it would be interesting to extract a table of latency + throughput characteristics for all relevant scalar + vector ops, for a number of different CPUs. Just so we get an overview of how easy this is to tune.

Maybe perfect tuning is not possible. Maybe we are willing to take a 5% regression in some cases to boost other cases by 30%. But that is a big maybe: we really do not like getting regressions in existing code, it tends to upset people more if they get regressions compared to how much they enjoy speedups - so work like this can be delicate.

Anyway, I don't right now have much time to investigate and work on this myself. So you'd have to do the work, benchmark, explanation etc. But I think the 30% speedup indicates that this work could really have potential!

As to what to do in sequence, here a suggestion:

  1. First work on Vector API cases of vector multiplication - this should have no impact on other things.
  2. Delay the MulINode::Ideal optimizations until after loop-opts: scalar code would still be handled in the old way, but auto-vectorized code would then be turned into MulV. And then go into the mul -> shift optimization for vectors under point 1.
  3. Tackle MulINode::Ideal for scalar cases after loop-opts, and see what you can do there.

This way you can separate scalar and vector changes.
All of this really depends on very good benchmarks, and benchmarks on various platforms. Good presentation would be key here. I find tables with numbers important, but a visual representation on top would be good too - it can give an easier overview over the patterns.

And please investigate when lea is applicable / profitable.

We may also want to get people from ARM and Intel into this discussion at some point.

That's enough for now. Let me know what you think :)

@erifan
Copy link
Contributor Author

erifan commented Jan 9, 2025

Hi @eme64 thanks for your review.

1. First work on Vector API cases of vector multiplication - this should have no impact on other things.
2. Delay the MulINode::Ideal optimizations until after loop-opts: scalar code would still be handled in the old way, but auto-vectorized code would then be turned into MulV. And then go into the mul -> shift optimization for vectors under point 1.
3. Tackle MulINode::Ideal for scalar cases after loop-opts, and see what you can do there.

I agree with you. I am actually working on 1. The slightly troublesome thing is that 1 and 3 are both related to the architecture, so it might take a little more time.

lea could this be an improvement over shift and add?

AARCH64 doesn't actually have a lea instruction. On x64 there are already some rules that turn shift add into lea.

The issue is that different platforms have different characteristics here for these instructions - we would have to see how they differ. As far as I remember mul is not always available on all ALUs, but add and shift should be available. This impacts their throughput (more ports / ALU means more throughput generally). But the instructions also have different latency. Further, I could imagine that at some point more instructions may not just affect the throughput, but also the code-size: that in turn would increase IR and may at some point affect the instruction cache.

Additionally: if your workload has other mul, shift and add mixed in, then some ports may already be saturated, and that could tilt the balance as to which option you are supposed to take.

And then the characteristics of scalar ops may not be identical to vector ops.

Yes this is very trick, the actual performance is related to many aspects, such as pipeline, latency, throughput, ROB, and even memory performance. We can only do optimization based on certain references and generalities, such as latency and throughput of different instructions. But when it comes to generalities, it is actually difficult to say which scenario is more general.

It would be interesting to have a really solid benchmark, where you explore the impact of these different effects.
And it would be interesting to extract a table of latency + throughput characteristics for all relevant scalar + vector ops, for a number of different CPUs. Just so we get an overview of how easy this is to tune.

I don't know such a benchmark suite yet. For AARCH64, I usually refer to the Arm Optimization Guide, but some instructions seem to be missing there. I guess AMD and Intel should have similar documents? I'm not sure.

Maybe perfect tuning is not possible. Maybe we are willing to take a 5% regression in some cases to boost other cases by 30%. But that is a big maybe: we really do not like getting regressions in existing code, it tends to upset people more if they get regressions compared to how much they enjoy speedups - so work like this can be delicate.

Yes I agree, I will deal with the performance loss.

And please investigate when lea is applicable / profitable.

Do you mean shift add => lea ? I think this is already done on x64.

We may also want to get people from ARM and Intel into this discussion at some point.

Yes.

@eme64
Copy link
Contributor

eme64 commented Jan 9, 2025

@erifan Amazing, thanks for the enthusiasm :)
Looking forward to what you come up with!

@jaskarth
Copy link
Member

jaskarth commented Feb 2, 2025

I also think that moving mul-related idealization to happen after loop opts would be a good idea. In the past I've noticed that patterns such as x * 4 * 4 do not properly fold into a single multiply, because the mul is strength reduced into left shifts before constant folding takes place (shifts aren't currently constant folded either, but I think that is a separate issue). By doing such strength reductions after loop opts we can fix that issue as well.

@bridgekeeper bridgekeeper bot removed oca Needs verification of OCA signatory status oca-verify Needs verification of OCA signatory status labels Feb 10, 2025
@openjdk
Copy link

openjdk bot commented Feb 10, 2025

@erifan this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8346964
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added merge-conflict Pull request has merge conflict with target branch rfr Pull request is ready for review labels Feb 10, 2025
@mlbridge
Copy link

mlbridge bot commented Feb 10, 2025

Webrevs

Change-Id: Ib47ed4f9c6d69326a0b7cb8ba7c29f604b8fc1ec
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Feb 11, 2025
@bridgekeeper
Copy link

bridgekeeper bot commented Mar 2, 2025

@erifan This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 30, 2025

@erifan This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

@bridgekeeper bridgekeeper bot closed this Mar 30, 2025
@erifan
Copy link
Contributor Author

erifan commented Mar 31, 2025

/open

@openjdk openjdk bot reopened this Mar 31, 2025
@openjdk
Copy link

openjdk bot commented Mar 31, 2025

@erifan This pull request is now open

@eme64
Copy link
Contributor

eme64 commented Apr 2, 2025

@erifan you opened this again. Does that mean we should review again? I see that you did not make any changes since our last conversation. If it is not ready for review, could you please convert it to Draft, so it is clear that you are not asking for reviews currently?

@erifan erifan marked this pull request as draft April 2, 2025 08:47
@openjdk openjdk bot removed the rfr Pull request is ready for review label Apr 2, 2025
@erifan
Copy link
Contributor Author

erifan commented Apr 2, 2025

@erifan you opened this again. Does that mean we should review again? I see that you did not make any changes since our last conversation. If it is not ready for review, could you please convert it to Draft, so it is clear that you are not asking for reviews currently?

Oh sorry to bother you. I'm working on something else. For this PR, actually I have finished the modification according to your suggestion, and tests on aarch64 and x86_64 also showed that there's no performance regression. But I didn't file the patch because of the following reasons:
1, The changes have grown so much that I'm wondering if it should be split into multiple patches so it will be easier to review.
2, We'll do almost the same optimization for scalar MulINode and vector MulVINode, this looks a bit repetitive, although the IR is different. I wonder if there's a better method for this situation. We haven’t done much optimization for vector nodes yet, but as we do more optimization, this will become an issue. For example, should we do constant reassociation for vector nodes?
3, The interaction between different optimization phases is a general problem, and I wonder if there is a more general solution.

Thanks!

@bridgekeeper
Copy link

bridgekeeper bot commented May 28, 2025

@erifan This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@erifan
Copy link
Contributor Author

erifan commented Jun 4, 2025

/keepalive

@openjdk
Copy link

openjdk bot commented Jun 4, 2025

@erifan The pull request is being re-evaluated and the inactivity timeout has been reset.

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 30, 2025

@erifan This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

3 participants