-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Vectorize filterCompetitiveHits
#14896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Vectorize filterCompetitiveHits
#14896
Conversation
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
BTW, benchmark result showed above runs on a machine with |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
Wow, this is a big speedup! I'd like to get opinions on
FWIW I like higher values of this parameter better. Otherwise a change may look like it's a speedup when in fact it is overfitted for a particular query, and other similar queries don't get the same speedup. I usually keep it at 5 like nightly benchmarks. Separately I wonder if we should keep accumulating scores in |
Yeah, I aggree with that too, I've seen the
Of course! I ran a luceneutil on
And it shows an similar speedup ! |
filterCompetitiveHits
filterCompetitiveHits
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
This looks good to me. The usage of the vector API, and the structuring of the code and test look good. Let's add a micro benchmark to lucene/benchmark-jmh (though the in-place modification of the arrays complicates writing a micro benchmark! :-( ). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left very minor comments, otherwise it looks good to me.
lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
Outdated
Show resolved
Hide resolved
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
I've added a jmh, here is the result on my machine:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left very minor comments, otherwise LGTM.
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/CompetitiveBenchmark.java
Show resolved
Hide resolved
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/CompetitiveBenchmark.java
Outdated
Show resolved
Hide resolved
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/CompetitiveBenchmark.java
Show resolved
Hide resolved
After the newest commit, the benchmark results are as follows (I added a case where
|
Hi, I am out of house this weekend In general code and the usual VectorUtil abstraction looks fine; about the risks:
|
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/CompetitiveBenchmark.java
Outdated
Show resolved
Hide resolved
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/CompetitiveBenchmark.java
Show resolved
Hide resolved
I get the following results on an AMD Ryzen 9 3900X (AVX2 support, but no AVX-512 support):
|
Agreed, I'm running the luceneutil on my machine based on a temporary branch to verify the branchless way, will share the result once it finishs iterations |
Here is the result comparing the branchless way (candidate) vs main branch (baseline) under identical setup:
I think this change shows a good enough speedup, will ran another luceneutil comparing the explict vectorizing and branchless way. BTW, should I keep that part of vectorize code or just keep the branchless way if we are about to merge this? ( IMHO, It might be beneficial if we can figure out a way to enable those complex vectorized operations (of couse, not in this PR), without slowing down on machines that don’t support the underlying instructions (or where they are not enabled in the JVM), because there may be other places where we could benefit from vectorization ? ) |
We can try to improve it and make more checks available, but the developer has to remember to use them. e.g. we have In cases like this one, I'm afraid it gets no simpler: sve must be checked for on the arm and avx-512 likely should also be avoided too. avx-512 is a mess of smaller instruction sets and maybe doing this compress requires VBMI or VBMI2 or something (I haven't looked yet, this is just an example). So better to use 256-bit vectors only since openjdk has a basic avx2 algorithm. https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#AVX-512_CPU_compatibility_table |
honestly, to make it reasonable and prevent traps like this, a good approach would be to support less stuff: e.g. only support 256-bit SVE on ARM and 256-bit AVX2 on x86. by ditching NEON and AVX-512, it would be simpler for us to maintain and reduce traps. Of course I imagine this would be controversial, I'm just discussing ways we can make our lives easier. at least we can start by ditching SSE2/AVX-128, I figure this shouldn't be controversial: #14901 |
Thanks for running macro benchmarks with the branchless approach!
I was thinking of creating a new PR with the branchless approach, merging it, then rebasing this PR and continuing discussions wrt how/when to use explicit vectorization depending on how performance compares with our new (branchless) baseline. Would you like to open a PR that switches main to the branchless impl? |
Good idea i think, sometimes the preferred bitsize is still 256-bit even on a machine support AVX-512, and the 512-bit also have risk of thermal throttling.
Of course! I'v opened a new PR #14906
I also ran a luceneutil with branchless as baseline and explict vectorize as candidate, with identical setup, here is the result if it helps: result
|
I want to share the luceneutil result against the latest code: result
There’s still about a 5% margin for improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for running benchmarks again. 5% is not a small speedup, we should definitely keep looking into this!
I left some minor comments, but to me the biggest question left is how the optimization should be disabled on architectures that do not support compress() efficiently such as Neon. The current check based on the number of lanes feels fragile. @rmuir may have ideas.
I ran the microbenchmark on my AMD Ryzen 3900X, it suggests similar performance as the branchless impl:
Benchmark (minScoreInclusive) (size) Mode Cnt Score Error Units
CompetitiveBenchmark.baseline 0 128 thrpt 5 16964.805 ± 811.642 ops/ms
CompetitiveBenchmark.baseline 0.2 128 thrpt 5 3349.967 ± 374.600 ops/ms
CompetitiveBenchmark.baseline 0.4 128 thrpt 5 1945.372 ± 41.630 ops/ms
CompetitiveBenchmark.baseline 0.5 128 thrpt 5 1849.745 ± 114.880 ops/ms
CompetitiveBenchmark.baseline 0.8 128 thrpt 5 3520.931 ± 142.512 ops/ms
CompetitiveBenchmark.branchlessCandidate 0 128 thrpt 5 17189.332 ± 935.697 ops/ms
CompetitiveBenchmark.branchlessCandidate 0.2 128 thrpt 5 8868.229 ± 315.512 ops/ms
CompetitiveBenchmark.branchlessCandidate 0.4 128 thrpt 5 8844.621 ± 213.875 ops/ms
CompetitiveBenchmark.branchlessCandidate 0.5 128 thrpt 5 8788.696 ± 305.367 ops/ms
CompetitiveBenchmark.branchlessCandidate 0.8 128 thrpt 5 8783.131 ± 639.710 ops/ms
CompetitiveBenchmark.branchlessCandidateCmov 0 128 thrpt 5 17920.213 ± 629.212 ops/ms
CompetitiveBenchmark.branchlessCandidateCmov 0.2 128 thrpt 5 10342.089 ± 343.342 ops/ms
CompetitiveBenchmark.branchlessCandidateCmov 0.4 128 thrpt 5 10493.824 ± 638.798 ops/ms
CompetitiveBenchmark.branchlessCandidateCmov 0.5 128 thrpt 5 10339.155 ± 314.769 ops/ms
CompetitiveBenchmark.branchlessCandidateCmov 0.8 128 thrpt 5 10461.349 ± 399.232 ops/ms
CompetitiveBenchmark.vectorizedCandidate 0 128 thrpt 5 10103.773 ± 193.359 ops/ms
CompetitiveBenchmark.vectorizedCandidate 0.2 128 thrpt 5 10136.886 ± 311.288 ops/ms
CompetitiveBenchmark.vectorizedCandidate 0.4 128 thrpt 5 10236.376 ± 255.335 ops/ms
CompetitiveBenchmark.vectorizedCandidate 0.5 128 thrpt 5 10100.938 ± 196.165 ops/ms
CompetitiveBenchmark.vectorizedCandidate 0.8 128 thrpt 5 10085.442 ± 358.695 ops/ms
However, luceneutil seems to see a speedup on wikibigall, so it's promising that it doesn't only help with AVX-512.
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
TermTitleSort 87.00 (4.2%) 85.96 (5.6%) -1.2% ( -10% - 9%) 0.447
AndHighOrMedMed 50.93 (2.1%) 50.74 (2.5%) -0.4% ( -4% - 4%) 0.610
FilteredDismaxOrHighHigh 70.96 (2.6%) 70.73 (2.7%) -0.3% ( -5% - 5%) 0.697
TermDayOfYearSort 286.00 (1.6%) 285.38 (1.3%) -0.2% ( -3% - 2%) 0.630
CountFilteredOrMany 27.14 (1.3%) 27.09 (1.9%) -0.2% ( -3% - 3%) 0.728
FilteredOrHighMed 154.55 (1.0%) 154.35 (1.0%) -0.1% ( -2% - 1%) 0.672
FilteredDismaxOrHighMed 131.33 (2.1%) 131.17 (2.0%) -0.1% ( -4% - 4%) 0.848
CountAndHighMed 309.76 (1.2%) 309.54 (1.7%) -0.1% ( -2% - 2%) 0.876
FilteredIntNRQ 297.68 (1.0%) 297.50 (0.7%) -0.1% ( -1% - 1%) 0.823
TermMonthSort 3565.51 (1.7%) 3564.45 (2.0%) -0.0% ( -3% - 3%) 0.960
FilteredOr3Terms 167.89 (0.9%) 167.88 (1.0%) -0.0% ( -1% - 1%) 0.978
CountFilteredOrHighHigh 137.15 (0.7%) 137.15 (0.9%) -0.0% ( -1% - 1%) 0.984
CountFilteredOrHighMed 149.10 (0.6%) 149.15 (0.6%) 0.0% ( -1% - 1%) 0.867
FilteredOr2Terms2StopWords 148.16 (1.0%) 148.22 (1.0%) 0.0% ( -1% - 2%) 0.895
CountAndHighHigh 357.92 (2.0%) 358.18 (2.4%) 0.1% ( -4% - 4%) 0.918
CombinedAndHighMed 89.28 (1.3%) 89.35 (1.8%) 0.1% ( -2% - 3%) 0.879
CountOrMany 29.08 (1.4%) 29.11 (1.8%) 0.1% ( -3% - 3%) 0.853
FilteredPhrase 32.36 (1.5%) 32.39 (1.0%) 0.1% ( -2% - 2%) 0.790
CountPhrase 4.22 (1.7%) 4.23 (1.5%) 0.2% ( -2% - 3%) 0.732
FilteredOrStopWords 45.94 (1.9%) 46.03 (1.5%) 0.2% ( -3% - 3%) 0.716
CombinedOrHighMed 88.00 (2.4%) 88.20 (1.9%) 0.2% ( -3% - 4%) 0.731
CountFilteredPhrase 25.42 (1.8%) 25.48 (1.0%) 0.2% ( -2% - 3%) 0.601
CombinedTerm 38.45 (2.6%) 38.54 (2.6%) 0.2% ( -4% - 5%) 0.776
FilteredPrefix3 151.00 (1.5%) 151.51 (1.8%) 0.3% ( -3% - 3%) 0.528
CountOrHighHigh 341.83 (1.9%) 343.02 (2.4%) 0.3% ( -3% - 4%) 0.617
FilteredOrHighHigh 67.43 (1.8%) 67.67 (1.6%) 0.3% ( -2% - 3%) 0.514
FilteredOrMany 16.53 (1.7%) 16.59 (2.1%) 0.4% ( -3% - 4%) 0.520
CountOrHighMed 361.72 (1.4%) 363.32 (2.3%) 0.4% ( -3% - 4%) 0.465
FilteredAnd2Terms2StopWords 213.67 (1.3%) 215.21 (1.1%) 0.7% ( -1% - 3%) 0.058
FilteredAndHighHigh 78.46 (1.9%) 79.03 (2.3%) 0.7% ( -3% - 5%) 0.280
CountTerm 9673.38 (2.7%) 9743.81 (2.5%) 0.7% ( -4% - 6%) 0.377
DismaxOrHighMed 188.33 (4.9%) 189.73 (5.1%) 0.7% ( -8% - 11%) 0.638
PKLookup 312.42 (5.1%) 314.78 (4.6%) 0.8% ( -8% - 11%) 0.624
TermDTSort 386.11 (2.0%) 389.12 (2.3%) 0.8% ( -3% - 5%) 0.257
FilteredAndStopWords 64.83 (2.0%) 65.36 (2.4%) 0.8% ( -3% - 5%) 0.247
OrHighMed 251.13 (8.3%) 253.57 (8.1%) 1.0% ( -14% - 19%) 0.709
DismaxTerm 737.97 (7.0%) 745.84 (6.3%) 1.1% ( -11% - 15%) 0.612
FilteredAnd3Terms 187.22 (2.1%) 189.33 (2.2%) 1.1% ( -3% - 5%) 0.091
FilteredDismaxTerm 161.96 (3.1%) 163.96 (1.3%) 1.2% ( -3% - 5%) 0.101
FilteredTerm 161.87 (3.5%) 163.99 (1.4%) 1.3% ( -3% - 6%) 0.117
CombinedAndHighHigh 23.01 (1.6%) 23.33 (1.7%) 1.4% ( -1% - 4%) 0.007
DismaxOrHighHigh 127.57 (5.7%) 129.55 (5.1%) 1.6% ( -8% - 13%) 0.364
Or2Terms2StopWords 200.65 (5.9%) 203.99 (5.9%) 1.7% ( -9% - 14%) 0.371
And2Terms2StopWords 199.17 (5.7%) 202.67 (5.5%) 1.8% ( -8% - 13%) 0.323
CombinedOrHighHigh 22.75 (3.3%) 23.16 (1.8%) 1.8% ( -3% - 7%) 0.030
OrStopWords 44.74 (11.7%) 45.62 (11.1%) 2.0% ( -18% - 28%) 0.589
Term 623.54 (9.7%) 635.98 (8.5%) 2.0% ( -14% - 22%) 0.489
OrHighRare 290.94 (8.7%) 297.45 (5.2%) 2.2% ( -10% - 17%) 0.325
OrHighHigh 71.01 (12.3%) 72.76 (11.9%) 2.5% ( -19% - 30%) 0.519
AndHighMed 190.36 (9.8%) 196.24 (9.8%) 3.1% ( -15% - 25%) 0.320
FilteredAndHighMed 150.51 (3.4%) 156.00 (3.1%) 3.6% ( -2% - 10%) 0.000
AndMedOrHighHigh 82.89 (5.2%) 86.16 (4.9%) 4.0% ( -5% - 14%) 0.013
AndHighHigh 61.20 (12.9%) 63.68 (13.1%) 4.0% ( -19% - 34%) 0.326
AndStopWords 41.77 (11.0%) 44.06 (11.0%) 5.5% ( -14% - 30%) 0.114
Or3Terms 216.89 (7.5%) 229.58 (7.0%) 5.8% ( -8% - 21%) 0.011
And3Terms 224.44 (7.3%) 237.93 (7.0%) 6.0% ( -7% - 21%) 0.008
OrMany 22.31 (4.9%) 23.66 (4.5%) 6.1% ( -3% - 16%) 0.000
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/CompetitiveBenchmark.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
Outdated
Show resolved
Hide resolved
lucene/core/src/test/org/apache/lucene/util/TestVectorUtil.java
Outdated
Show resolved
Hide resolved
lucene/core/src/test/org/apache/lucene/util/TestVectorUtil.java
Outdated
Show resolved
Hide resolved
To move this forward, I believe that we should try to add detection of |
UseSVE should be easy to detect in the same way like that: lucene/lucene/core/src/java/org/apache/lucene/util/Constants.java Lines 93 to 95 in 633bb1c
For AVX, I think you have to parse the option as integer and compare |
+1: I like that idea, of having checks for each "feature". It makes this haphazard situation amenable to static analysis to prevent problems: I will separately look into that. Note, this PR also uses some other features such as VectorMask.cast() and so on, I don't know off the top of my head what cpu instructions they use. Maybe nothing scary/special, just dont know. |
Sorry for the late reply, I'm a little busy these days. As for the AVX, I'm little bit confused by the source code, it seems we only need to check wheather the prefered vector size is higher than or equal to 256-bits ? Hope I'm not getting anything wrong, I'll try to implement these logic if it's correct, and check if the VectorMask.cast causes any extra trouble |
it seems As for amd64, we still only need to check the vector bit-size ? (and maybe also the |
If you find more slow vector methods (such as this VectorMask.cast that you found here), can you add them to this list? It may help the next PR.
|
Can you fix the message in forbidden-apis for compress: @defaultMessage Potentially slow on some CPUs, please check the CPU has feature: Unsupported on NEON There you should add your new message and then combine both sections. |
I didn't really check the |
Description
This PR is a follow-up of the comment from #14827 , trying to vectorize the
filterCompetitiveHits
function by utilizing(Int|Float)Vector#compress
.I'm still working on it, tests are not added yet, nor is the code stable , comments and suggestions are welcomed !
But I do did a quick run of luceneutil based on
62e0276032189deee9559327cc53ac3f59f354a9
withwikimediumall
withsearchConcurrency=0, taskCountPerCat=1, taskRepeatCount=20
, here is the result after 20 iterations, which seems to be promising (hope I didn't get anything wrong). Will do another run with different setup(BTW, The lastest luceneutil have some constructor problem since #14873 is introduced, will get error like below)
