-
Notifications
You must be signed in to change notification settings - Fork 14.2k
[libc++][ranges] Optimize the performance of ranges::starts_with
#84570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write If you have received no comments on your PR for a week, you can request a review If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-libcxx Author: Xiaoyang Liu (xiaoyang-sde) ChangesAbstractThis pull request attempts to optimize the performance of This pull request is related to the issue #83880. Benchmarking2024-03-08T23:40:18-08:00
Running ./ranges_starts_with.libcxx.out
Run on (12 X 2496 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 1280 KiB (x6)
L3 Unified 18432 KiB (x1)
Load Average: 0.79, 1.93, 2.27
-----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------------------------
bm_starts_with_contiguous_iter_with_memcmp_optimization/16 1.50 ns 1.50 ns 460164239
bm_starts_with_contiguous_iter_with_memcmp_optimization/256 12.5 ns 12.5 ns 55148246
bm_starts_with_contiguous_iter_with_memcmp_optimization/4096 189 ns 189 ns 4030953
bm_starts_with_contiguous_iter_with_memcmp_optimization/65536 4889 ns 4889 ns 157791
bm_starts_with_contiguous_iter_with_memcmp_optimization/1048576 138974 ns 138964 ns 5810
bm_starts_with_contiguous_iter_with_memcmp_optimization/16777216 3874632 ns 3874633 ns 180
bm_starts_with_contiguous_iter/16 6.55 ns 6.55 ns 90161221
bm_starts_with_contiguous_iter/256 99.8 ns 99.8 ns 7093729
bm_starts_with_contiguous_iter/4096 1505 ns 1488 ns 420866
bm_starts_with_contiguous_iter/65536 24836 ns 24836 ns 29657
bm_starts_with_contiguous_iter/1048576 400852 ns 400847 ns 1700
bm_starts_with_contiguous_iter/16777216 7017180 ns 7017118 ns 101
bm_starts_with_random_access_iter/16 6.86 ns 6.86 ns 99695785
bm_starts_with_random_access_iter/256 102 ns 102 ns 6721592
bm_starts_with_random_access_iter/4096 1537 ns 1537 ns 478604
bm_starts_with_random_access_iter/65536 23564 ns 23564 ns 30089
bm_starts_with_random_access_iter/1048576 382977 ns 382985 ns 1850
bm_starts_with_random_access_iter/16777216 6952078 ns 6952032 ns 99
bm_starts_with_bidirectional_iter/16 6.23 ns 6.23 ns 111062808
bm_starts_with_bidirectional_iter/256 103 ns 103 ns 6981875
bm_starts_with_bidirectional_iter/4096 1498 ns 1468 ns 451407
bm_starts_with_bidirectional_iter/65536 23994 ns 23994 ns 30011
bm_starts_with_bidirectional_iter/1048576 381783 ns 381781 ns 1838
bm_starts_with_bidirectional_iter/16777216 7538205 ns 7538213 ns 103
bm_starts_with_forward_iter/16 6.37 ns 6.37 ns 103694333
bm_starts_with_forward_iter/256 103 ns 103 ns 7017987
bm_starts_with_forward_iter/4096 1535 ns 1535 ns 483272
bm_starts_with_forward_iter/65536 23208 ns 23208 ns 28449
bm_starts_with_forward_iter/1048576 370772 ns 370773 ns 1812
bm_starts_with_forward_iter/16777216 6758327 ns 6758279 ns 101 Full diff: https://github.com/llvm/llvm-project/pull/84570.diff 5 Files Affected:
diff --git a/libcxx/benchmarks/CMakeLists.txt b/libcxx/benchmarks/CMakeLists.txt
index b436e96f178b70..621b295a92c8d2 100644
--- a/libcxx/benchmarks/CMakeLists.txt
+++ b/libcxx/benchmarks/CMakeLists.txt
@@ -186,6 +186,7 @@ set(BENCHMARK_TESTS
algorithms/pstl.stable_sort.bench.cpp
algorithms/push_heap.bench.cpp
algorithms/ranges_contains.bench.cpp
+ algorithms/ranges_starts_with.bench.cpp
algorithms/ranges_ends_with.bench.cpp
algorithms/ranges_make_heap.bench.cpp
algorithms/ranges_make_heap_then_sort_heap.bench.cpp
diff --git a/libcxx/benchmarks/algorithms/ranges_starts_with.bench.cpp b/libcxx/benchmarks/algorithms/ranges_starts_with.bench.cpp
new file mode 100644
index 00000000000000..9986fdb1684fc7
--- /dev/null
+++ b/libcxx/benchmarks/algorithms/ranges_starts_with.bench.cpp
@@ -0,0 +1,107 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include <algorithm>
+#include <benchmark/benchmark.h>
+#include <functional>
+#include <vector>
+
+#include "test_iterators.h"
+
+static void bm_starts_with_contiguous_iter_with_memcmp_optimization(benchmark::State& state) {
+ std::vector<int> a(state.range(), 1);
+ std::vector<int> p(state.range(), 1);
+
+ for (auto _ : state) {
+ benchmark::DoNotOptimize(a);
+ benchmark::DoNotOptimize(p);
+
+ auto begin1 = contiguous_iterator(a.data());
+ auto end1 = contiguous_iterator(a.data() + a.size());
+ auto begin2 = contiguous_iterator(p.data());
+ auto end2 = contiguous_iterator(p.data() + p.size());
+
+ benchmark::DoNotOptimize(std::ranges::starts_with(begin1, end1, begin2, end2));
+ }
+}
+BENCHMARK(bm_starts_with_contiguous_iter_with_memcmp_optimization)->RangeMultiplier(16)->Range(16, 16 << 20);
+
+static void bm_starts_with_contiguous_iter(benchmark::State& state) {
+ std::vector<int> a(state.range(), 1);
+ std::vector<int> p(state.range(), 1);
+
+ for (auto _ : state) {
+ benchmark::DoNotOptimize(a);
+ benchmark::DoNotOptimize(p);
+
+ auto begin1 = contiguous_iterator(a.data());
+ auto end1 = contiguous_iterator(a.data() + a.size());
+ auto begin2 = contiguous_iterator(p.data());
+ auto end2 = contiguous_iterator(p.data() + p.size());
+
+ benchmark::DoNotOptimize(
+ std::ranges::starts_with(begin1, end1, begin2, end2, [](const int a, const int b) { return a == b; }));
+ }
+}
+BENCHMARK(bm_starts_with_contiguous_iter)->RangeMultiplier(16)->Range(16, 16 << 20);
+
+static void bm_starts_with_random_access_iter(benchmark::State& state) {
+ std::vector<int> a(state.range(), 1);
+ std::vector<int> p(state.range(), 1);
+
+ for (auto _ : state) {
+ benchmark::DoNotOptimize(a);
+ benchmark::DoNotOptimize(p);
+
+ auto begin1 = random_access_iterator(a.data());
+ auto end1 = random_access_iterator(a.data() + a.size());
+ auto begin2 = random_access_iterator(p.data());
+ auto end2 = random_access_iterator(p.data() + p.size());
+
+ benchmark::DoNotOptimize(std::ranges::starts_with(begin1, end1, begin2, end2));
+ }
+}
+BENCHMARK(bm_starts_with_random_access_iter)->RangeMultiplier(16)->Range(16, 16 << 20);
+
+static void bm_starts_with_bidirectional_iter(benchmark::State& state) {
+ std::vector<int> a(state.range(), 1);
+ std::vector<int> p(state.range(), 1);
+
+ for (auto _ : state) {
+ benchmark::DoNotOptimize(a);
+ benchmark::DoNotOptimize(p);
+
+ auto begin1 = bidirectional_iterator(a.data());
+ auto end1 = bidirectional_iterator(a.data() + a.size());
+ auto begin2 = bidirectional_iterator(p.data());
+ auto end2 = bidirectional_iterator(p.data() + p.size());
+
+ benchmark::DoNotOptimize(std::ranges::starts_with(begin1, end1, begin2, end2));
+ }
+}
+BENCHMARK(bm_starts_with_bidirectional_iter)->RangeMultiplier(16)->Range(16, 16 << 20);
+
+static void bm_starts_with_forward_iter(benchmark::State& state) {
+ std::vector<int> a(state.range(), 1);
+ std::vector<int> p(state.range(), 1);
+
+ for (auto _ : state) {
+ benchmark::DoNotOptimize(a);
+ benchmark::DoNotOptimize(p);
+
+ auto begin1 = forward_iterator(a.data());
+ auto end1 = forward_iterator(a.data() + a.size());
+ auto begin2 = forward_iterator(p.data());
+ auto end2 = forward_iterator(p.data() + p.size());
+
+ benchmark::DoNotOptimize(std::ranges::starts_with(begin1, end1, begin2, end2));
+ }
+}
+BENCHMARK(bm_starts_with_forward_iter)->RangeMultiplier(16)->Range(16, 16 << 20);
+
+BENCHMARK_MAIN();
diff --git a/libcxx/include/__algorithm/ranges_starts_with.h b/libcxx/include/__algorithm/ranges_starts_with.h
index 90e184aa9bccc2..ebcc15555f5544 100644
--- a/libcxx/include/__algorithm/ranges_starts_with.h
+++ b/libcxx/include/__algorithm/ranges_starts_with.h
@@ -9,12 +9,14 @@
#ifndef _LIBCPP___ALGORITHM_RANGES_STARTS_WITH_H
#define _LIBCPP___ALGORITHM_RANGES_STARTS_WITH_H
-#include <__algorithm/in_in_result.h>
+#include <__algorithm/ranges_equal.h>
#include <__algorithm/ranges_mismatch.h>
#include <__config>
#include <__functional/identity.h>
#include <__functional/ranges_operations.h>
+#include <__functional/reference_wrapper.h>
#include <__iterator/concepts.h>
+#include <__iterator/distance.h>
#include <__iterator/indirectly_comparable.h>
#include <__ranges/access.h>
#include <__ranges/concepts.h>
@@ -34,6 +36,48 @@ _LIBCPP_BEGIN_NAMESPACE_STD
namespace ranges {
namespace __starts_with {
struct __fn {
+ template <class _Iter1, class _Sent1, class _Iter2, class _Sent2, class _Pred, class _Proj1, class _Proj2>
+ _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI static constexpr bool __starts_with_impl(
+ _Iter1 __first1,
+ _Sent1 __last1,
+ _Iter2 __first2,
+ _Sent2 __last2,
+ _Pred& __pred,
+ _Proj1& __proj1,
+ _Proj2& __proj2) {
+ if constexpr (sized_sentinel_for<_Sent1, _Iter1> && sized_sentinel_for<_Sent2, _Iter2>) {
+ auto __n1 = ranges::distance(__first1, __last1);
+ auto __n2 = ranges::distance(__first2, __last2);
+ if (__n2 == 0) {
+ return true;
+ }
+ if (__n2 > __n1) {
+ return false;
+ }
+
+ if constexpr (random_access_iterator<_Iter1> && random_access_iterator<_Iter2>) {
+ return ranges::equal(
+ std::move(__first1),
+ ranges::next(__first1, __n2),
+ std::move(__first2),
+ std::move(__last2),
+ std::ref(__pred),
+ std::ref(__proj1),
+ std::ref(__proj2));
+ }
+ }
+
+ return ranges::mismatch(
+ std::move(__first1),
+ std::move(__last1),
+ std::move(__first2),
+ std::move(__last2),
+ std::ref(__pred),
+ std::ref(__proj1),
+ std::ref(__proj2))
+ .in2 == __last2;
+ }
+
template <input_iterator _Iter1,
sentinel_for<_Iter1> _Sent1,
input_iterator _Iter2,
@@ -42,23 +86,16 @@ struct __fn {
class _Proj1 = identity,
class _Proj2 = identity>
requires indirectly_comparable<_Iter1, _Iter2, _Pred, _Proj1, _Proj2>
- _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr bool operator()(
+ _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI static constexpr bool operator()(
_Iter1 __first1,
_Sent1 __last1,
_Iter2 __first2,
_Sent2 __last2,
_Pred __pred = {},
_Proj1 __proj1 = {},
- _Proj2 __proj2 = {}) const {
- return __mismatch::__fn::__go(
- std::move(__first1),
- std::move(__last1),
- std::move(__first2),
- std::move(__last2),
- __pred,
- __proj1,
- __proj2)
- .in2 == __last2;
+ _Proj2 __proj2 = {}) {
+ return __starts_with_impl(
+ std::move(__first1), std::move(__last1), std::move(__first2), std::move(__last2), __pred, __proj1, __proj2);
}
template <input_range _Range1,
@@ -67,17 +104,16 @@ struct __fn {
class _Proj1 = identity,
class _Proj2 = identity>
requires indirectly_comparable<iterator_t<_Range1>, iterator_t<_Range2>, _Pred, _Proj1, _Proj2>
- _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr bool operator()(
- _Range1&& __range1, _Range2&& __range2, _Pred __pred = {}, _Proj1 __proj1 = {}, _Proj2 __proj2 = {}) const {
- return __mismatch::__fn::__go(
- ranges::begin(__range1),
- ranges::end(__range1),
- ranges::begin(__range2),
- ranges::end(__range2),
- __pred,
- __proj1,
- __proj2)
- .in2 == ranges::end(__range2);
+ _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI static constexpr bool
+ operator()(_Range1&& __range1, _Range2&& __range2, _Pred __pred = {}, _Proj1 __proj1 = {}, _Proj2 __proj2 = {}) {
+ return __starts_with_impl(
+ ranges::begin(__range1),
+ ranges::end(__range1),
+ ranges::begin(__range2),
+ ranges::end(__range2),
+ __pred,
+ __proj1,
+ __proj2);
}
};
} // namespace __starts_with
diff --git a/libcxx/include/__functional/operations.h b/libcxx/include/__functional/operations.h
index 7ddc00650f162f..4565e1415a6e02 100644
--- a/libcxx/include/__functional/operations.h
+++ b/libcxx/include/__functional/operations.h
@@ -13,6 +13,7 @@
#include <__config>
#include <__functional/binary_function.h>
#include <__functional/unary_function.h>
+#include <__fwd/functional.h>
#include <__type_traits/integral_constant.h>
#include <__type_traits/operation_traits.h>
#include <__utility/forward.h>
@@ -316,10 +317,18 @@ struct _LIBCPP_TEMPLATE_VIS equal_to<void> {
// comparison when we don't perform an implicit conversion when calling it.
template <class _Tp>
struct __desugars_to<__equal_tag, equal_to<_Tp>, _Tp, _Tp> : true_type {};
+template <class _Tp>
+struct __desugars_to<__equal_tag, reference_wrapper<equal_to<_Tp>>, _Tp, _Tp> : true_type {};
+template <class _Tp>
+struct __desugars_to<__equal_tag, reference_wrapper<const equal_to<_Tp>>, _Tp, _Tp> : true_type {};
// In the transparent case, we do not enforce that
template <class _Tp, class _Up>
struct __desugars_to<__equal_tag, equal_to<void>, _Tp, _Up> : true_type {};
+template <class _Tp, class _Up>
+struct __desugars_to<__equal_tag, reference_wrapper<equal_to<void>>, _Tp, _Up> : true_type {};
+template <class _Tp, class _Up>
+struct __desugars_to<__equal_tag, reference_wrapper<const equal_to<void>>, _Tp, _Up> : true_type {};
#if _LIBCPP_STD_VER >= 14
template <class _Tp = void>
diff --git a/libcxx/include/__functional/ranges_operations.h b/libcxx/include/__functional/ranges_operations.h
index 38b28018049eb7..be400c42a65790 100644
--- a/libcxx/include/__functional/ranges_operations.h
+++ b/libcxx/include/__functional/ranges_operations.h
@@ -13,6 +13,7 @@
#include <__concepts/equality_comparable.h>
#include <__concepts/totally_ordered.h>
#include <__config>
+#include <__fwd/functional.h>
#include <__type_traits/integral_constant.h>
#include <__type_traits/operation_traits.h>
#include <__utility/forward.h>
@@ -99,6 +100,10 @@ struct greater_equal {
// operator are of the same type
template <class _Tp, class _Up>
struct __desugars_to<__equal_tag, ranges::equal_to, _Tp, _Up> : true_type {};
+template <class _Tp, class _Up>
+struct __desugars_to<__equal_tag, reference_wrapper<ranges::equal_to>, _Tp, _Up> : true_type {};
+template <class _Tp, class _Up>
+struct __desugars_to<__equal_tag, reference_wrapper<const ranges::equal_to>, _Tp, _Up> : true_type {};
#endif // _LIBCPP_STD_VER >= 20
|
template <class _Tp, class _Up> | ||
struct __desugars_to<__equal_tag, reference_wrapper<ranges::equal_to>, _Tp, _Up> : true_type {}; | ||
template <class _Tp, class _Up> | ||
struct __desugars_to<__equal_tag, reference_wrapper<const ranges::equal_to>, _Tp, _Up> : true_type {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memcmp
-optimized overload of __equal_impl
requires that the predicate models __desugars_to<__equal_tag>
and the projection function models __is_identity
.
template <class _Tp,
class _Up,
class _Pred,
class _Proj1,
class _Proj2,
__enable_if_t<__desugars_to<__equal_tag, _Pred, _Tp, _Up>::value && __is_identity<_Proj1>::value &&
__is_identity<_Proj2>::value && !is_volatile<_Tp>::value && !is_volatile<_Up>::value &&
__libcpp_is_trivially_equality_comparable<_Tp, _Up>::value,
int> = 0>
_LIBCPP_NODISCARD inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 bool
__equal_impl(_Tp* __first1, _Tp* __last1, _Up* __first2, _Up*, _Pred&, _Proj1&, _Proj2&) {
return std::__constexpr_memcmp_equal(__first1, __first2, __element_count(__last1 - __first1));
}
Without these changes, this overload won't be selected when the predicate is wrapped in std::reference_wrapper
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move this into it's own PR, assuming that we want it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With #73255 I don't think there will be any significant performance improvement anymore, so I'm not sure whether we want to continue pursuing this. It could probably reduce the amount of generated code a bit, but I'm not sure it's worth the additional amount of complexity. @ldionne do you have any thoughts?
template <class _Tp, class _Up> | ||
struct __desugars_to<__equal_tag, reference_wrapper<ranges::equal_to>, _Tp, _Up> : true_type {}; | ||
template <class _Tp, class _Up> | ||
struct __desugars_to<__equal_tag, reference_wrapper<const ranges::equal_to>, _Tp, _Up> : true_type {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move this into it's own PR, assuming that we want it.
@xiaoyang-sde Can you please update your branch to make sure #73255 is included and rerun the benchmarks? This might give us some indication on whether this patch is worth pursuing further. FWIW, I'm somewhat in favor of landing this. If numbers show that there is no performance difference anymore, we should probably remove similar optimizations from |
Here's the benchmark result of the unoptimized 2024-04-08T00:51:04-07:00
Running ./ranges_starts_with.libcxx.out
Run on (12 X 2496 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 1280 KiB (x6)
L3 Unified 18432 KiB (x1)
Load Average: 1.27, 2.29, 1.34
-----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------------------------
bm_starts_with_contiguous_iter_with_memcmp_optimization/16 3.09 ns 3.09 ns 212067553
bm_starts_with_contiguous_iter_with_memcmp_optimization/256 20.9 ns 20.9 ns 30603943
bm_starts_with_contiguous_iter_with_memcmp_optimization/4096 310 ns 310 ns 2176699
bm_starts_with_contiguous_iter_with_memcmp_optimization/65536 5394 ns 5394 ns 136959
bm_starts_with_contiguous_iter_with_memcmp_optimization/1048576 134816 ns 134817 ns 4916
bm_starts_with_contiguous_iter_with_memcmp_optimization/16777216 4792548 ns 4792472 ns 150
bm_starts_with_contiguous_iter/16 5.38 ns 5.38 ns 123790610
bm_starts_with_contiguous_iter/256 81.4 ns 81.4 ns 8541221
bm_starts_with_contiguous_iter/4096 1107 ns 1107 ns 615306
bm_starts_with_contiguous_iter/65536 16869 ns 16868 ns 41362
bm_starts_with_contiguous_iter/1048576 293850 ns 293845 ns 2276
bm_starts_with_contiguous_iter/16777216 6079325 ns 6079284 ns 108
bm_starts_with_random_access_iter/16 7.17 ns 7.04 ns 101903410
bm_starts_with_random_access_iter/256 100 ns 100 ns 6806306
bm_starts_with_random_access_iter/4096 1449 ns 1449 ns 482025
bm_starts_with_random_access_iter/65536 23967 ns 23967 ns 27789
bm_starts_with_random_access_iter/1048576 387731 ns 387727 ns 1752
bm_starts_with_random_access_iter/16777216 6773716 ns 6773727 ns 100
bm_starts_with_bidirectional_iter/16 6.95 ns 6.95 ns 97150439
bm_starts_with_bidirectional_iter/256 103 ns 101 ns 6666863
bm_starts_with_bidirectional_iter/4096 1459 ns 1459 ns 477527
bm_starts_with_bidirectional_iter/65536 23458 ns 23458 ns 29772
bm_starts_with_bidirectional_iter/1048576 405195 ns 405189 ns 1822
bm_starts_with_bidirectional_iter/16777216 6768824 ns 6768460 ns 99
bm_starts_with_forward_iter/16 6.86 ns 6.76 ns 99222099
bm_starts_with_forward_iter/256 99.3 ns 99.3 ns 6935465
bm_starts_with_forward_iter/4096 1455 ns 1455 ns 482635
bm_starts_with_forward_iter/65536 23077 ns 23077 ns 29802
bm_starts_with_forward_iter/1048576 379438 ns 379432 ns 1795
bm_starts_with_forward_iter/16777216 6722769 ns 6722515 ns 104 To run the benchmark with the optimized template <class _Tp, class _Up>
inline const bool __desugars_to_v<__equal_tag, reference_wrapper<ranges::equal_to>, _Tp, _Up> = true;
template <class _Tp, class _Up>
inline const bool __desugars_to_v<__equal_tag, reference_wrapper<const ranges::equal_to>, _Tp, _Up> = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With #73255 I don't think there will be any significant performance improvement anymore, so I'm not sure whether we want to continue pursuing this. It could probably reduce the amount of generated code a bit, but I'm not sure it's worth the additional amount of complexity. @ldionne do you have any thoughts?
I am not sure I read the benchmarks right, but if I did it looks like we do have quite a bit of a performance benefit with this patch. It makes sense too, since we only need to use memcmp
to compare the two ranges instead of finding where they actually differ, which is more complicated.
So IMO this makes sense.
3866415
to
e5da968
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this LGTM, I don't see any problem with the patch and it does seem to provide a fair speedup. @philnik777 would you agree?
IIRC these aren't at all the numbers I got when benchmarking |
I wrote a simple benchmark that compares the performance of #include <algorithm>
#include <benchmark/benchmark.h>
#include <vector>
#include "test_iterators.h"
static void bm_starts_with_contiguous_iter_with_equal_impl(benchmark::State& state) {
std::vector<int> a(state.range(), 1);
std::vector<int> p(state.range(), 1);
for (auto _ : state) {
benchmark::DoNotOptimize(a);
benchmark::DoNotOptimize(p);
auto begin1 = contiguous_iterator(a.data());
auto end1 = contiguous_iterator(a.data() + a.size());
auto begin2 = contiguous_iterator(p.data());
auto end2 = contiguous_iterator(p.data() + p.size());
benchmark::DoNotOptimize(std::ranges::equal(begin1, end1, begin2, end2));
}
}
BENCHMARK(bm_starts_with_contiguous_iter_with_equal_impl)->RangeMultiplier(16)->Range(16, 16 << 20);
static void bm_starts_with_contiguous_iter_with_mismatch_impl(benchmark::State& state) {
std::vector<int> a(state.range(), 1);
std::vector<int> p(state.range(), 1);
for (auto _ : state) {
benchmark::DoNotOptimize(a);
benchmark::DoNotOptimize(p);
auto begin1 = contiguous_iterator(a.data());
auto end1 = contiguous_iterator(a.data() + a.size());
auto begin2 = contiguous_iterator(p.data());
auto end2 = contiguous_iterator(p.data() + p.size());
benchmark::DoNotOptimize(std::ranges::mismatch(begin1, end1, begin2, end2).in2 == end2);
}
}
BENCHMARK(bm_starts_with_contiguous_iter_with_mismatch_impl)->RangeMultiplier(16)->Range(16, 16 << 20);
BENCHMARK_MAIN(); The performance is similar on MacBook Air (M1, arm64): 2024-04-26T17:14:19-04:00
Running ./build/libcxx/benchmarks/ranges_starts_with.libcxx.out
Run on (8 X 24 MHz CPU s)
CPU Caches:
L1 Data 64 KiB
L1 Instruction 128 KiB
L2 Unified 4096 KiB (x8)
Load Average: 2.00, 3.59, 3.41
-----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------------------
bm_starts_with_contiguous_iter_with_equal_impl/16 2.89 ns 2.86 ns 244051251
bm_starts_with_contiguous_iter_with_equal_impl/256 27.1 ns 26.1 ns 26963107
bm_starts_with_contiguous_iter_with_equal_impl/4096 451 ns 443 ns 1578169
bm_starts_with_contiguous_iter_with_equal_impl/65536 6693 ns 6667 ns 104706
bm_starts_with_contiguous_iter_with_equal_impl/1048576 110306 ns 109800 ns 6162
bm_starts_with_contiguous_iter_with_equal_impl/16777216 3129511 ns 2780193 ns 311
bm_starts_with_contiguous_iter_with_mismatch_impl/16 3.08 ns 3.05 ns 225504566
bm_starts_with_contiguous_iter_with_mismatch_impl/256 26.9 ns 26.7 ns 25911339
bm_starts_with_contiguous_iter_with_mismatch_impl/4096 422 ns 420 ns 1678504
bm_starts_with_contiguous_iter_with_mismatch_impl/65536 6834 ns 6722 ns 105055
bm_starts_with_contiguous_iter_with_mismatch_impl/1048576 124471 ns 123355 ns 5691
bm_starts_with_contiguous_iter_with_mismatch_impl/16777216 2337331 ns 2326288 ns 288 However, the performance is different on Arch Linux with a 4th Gen Xeon processor (x86_64): 2024-04-26T20:45:12+00:00
Running ./build/libcxx/benchmarks/ranges_starts_with.libcxx.out
Run on (4 X 2294.61 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 4096 KiB (x4)
Load Average: 0.00, 0.32, 0.66
-----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------------------
bm_starts_with_contiguous_iter_with_equal_impl/16 7.68 ns 7.68 ns 93071510
bm_starts_with_contiguous_iter_with_equal_impl/256 31.4 ns 31.4 ns 25166061
bm_starts_with_contiguous_iter_with_equal_impl/4096 396 ns 396 ns 1528737
bm_starts_with_contiguous_iter_with_equal_impl/65536 10798 ns 10798 ns 59100
bm_starts_with_contiguous_iter_with_equal_impl/1048576 496691 ns 496671 ns 1499
bm_starts_with_contiguous_iter_with_equal_impl/16777216 13436051 ns 13435049 ns 50
bm_starts_with_contiguous_iter_with_mismatch_impl/16 10.7 ns 10.7 ns 68709479
bm_starts_with_contiguous_iter_with_mismatch_impl/256 59.0 ns 59.0 ns 10459829
bm_starts_with_contiguous_iter_with_mismatch_impl/4096 1069 ns 1069 ns 729445
bm_starts_with_contiguous_iter_with_mismatch_impl/65536 16881 ns 16880 ns 34519
bm_starts_with_contiguous_iter_with_mismatch_impl/1048576 583530 ns 583395 ns 1250
bm_starts_with_contiguous_iter_with_mismatch_impl/16777216 15792555 ns 15791353 ns 43 @philnik777 Thanks for the comment! Could you please review these benchmark results? If the result is valid, rather than focusing on optimizing |
@philnik777 Friendly ping re. the benchmarks. :) |
Sorry for being so slow on this. I had a lot to do recently. TL;DR it's complicated. The long version: AVX2 seems to make quite a significant difference for I think we have to have a more general discussion on how far we want to go for improved performance. For now I don't want to block this change, but I'd like us to have a more rigid understanding of how far we want to go before doing similar optimizations (i.e. selecting our vector code vs. the libc function). |
Thanks for the clarification! Whether this patch is merged or not isn't a concern for me, but I genuinely gained valuable knowledge from our discussion. I appreciate it and please take your time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally caught up to this patch. I tend to like the patch. I think it applies an optimization that should almost always be a win, except in cases where ranges::equal
is slower than ranges::mismatch
, which boils down to whether memcmp
is faster than our implementation of mismatch
. If I misunderstood something, please let me know.
Assuming I'm on the same page as the other reviewers, I would say this: comparing whether two ranges are equal is definitely "a simpler operation" (conceptually at least) than finding the point at which they don't match. So I think it makes sense to assume that ranges::equal
should always be at least as fast as ranges::mismatch
(in the worst case, I guess we could even implement ranges::equal
by calling ranges::mismatch
and checking that the result is the last iterator).
I understand that some C libraries might provide extremely poor quality of implementation for some functions, however I think that should be a bug filed on them and I don't think we should start deciding how to optimize our code based on that.
I have a few comments and I'd like to know if @philnik777 is fine with what I said above, but I'd say this patch otherwise LGTM.
std::move(__first1), | ||
ranges::next(__first1, __n2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd have to compute ranges::next(__first1, __n2)
before the function call if you want to move __first1
. Otherwise you could end up using a moved-from iterator in the ranges::next
call.
@xiaoyang-sde There's a bit of review feedback to apply, but otherwise I think this is good to go. Do you think you're able to pick this up and make the last few requested changes? |
I completely forgot about this! I'll respond to the comments shortly. |
When benchmarking locally, I get the following results for
Using Edit: Ah, it looks like #84570 (comment) and #84570 (comment) explain this result. |
ranges::starts_with
ranges::starts_with
Note that I don't think this will actually fix #83880 since |
@philnik777 What do you think the next step should be? Vectorize Note that I think this patch makes sense in principle, but I also don't think we should check it in as-is since it regresses on some platforms. I'd even be fine with checking it in with some sort of |
@ldionne I'm really not sure what to do. It's nice to piggy-back off of the |
IMO the answer is that if we can beat the platform's |
In that case we should probably add our own wrappers of all the C functions and check whether they're slower than what we could easily achieve. |
Abstract
This pull request attempts to optimize the performance of the
ranges::starts_with
algorithm.Implementation
ranges::starts_with
algorithm will check if the first range has fewer elements than the second range before accessing the elements, if bothI1, S1
andI2, S2
modelsized_sentinel_for
.ranges::starts_with
algorithm will invokeranges::equal
if bothI1, S1
andI2, S2
modelsized_sentinel_for
andrandom_access_range
.ranges::equal
will use__builtin_memcmp
for element comparison under specific conditions.__desugars_to
template has more specialization, handling scenarios when the predicate is wrapped in astd::reference_wrapper
. (In the existing implementation,ranges::ends_with
forwarding toranges:equal
, but it wraps the predicate withstd::ref
, thus failing to meet the constraint of the optimized implementation of__equal_impl
. This change fixes this issue.)Benchmarking
The result is obtained from Arch Linux running on WSL 2, with an i5-12400 processor and 32 GB of RAM. The result indicates that the
memcmp
optimization provides significant performance improvements.Reference
starts_with
andends_with