Skip to content

[libc++][ranges] Optimize the performance of ranges::starts_with #84570

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

xiaoyang-sde
Copy link
Member

@xiaoyang-sde xiaoyang-sde commented Mar 8, 2024

Abstract

This pull request attempts to optimize the performance of the ranges::starts_with algorithm.

template<input_iterator I1, sentinel_for<I1> S1, input_iterator I2, sentinel_for<I2> S2,
         class Pred = ranges::equal_to, class Proj1 = identity, class Proj2 = identity>
  requires indirectly_comparable<I1, I2, Pred, Proj1, Proj2>
constexpr bool ranges::starts_with(I1 first1, S1 last1, I2 first2, S2 last2, Pred pred = {},
                                   Proj1 proj1 = {}, Proj2 proj2 = {});
template<input_range R1, input_range R2, class Pred = ranges::equal_to, class Proj1 = identity,
         class Proj2 = identity>
  requires indirectly_comparable<iterator_t<R1>, iterator_t<R2>, Pred, Proj1, Proj2>
constexpr bool ranges::starts_with(R1&& r1, R2&& r2, Pred pred = {},
                                   Proj1 proj1 = {}, Proj2 proj2 = {});

Implementation

  • The ranges::starts_with algorithm will check if the first range has fewer elements than the second range before accessing the elements, if both I1, S1 and I2, S2 model sized_sentinel_for.
  • The ranges::starts_with algorithm will invoke ranges::equal if both I1, S1 and I2, S2 model sized_sentinel_for and random_access_range. ranges::equal will use __builtin_memcmp for element comparison under specific conditions.
  • The __desugars_to template has more specialization, handling scenarios when the predicate is wrapped in a std::reference_wrapper. (In the existing implementation, ranges::ends_with forwarding to ranges:equal, but it wraps the predicate with std::ref, thus failing to meet the constraint of the optimized implementation of __equal_impl. This change fixes this issue.)

Benchmarking

The result is obtained from Arch Linux running on WSL 2, with an i5-12400 processor and 32 GB of RAM. The result indicates that the memcmp optimization provides significant performance improvements.

2024-03-08T23:40:18-08:00
Running ./ranges_starts_with.libcxx.out
Run on (12 X 2496 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1280 KiB (x6)
  L3 Unified 18432 KiB (x1)
Load Average: 0.79, 1.93, 2.27
-----------------------------------------------------------------------------------------------------------
Benchmark                                                                 Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------------
bm_starts_with_contiguous_iter_with_memcmp_optimization/16             1.50 ns         1.50 ns    460164239
bm_starts_with_contiguous_iter_with_memcmp_optimization/256            12.5 ns         12.5 ns     55148246
bm_starts_with_contiguous_iter_with_memcmp_optimization/4096            189 ns          189 ns      4030953
bm_starts_with_contiguous_iter_with_memcmp_optimization/65536          4889 ns         4889 ns       157791
bm_starts_with_contiguous_iter_with_memcmp_optimization/1048576      138974 ns       138964 ns         5810
bm_starts_with_contiguous_iter_with_memcmp_optimization/16777216    3874632 ns      3874633 ns          180
bm_starts_with_contiguous_iter/16                                      6.55 ns         6.55 ns     90161221
bm_starts_with_contiguous_iter/256                                     99.8 ns         99.8 ns      7093729
bm_starts_with_contiguous_iter/4096                                    1505 ns         1488 ns       420866
bm_starts_with_contiguous_iter/65536                                  24836 ns        24836 ns        29657
bm_starts_with_contiguous_iter/1048576                               400852 ns       400847 ns         1700
bm_starts_with_contiguous_iter/16777216                             7017180 ns      7017118 ns          101
bm_starts_with_random_access_iter/16                                   6.86 ns         6.86 ns     99695785
bm_starts_with_random_access_iter/256                                   102 ns          102 ns      6721592
bm_starts_with_random_access_iter/4096                                 1537 ns         1537 ns       478604
bm_starts_with_random_access_iter/65536                               23564 ns        23564 ns        30089
bm_starts_with_random_access_iter/1048576                            382977 ns       382985 ns         1850
bm_starts_with_random_access_iter/16777216                          6952078 ns      6952032 ns           99
bm_starts_with_bidirectional_iter/16                                   6.23 ns         6.23 ns    111062808
bm_starts_with_bidirectional_iter/256                                   103 ns          103 ns      6981875
bm_starts_with_bidirectional_iter/4096                                 1498 ns         1468 ns       451407
bm_starts_with_bidirectional_iter/65536                               23994 ns        23994 ns        30011
bm_starts_with_bidirectional_iter/1048576                            381783 ns       381781 ns         1838
bm_starts_with_bidirectional_iter/16777216                          7538205 ns      7538213 ns          103
bm_starts_with_forward_iter/16                                         6.37 ns         6.37 ns    103694333
bm_starts_with_forward_iter/256                                         103 ns          103 ns      7017987
bm_starts_with_forward_iter/4096                                       1535 ns         1535 ns       483272
bm_starts_with_forward_iter/65536                                     23208 ns        23208 ns        28449
bm_starts_with_forward_iter/1048576                                  370772 ns       370773 ns         1812
bm_starts_with_forward_iter/16777216                                6758327 ns      6758279 ns          101

Reference

Copy link

github-actions bot commented Mar 8, 2024

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@xiaoyang-sde xiaoyang-sde marked this pull request as ready for review March 9, 2024 08:14
@xiaoyang-sde xiaoyang-sde requested a review from a team as a code owner March 9, 2024 08:14
@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Mar 9, 2024
@llvmbot
Copy link
Member

llvmbot commented Mar 9, 2024

@llvm/pr-subscribers-libcxx

Author: Xiaoyang Liu (xiaoyang-sde)

Changes

Abstract

This pull request attempts to optimize the performance of ranges::starts_with.

This pull request is related to the issue #83880.

Benchmarking

2024-03-08T23:40:18-08:00
Running ./ranges_starts_with.libcxx.out
Run on (12 X 2496 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1280 KiB (x6)
  L3 Unified 18432 KiB (x1)
Load Average: 0.79, 1.93, 2.27
-----------------------------------------------------------------------------------------------------------
Benchmark                                                                 Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------------
bm_starts_with_contiguous_iter_with_memcmp_optimization/16             1.50 ns         1.50 ns    460164239
bm_starts_with_contiguous_iter_with_memcmp_optimization/256            12.5 ns         12.5 ns     55148246
bm_starts_with_contiguous_iter_with_memcmp_optimization/4096            189 ns          189 ns      4030953
bm_starts_with_contiguous_iter_with_memcmp_optimization/65536          4889 ns         4889 ns       157791
bm_starts_with_contiguous_iter_with_memcmp_optimization/1048576      138974 ns       138964 ns         5810
bm_starts_with_contiguous_iter_with_memcmp_optimization/16777216    3874632 ns      3874633 ns          180
bm_starts_with_contiguous_iter/16                                      6.55 ns         6.55 ns     90161221
bm_starts_with_contiguous_iter/256                                     99.8 ns         99.8 ns      7093729
bm_starts_with_contiguous_iter/4096                                    1505 ns         1488 ns       420866
bm_starts_with_contiguous_iter/65536                                  24836 ns        24836 ns        29657
bm_starts_with_contiguous_iter/1048576                               400852 ns       400847 ns         1700
bm_starts_with_contiguous_iter/16777216                             7017180 ns      7017118 ns          101
bm_starts_with_random_access_iter/16                                   6.86 ns         6.86 ns     99695785
bm_starts_with_random_access_iter/256                                   102 ns          102 ns      6721592
bm_starts_with_random_access_iter/4096                                 1537 ns         1537 ns       478604
bm_starts_with_random_access_iter/65536                               23564 ns        23564 ns        30089
bm_starts_with_random_access_iter/1048576                            382977 ns       382985 ns         1850
bm_starts_with_random_access_iter/16777216                          6952078 ns      6952032 ns           99
bm_starts_with_bidirectional_iter/16                                   6.23 ns         6.23 ns    111062808
bm_starts_with_bidirectional_iter/256                                   103 ns          103 ns      6981875
bm_starts_with_bidirectional_iter/4096                                 1498 ns         1468 ns       451407
bm_starts_with_bidirectional_iter/65536                               23994 ns        23994 ns        30011
bm_starts_with_bidirectional_iter/1048576                            381783 ns       381781 ns         1838
bm_starts_with_bidirectional_iter/16777216                          7538205 ns      7538213 ns          103
bm_starts_with_forward_iter/16                                         6.37 ns         6.37 ns    103694333
bm_starts_with_forward_iter/256                                         103 ns          103 ns      7017987
bm_starts_with_forward_iter/4096                                       1535 ns         1535 ns       483272
bm_starts_with_forward_iter/65536                                     23208 ns        23208 ns        28449
bm_starts_with_forward_iter/1048576                                  370772 ns       370773 ns         1812
bm_starts_with_forward_iter/16777216                                6758327 ns      6758279 ns          101

Full diff: https://github.com/llvm/llvm-project/pull/84570.diff

5 Files Affected:

  • (modified) libcxx/benchmarks/CMakeLists.txt (+1)
  • (added) libcxx/benchmarks/algorithms/ranges_starts_with.bench.cpp (+107)
  • (modified) libcxx/include/__algorithm/ranges_starts_with.h (+59-23)
  • (modified) libcxx/include/__functional/operations.h (+9)
  • (modified) libcxx/include/__functional/ranges_operations.h (+5)
diff --git a/libcxx/benchmarks/CMakeLists.txt b/libcxx/benchmarks/CMakeLists.txt
index b436e96f178b70..621b295a92c8d2 100644
--- a/libcxx/benchmarks/CMakeLists.txt
+++ b/libcxx/benchmarks/CMakeLists.txt
@@ -186,6 +186,7 @@ set(BENCHMARK_TESTS
     algorithms/pstl.stable_sort.bench.cpp
     algorithms/push_heap.bench.cpp
     algorithms/ranges_contains.bench.cpp
+    algorithms/ranges_starts_with.bench.cpp
     algorithms/ranges_ends_with.bench.cpp
     algorithms/ranges_make_heap.bench.cpp
     algorithms/ranges_make_heap_then_sort_heap.bench.cpp
diff --git a/libcxx/benchmarks/algorithms/ranges_starts_with.bench.cpp b/libcxx/benchmarks/algorithms/ranges_starts_with.bench.cpp
new file mode 100644
index 00000000000000..9986fdb1684fc7
--- /dev/null
+++ b/libcxx/benchmarks/algorithms/ranges_starts_with.bench.cpp
@@ -0,0 +1,107 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include <algorithm>
+#include <benchmark/benchmark.h>
+#include <functional>
+#include <vector>
+
+#include "test_iterators.h"
+
+static void bm_starts_with_contiguous_iter_with_memcmp_optimization(benchmark::State& state) {
+  std::vector<int> a(state.range(), 1);
+  std::vector<int> p(state.range(), 1);
+
+  for (auto _ : state) {
+    benchmark::DoNotOptimize(a);
+    benchmark::DoNotOptimize(p);
+
+    auto begin1 = contiguous_iterator(a.data());
+    auto end1   = contiguous_iterator(a.data() + a.size());
+    auto begin2 = contiguous_iterator(p.data());
+    auto end2   = contiguous_iterator(p.data() + p.size());
+
+    benchmark::DoNotOptimize(std::ranges::starts_with(begin1, end1, begin2, end2));
+  }
+}
+BENCHMARK(bm_starts_with_contiguous_iter_with_memcmp_optimization)->RangeMultiplier(16)->Range(16, 16 << 20);
+
+static void bm_starts_with_contiguous_iter(benchmark::State& state) {
+  std::vector<int> a(state.range(), 1);
+  std::vector<int> p(state.range(), 1);
+
+  for (auto _ : state) {
+    benchmark::DoNotOptimize(a);
+    benchmark::DoNotOptimize(p);
+
+    auto begin1 = contiguous_iterator(a.data());
+    auto end1   = contiguous_iterator(a.data() + a.size());
+    auto begin2 = contiguous_iterator(p.data());
+    auto end2   = contiguous_iterator(p.data() + p.size());
+
+    benchmark::DoNotOptimize(
+        std::ranges::starts_with(begin1, end1, begin2, end2, [](const int a, const int b) { return a == b; }));
+  }
+}
+BENCHMARK(bm_starts_with_contiguous_iter)->RangeMultiplier(16)->Range(16, 16 << 20);
+
+static void bm_starts_with_random_access_iter(benchmark::State& state) {
+  std::vector<int> a(state.range(), 1);
+  std::vector<int> p(state.range(), 1);
+
+  for (auto _ : state) {
+    benchmark::DoNotOptimize(a);
+    benchmark::DoNotOptimize(p);
+
+    auto begin1 = random_access_iterator(a.data());
+    auto end1   = random_access_iterator(a.data() + a.size());
+    auto begin2 = random_access_iterator(p.data());
+    auto end2   = random_access_iterator(p.data() + p.size());
+
+    benchmark::DoNotOptimize(std::ranges::starts_with(begin1, end1, begin2, end2));
+  }
+}
+BENCHMARK(bm_starts_with_random_access_iter)->RangeMultiplier(16)->Range(16, 16 << 20);
+
+static void bm_starts_with_bidirectional_iter(benchmark::State& state) {
+  std::vector<int> a(state.range(), 1);
+  std::vector<int> p(state.range(), 1);
+
+  for (auto _ : state) {
+    benchmark::DoNotOptimize(a);
+    benchmark::DoNotOptimize(p);
+
+    auto begin1 = bidirectional_iterator(a.data());
+    auto end1   = bidirectional_iterator(a.data() + a.size());
+    auto begin2 = bidirectional_iterator(p.data());
+    auto end2   = bidirectional_iterator(p.data() + p.size());
+
+    benchmark::DoNotOptimize(std::ranges::starts_with(begin1, end1, begin2, end2));
+  }
+}
+BENCHMARK(bm_starts_with_bidirectional_iter)->RangeMultiplier(16)->Range(16, 16 << 20);
+
+static void bm_starts_with_forward_iter(benchmark::State& state) {
+  std::vector<int> a(state.range(), 1);
+  std::vector<int> p(state.range(), 1);
+
+  for (auto _ : state) {
+    benchmark::DoNotOptimize(a);
+    benchmark::DoNotOptimize(p);
+
+    auto begin1 = forward_iterator(a.data());
+    auto end1   = forward_iterator(a.data() + a.size());
+    auto begin2 = forward_iterator(p.data());
+    auto end2   = forward_iterator(p.data() + p.size());
+
+    benchmark::DoNotOptimize(std::ranges::starts_with(begin1, end1, begin2, end2));
+  }
+}
+BENCHMARK(bm_starts_with_forward_iter)->RangeMultiplier(16)->Range(16, 16 << 20);
+
+BENCHMARK_MAIN();
diff --git a/libcxx/include/__algorithm/ranges_starts_with.h b/libcxx/include/__algorithm/ranges_starts_with.h
index 90e184aa9bccc2..ebcc15555f5544 100644
--- a/libcxx/include/__algorithm/ranges_starts_with.h
+++ b/libcxx/include/__algorithm/ranges_starts_with.h
@@ -9,12 +9,14 @@
 #ifndef _LIBCPP___ALGORITHM_RANGES_STARTS_WITH_H
 #define _LIBCPP___ALGORITHM_RANGES_STARTS_WITH_H
 
-#include <__algorithm/in_in_result.h>
+#include <__algorithm/ranges_equal.h>
 #include <__algorithm/ranges_mismatch.h>
 #include <__config>
 #include <__functional/identity.h>
 #include <__functional/ranges_operations.h>
+#include <__functional/reference_wrapper.h>
 #include <__iterator/concepts.h>
+#include <__iterator/distance.h>
 #include <__iterator/indirectly_comparable.h>
 #include <__ranges/access.h>
 #include <__ranges/concepts.h>
@@ -34,6 +36,48 @@ _LIBCPP_BEGIN_NAMESPACE_STD
 namespace ranges {
 namespace __starts_with {
 struct __fn {
+  template <class _Iter1, class _Sent1, class _Iter2, class _Sent2, class _Pred, class _Proj1, class _Proj2>
+  _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI static constexpr bool __starts_with_impl(
+      _Iter1 __first1,
+      _Sent1 __last1,
+      _Iter2 __first2,
+      _Sent2 __last2,
+      _Pred& __pred,
+      _Proj1& __proj1,
+      _Proj2& __proj2) {
+    if constexpr (sized_sentinel_for<_Sent1, _Iter1> && sized_sentinel_for<_Sent2, _Iter2>) {
+      auto __n1 = ranges::distance(__first1, __last1);
+      auto __n2 = ranges::distance(__first2, __last2);
+      if (__n2 == 0) {
+        return true;
+      }
+      if (__n2 > __n1) {
+        return false;
+      }
+
+      if constexpr (random_access_iterator<_Iter1> && random_access_iterator<_Iter2>) {
+        return ranges::equal(
+            std::move(__first1),
+            ranges::next(__first1, __n2),
+            std::move(__first2),
+            std::move(__last2),
+            std::ref(__pred),
+            std::ref(__proj1),
+            std::ref(__proj2));
+      }
+    }
+
+    return ranges::mismatch(
+               std::move(__first1),
+               std::move(__last1),
+               std::move(__first2),
+               std::move(__last2),
+               std::ref(__pred),
+               std::ref(__proj1),
+               std::ref(__proj2))
+               .in2 == __last2;
+  }
+
   template <input_iterator _Iter1,
             sentinel_for<_Iter1> _Sent1,
             input_iterator _Iter2,
@@ -42,23 +86,16 @@ struct __fn {
             class _Proj1 = identity,
             class _Proj2 = identity>
     requires indirectly_comparable<_Iter1, _Iter2, _Pred, _Proj1, _Proj2>
-  _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr bool operator()(
+  _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI static constexpr bool operator()(
       _Iter1 __first1,
       _Sent1 __last1,
       _Iter2 __first2,
       _Sent2 __last2,
       _Pred __pred   = {},
       _Proj1 __proj1 = {},
-      _Proj2 __proj2 = {}) const {
-    return __mismatch::__fn::__go(
-               std::move(__first1),
-               std::move(__last1),
-               std::move(__first2),
-               std::move(__last2),
-               __pred,
-               __proj1,
-               __proj2)
-               .in2 == __last2;
+      _Proj2 __proj2 = {}) {
+    return __starts_with_impl(
+        std::move(__first1), std::move(__last1), std::move(__first2), std::move(__last2), __pred, __proj1, __proj2);
   }
 
   template <input_range _Range1,
@@ -67,17 +104,16 @@ struct __fn {
             class _Proj1 = identity,
             class _Proj2 = identity>
     requires indirectly_comparable<iterator_t<_Range1>, iterator_t<_Range2>, _Pred, _Proj1, _Proj2>
-  _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr bool operator()(
-      _Range1&& __range1, _Range2&& __range2, _Pred __pred = {}, _Proj1 __proj1 = {}, _Proj2 __proj2 = {}) const {
-    return __mismatch::__fn::__go(
-               ranges::begin(__range1),
-               ranges::end(__range1),
-               ranges::begin(__range2),
-               ranges::end(__range2),
-               __pred,
-               __proj1,
-               __proj2)
-               .in2 == ranges::end(__range2);
+  _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI static constexpr bool
+  operator()(_Range1&& __range1, _Range2&& __range2, _Pred __pred = {}, _Proj1 __proj1 = {}, _Proj2 __proj2 = {}) {
+    return __starts_with_impl(
+        ranges::begin(__range1),
+        ranges::end(__range1),
+        ranges::begin(__range2),
+        ranges::end(__range2),
+        __pred,
+        __proj1,
+        __proj2);
   }
 };
 } // namespace __starts_with
diff --git a/libcxx/include/__functional/operations.h b/libcxx/include/__functional/operations.h
index 7ddc00650f162f..4565e1415a6e02 100644
--- a/libcxx/include/__functional/operations.h
+++ b/libcxx/include/__functional/operations.h
@@ -13,6 +13,7 @@
 #include <__config>
 #include <__functional/binary_function.h>
 #include <__functional/unary_function.h>
+#include <__fwd/functional.h>
 #include <__type_traits/integral_constant.h>
 #include <__type_traits/operation_traits.h>
 #include <__utility/forward.h>
@@ -316,10 +317,18 @@ struct _LIBCPP_TEMPLATE_VIS equal_to<void> {
 // comparison when we don't perform an implicit conversion when calling it.
 template <class _Tp>
 struct __desugars_to<__equal_tag, equal_to<_Tp>, _Tp, _Tp> : true_type {};
+template <class _Tp>
+struct __desugars_to<__equal_tag, reference_wrapper<equal_to<_Tp>>, _Tp, _Tp> : true_type {};
+template <class _Tp>
+struct __desugars_to<__equal_tag, reference_wrapper<const equal_to<_Tp>>, _Tp, _Tp> : true_type {};
 
 // In the transparent case, we do not enforce that
 template <class _Tp, class _Up>
 struct __desugars_to<__equal_tag, equal_to<void>, _Tp, _Up> : true_type {};
+template <class _Tp, class _Up>
+struct __desugars_to<__equal_tag, reference_wrapper<equal_to<void>>, _Tp, _Up> : true_type {};
+template <class _Tp, class _Up>
+struct __desugars_to<__equal_tag, reference_wrapper<const equal_to<void>>, _Tp, _Up> : true_type {};
 
 #if _LIBCPP_STD_VER >= 14
 template <class _Tp = void>
diff --git a/libcxx/include/__functional/ranges_operations.h b/libcxx/include/__functional/ranges_operations.h
index 38b28018049eb7..be400c42a65790 100644
--- a/libcxx/include/__functional/ranges_operations.h
+++ b/libcxx/include/__functional/ranges_operations.h
@@ -13,6 +13,7 @@
 #include <__concepts/equality_comparable.h>
 #include <__concepts/totally_ordered.h>
 #include <__config>
+#include <__fwd/functional.h>
 #include <__type_traits/integral_constant.h>
 #include <__type_traits/operation_traits.h>
 #include <__utility/forward.h>
@@ -99,6 +100,10 @@ struct greater_equal {
 // operator are of the same type
 template <class _Tp, class _Up>
 struct __desugars_to<__equal_tag, ranges::equal_to, _Tp, _Up> : true_type {};
+template <class _Tp, class _Up>
+struct __desugars_to<__equal_tag, reference_wrapper<ranges::equal_to>, _Tp, _Up> : true_type {};
+template <class _Tp, class _Up>
+struct __desugars_to<__equal_tag, reference_wrapper<const ranges::equal_to>, _Tp, _Up> : true_type {};
 
 #endif // _LIBCPP_STD_VER >= 20
 

Comment on lines 103 to 106
template <class _Tp, class _Up>
struct __desugars_to<__equal_tag, reference_wrapper<ranges::equal_to>, _Tp, _Up> : true_type {};
template <class _Tp, class _Up>
struct __desugars_to<__equal_tag, reference_wrapper<const ranges::equal_to>, _Tp, _Up> : true_type {};
Copy link
Member Author

@xiaoyang-sde xiaoyang-sde Mar 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memcmp-optimized overload of __equal_impl requires that the predicate models __desugars_to<__equal_tag> and the projection function models __is_identity.

template <class _Tp,
          class _Up,
          class _Pred,
          class _Proj1,
          class _Proj2,
          __enable_if_t<__desugars_to<__equal_tag, _Pred, _Tp, _Up>::value && __is_identity<_Proj1>::value &&
                            __is_identity<_Proj2>::value && !is_volatile<_Tp>::value && !is_volatile<_Up>::value &&
                            __libcpp_is_trivially_equality_comparable<_Tp, _Up>::value,
                        int> = 0>
_LIBCPP_NODISCARD inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 bool
__equal_impl(_Tp* __first1, _Tp* __last1, _Up* __first2, _Up*, _Pred&, _Proj1&, _Proj2&) {
  return std::__constexpr_memcmp_equal(__first1, __first2, __element_count(__last1 - __first1));
}

Without these changes, this overload won't be selected when the predicate is wrapped in std::reference_wrapper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this into it's own PR, assuming that we want it.

@var-const var-const added ranges Issues related to `<ranges>` performance labels Mar 10, 2024
@var-const var-const self-assigned this Mar 10, 2024
Copy link
Contributor

@philnik777 philnik777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #73255 I don't think there will be any significant performance improvement anymore, so I'm not sure whether we want to continue pursuing this. It could probably reduce the amount of generated code a bit, but I'm not sure it's worth the additional amount of complexity. @ldionne do you have any thoughts?

Comment on lines 103 to 106
template <class _Tp, class _Up>
struct __desugars_to<__equal_tag, reference_wrapper<ranges::equal_to>, _Tp, _Up> : true_type {};
template <class _Tp, class _Up>
struct __desugars_to<__equal_tag, reference_wrapper<const ranges::equal_to>, _Tp, _Up> : true_type {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this into it's own PR, assuming that we want it.

@var-const
Copy link
Member

With #73255 I don't think there will be any significant performance improvement anymore, so I'm not sure whether we want to continue pursuing this. It could probably reduce the amount of generated code a bit, but I'm not sure it's worth the additional amount of complexity. @ldionne do you have any thoughts?

@xiaoyang-sde Can you please update your branch to make sure #73255 is included and rerun the benchmarks? This might give us some indication on whether this patch is worth pursuing further.

FWIW, I'm somewhat in favor of landing this. If numbers show that there is no performance difference anymore, we should probably remove similar optimizations from ends_with -- whether we delegate to equal or not, IMO it should be consistent between starts_with and ends_with. Also, I think adding a benchmark and extending desugars_to is valuable independent of changes to the algorithm itself.

@xiaoyang-sde
Copy link
Member Author

xiaoyang-sde commented Apr 8, 2024

Here's the benchmark result of the unoptimized ranges::starts_with (copied from the main branch) with vectorized mismatch. Please note that bm_starts_with_contiguous_iter won't invoke the vectorized mismatch because it uses a custom predicate rather than ranges::equal_to. It seems that forwarding to equal still provides some performance improvements.

2024-04-08T00:51:04-07:00
Running ./ranges_starts_with.libcxx.out
Run on (12 X 2496 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1280 KiB (x6)
  L3 Unified 18432 KiB (x1)
Load Average: 1.27, 2.29, 1.34
-----------------------------------------------------------------------------------------------------------
Benchmark                                                                 Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------------
bm_starts_with_contiguous_iter_with_memcmp_optimization/16             3.09 ns         3.09 ns    212067553
bm_starts_with_contiguous_iter_with_memcmp_optimization/256            20.9 ns         20.9 ns     30603943
bm_starts_with_contiguous_iter_with_memcmp_optimization/4096            310 ns          310 ns      2176699
bm_starts_with_contiguous_iter_with_memcmp_optimization/65536          5394 ns         5394 ns       136959
bm_starts_with_contiguous_iter_with_memcmp_optimization/1048576      134816 ns       134817 ns         4916
bm_starts_with_contiguous_iter_with_memcmp_optimization/16777216    4792548 ns      4792472 ns          150
bm_starts_with_contiguous_iter/16                                      5.38 ns         5.38 ns    123790610
bm_starts_with_contiguous_iter/256                                     81.4 ns         81.4 ns      8541221
bm_starts_with_contiguous_iter/4096                                    1107 ns         1107 ns       615306
bm_starts_with_contiguous_iter/65536                                  16869 ns        16868 ns        41362
bm_starts_with_contiguous_iter/1048576                               293850 ns       293845 ns         2276
bm_starts_with_contiguous_iter/16777216                             6079325 ns      6079284 ns          108
bm_starts_with_random_access_iter/16                                   7.17 ns         7.04 ns    101903410
bm_starts_with_random_access_iter/256                                   100 ns          100 ns      6806306
bm_starts_with_random_access_iter/4096                                 1449 ns         1449 ns       482025
bm_starts_with_random_access_iter/65536                               23967 ns        23967 ns        27789
bm_starts_with_random_access_iter/1048576                            387731 ns       387727 ns         1752
bm_starts_with_random_access_iter/16777216                          6773716 ns      6773727 ns          100
bm_starts_with_bidirectional_iter/16                                   6.95 ns         6.95 ns     97150439
bm_starts_with_bidirectional_iter/256                                   103 ns          101 ns      6666863
bm_starts_with_bidirectional_iter/4096                                 1459 ns         1459 ns       477527
bm_starts_with_bidirectional_iter/65536                               23458 ns        23458 ns        29772
bm_starts_with_bidirectional_iter/1048576                            405195 ns       405189 ns         1822
bm_starts_with_bidirectional_iter/16777216                          6768824 ns      6768460 ns           99
bm_starts_with_forward_iter/16                                         6.86 ns         6.76 ns     99222099
bm_starts_with_forward_iter/256                                        99.3 ns         99.3 ns      6935465
bm_starts_with_forward_iter/4096                                       1455 ns         1455 ns       482635
bm_starts_with_forward_iter/65536                                     23077 ns        23077 ns        29802
bm_starts_with_forward_iter/1048576                                  379438 ns       379432 ns         1795
bm_starts_with_forward_iter/16777216                                6722769 ns      6722515 ns          104

To run the benchmark with the optimized ranges::starts_with, please make sure to add these lines to libcxx/include/__functional/ranges_operations.h:

template <class _Tp, class _Up>
inline const bool __desugars_to_v<__equal_tag, reference_wrapper<ranges::equal_to>, _Tp, _Up> = true;

template <class _Tp, class _Up>
inline const bool __desugars_to_v<__equal_tag, reference_wrapper<const ranges::equal_to>, _Tp, _Up> = true;

Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #73255 I don't think there will be any significant performance improvement anymore, so I'm not sure whether we want to continue pursuing this. It could probably reduce the amount of generated code a bit, but I'm not sure it's worth the additional amount of complexity. @ldionne do you have any thoughts?

I am not sure I read the benchmarks right, but if I did it looks like we do have quite a bit of a performance benefit with this patch. It makes sense too, since we only need to use memcmp to compare the two ranges instead of finding where they actually differ, which is more complicated.

So IMO this makes sense.

Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this LGTM, I don't see any problem with the patch and it does seem to provide a fair speedup. @philnik777 would you agree?

@philnik777
Copy link
Contributor

IIRC these aren't at all the numbers I got when benchmarking std::mismatch against memcmp, so I'd really like to check this again before this lands. I also don't see how there'd be a significant difference, since the loop part in both memcmp and std::mismtach is finding a mismatch. The part that is more work in std::mismatch is always only at the end of the loop, so I'd expect at most a constant time difference between the implementations (which is exactly what I observed IIRC).

@xiaoyang-sde
Copy link
Member Author

xiaoyang-sde commented Apr 26, 2024

I wrote a simple benchmark that compares the performance of std::ranges::equal (memcmp) and std::ranges::mismatch (vectorized). I ran it on 2 machines and observed different results.

#include <algorithm>
#include <benchmark/benchmark.h>
#include <vector>

#include "test_iterators.h"

static void bm_starts_with_contiguous_iter_with_equal_impl(benchmark::State& state) {
  std::vector<int> a(state.range(), 1);
  std::vector<int> p(state.range(), 1);

  for (auto _ : state) {
    benchmark::DoNotOptimize(a);
    benchmark::DoNotOptimize(p);

    auto begin1 = contiguous_iterator(a.data());
    auto end1   = contiguous_iterator(a.data() + a.size());
    auto begin2 = contiguous_iterator(p.data());
    auto end2   = contiguous_iterator(p.data() + p.size());

    benchmark::DoNotOptimize(std::ranges::equal(begin1, end1, begin2, end2));
  }
}
BENCHMARK(bm_starts_with_contiguous_iter_with_equal_impl)->RangeMultiplier(16)->Range(16, 16 << 20);

static void bm_starts_with_contiguous_iter_with_mismatch_impl(benchmark::State& state) {
  std::vector<int> a(state.range(), 1);
  std::vector<int> p(state.range(), 1);

  for (auto _ : state) {
    benchmark::DoNotOptimize(a);
    benchmark::DoNotOptimize(p);

    auto begin1 = contiguous_iterator(a.data());
    auto end1   = contiguous_iterator(a.data() + a.size());
    auto begin2 = contiguous_iterator(p.data());
    auto end2   = contiguous_iterator(p.data() + p.size());

    benchmark::DoNotOptimize(std::ranges::mismatch(begin1, end1, begin2, end2).in2 == end2);
  }
}
BENCHMARK(bm_starts_with_contiguous_iter_with_mismatch_impl)->RangeMultiplier(16)->Range(16, 16 << 20);

BENCHMARK_MAIN();

The performance is similar on MacBook Air (M1, arm64):

2024-04-26T17:14:19-04:00
Running ./build/libcxx/benchmarks/ranges_starts_with.libcxx.out
Run on (8 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x8)
Load Average: 2.00, 3.59, 3.41
-----------------------------------------------------------------------------------------------------
Benchmark                                                           Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------
bm_starts_with_contiguous_iter_with_equal_impl/16                2.89 ns         2.86 ns    244051251
bm_starts_with_contiguous_iter_with_equal_impl/256               27.1 ns         26.1 ns     26963107
bm_starts_with_contiguous_iter_with_equal_impl/4096               451 ns          443 ns      1578169
bm_starts_with_contiguous_iter_with_equal_impl/65536             6693 ns         6667 ns       104706
bm_starts_with_contiguous_iter_with_equal_impl/1048576         110306 ns       109800 ns         6162
bm_starts_with_contiguous_iter_with_equal_impl/16777216       3129511 ns      2780193 ns          311
bm_starts_with_contiguous_iter_with_mismatch_impl/16             3.08 ns         3.05 ns    225504566
bm_starts_with_contiguous_iter_with_mismatch_impl/256            26.9 ns         26.7 ns     25911339
bm_starts_with_contiguous_iter_with_mismatch_impl/4096            422 ns          420 ns      1678504
bm_starts_with_contiguous_iter_with_mismatch_impl/65536          6834 ns         6722 ns       105055
bm_starts_with_contiguous_iter_with_mismatch_impl/1048576      124471 ns       123355 ns         5691
bm_starts_with_contiguous_iter_with_mismatch_impl/16777216    2337331 ns      2326288 ns          288

However, the performance is different on Arch Linux with a 4th Gen Xeon processor (x86_64):

2024-04-26T20:45:12+00:00
Running ./build/libcxx/benchmarks/ranges_starts_with.libcxx.out
Run on (4 X 2294.61 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 4096 KiB (x4)
Load Average: 0.00, 0.32, 0.66
-----------------------------------------------------------------------------------------------------
Benchmark                                                           Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------
bm_starts_with_contiguous_iter_with_equal_impl/16                7.68 ns         7.68 ns     93071510
bm_starts_with_contiguous_iter_with_equal_impl/256               31.4 ns         31.4 ns     25166061
bm_starts_with_contiguous_iter_with_equal_impl/4096               396 ns          396 ns      1528737
bm_starts_with_contiguous_iter_with_equal_impl/65536            10798 ns        10798 ns        59100
bm_starts_with_contiguous_iter_with_equal_impl/1048576         496691 ns       496671 ns         1499
bm_starts_with_contiguous_iter_with_equal_impl/16777216      13436051 ns     13435049 ns           50
bm_starts_with_contiguous_iter_with_mismatch_impl/16             10.7 ns         10.7 ns     68709479
bm_starts_with_contiguous_iter_with_mismatch_impl/256            59.0 ns         59.0 ns     10459829
bm_starts_with_contiguous_iter_with_mismatch_impl/4096           1069 ns         1069 ns       729445
bm_starts_with_contiguous_iter_with_mismatch_impl/65536         16881 ns        16880 ns        34519
bm_starts_with_contiguous_iter_with_mismatch_impl/1048576      583530 ns       583395 ns         1250
bm_starts_with_contiguous_iter_with_mismatch_impl/16777216   15792555 ns     15791353 ns           43

@philnik777 Thanks for the comment! Could you please review these benchmark results? If the result is valid, rather than focusing on optimizing std::ranges::starts_with, I guess we can further investigate the performance of std::ranges::mismatch on x86_64.

@var-const
Copy link
Member

@philnik777 Friendly ping re. the benchmarks. :)

@philnik777
Copy link
Contributor

Sorry for being so slow on this. I had a lot to do recently. TL;DR it's complicated.

The long version: AVX2 seems to make quite a significant difference for mismatch. @xiaoyang-sde I guess you compiled without any -m flags? The glibc functions are heavily optimized for a lot of platforms, and chooses the implementation based on what CPU it's running on. I wasn't able to benchmark yet, but it seems that other implementations aren't optimized as well, making the performance improvement dependent on your compilation flags and what libc you're using (e.g. musl doesn't seem to have any optimized memcmp, which would make our implementation 40x faster). We would be able to do something similar to what glibc does with [[gnu::target_clones]] or [[gnu::target_version]], but I'm not sure it's worth the cost.

I think we have to have a more general discussion on how far we want to go for improved performance. For now I don't want to block this change, but I'd like us to have a more rigid understanding of how far we want to go before doing similar optimizations (i.e. selecting our vector code vs. the libc function).

@xiaoyang-sde
Copy link
Member Author

xiaoyang-sde commented May 10, 2024

Sorry for being so slow on this. I had a lot to do recently. TL;DR it's complicated.

The long version: AVX2 seems to make quite a significant difference for mismatch. @xiaoyang-sde I guess you compiled without any -m flags? The glibc functions are heavily optimized for a lot of platforms, and chooses the implementation based on what CPU it's running on. I wasn't able to benchmark yet, but it seems that other implementations aren't optimized as well, making the performance improvement dependent on your compilation flags and what libc you're using (e.g. musl doesn't seem to have any optimized memcmp, which would make our implementation 40x faster). We would be able to do something similar to what glibc does with [[gnu::target_clones]] or [[gnu::target_version]], but I'm not sure it's worth the cost.

I think we have to have a more general discussion on how far we want to go for improved performance. For now I don't want to block this change, but I'd like us to have a more rigid understanding of how far we want to go before doing similar optimizations (i.e. selecting our vector code vs. the libc function).

Thanks for the clarification! Whether this patch is merged or not isn't a concern for me, but I genuinely gained valuable knowledge from our discussion. I appreciate it and please take your time.

Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally caught up to this patch. I tend to like the patch. I think it applies an optimization that should almost always be a win, except in cases where ranges::equal is slower than ranges::mismatch, which boils down to whether memcmp is faster than our implementation of mismatch. If I misunderstood something, please let me know.

Assuming I'm on the same page as the other reviewers, I would say this: comparing whether two ranges are equal is definitely "a simpler operation" (conceptually at least) than finding the point at which they don't match. So I think it makes sense to assume that ranges::equal should always be at least as fast as ranges::mismatch (in the worst case, I guess we could even implement ranges::equal by calling ranges::mismatch and checking that the result is the last iterator).

I understand that some C libraries might provide extremely poor quality of implementation for some functions, however I think that should be a bug filed on them and I don't think we should start deciding how to optimize our code based on that.

I have a few comments and I'd like to know if @philnik777 is fine with what I said above, but I'd say this patch otherwise LGTM.

Comment on lines 69 to 70
std::move(__first1),
ranges::next(__first1, __n2),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd have to compute ranges::next(__first1, __n2) before the function call if you want to move __first1. Otherwise you could end up using a moved-from iterator in the ranges::next call.

@ldionne
Copy link
Member

ldionne commented Oct 1, 2024

@xiaoyang-sde There's a bit of review feedback to apply, but otherwise I think this is good to go. Do you think you're able to pick this up and make the last few requested changes?

@xiaoyang-sde
Copy link
Member Author

@xiaoyang-sde There's a bit of review feedback to apply, but otherwise I think this is good to go. Do you think you're able to pick this up and make the last few requested changes?

I completely forgot about this! I'll respond to the comments shortly.

@ldionne
Copy link
Member

ldionne commented Mar 27, 2025

When benchmarking locally, I get the following results for std::vector:

Comparing build/default/libcxx/test/benchmarks/algorithms/nonmodifying/Output/starts_with.bench.cpp.dir/benchmark-result.json to build/candidate/libcxx/test/benchmarks/algorithms/nonmodifying/Output/starts_with.bench.cpp.dir/benchmark-result.json
Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------
rng::starts_with(vector<int>)/8                      -0.0314         -0.0314             2             2             2             2
rng::starts_with(vector<int>)/1000                   +0.1724         +0.1719            99           116            99           115
rng::starts_with(vector<int>)/1024                   +0.1791         +0.1791           101           119           101           119
rng::starts_with(vector<int>)/8192                   +0.1894         +0.1893           821           976           821           976
rng::starts_with(vector<int>)/1048576                +0.0984         +0.0984        113321        124466        113308        124457
OVERALL_GEOMEAN                                      +5.1211         +5.1148             0             0             0             0

Using ranges::equal seems to be consistently worse than using ranges::mismatch.

Edit: Ah, it looks like #84570 (comment) and #84570 (comment) explain this result.

@ldionne ldionne changed the title [libc++][ranges] optimize the performance of ranges::starts_with [libc++][ranges] Optimize the performance of ranges::starts_with Mar 27, 2025
@ldionne
Copy link
Member

ldionne commented Mar 27, 2025

Note that I don't think this will actually fix #83880 since ends_with uses starts_with with reverse iterators, and we don't have an optimization for equal on reverse iterators.

@ldionne
Copy link
Member

ldionne commented Mar 27, 2025

@philnik777 What do you think the next step should be? Vectorize std::equal when we're on a platform where we know we can do better than memcmp?

Note that I think this patch makes sense in principle, but I also don't think we should check it in as-is since it regresses on some platforms. I'd even be fine with checking it in with some sort of #ifndef _LIBCPP_MISMATCH_FASTER_THAN_MEMCMP that we could fix in the near future. WDYT?

@philnik777
Copy link
Contributor

@ldionne I'm really not sure what to do. It's nice to piggy-back off of the memcmp optimizations the compiler can do, which means that we probably cannot compete with some implementations (the compiler can assume more about memcmp than it would about our implementation). OTOH we're doing better with our mismatch implementation than the system libc in some situations without the optimization barriers that would be required to match memcmp on all platforms. From a pure perofrmance stand-point, introducing our own __memcmp to be used everywhere instead of the libc one would probably be best. We could replace that with our own implementation if we want (though I fear people will piggy-back off of that for embedded platforms). I'm not sure how the maintainability of that would be though. Would we also introduce our own memmove if we can make a faster one?

@ldionne
Copy link
Member

ldionne commented Mar 27, 2025

@ldionne I'm really not sure what to do. It's nice to piggy-back off of the memcmp optimizations the compiler can do, which means that we probably cannot compete with some implementations (the compiler can assume more about memcmp than it would about our implementation). OTOH we're doing better with our mismatch implementation than the system libc in some situations without the optimization barriers that would be required to match memcmp on all platforms. From a pure perofrmance stand-point, introducing our own __memcmp to be used everywhere instead of the libc one would probably be best. We could replace that with our own implementation if we want (though I fear people will piggy-back off of that for embedded platforms). I'm not sure how the maintainability of that would be though. Would we also introduce our own memmove if we can make a faster one?

IMO the answer is that if we can beat the platform's memcmp and memmove, we should implement our own. And we should also report a bug against any platform that we beat, but we should do it for the sake of libc++'s own performance. In particular, I find it unfortunate that this patch, which should obviously be an improvement, is actually a pessimization due to suboptimal memcmp on some platforms. I'd like to be able to unblock patches like this one.

@philnik777
Copy link
Contributor

In that case we should probably add our own wrappers of all the C functions and check whether they're slower than what we could easily achieve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. performance ranges Issues related to `<ranges>`
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants