-
Couldn't load subscription status.
- Fork 15k
[libcxx] Remove Redundant Reset in ~basic_string #164718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[libcxx] Remove Redundant Reset in ~basic_string #164718
Conversation
8dae17b refactors basic_string for more code reuse. This makes sense in most cases, but has performance overhead in the case of ~basic_string. The refactoring of ~basic_string to call __reset_internal_buffer() added a redundant (inside the destructor) reset of the object, which the optimizer is unable to optimize away in many cases. This patch prevents a ~1% regression we observed on an internal workload when applying the original refactoring. This does slightly pessimize the code readability, but I think this change is worth it given the performance impact. I'm hoping to add a benchmark(s) to the upstream libc++ benchmark suite around string construction/destruction to ensure that this case does not regress as it seems common in real world applications. I will put up a separate PR for that when I figure out a reasonable way to write it.
|
@llvm/pr-subscribers-libcxx Author: Aiden Grossman (boomanaiden154) Changes8dae17b refactors basic_string for more code reuse. This makes sense in most cases, but has performance overhead in the case of ~basic_string. The refactoring of ~basic_string to call __reset_internal_buffer() added a redundant (inside the destructor) reset of the object, which the optimizer is unable to optimize away in many cases. This patch prevents a ~1% regression we observed on an internal workload when applying the original refactoring. This does slightly pessimize the code readability, but I think this change is worth it given the performance impact. I'm hoping to add a benchmark(s) to the upstream libc++ benchmark suite around string construction/destruction to ensure that this case does not regress as it seems common in real world applications. I will put up a separate PR for that when I figure out a reasonable way to write it. Full diff: https://github.com/llvm/llvm-project/pull/164718.diff 1 Files Affected:
diff --git a/libcxx/include/string b/libcxx/include/string
index 8f80afbc2fd37..e680c17069140 100644
--- a/libcxx/include/string
+++ b/libcxx/include/string
@@ -1206,7 +1206,13 @@ public:
}
# endif // _LIBCPP_CXX03_LANG
- inline _LIBCPP_CONSTEXPR_SINCE_CXX20 ~basic_string() { __reset_internal_buffer(); }
+ inline _LIBCPP_CONSTEXPR_SINCE_CXX20 ~basic_string() {
+ // We do not use __reset_internal_buffer() here to avoid the overhead of
+ // resetting the object.
+ __annotate_delete();
+ if (__is_long())
+ __alloc_traits::deallocate(__alloc_, __get_long_pointer(), __get_long_cap());
+ }
_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 operator __self_view() const _NOEXCEPT {
return __self_view(typename __self_view::__assume_valid(), data(), size());
|
|
Thanks, Aiden, for doing the legwork to gather performance data! I want to take a shameless victory lap for insisting that we run the numbers specifically just on this diff hunk. I eyeballed the rest of #128423, and I didn't see any other new obviously dead stores, but it's always worth keeping these concerns in mind when working on string and vector. It reminds me of a paper I heard about at CGO when I was a new grad. Someone built a dynamic instrumentation tool with Pin to find dead stores and identify optimization opportunities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good idea. IMO the main question to ask here is why the compiler can't figure out that it's a dead store. By just looking at the code it's rather obvious, so why is that not the case for the compiler? The point of the change was to simplify the string allocation and deallocation so we have a single point where we do the actual calls, which this patch defeats. I'd be much happier with some annotation to the compiler saying "after this point reading from the storage is UB" if that's required for the compiler to optimize this. That would probably allow eliminating a lot more cases of dead stores than just this specific one.
|
I guess what I'm asking is that either clang automatically annotates the this pointer in a destructor as |
|
Emitting |
|
Nevermind, |
|
Rolling out We have this problem in reverse for full constructor initialization. You end up with a bunch of zero initializing stores, and then a complex hierarchy of constructor calls that may or may not initialize the memory. The LLVM |
|
After thinking about this a bit more and talking to Reid, I've updated the patch at @dwblaikie's suggestion to refactor out |
I'm not sure I follow. Adding
Yeah, I'm a bit more happy with the refactoring. Still not a huge fan though.
I can certainly see that rolling such a change out might take some time. However, given that it at least seems to work, do you disagree that adding
I'm not sure I follow. Are you saying we're unnecessarily initializing stuff multiple times in the string constructors, or was that more of a general comment? |
It works when you do that, but you're not demonstrating the phase ordering problem that comes up. Inlining currently does not propagate void test() { some_string.~basic_string(); }
inline ~basic_string(/*implict this is marked dead on return*/) { __reset_internal_buffer(); }
void __reset_internal_buffer() {
__annotate_delete();
if (__is_long())
__alloc_traits::deallocate(__alloc_, __get_long_pointer(), __get_long_cap());
__rep_.__s = __short();
}We do some inlining and we get the following: void test() {
some_string.__annotate_delete();
if (some_string.__is_long())
__alloc_traits::deallocate(__alloc_, some_string.__get_long_pointer(), some_string.__get_long_cap());
some_string.__rep_.__s = __short();
}Now the Now I think (haven't looked at feasibility/how well it would work) we could theoretically fix this in two different ways:
Yeah, I agree it's not an optimal solution. But neither the standard library or the compiler are magic, so we have to make a tradeoff somewhere. I think that this keeps everything together enough to preserve most of the benefit of the refactoring while preserving performance.
I don't think it's a bad idea in general. There are two things here:
If the leg work ends up getting done in the compiler to optimize these away, I'm perfectly happy reverting this patch. I just think it's going to be a huge amount of work (quarters to a year like Reid mentioned).
This was a general comment related to us trying to deploy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the report. I had also noticed some change in benchmarks I've been running and I thought it might be related to this patch, but I was then unable to reproduce the regression so Nikolas and I had kind of assumed it wasn't an actual regression. Perhaps it was. This is where e.g. LNT would come in really handy -- let's hope we can unblock that soon!
I think it would be great to have a regression test or a benchmark for this, but I understand it might be difficult to capture.
|
|
||
| // Deallocate the long buffer if it exists and clear the short buffer so we are an empty string | ||
| _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __reset_internal_buffer() { | ||
| _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __deallocate_buffer_if_long() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to suggest the following solution, which solves the problem while IMO reducing the code complexity. It also addresses something that I didn't particularly like from the previous refactoring (the existence of __replace_internal_buffer). This makes the internal API closer in usage to unique_ptr::reset:
diff --git a/libcxx/include/string b/libcxx/include/string
index 8f80afbc2fd3..59d9722b40f4 100644
--- a/libcxx/include/string
+++ b/libcxx/include/string
@@ -1206,7 +1206,7 @@ public:
}
# endif // _LIBCPP_CXX03_LANG
- inline _LIBCPP_CONSTEXPR_SINCE_CXX20 ~basic_string() { __reset_internal_buffer(); }
+ inline _LIBCPP_CONSTEXPR_SINCE_CXX20 ~basic_string() { __reset_internal_buffer(__uninitialized_tag()); }
_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 operator __self_view() const _NOEXCEPT {
return __self_view(typename __self_view::__assume_valid(), data(), size());
@@ -2259,18 +2259,23 @@ private:
return __long(__buffer, __capacity);
}
- // Deallocate the long buffer if it exists and clear the short buffer so we are an empty string
- _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __reset_internal_buffer() {
+ // Deallocate the long buffer if it exists. If __uninitialized_tag is passed, initialize neither
+ // the short nor the long representation. If a short or a long representation is passed, initialize
+ // the string with it.
+ _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __reset_internal_buffer(__uninitialized_tag) {
__annotate_delete();
if (__is_long())
__alloc_traits::deallocate(__alloc_, __get_long_pointer(), __get_long_cap());
- __rep_.__s = __short();
}
- // Replace the current buffer with __alloc; the first __size elements constitute a string
- _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __replace_internal_buffer(__long __alloc) {
- __reset_internal_buffer();
- __rep_.__l = __alloc;
+ _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __reset_internal_buffer(__short __contents) {
+ __reset_internal_buffer(__uninitialized_tag());
+ __rep_.__s = __contents;
+ }
+
+ _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __reset_internal_buffer(__long __allocation) {
+ __reset_internal_buffer(__uninitialized_tag());
+ __rep_.__l = __allocation; // this function assumes that the first __size elements constitute a string
}
// Initialize the internal buffer to hold __size elements
@@ -2438,13 +2443,13 @@ private:
__alloc_ = __str.__alloc_;
else {
if (!__str.__is_long()) {
- __reset_internal_buffer();
+ __reset_internal_buffer(__short());
__alloc_ = __str.__alloc_;
} else {
__annotate_delete();
auto __guard = std::__make_scope_guard(__annotate_new_size(*this));
auto __alloc = __str.__alloc_;
- __replace_internal_buffer(__allocate_long_buffer(__alloc, __str.size()));
+ __reset_internal_buffer(__allocate_long_buffer(__alloc, __str.size()));
__alloc_ = std::move(__alloc);
}
}
@@ -2655,7 +2660,7 @@ basic_string<_CharT, _Traits, _Allocator>::__init_with_sentinel(_InputIterator _
__rep_ = __rep();
__annotate_new(0);
- auto __guard = std::__make_exception_guard([this] { __reset_internal_buffer(); });
+ auto __guard = std::__make_exception_guard([this] { __reset_internal_buffer(__short()); });
for (; __first != __last; ++__first)
push_back(*__first);
__guard.__complete();
@@ -2675,7 +2680,7 @@ _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void
basic_string<_CharT, _Traits, _Allocator>::__init_with_size(_InputIterator __first, _Sentinel __last, size_type __sz) {
pointer __p = __init_internal_buffer(__sz);
- auto __guard = std::__make_exception_guard([this] { __reset_internal_buffer(); });
+ auto __guard = std::__make_exception_guard([this] { __reset_internal_buffer(__short()); });
auto __end = __copy_non_overlapping_range(std::move(__first), std::move(__last), std::__to_address(__p));
traits_type::assign(*__end, value_type());
__guard.__complete();
@@ -2710,7 +2715,7 @@ _LIBCPP_CONSTEXPR_SINCE_CXX20 void basic_string<_CharT, _Traits, _Allocator>::__
__sec_cp_sz);
__buffer.__size_ = __n_copy + __n_add + __sec_cp_sz;
traits_type::assign(__buffer.__data_[__buffer.__size_], value_type());
- __replace_internal_buffer(__buffer);
+ __reset_internal_buffer(__buffer);
}
// __grow_by is deprecated because it does not set the size. It may not update the size when the size is changed, and it
@@ -2746,7 +2751,7 @@ _LIBCPP_DEPRECATED_("use __grow_by_without_replace") basic_string<_CharT, _Trait
// This is -1 to make sure the caller sets the size properly, since old versions of this function didn't set the size
// at all.
__buffer.__size_ = -1;
- __replace_internal_buffer(__buffer);
+ __reset_internal_buffer(__buffer);
}
template <class _CharT, class _Traits, class _Allocator>
@@ -2898,7 +2903,7 @@ basic_string<_CharT, _Traits, _Allocator>::__move_assign(basic_string& __str, tr
{
__annotate_delete();
if (__is_long()) {
- __reset_internal_buffer();
+ __reset_internal_buffer(__short());
# if _LIBCPP_STD_VER <= 14
if (!is_nothrow_move_assignable<allocator_type>::value) {
__set_short_size(0);
@@ -3394,7 +3399,7 @@ _LIBCPP_CONSTEXPR_SINCE_CXX20 void basic_string<_CharT, _Traits, _Allocator>::re
__long __buffer = __allocate_long_buffer(__alloc_, __requested_capacity);
__buffer.__size_ = size();
traits_type::copy(std::__to_address(__buffer.__data_), data(), __buffer.__size_ + 1);
- __replace_internal_buffer(__buffer);
+ __reset_internal_buffer(__buffer);
}
template <class _CharT, class _Traits, class _Allocator>
@@ -3433,7 +3438,7 @@ inline _LIBCPP_CONSTEXPR_SINCE_CXX20 void basic_string<_CharT, _Traits, _Allocat
}
traits_type::copy(std::__to_address(__buffer.__data_), std::__to_address(__get_long_pointer()), __size + 1);
- __replace_internal_buffer(__buffer);
+ __reset_internal_buffer(__buffer);
# if _LIBCPP_HAS_EXCEPTIONS
} catch (...) {
return;Alternatively, we could decide that we're comfortable with making __reset_internal_buffer() equivalent to __reset_internal_buffer(__short()) and we'd have to change fewer call sites. WDYT?
Edit: Note that __uninitialized_tag already exists for no_destroy.h, we'd probably want to lift it somewhere else.
|
@boomanaiden154 Could you demonstrate the phase ordering problem please? Looking at the pipeline it seems to me like the dse is done before inlining, and the store is in fact eliminated in the example I gave, even though the destructor (which I annotated manually) did get inlined. |
8dae17b refactors basic_string for more code reuse. This makes sense in most cases, but has performance overhead in the case of ~basic_string. The refactoring of ~basic_string to call __reset_internal_buffer() added a redundant (inside the destructor) reset of the object, which the optimizer is unable to optimize away in many cases. This patch prevents a ~1% regression we observed on an internal workload when applying the original refactoring. This does slightly pessimize the code readability, but I think this change is worth it given the performance impact.
I'm hoping to add a benchmark(s) to the upstream libc++ benchmark suite around string construction/destruction to ensure that this case does not regress as it seems common in real world applications. I will put up a separate PR for that when I figure out a reasonable way to write it.