Skip to content

Conversation

@boomanaiden154
Copy link
Contributor

8dae17b refactors basic_string for more code reuse. This makes sense in most cases, but has performance overhead in the case of ~basic_string. The refactoring of ~basic_string to call __reset_internal_buffer() added a redundant (inside the destructor) reset of the object, which the optimizer is unable to optimize away in many cases. This patch prevents a ~1% regression we observed on an internal workload when applying the original refactoring. This does slightly pessimize the code readability, but I think this change is worth it given the performance impact.

I'm hoping to add a benchmark(s) to the upstream libc++ benchmark suite around string construction/destruction to ensure that this case does not regress as it seems common in real world applications. I will put up a separate PR for that when I figure out a reasonable way to write it.

8dae17b refactors basic_string for more
code reuse. This makes sense in most cases, but has performance overhead
in the case of ~basic_string. The refactoring of ~basic_string to call
__reset_internal_buffer() added a redundant (inside the destructor)
reset of the object, which the optimizer is unable to optimize away in
many cases. This patch prevents a ~1% regression we observed on an
internal workload when applying the original refactoring. This does
slightly pessimize the code readability, but I think this change is
worth it given the performance impact.

I'm hoping to add a benchmark(s) to the upstream libc++ benchmark suite
around string construction/destruction to ensure that this case does not
regress as it seems common in real world applications. I will put up a
separate PR for that when I figure out a reasonable way to write it.
@boomanaiden154 boomanaiden154 requested a review from a team as a code owner October 22, 2025 22:09
@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Oct 22, 2025
@llvmbot
Copy link
Member

llvmbot commented Oct 22, 2025

@llvm/pr-subscribers-libcxx

Author: Aiden Grossman (boomanaiden154)

Changes

8dae17b refactors basic_string for more code reuse. This makes sense in most cases, but has performance overhead in the case of ~basic_string. The refactoring of ~basic_string to call __reset_internal_buffer() added a redundant (inside the destructor) reset of the object, which the optimizer is unable to optimize away in many cases. This patch prevents a ~1% regression we observed on an internal workload when applying the original refactoring. This does slightly pessimize the code readability, but I think this change is worth it given the performance impact.

I'm hoping to add a benchmark(s) to the upstream libc++ benchmark suite around string construction/destruction to ensure that this case does not regress as it seems common in real world applications. I will put up a separate PR for that when I figure out a reasonable way to write it.


Full diff: https://github.com/llvm/llvm-project/pull/164718.diff

1 Files Affected:

  • (modified) libcxx/include/string (+7-1)
diff --git a/libcxx/include/string b/libcxx/include/string
index 8f80afbc2fd37..e680c17069140 100644
--- a/libcxx/include/string
+++ b/libcxx/include/string
@@ -1206,7 +1206,13 @@ public:
   }
 #  endif // _LIBCPP_CXX03_LANG
 
-  inline _LIBCPP_CONSTEXPR_SINCE_CXX20 ~basic_string() { __reset_internal_buffer(); }
+  inline _LIBCPP_CONSTEXPR_SINCE_CXX20 ~basic_string() {
+    // We do not use __reset_internal_buffer() here to avoid the overhead of
+    // resetting the object.
+    __annotate_delete();
+    if (__is_long())
+      __alloc_traits::deallocate(__alloc_, __get_long_pointer(), __get_long_cap());
+  }
 
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 operator __self_view() const _NOEXCEPT {
     return __self_view(typename __self_view::__assume_valid(), data(), size());

@rnk
Copy link
Collaborator

rnk commented Oct 22, 2025

Thanks, Aiden, for doing the legwork to gather performance data!

I want to take a shameless victory lap for insisting that we run the numbers specifically just on this diff hunk. I eyeballed the rest of #128423, and I didn't see any other new obviously dead stores, but it's always worth keeping these concerns in mind when working on string and vector.

It reminds me of a paper I heard about at CGO when I was a new grad. Someone built a dynamic instrumentation tool with Pin to find dead stores and identify optimization opportunities.

Copy link
Contributor

@philnik777 philnik777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good idea. IMO the main question to ask here is why the compiler can't figure out that it's a dead store. By just looking at the code it's rather obvious, so why is that not the case for the compiler? The point of the change was to simplify the string allocation and deallocation so we have a single point where we do the actual calls, which this patch defeats. I'd be much happier with some annotation to the compiler saying "after this point reading from the storage is UB" if that's required for the compiler to optimize this. That would probably allow eliminating a lot more cases of dead stores than just this specific one.

@philnik777
Copy link
Contributor

I guess what I'm asking is that either clang automatically annotates the this pointer in a destructor as dead_on_return or it is exposed as an attirbute by Clang so we can annotate destructors manually (which we would probably want to do for every destructor in the library for better or worse).

@boomanaiden154
Copy link
Contributor Author

Emitting dead_on_return in destructors within clang would probably work. I'm not sure of the implementation status though, and it doesn't look like it's been wired up in DSE/AA yet though, so probably needs quite a bit of work.

@boomanaiden154
Copy link
Contributor Author

Nevermind, dead_on_return is wired up in DSE. I'll see if I can get that wired up.

@rnk
Copy link
Collaborator

rnk commented Oct 23, 2025

Rolling out dead_on_return on this on destructors will be a major behavior change. The work involved in rolling out such a change would be measured in quarters. It's totally impractical to simply make this the compiler's problem. I feel strongly that the practical thing to do here is not emit the store so we don't have to carry the IR to optimize it away later.

We have this problem in reverse for full constructor initialization. You end up with a bunch of zero initializing stores, and then a complex hierarchy of constructor calls that may or may not initialize the memory. The LLVM initializes attribute makes this slightly better, but the compiler has to give up in the general case because of things like downcasts.

@boomanaiden154
Copy link
Contributor Author

boomanaiden154 commented Oct 23, 2025

After thinking about this a bit more and talking to Reid, dead_on_return has more problems. It's important to note that adding dead_on_return to the destructor's this parameter also would not actually help here given inlining and how the pass pipeline is currently set up. ~basic_string is likely to get inlined in many places, and after inlining dead_on_return goes away, so we have no additional information to guide AA/DSE. To fix that we would need to add an intrinsic similar to @llvm.lifetime.end to track that information. Or we would need to mark __reset_internal_buffer() as always_inline and add a EarlyDSE pass that takes care of trivial stores to handle this. Neither are great options, on top of what Reid already mentioned.

I've updated the patch at @dwblaikie's suggestion to refactor out __reset_internal_buffer() into two functions so that we can keep all the code reused, consolidate where everything gets reset into a similar number of lines, while preventing zeroing stores inside the destructor. I think that should help with the maintenance concerns at least a little bit.

@philnik777
Copy link
Contributor

After thinking about this a bit more and talking to Reid, dead_on_return has more problems. It's important to note that adding dead_on_return to the destructor's this parameter also would not actually help here given inlining and how the pass pipeline is currently set up.~basic_string is likely to get inlined in many places, and after inlining dead_on_return goes away, so we have no additional information to guide AA/DSE. To fix that we would need to add an intrinsic similar to @llvm.lifetime.end to track that information. Or we would need to mark __reset_internal_buffer() as always_inline and add a EarlyDSE pass that takes care of trivial stores to handle this. Neither are great options, on top of what Reid already mentioned.

I'm not sure I follow. Adding dead_on_return seems to work just fine: https://godbolt.org/z/dv4oEq3eK. The IR is taken from this: https://godbolt.org/z/eo3jbco1s
_ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEED2Ev is the function where I added an extra dead_on_return to the this pointer.

I've updated the patch at @dwblaikie's suggestion to refactor out __reset_internal_buffer() into two functions so that we can keep all the code reused, consolidate where everything gets reset into a similar number of lines, while preventing zeroing stores inside the destructor. I think that should help with the maintenance concerns at least a little bit.

Yeah, I'm a bit more happy with the refactoring. Still not a huge fan though.

Rolling out dead_on_return on this on destructors will be a major behavior change. The work involved in rolling out such a change would be measured in quarters. It's totally impractical to simply make this the compiler's problem. I feel strongly that the practical thing to do here is not emit the store so we don't have to carry the IR to optimize it away later.

I can certainly see that rolling such a change out might take some time. However, given that it at least seems to work, do you disagree that adding dead_on_return to destructors is a good idea in general, and once we've done that we can basically revert anything in this patch? IIUC the actual implementation should be fairly trivial. The actual problem is that people probably sometimes rely on the compiler not optimizing dead stores in destructors. I'd be quite happy to take this patch if this is a temporary solution until we've managed to land a patch for dead_on_return.

We have this problem in reverse for full constructor initialization. You end up with a bunch of zero initializing stores, and then a complex hierarchy of constructor calls that may or may not initialize the memory. The LLVM initializes attribute makes this slightly better, but the compiler has to give up in the general case because of things like downcasts.

I'm not sure I follow. Are you saying we're unnecessarily initializing stuff multiple times in the string constructors, or was that more of a general comment?

@boomanaiden154
Copy link
Contributor Author

I'm not sure I follow. Adding dead_on_return seems to work just fine: https://godbolt.org/z/dv4oEq3eK. The IR is taken from this: https://godbolt.org/z/eo3jbco1s
_ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEED2Ev is the function where I added an extra dead_on_return to the this pointer.

It works when you do that, but you're not demonstrating the phase ordering problem that comes up. Inlining currently does not propagate dead_on_return information and DSE runs very late. So if you start with something like this:

void test() { some_string.~basic_string(); }

inline ~basic_string(/*implict this is marked dead on return*/) { __reset_internal_buffer(); }

void __reset_internal_buffer() {
  __annotate_delete();
  if (__is_long())
    __alloc_traits::deallocate(__alloc_, __get_long_pointer(), __get_long_cap());
  __rep_.__s = __short();
}

We do some inlining and we get the following:

void test() {
  some_string.__annotate_delete();
  if (some_string.__is_long())
    __alloc_traits::deallocate(__alloc_, some_string.__get_long_pointer(), some_string.__get_long_cap());
  some_string.__rep_.__s = __short();
}

Now the dead_on_return attribute is no more and we get no assistance from the annotation in optimization. It's of course still trivial to pick out here, but it's easy to imagine a complicated scenario where proving aliasing ends up being difficult. This is likely to happen in tons of hot places where string destruction occurs in hot code (which will be a lot). We only run dead store elimination long after inlining, so we cannot delete the stores when they are simple to pick out. That's unlikely to change given the compile time overhead of good AA that's needed for DSE.

Now I think (haven't looked at feasibility/how well it would work) we could theoretically fix this in two different ways:

  1. Add a new intrinsic to IR similar to @llvm.lifetime.end but more general that the inliner can put in the terminator blocks of functions that have a dead_on_return parameter. That feels a bit specific to this case (although would be more general to anything adding dead_on_return), and I could forsee there being some design issues that would make implementation a lot more difficult.
  2. We could mark __reset_internal_buffer as always_inline and add a new EarlyDSE pass that only picks out trivially dead stores soon after the AlwaysInliner pass. That would let us eliminate these stores when they are easy to prove dead. I think there would be a decent amount of compile time overhead from this option that might not be tenable.

Yeah, I'm a bit more happy with the refactoring. Still not a huge fan though.

Yeah, I agree it's not an optimal solution. But neither the standard library or the compiler are magic, so we have to make a tradeoff somewhere. I think that this keeps everything together enough to preserve most of the benefit of the refactoring while preserving performance.

I can certainly see that rolling such a change out might take some time. However, given that it at least seems to work, do you disagree that adding dead_on_return to destructors is a good idea in general, and once we've done that we can basically revert anything in this patch? IIUC the actual implementation should be fairly trivial. The actual problem is that people probably sometimes rely on the compiler not optimizing dead stores in destructors. I'd be quite happy to take this patch if this is a temporary solution until we've managed to land a patch for dead_on_return.

I don't think it's a bad idea in general. There are two things here:

  1. Just adding dead_on_return to destructor's in clang's codegen does not actually solve the problem here without more work in the optimization pipeline. This optimization pipeline work would be pretty nontrivial.
  2. The amount of work to clean up people relying on UB in our large internal codebase would probably be huge. Even just this patch zeroing destructed strings caused us to have to cleanup 20-30 tests (some tsan failures and some tests just failing). The code was of course all wrong to begin with, but it's still a lot of work to cleanup, or even just file bugs against other teams to cleanup. I would imagine the amount of work cleaning up people relying on destructors in general not zeroing everything would be huge. I think our time would be better spent working on the compiler.

If the leg work ends up getting done in the compiler to optimize these away, I'm perfectly happy reverting this patch. I just think it's going to be a huge amount of work (quarters to a year like Reid mentioned).

I'm not sure I follow. Are you saying we're unnecessarily initializing stuff multiple times in the string constructors, or was that more of a general comment?

This was a general comment related to us trying to deploy -ftrivial-auto-var-init and trying to reduce performance overhead from it.

Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the report. I had also noticed some change in benchmarks I've been running and I thought it might be related to this patch, but I was then unable to reproduce the regression so Nikolas and I had kind of assumed it wasn't an actual regression. Perhaps it was. This is where e.g. LNT would come in really handy -- let's hope we can unblock that soon!

I think it would be great to have a regression test or a benchmark for this, but I understand it might be difficult to capture.


// Deallocate the long buffer if it exists and clear the short buffer so we are an empty string
_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __reset_internal_buffer() {
_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __deallocate_buffer_if_long() {
Copy link
Member

@ldionne ldionne Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to suggest the following solution, which solves the problem while IMO reducing the code complexity. It also addresses something that I didn't particularly like from the previous refactoring (the existence of __replace_internal_buffer). This makes the internal API closer in usage to unique_ptr::reset:

diff --git a/libcxx/include/string b/libcxx/include/string
index 8f80afbc2fd3..59d9722b40f4 100644
--- a/libcxx/include/string
+++ b/libcxx/include/string
@@ -1206,7 +1206,7 @@ public:
   }
 #  endif // _LIBCPP_CXX03_LANG
 
-  inline _LIBCPP_CONSTEXPR_SINCE_CXX20 ~basic_string() { __reset_internal_buffer(); }
+  inline _LIBCPP_CONSTEXPR_SINCE_CXX20 ~basic_string() { __reset_internal_buffer(__uninitialized_tag()); }
 
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 operator __self_view() const _NOEXCEPT {
     return __self_view(typename __self_view::__assume_valid(), data(), size());
@@ -2259,18 +2259,23 @@ private:
     return __long(__buffer, __capacity);
   }
 
-  // Deallocate the long buffer if it exists and clear the short buffer so we are an empty string
-  _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __reset_internal_buffer() {
+  // Deallocate the long buffer if it exists. If __uninitialized_tag is passed, initialize neither
+  // the short nor the long representation. If a short or a long representation is passed, initialize
+  // the string with it.
+  _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __reset_internal_buffer(__uninitialized_tag) {
     __annotate_delete();
     if (__is_long())
       __alloc_traits::deallocate(__alloc_, __get_long_pointer(), __get_long_cap());
-    __rep_.__s = __short();
   }
 
-  // Replace the current buffer with __alloc; the first __size elements constitute a string
-  _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __replace_internal_buffer(__long __alloc) {
-    __reset_internal_buffer();
-    __rep_.__l = __alloc;
+  _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __reset_internal_buffer(__short __contents) {
+    __reset_internal_buffer(__uninitialized_tag());
+    __rep_.__s = __contents;
+  }
+
+  _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __reset_internal_buffer(__long __allocation) {
+    __reset_internal_buffer(__uninitialized_tag());
+    __rep_.__l = __allocation; // this function assumes that the first __size elements constitute a string
   }
 
   // Initialize the internal buffer to hold __size elements
@@ -2438,13 +2443,13 @@ private:
       __alloc_ = __str.__alloc_;
     else {
       if (!__str.__is_long()) {
-        __reset_internal_buffer();
+        __reset_internal_buffer(__short());
         __alloc_ = __str.__alloc_;
       } else {
         __annotate_delete();
         auto __guard = std::__make_scope_guard(__annotate_new_size(*this));
         auto __alloc = __str.__alloc_;
-        __replace_internal_buffer(__allocate_long_buffer(__alloc, __str.size()));
+        __reset_internal_buffer(__allocate_long_buffer(__alloc, __str.size()));
         __alloc_ = std::move(__alloc);
       }
     }
@@ -2655,7 +2660,7 @@ basic_string<_CharT, _Traits, _Allocator>::__init_with_sentinel(_InputIterator _
   __rep_ = __rep();
   __annotate_new(0);
 
-  auto __guard = std::__make_exception_guard([this] { __reset_internal_buffer(); });
+  auto __guard = std::__make_exception_guard([this] { __reset_internal_buffer(__short()); });
   for (; __first != __last; ++__first)
     push_back(*__first);
   __guard.__complete();
@@ -2675,7 +2680,7 @@ _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void
 basic_string<_CharT, _Traits, _Allocator>::__init_with_size(_InputIterator __first, _Sentinel __last, size_type __sz) {
   pointer __p = __init_internal_buffer(__sz);
 
-  auto __guard = std::__make_exception_guard([this] { __reset_internal_buffer(); });
+  auto __guard = std::__make_exception_guard([this] { __reset_internal_buffer(__short()); });
   auto __end   = __copy_non_overlapping_range(std::move(__first), std::move(__last), std::__to_address(__p));
   traits_type::assign(*__end, value_type());
   __guard.__complete();
@@ -2710,7 +2715,7 @@ _LIBCPP_CONSTEXPR_SINCE_CXX20 void basic_string<_CharT, _Traits, _Allocator>::__
                       __sec_cp_sz);
   __buffer.__size_ = __n_copy + __n_add + __sec_cp_sz;
   traits_type::assign(__buffer.__data_[__buffer.__size_], value_type());
-  __replace_internal_buffer(__buffer);
+  __reset_internal_buffer(__buffer);
 }
 
 // __grow_by is deprecated because it does not set the size. It may not update the size when the size is changed, and it
@@ -2746,7 +2751,7 @@ _LIBCPP_DEPRECATED_("use __grow_by_without_replace") basic_string<_CharT, _Trait
   // This is -1 to make sure the caller sets the size properly, since old versions of this function didn't set the size
   // at all.
   __buffer.__size_ = -1;
-  __replace_internal_buffer(__buffer);
+  __reset_internal_buffer(__buffer);
 }
 
 template <class _CharT, class _Traits, class _Allocator>
@@ -2898,7 +2903,7 @@ basic_string<_CharT, _Traits, _Allocator>::__move_assign(basic_string& __str, tr
 {
   __annotate_delete();
   if (__is_long()) {
-    __reset_internal_buffer();
+    __reset_internal_buffer(__short());
 #    if _LIBCPP_STD_VER <= 14
     if (!is_nothrow_move_assignable<allocator_type>::value) {
       __set_short_size(0);
@@ -3394,7 +3399,7 @@ _LIBCPP_CONSTEXPR_SINCE_CXX20 void basic_string<_CharT, _Traits, _Allocator>::re
   __long __buffer  = __allocate_long_buffer(__alloc_, __requested_capacity);
   __buffer.__size_ = size();
   traits_type::copy(std::__to_address(__buffer.__data_), data(), __buffer.__size_ + 1);
-  __replace_internal_buffer(__buffer);
+  __reset_internal_buffer(__buffer);
 }
 
 template <class _CharT, class _Traits, class _Allocator>
@@ -3433,7 +3438,7 @@ inline _LIBCPP_CONSTEXPR_SINCE_CXX20 void basic_string<_CharT, _Traits, _Allocat
     }
 
     traits_type::copy(std::__to_address(__buffer.__data_), std::__to_address(__get_long_pointer()), __size + 1);
-    __replace_internal_buffer(__buffer);
+    __reset_internal_buffer(__buffer);
 #  if _LIBCPP_HAS_EXCEPTIONS
   } catch (...) {
     return;

Alternatively, we could decide that we're comfortable with making __reset_internal_buffer() equivalent to __reset_internal_buffer(__short()) and we'd have to change fewer call sites. WDYT?

Edit: Note that __uninitialized_tag already exists for no_destroy.h, we'd probably want to lift it somewhere else.

@philnik777
Copy link
Contributor

@boomanaiden154 Could you demonstrate the phase ordering problem please? Looking at the pipeline it seems to me like the dse is done before inlining, and the store is in fact eliminated in the example I gave, even though the destructor (which I annotated manually) did get inlined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants