Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Ref assignment atomic #46302

Closed
wants to merge 1 commit into from

Conversation

ellenhp
Copy link
Contributor

@ellenhp ellenhp commented Feb 21, 2021

This PR ensures that references are correctly counted even when two threads assign to the same Ref<> variable at the same time. There's a nonzero performance cost to using atomics here, but we're already using atomics in SafeRefCount so I think it's probably fine. I need correct behavior in Ref<> for the implementation godotengine/godot-proposals#2299 See godotengine/godot-proposals#2330 for justification for this change.

I can't seem to get ASAN to work on macOS so I'd really appreciate some more thorough testing. I'll update this if I find a way to test for leaks and memory errors.

Please review this as if I'm just some random person on the internet who barely knows C++ modifying one of the most fundamental data structures in the entire engine, because that is 100% accurate. 🤣

Fixes: godotengine/godot-proposals#2330

@fire fire added the bug label Feb 21, 2021
@fire fire requested review from a team and RandomShaper February 21, 2021 23:16
@ellenhp ellenhp marked this pull request as ready for review February 21, 2021 23:27
@YuriSizov YuriSizov added this to the 4.0 milestone Feb 21, 2021
@ellenhp
Copy link
Contributor Author

ellenhp commented Feb 22, 2021

Question for @RandomShaper since I think you were working on some of the other atomics in the engine: Should we use a different memory order than the default? I don't understand the instructions that this code gets compiled down to, but it seems like the more relaxed memory orders might be faster.

https://www.cplusplus.com/reference/atomic/memory_order/

Obviously I haven't profiled it, but I don't personally care what order assignments happen. If two threads write to the same Ref without a lock, the ordering should be undefined. I don't see any benefit to a stricter guarantee than promising that we'll reference count correctly and that the value of reference will match one of the two values assigned to it. Strict guarantees on ordering are what mutexes are for, IMO. 🤷‍♀️

@ellenhp
Copy link
Contributor Author

ellenhp commented Feb 22, 2021

Related: #35046

@RandomShaper
Copy link
Member

I think you don't need to make the local pointers atomic. Just by having an atomic member pointer, you can get and set its value into (and from) the local pointers safely.

Regarding memory order, I'd need more time to reason about all the possibilities, but I'm quite confident that relaxed is fine, given the kind of guarantee you want to add with this PR, which it not full thread safety (that, as you said, would likely require some locking mechanism, or maybe one super clever lock-less algorithm).

Instead of adding multiple review comments, I've applied my changes locally (on top of yours) and created this patch file:
patch.txt

@ellenhp
Copy link
Contributor Author

ellenhp commented Feb 23, 2021

Patch looks good. CI is breaking but I think it's ready for another look :)

Copy link
Member

@RandomShaper RandomShaper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. However, before merging maybe it'd be good to add a test that exercises this. Or maybe it can be done later (I may be able to add the test myself, but not any sooner than the weekend).

@ellenhp
Copy link
Contributor Author

ellenhp commented Feb 24, 2021

I don't have much experience with tests, and I think they're failing on master right now so I'm not sure how to work around that.

@RandomShaper
Copy link
Member

I'll try do add a test myself today. CI seems fine. I guess this can be merged and, in case issues arise or the test fails, you can just make a fixup PR. Working on master has advantages... 😀

@akien-mga akien-mga requested a review from reduz February 24, 2021 09:46
@reduz
Copy link
Member

reduz commented Feb 24, 2021

Making the pointer atomic forces the CPU to do a barrier every time its accessed (which happens far more often than it is copied), which is going to have a serious performance implication. Barriers are cache invalidations across CPUs so this is not cheap. I don't think solving the case mentioned in this PR is worth this extra cost.

@RandomShaper mentions it is possible to ensure that barrier won't happen at compile time, so that might work. I was not aware of this C++11 feature.

@RandomShaper
Copy link
Member

@ellenhp, after discussing this with @reduz on IRC, we've agreed that, in order to know if some future CPU or C++ library needs to lock to achieve atomicity (so we know we must use a different approach if that happens), adding this inside the body of the class is desired:
static_assert(std::atomic<T *>::is_always_lock_free)

That guarantee, together with the relaxed ordering, gives us enough peace of mind.

@ellenhp
Copy link
Contributor Author

ellenhp commented Feb 24, 2021

I'll make this change later tonight and confirm it's working by going through the generated assembly for some toy examples. I do think that a read-modify-write operation like exchange is going to either invalidate some caches, stall the other CPUs or flush some pipelines, which is why I mentioned this would have a performance impact in the PR description. Like I said though, we're already doing this for the reference counts themselves and I don't think anyone has noticed a performance impact.

@ellenhp
Copy link
Contributor Author

ellenhp commented Feb 25, 2021

It looks like there is a memory barrier on x86_64 (thank you @fire for dumping this assembly for me 👍)

.seh_proc _ZNSt3__121__cxx_atomic_exchangeIyEET_PNS_22__cxx_atomic_base_implIS1_EES1_NS_12memory_orderE
# %bb.0:
	subq	$56, %rsp
	.seh_stackalloc 56
	.seh_endprologue
	movq	%rcx, 48(%rsp)
	movq	%rdx, 40(%rsp)
	movl	%r8d, 36(%rsp)
	movq	48(%rsp), %rax
	movl	36(%rsp), %r8d
	movq	40(%rsp), %rcx
	movq	%rcx, 24(%rsp)
	addl	$-1, %r8d
	movl	%r8d, %ecx
	subl	$4, %r8d
	movq	%rax, 8(%rsp)                   # 8-byte Spill
	movq	%rcx, (%rsp)                    # 8-byte Spill
	ja	.LBB18_1
# %bb.7:
	leaq	.LJTI18_0(%rip), %rax
	movq	(%rsp), %rcx                    # 8-byte Reload
	movslq	(%rax,%rcx,4), %rdx
	addq	%rax, %rdx
	jmpq	*%rdx
.LBB18_1:
	movq	24(%rsp), %rax
	movq	8(%rsp), %rcx                   # 8-byte Reload
	xchgq	%rax, (%rcx)
	movq	%rax, 16(%rsp)
	jmp	.LBB18_6
.LBB18_2:
	movq	24(%rsp), %rax
	movq	8(%rsp), %rcx                   # 8-byte Reload
	xchgq	%rax, (%rcx)
	movq	%rax, 16(%rsp)
	jmp	.LBB18_6
.LBB18_3:
	movq	24(%rsp), %rax
	movq	8(%rsp), %rcx                   # 8-byte Reload
	xchgq	%rax, (%rcx)
	movq	%rax, 16(%rsp)
	jmp	.LBB18_6
.LBB18_4:
	movq	24(%rsp), %rax
	movq	8(%rsp), %rcx                   # 8-byte Reload
	xchgq	%rax, (%rcx)
	movq	%rax, 16(%rsp)
	jmp	.LBB18_6
.LBB18_5:
	movq	24(%rsp), %rax
	movq	8(%rsp), %rcx                   # 8-byte Reload
	xchgq	%rax, (%rcx)
	movq	%rax, 16(%rsp)
.LBB18_6:
	movq	16(%rsp), %rax
	addq	$56, %rsp
	retq
	.p2align	2, 0x90

I'm not the best at looking at assembly, but I think xchgq %rax, (%rcx) exchanges rax with the memory at pointer location rcx. I read through some docs here: https://www.felixcloutier.com/x86/xchg

The operands can be two general-purpose registers or a register and a memory location. If a memory operand is referenced, the processor’s locking protocol is automatically implemented for the duration of the exchange operation, regardless of the presence or absence of the LOCK prefix or of the value of the IOPL. (See the LOCK prefix description in this chapter for more information on the locking protocol.)

So, bad news. But again, we're already doing this for reference counts and I'm not aware of any performance issues with that. I can profile this too if needed :) Just let me know.

@RandomShaper
Copy link
Member

RandomShaper commented Feb 25, 2021

When I said "no barriers" I meant no compiler barriers and no enforcement of CPU barriers. However, that's a minimum requirement from the programmer and the CPU still has to do what it has to do; i.e. do the closest equal-or-safer thing it can do to at least meet the expectations (like if you explicitly issue some kind of barrier not considered by some CPU instruction set, it may have to use a full barrier to at least provide the expected guarantees).

In our case, yes, XCHG in x86 has that implicit LOCK prefix, which keeps others' hands out of the data until the full operation is complete, and it turns out to be in practice a full memory barrier (there are some subtleties, but they are beyond my current knowledge).

In conclusion, the barrier is a must on x86 (64) and I guess we can live with it. I don't think it will hinder performance so much that it becomes a problem and, anyway, this is in harmony to our current leaning towards a thread safer Godot.

P. S.: At least we can ensure the operation is lock-free, which would be worse than the memory barrier.

@ellenhp
Copy link
Contributor Author

ellenhp commented Feb 25, 2021

Agreed. I just wanted to bring this up because it theoretically could impact performance. I don't think it will in a serious way but it would be dishonest to not mention it. :)

@reduz
Copy link
Member

reduz commented Mar 15, 2021

We should most likely measure this somehow before merging. IMO this protects against the same reference being accesed from different threads, which is not exactly what the refcount was meant for, and not a common case along the engine. Maybe this could be special case where it is an actual problema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Ref<> assignments atomic
6 participants