-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data race with rxbufstat
in several nic drivers
#852
Comments
Hmm, thanks for your post. Data races are always a serious bug. But as far as I can see, and supported by years of use by many applications, there is no race condition here. I think the tool is not taking into account the fact that rxbufstat is actually an array of int. And a lot of care is taken to either, have mutexes to protect it, or otherwise there is a guarantee of single ownership and only the owner modifies the variable. Each item in the array is of the native cpu word size and aligned to the word boundary. So there is no risk of partial writes or reads. You can call this an implicit atomic. We do expect that the socket read function recv() is thread safe and does not deliver the same packet twice. Do not trust outputs from ThreadSanitizer or other tools without checking the results. |
Thanks for your quick reply. I agree that one should not blindly trust the results from code analysis tools. Looking at the code I still think there are some actual data races here. Not all interactions between the mentioned functions are necessary a data race, because as you mentioned
The interesting part is when we receive a packet with an index which we did not request. Here the first data race is the check for Another race is the check for is the check for Now, both races might not actually cause problems. For the first race the read side might miss the |
Indeed, the scenarios you described are intended this way. The write ( The second case is indeed separated by the transmission itself. On purpose the variable is set before the actual transmit. So it is guaranteed that even when this code is stuck for execution there can not be a receiving part that does not also see the change in the variable.
I agree with you that all of this is only true if there is no code reordering by the compiler, and the CPU itself also does not do any funny tricks. To prevent these side effect we should introduce memory barriers. On single CPU systems these barriers would be zero-op. Also the barriers only need to be per CPU and not system wide. Then again, I checked the code output of GCC with various optimization settings and the code reordering does not happen on the critical points. Multiple people have spend a lot of time analyzing precisely the code you are referring to for race conditions. The advantage of all this careful variable manipulation with minimal locks, is very efficient code execution. And I want to thank you for diving this deep in the code and trying to understand / improve things. That is why I think you deserve an elaborate answer. |
Thanks for the elaborate answer. After looking at the memory order guarantees given by the x86 architecture, I think you are correct, that the code should work as expected. At least if the compiler does not perform any reordering of the critical code sections. While I’d like to see some actual atomic operations, I get why you chose to omit them. With atomics the compiler won’t perform any breaking reordering, but the generated instructions should be the same on x86. However, on other architectures with less strict ordering (e.g. ARM) the atomic operations are actually needed. So, if Linux on ARM is a supported platform and you can live with requiring a C11 compiler, you should consider using atomics. |
I have to agree there. To solve the compiler diversity issue we could try to encapsulate the atomics in the abstraction layer. For compilers that do no support atomics it ill just be a regular read/write operation. The disadvantage of more and more abstractions is that users have no feeling anymore about the inner workings and things could subtlety break if assumptions turn out to be wrong. |
I was testing my application with GCC ThreadSanizier on linux and I got several warnings from the SOEM code (see below). Looking at the code there are several places in the nic drivers, where
rxbufstat
is accessed, but no real locking seems to be included for this variable. In the linuxnicdrv.c
I've identified the following places where this variable is accessed (ignoring initialization routines):ecx_getindex
: Access is protected bygetindex_mutex
ecx_setbufstat
: No lockingecx_outframe
: No lockingecx_outframe_red
: Access is protected bytx_mutex
ecx_inframe
: Partially protected byrx_mutex
As you can see, there is either is no locking at all, or if mutexes are used they are not the same across these functions. As I understand it should be possible to use the SOEM library to simultaneous exchange process data and access SDOs. In this scenario these functions may be called in parallel, so they should be thread safe.
To make this code thread safe a lot of lock/unlock blocks around the
rxbufstat
variable are needed. So, I’m not sure if this is the best way to fix this. Possibly an easier way would be to use atomics for this state.Example ThreadSanitizer output:
The text was updated successfully, but these errors were encountered: