-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add mutex improvement proposal #10191
Changes from all commits
381019b
d5781e2
f89f224
f0c8733
49865b1
77aa286
335bdd5
3718fa3
bd723a4
c59bae7
d326bac
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -220,3 +220,8 @@ idempotence | |
kube-scheduler | ||
kube-apiserver | ||
kubectl | ||
MutexStatus | ||
SemaphoreStatus | ||
structs | ||
mutexes | ||
ConfigMap | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
# Proposal for Mutex/Semaphore improvements | ||
|
||
## Introduction | ||
|
||
The motivation for this is to improve the reliability of how locking works via mutexes and semaphores. Currently the implementation makes use of | ||
string formatting, this is not scalable (with respect to the size of developers and features). | ||
|
||
### How do mutexes currently work? | ||
|
||
Nearly all of the code regarding how mutexes work reside in `sync_manager.go`. | ||
Here is an example run of how locks are acquired and released. Some parts have been omitted for brevity, I recommend opening up the file and following through the | ||
examples below. | ||
|
||
\```go | ||
getHolderKey({"Namespace": "argo", Name: "example"}, "node") = "argo/example/node" | ||
|
||
// -- MutexStatus After LockAcquired Call -- | ||
items = ["argo", "example", "node"] | ||
holdingName = "node" | ||
ms.Holding = [MutexHolding{Mutex: lockKey, Holder: holdingName}] | ||
|
||
getResourceKey("argo", "example", "node") = "argo/example/node" | ||
\``` | ||
|
||
This works fine but let's examine another case where this breaks. This is the bug from issue <https://github.com/argoproj/argo-workflows/issues/8684> | ||
|
||
\```go | ||
getHolderKey({"Namespace": "argo", Name: "deadlock-test-sn8p5"}, "deadlock-test-sn8p5") = "argo/deadlock-test-sn8p5/deadlock-test-sn8p5" | ||
|
||
// -- "MutexStatus" After LockAcquired Call -- | ||
items = ["argo", "deadlock-test-sn8p5", "deadlock-test-sn8p5"] | ||
holdingName = "deadlock-test-sn8p5" | ||
ms.Holding = [MutexHolding{Mutex: lockKey, Holder: holdingName}] | ||
|
||
getResourceKey("argo", "deadlock-test-sn8p5", "deadlock-test-sn8p5") = "argo/deadlock-test-sn8p5" | ||
\``` | ||
|
||
### A criticism of the current approach | ||
|
||
The current approach is heavily dependent on string formatting. This makes it incredibly difficult to understand functionality when there are no comments or documentation | ||
outlining the logic used. On top of this, it makes the "Find References" functionality on LSP powered editors near useless. My understanding of the function `getResourceKey` is that | ||
it seems to construct the holder key that is generated via `getHolderKey` through using processed strings generated via `LockAcquired`. | ||
I strongly believe this needs to be refactored to be less dependent on string formatting. I strongly believe that the key obtaining process should be far more obvious, the disconnect between key acquisition and release is inherently flawed and will only lead to further bugs down the line. | ||
|
||
### How should they work? | ||
|
||
We should not be generating information after it has already been generated. It is extremely important to only maintain a single source of truth, | ||
generating this information (the holder key) produces sources of truth at every generation point. Ideally we should be storing this information somewhere | ||
as we will be needing it eventually when a release call is made. | ||
|
||
### Solutions | ||
|
||
#### Store the holder key directly in the MutexStatus/SemaphoreStatus structs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How will these solutions form the holder keys? Will it use string format? |
||
|
||
I may be wrong here but I don't see why it wouldn't be possible to store this information directly in the status structs. | ||
This seems to be the most simple way of ensuring that there is only a single source of truth. I didn't go with this solution in my existing PR because this would require changing the information MutexStatus/SemaphoreStatus held. But given this | ||
discussion is being opened in a proposal, it seems plausible to go with this solution instead. | ||
|
||
##### Advantages | ||
|
||
* More sensible approach than including a new dependency (ConfigMap). | ||
* Relatively simple change. | ||
* Single source of truth | ||
|
||
##### Disadvantages | ||
|
||
* Changing a field could be major change and might break projects in the ecosystem that rely on this behavior. This has been the behavior of mutexes ever since they have been implemented, it is high risk to change this in my opinion. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this change a field on the CR itself? Or just the internal structs? If it is just internal structs, a restart of the controller should get everything into the new format. If it is on the CR, perhaps there could be a in-between option -- a struct that is encoded into a string? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it is just internal structs unfortunately, it gets serialised into json and protobufs hence why I assumed there might be some usage here from the outside world. Look, honestly seems very unlikely though, think we can probably just change the struct. Chucking everything in a ConfigMap would keep things consistent between different versions. I am open to just changing the struct. This is basically an RFC to get some idea of what to do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now that I'm more familiar with the Controller, I believe this impacts the We can add fields though, as we do quite often already. So perhaps we deprecate the existing fields and only use the new fields moving forward. In either manner, we would need to ensure backward-compat with older Workflows that used the old format. I think that would mitigate the breaking change part and not have the drawbacks of additional ConfigMaps |
||
|
||
#### Store holding/pending information in a config map | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add sample configmap content? |
||
|
||
We can store the holder keys inside a config map, on release, we refer to the holder keys inside this config map. There is an existing WIP PR for this, [here](https://github.com/argoproj/argo-workflows/pull/10009). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think there is a security concern in multi-tenant env? If Controller updates the holder on Configmap, The controller should have cluster-level RBAC to update the config map across the namespace. |
||
It requires handling pending workflows. If we are going with this solution, a small amendment to deal with pending items will have to be made, we may have to introduce two config maps. One will be used for storing information regarding acquired locks, the other for storing information regarding pending lock acquisitions. | ||
There is a possibility that it might be possible to use a single config map here, but that solution needs to be explored in order to confirm this. | ||
|
||
##### Advantages | ||
|
||
* The change in behavior is transparent to users. | ||
* Existing PR should allow for us to be quite quick in pushing a fix. | ||
|
||
##### Disadvantages | ||
|
||
* Fairly large change. | ||
* Introduces error paths, that being said these error paths being encountered is a bigger problem than the mutex issue itself. The error path here is encountered if the Kubernetes API is down. | ||
* Two sources of truth, the data structure and also the ConfigMap used to store metadata. | ||
|
||
#### Recommendation | ||
|
||
My recommendation is dependent on whether we classify the first solution as a major/breaking change or not. If it is a breaking change the second approach is clearly better as it is transparent | ||
to the users. I think in this case, it is potentially better to push the second solution as a minor change as a temporary fix and introduce the first when we publish the next major version. | ||
|
||
**Even if we don't go with either of these approaches, I strongly believe we should at least change how the mutex acquisition and release mechanism works at the moment because currently it is inherently flawed.** | ||
|
||
## Resources | ||
|
||
* [Open PR](https://github.com/argoproj/argo-workflows/pull/10009) | ||
* [Issue that prompted this discussion](https://github.com/argoproj/argo-workflows/issues/8684) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file is roughly alphabetical.
MutexStatus
andSempahoreStatus
could be written with backticks which would no longer be spell checked