-
Notifications
You must be signed in to change notification settings - Fork 292
xapi_xenops: Avoid a race during suspend #6454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xapi_xenops: Avoid a race during suspend #6454
Conversation
I've manually tested this fix and could not reproduce the issue with the xenopsd assert making a snapshot unresumable anymore. |
ocaml/xapi/xapi_xenops.ml
Outdated
@@ -1714,7 +1717,8 @@ module Xenopsd_metadata = struct | |||
in | |||
delete_nolock ~__context id ; | |||
md | |||
) | |||
) | |||
(fun () -> try Mutex.unlock metadata_m with Sys_error _ -> ()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I this correct even if the lock was aquired not in this function but higher up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly - you're right. But there's no other users of the function and in this case we no longer need to hold the lock once pull
is done, so it's alright
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to leave a comment that release of the lock is ok. Altogether it would be better if we don't have to check dynamically if the lock is held already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On line 2094 there is a with_lock
call which will do an unlock at the end of that call, and here if the lock is already held higher up, and a Mutex.unlock
is called, metadata_m
will be unlocked, and later on when it reaches the with_lock
on line 2094, it will be unlocked again. Is this true? If so, it is not safe to unlock a mutex twice in ocaml, see https://ocaml.org/manual/5.1/api/Mutex.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought so too - but didn't see any exceptions in testing. (the exception will anyhow be caught and logged by finally
)
So is the problem that this code in
grabs a stale |
Yes! Exactly. Perhaps I should describe it better in the commit message/code comment? |
5ed3c3e
to
ee19704
Compare
ocaml/xapi/xapi_xenops.ml
Outdated
@@ -1687,7 +1687,10 @@ module Xenopsd_metadata = struct | |||
|
|||
(* Unregisters a VM with xenopsd, and cleans up metadata and caches *) | |||
let pull ~__context id = | |||
with_lock metadata_m (fun () -> | |||
(* Continue even if mutex is already locked one level above *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to use a reentrant lock here?
I think Colin has done something before to make the db lock reentrant (fae9b55), could we borrow that or use a library?
ocaml/xapi/xapi_xenops.ml
Outdated
@@ -1714,7 +1717,11 @@ module Xenopsd_metadata = struct | |||
in | |||
delete_nolock ~__context id ; | |||
md | |||
) | |||
) | |||
(* NOTE: This function has a single user and releasing the upper-level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it has a single user always, how about making it as pull_no_lock
so that it could solely rely on its caller to acquire the lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. The outer lock will be acquired only when "power_state = `Suspended".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't be called unless power_state = Suspended
, so I could just make it pull_no_lock
indeed and leave the reentrant lock refactoring for the future if someone else starts using this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, locking/unlocking twice seems a bit strange to me.
So the xapi event handler pushed the wrong metadata to the xenopsd, and then the |
This could be another way of doing this - though I am not 100% sure we wouldn't prevent some desired behaviour this way. |
I just thought the locking scope is bigger and the lock is shared by all operations. But it may be fine as the bigger locking scope only covers the suspended case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, pressed the Approve button by mistake 😅
ee19704
to
c8eb4c8
Compare
It's not just shrinking the window, but eliminating it: if the |
c8eb4c8
to
1e67e7d
Compare
As described in [xapi-project#6451](xapi-project#6451), a xapi event could prevent update_vm from pulling the latest Xenopsd metadata, overwriting it with stale information. In case of suspend, this would make the snapshot unresumable, raising an assert in xenopsd due to incongruities in memory values. Instead pull the xenopsd metadata right before updating DB.power_state in Xapi_vm_lifecycle.force_state_reset_keep_current_operations, eliminating the window for the race. Signed-off-by: Andrii Sultanov <andriy.sultanov@vates.tech>
1e67e7d
to
d875cfa
Compare
As described in #6451, a xapi event could prevent update_vm from pulling the latest Xenopsd metadata, overwriting it with stale information. In case of suspend, this would make the snapshot unresumable, raising an assert in xenopsd due to incongruities in memory values.
Instead pull the xenopsd metadata right before updating DB.power_state in
Xapi_vm_lifecycle.force_state_reset_keep_current_operations
, eliminating the window for the race.Closes #6451