remote/coordinator: run reservation scheduling under lock#1872
remote/coordinator: run reservation scheduling under lock#1872alextercete wants to merge 1 commit into
Conversation
Reservation scheduling mutates shared coordinator state (`reservations`, `places`, and the derived `place.reservation` view), but the periodic scheduler path was performing these updates without holding the coordinator lock. This can lead to races between the poll-based scheduler and concurrent RPC handlers (e.g. acquire/release/reservation operations), potentially resulting in inconsistent state or incorrect reservation transitions. We now run reservation scheduling under the coordinator lock by making it async and awaiting it in the poll path. This ensures that all mutations of coordinator state are serialized consistently with the rest of the coordinator operations. Renamed `schedule_reservations` to `_schedule_reservations` to reflect that it is now an internal async helper. All usages in the coordinator were updated. This method is not part of a public API, so renaming it is not expected to impact external users. Testing ======= - Ran existing unit tests covering reservation and acquire/release flows. - Exercised reservation creation/cancel and place acquire/release scenarios manually. - No deadlocks were observed in these runs. - This patch does not add a dedicated stress test, and the current tests do not prove the absence of deadlocks. Performance =========== - Measured `_schedule_reservations` runtime under load. With 100 places and 200 reservations created in parallel, observed runtime was ~0.0013s, with worst case ~0.0036 s. - RPC paths were already invoking reservation scheduling while holding the coordinator lock; the primary change is that the poll-based scheduler path is now also serialized under the same lock. This patch focuses on fixing the scheduler locking model and does not change coordinator scheduling behavior. Signed-off-by: Alex Tercete <alex.tercete@arm.com> Reviewed-by: Alex Tercete <alex.tercete@arm.com> # gatekeeper Co-authored-by: Idan Saadon <idan.saadon@arm.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1872 +/- ##
========================================
- Coverage 46.0% 46.0% -0.1%
========================================
Files 180 180
Lines 14439 14441 +2
========================================
Hits 6654 6654
- Misses 7785 7787 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Emantor
left a comment
There was a problem hiding this comment.
I agree with the locking, however I do not see what we gain from turning _schedule_reservation() into an async function.
| logging.exception("error during get places") | ||
|
|
||
| def schedule_reservations(self): | ||
| async def _schedule_reservations(self): |
There was a problem hiding this comment.
What exactly do we gain from turning thie into an async function? As far as I can tell this function does not employ an async primitive so it could be parallelized or am I missing something?
Description
Reservation scheduling mutates shared coordinator state (
reservations,places, and the derivedplace.reservationview), but the periodic scheduler path was performing these updates without holding the coordinator lock.This can lead to races between the poll-based scheduler and concurrent RPC handlers (e.g. acquire/release/reservation operations), potentially resulting in inconsistent state or incorrect reservation transitions.
We now run reservation scheduling under the coordinator lock by making it async and awaiting it in the poll path. This ensures that all mutations of coordinator state are serialized consistently with the rest of the coordinator operations.
Renamed
schedule_reservationsto_schedule_reservationsto reflect that it is now an internal async helper. All usages in the coordinator were updated. This method is not part of a public API, so renaming it is not expected to impact external users.Testing
Performance
_schedule_reservationsruntime under load. With 100 places and 200 reservations created in parallel, observed runtime was ~0.0013s, with worst case ~0.0036 s.This patch focuses on fixing the scheduler locking model and does not change coordinator scheduling behavior.
Checklist