Skip to content

remote/coordinator: run reservation scheduling under lock#1872

Open
alextercete wants to merge 1 commit into
labgrid-project:masterfrom
ARM-software:coordinator-scheduling-lock
Open

remote/coordinator: run reservation scheduling under lock#1872
alextercete wants to merge 1 commit into
labgrid-project:masterfrom
ARM-software:coordinator-scheduling-lock

Conversation

@alextercete
Copy link
Copy Markdown
Contributor

Description
Reservation scheduling mutates shared coordinator state (reservations, places, and the derived place.reservation view), but the periodic scheduler path was performing these updates without holding the coordinator lock.

This can lead to races between the poll-based scheduler and concurrent RPC handlers (e.g. acquire/release/reservation operations), potentially resulting in inconsistent state or incorrect reservation transitions.

We now run reservation scheduling under the coordinator lock by making it async and awaiting it in the poll path. This ensures that all mutations of coordinator state are serialized consistently with the rest of the coordinator operations.

Renamed schedule_reservations to _schedule_reservations to reflect that it is now an internal async helper. All usages in the coordinator were updated. This method is not part of a public API, so renaming it is not expected to impact external users.

Testing

  • Ran existing unit tests covering reservation and acquire/release flows.
  • Exercised reservation creation/cancel and place acquire/release scenarios manually.
  • No deadlocks were observed in these runs.
  • This patch does not add a dedicated stress test, and the current tests do not prove the absence of deadlocks.

Performance

  • Measured _schedule_reservations runtime under load. With 100 places and 200 reservations created in parallel, observed runtime was ~0.0013s, with worst case ~0.0036 s.
  • RPC paths were already invoking reservation scheduling while holding the coordinator lock; the primary change is that the poll-based scheduler path is now also serialized under the same lock.

This patch focuses on fixing the scheduler locking model and does not change coordinator scheduling behavior.

Checklist

  • PR has been tested

Reservation scheduling mutates shared coordinator state (`reservations`,
`places`, and the derived `place.reservation` view), but the periodic
scheduler path was performing these updates without holding the
coordinator lock.

This can lead to races between the poll-based scheduler and concurrent
RPC handlers (e.g. acquire/release/reservation operations), potentially
resulting in inconsistent state or incorrect reservation transitions.

We now run reservation scheduling under the coordinator lock by making
it async and awaiting it in the poll path. This ensures that all
mutations of coordinator state are serialized consistently with the rest
of the coordinator operations.

Renamed `schedule_reservations` to `_schedule_reservations` to reflect
that it is now an internal async helper. All usages in the coordinator
were updated.  This method is not part of a public API, so renaming it
is not expected to impact external users.

Testing
=======

- Ran existing unit tests covering reservation and acquire/release
  flows.
- Exercised reservation creation/cancel and place acquire/release
  scenarios manually.
- No deadlocks were observed in these runs.
- This patch does not add a dedicated stress test, and the current tests
  do not prove the absence of deadlocks.

Performance
===========

- Measured `_schedule_reservations` runtime under load. With 100 places
  and 200 reservations created in parallel, observed runtime was
  ~0.0013s, with worst case ~0.0036 s.
- RPC paths were already invoking reservation scheduling while holding
  the coordinator lock; the primary change is that the poll-based
  scheduler path is now also serialized under the same lock.

This patch focuses on fixing the scheduler locking model and does not
change coordinator scheduling behavior.

Signed-off-by: Alex Tercete <alex.tercete@arm.com>
Reviewed-by: Alex Tercete <alex.tercete@arm.com> # gatekeeper
Co-authored-by: Idan Saadon <idan.saadon@arm.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 0% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.0%. Comparing base (c027b2d) to head (1169700).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
labgrid/remote/coordinator.py 0.0% 9 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1872     +/-   ##
========================================
- Coverage    46.0%   46.0%   -0.1%     
========================================
  Files         180     180             
  Lines       14439   14441      +2     
========================================
  Hits         6654    6654             
- Misses       7785    7787      +2     
Flag Coverage Δ
3.10 46.0% <0.0%> (-0.1%) ⬇️
3.11 46.0% <0.0%> (-0.1%) ⬇️
3.12 46.0% <0.0%> (-0.1%) ⬇️
3.13 46.0% <0.0%> (-0.1%) ⬇️
3.14 46.0% <0.0%> (-0.1%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Emantor Emantor requested review from Emantor and jluebbe May 20, 2026 10:48
Copy link
Copy Markdown
Member

@Emantor Emantor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the locking, however I do not see what we gain from turning _schedule_reservation() into an async function.

logging.exception("error during get places")

def schedule_reservations(self):
async def _schedule_reservations(self):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly do we gain from turning thie into an async function? As far as I can tell this function does not employ an async primitive so it could be parallelized or am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants