Skip to content

Conversation

domsolutions
Copy link
Contributor

@domsolutions domsolutions commented Sep 22, 2025

Motivation

When multiples replicas are deployed i.e.

apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  creationTimestamp: "2025-09-19T10:10:31Z"
  generation: 1
  name: autotest-mlserver
  namespace: seldon-mesh
spec:
  maxReplicas: 4
  minReplicas: 4
  replicas: 4
  serverConfig: mlserver
  statefulSetPersistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain

With multiple models on each replica i.e.

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  creationTimestamp: "2025-09-17T06:30:35Z"
  finalizers:
  - seldon.model.finalizer
  generation: 2
  name: automatedtests-1-echo-2
  namespace: seldon-mesh
  resourceVersion: "3233061"
  uid: bc113e93-1e83-464d-bca1-6db11f63355a
spec:
  memory: 20k
  parameters:
  - name: response_length
    value: "10"
  replicas: 1
  requirements:
  - mlserver
  server: autotest-mlserver
  storageUri: gs://seldon-models/integration-tests/models/mlserver/echo
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  creationTimestamp: "2025-09-17T06:30:35Z"
  finalizers:
  - seldon.model.finalizer
  generation: 2
  name: automatedtests-1-echo-3
  namespace: seldon-mesh
  resourceVersion: "3233067"
  uid: a7683b5b-d2b8-430c-b206-9e2a86117459
spec:
  memory: 20k
  parameters:
  - name: response_length
    value: "10"
  replicas: 1
  requirements:
  - mlserver
  server: autotest-mlserver
  storageUri: gs://seldon-models/integration-tests/models/mlserver/echo

And the Server CR is deleted. All replicas attempt to drain concurrently. On occasion 1 or 2 agents are blocked from completing the drain due to:

	s.waiter.wait(serverName, serverReplicaIdx)

Where they're waitinig for the models that were loaded on them to be re-scheduled. There appears to be a race where a draining request will attempt to re-schedule the models on a replica who is draining but the drain request has not yet been receivied by the scheduler.

Summary of changes

  • Wait for 0.5 seconds to give time to receive all drain reqs from replicas
  • Additionally try to re-schedule the models that are still loading (have been sent to the agent and awaiting ACK loading confirmation)

Checklist

  • Added/updated unit tests
  • Added/updated documentation
  • Checked for typos in variable names, comments, etc.
  • Added licences for new files

Testing

@domsolutions domsolutions requested a review from lc525 as a code owner September 22, 2025 14:26
@domsolutions domsolutions changed the title fix(scheduler): blocked draining agents when waiting for model to be loaded which… fix(scheduler): block draining agents Sep 22, 2025
@domsolutions domsolutions changed the title fix(scheduler): block draining agents fix(scheduler): blocked draining agents Sep 22, 2025
@lc525 lc525 added the v2 label Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants