run auto deploy remote model in partially deployed status #3423

Zhangxunmt · 2025-01-23T01:26:08Z

Description

Currently the remote model auto-deploy only happens when the model is not deployed at all, by checking the running worker nodes == 0. But in some edge cases, we'd like to auto-deploy the model even it's in PARTIALLY_DEPLOYED status.
The planning worker nodes are synced in the memory so when there is a partially deployed status, the auto deploy will apply.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

plugin/src/main/java/org/opensearch/ml/model/MLModelManager.java

ylwu-amzn · 2025-01-23T22:16:54Z

plugin/src/main/java/org/opensearch/ml/model/MLModelCacheHelper.java

+        if (modelCache == null) {
+            return null;
+        }
+        return modelCache.getTargetWorkerNodes();


We should consider deploy to all nodes case.
If deploy to all nodes is try and target work nodes may be [node1, node2] on day1.
Then day2, user add one more nodes to cluster, now the target work nodes should be [node1, node2, node3], so we can deploy to all nodes.

I think the work node from cache will still be [node1, node2], right ? Can't remember the details, maybe it returns null or empty for deploy to all node case ? Can you confirm ?

From what I see the synup job only syncs up the worker nodes, but not the target worker nodes. This is another enhancement needed. Both the target worker nodes and running worker nodes needs to be up-to-date in the memory so it can cover all kinds of cases.

planning worker nodes are synced in the new commit so I think this is not an issue anymore.

plugin/src/main/java/org/opensearch/ml/model/MLModelManager.java

pyek-bot · 2025-06-30T18:05:54Z

plugin/src/main/java/org/opensearch/ml/model/MLModelCacheHelper.java

+     * @param modelPlanningWorkerNodes planning worker nodes of all models
+     */
+    public void syncPlanningWorkerNodes(Map<String, Set<String>> modelPlanningWorkerNodes) {
+        log.debug("sync model planning worker nodes");


since this is debug log, what do you think about adding node ids to it? it can help with debugging

This is a Map of <modelId: set of Nodes>, so showing this full matrix looks too much in the logs. Imagine what's looks like when there is hundreds of models and nodes in the domain. We can use the Profile API to check the memory to help debug.

plugin/src/main/java/org/opensearch/ml/model/MLModelCache.java

pyek-bot · 2025-06-30T18:13:54Z

plugin/src/main/java/org/opensearch/ml/task/MLPredictTaskRunner.java

-            if (workerNodes == null || workerNodes.length == 0) {
+            String[] targetWorkerNodes = mlModelManager.getTargetWorkerNodes(modelId);
+
+            if (requiresAutoDeployment(workerNodes, targetWorkerNodes)) {


what about calling a syncModelPlanningWorkerNodes here if this condition is true? in case the predict runs in between syncUpCronJobs and the nodes have been updated? (i guess it will be handled in the next predict?)

Good call out! But here are the things to consider 1) this is the call in the runtime for model prediction so less steps means less overheads. For most cases this syncUp shouldn't be needed so adding another syncUp layer for all model predictions may be expensive overall to justify it. 2) This syncUp is only for memory so it would cause inconsistency between memory and index before the next SyncUp job(which syncs for both).

In case the predict runs in between syncUpCronJobs and the nodes haven't been updated, the prediction will still proceed with the partially loaded nodes, and will sync up automatically to route to all nodes after the next SyncUp. So the impact is minimum to CX if not None.

With that said, I think it's not justified to add another syncUp given all the overheads it brings to the system.

sounds good, thanks!

pyek-bot

LGTM

Signed-off-by: Xun Zhang <xunzh@amazon.com>

* run auto deploy remote model in partially deployed status Signed-off-by: Xun Zhang <xunzh@amazon.com> * add sync up for planning worker nodes Signed-off-by: Xun Zhang <xunzh@amazon.com> * add more UTs and java doc Signed-off-by: Xun Zhang <xunzh@amazon.com> * rename syncPlanningWorkerNodes from comments Signed-off-by: Xun Zhang <xunzh@amazon.com> --------- Signed-off-by: Xun Zhang <xunzh@amazon.com> (cherry picked from commit 8fff3f3)

) * run auto deploy remote model in partially deployed status * add sync up for planning worker nodes * add more UTs and java doc * rename syncPlanningWorkerNodes from comments --------- (cherry picked from commit 8fff3f3) Signed-off-by: Xun Zhang <xunzh@amazon.com> Co-authored-by: Xun Zhang <xunzh@amazon.com>

…-project#3423) * run auto deploy remote model in partially deployed status Signed-off-by: Xun Zhang <xunzh@amazon.com> * add sync up for planning worker nodes Signed-off-by: Xun Zhang <xunzh@amazon.com> * add more UTs and java doc Signed-off-by: Xun Zhang <xunzh@amazon.com> * rename syncPlanningWorkerNodes from comments Signed-off-by: Xun Zhang <xunzh@amazon.com> --------- Signed-off-by: Xun Zhang <xunzh@amazon.com>

Zhangxunmt requested review from HenryL27, austintlee, b4sjoo, dhrubo-os, jngz-es, mingshl, model-collapse, rbhavna, xinyual, ylwu-amzn and zane-neo as code owners January 23, 2025 01:26

Zhangxunmt had a problem deploying to ml-commons-cicd-env January 23, 2025 01:27 — with GitHub Actions Failure

pyek-bot reviewed Jan 23, 2025

View reviewed changes

plugin/src/main/java/org/opensearch/ml/model/MLModelManager.java Show resolved Hide resolved

ylwu-amzn reviewed Jan 23, 2025

View reviewed changes

dhrubo-os reviewed Jan 29, 2025

View reviewed changes

plugin/src/main/java/org/opensearch/ml/model/MLModelManager.java Show resolved Hide resolved

opensearch-project deleted a comment from dhrubo-os Jun 20, 2025

Zhangxunmt force-pushed the main branch 3 times, most recently from ba93cec to 88332a0 Compare June 25, 2025 04:56

Zhangxunmt had a problem deploying to ml-commons-cicd-env June 25, 2025 04:57 — with GitHub Actions Failure

Zhangxunmt had a problem deploying to ml-commons-cicd-env June 25, 2025 04:57 — with GitHub Actions Error

Zhangxunmt had a problem deploying to ml-commons-cicd-env June 25, 2025 04:57 — with GitHub Actions Failure

Zhangxunmt had a problem deploying to ml-commons-cicd-env June 25, 2025 04:57 — with GitHub Actions Error

Zhangxunmt had a problem deploying to ml-commons-cicd-env June 26, 2025 03:22 — with GitHub Actions Error

Zhangxunmt force-pushed the main branch 2 times, most recently from efbdf48 to 707d7e1 Compare June 26, 2025 04:11

Zhangxunmt temporarily deployed to ml-commons-cicd-env June 26, 2025 04:13 — with GitHub Actions Inactive

Zhangxunmt had a problem deploying to ml-commons-cicd-env June 26, 2025 04:13 — with GitHub Actions Error

Zhangxunmt temporarily deployed to ml-commons-cicd-env June 26, 2025 04:13 — with GitHub Actions Inactive

Zhangxunmt had a problem deploying to ml-commons-cicd-env June 26, 2025 04:13 — with GitHub Actions Failure

Zhangxunmt added the backport 3.1 label Jun 30, 2025

Zhangxunmt had a problem deploying to ml-commons-cicd-env June 30, 2025 17:29 — with GitHub Actions Failure

pyek-bot reviewed Jun 30, 2025

View reviewed changes

plugin/src/main/java/org/opensearch/ml/model/MLModelCache.java Outdated Show resolved Hide resolved

pyek-bot reviewed Jun 30, 2025

View reviewed changes

Zhangxunmt had a problem deploying to ml-commons-cicd-env June 30, 2025 18:48 — with GitHub Actions Failure

pyek-bot approved these changes Jun 30, 2025

View reviewed changes

Zhangxunmt added 4 commits July 1, 2025 13:01

run auto deploy remote model in partially deployed status

12d9359

Signed-off-by: Xun Zhang <xunzh@amazon.com>

add sync up for planning worker nodes

16fb5f9

Signed-off-by: Xun Zhang <xunzh@amazon.com>

add more UTs and java doc

7989a0c

Signed-off-by: Xun Zhang <xunzh@amazon.com>

rename syncPlanningWorkerNodes from comments

2406fab

Signed-off-by: Xun Zhang <xunzh@amazon.com>

Zhangxunmt force-pushed the main branch from 707d7e1 to 2406fab Compare July 1, 2025 20:02

Zhangxunmt temporarily deployed to ml-commons-cicd-env July 1, 2025 20:03 — with GitHub Actions Inactive

Zhangxunmt had a problem deploying to ml-commons-cicd-env July 1, 2025 20:03 — with GitHub Actions Failure

Zhangxunmt had a problem deploying to ml-commons-cicd-env July 1, 2025 20:03 — with GitHub Actions Error

jngz-es approved these changes Jul 1, 2025

View reviewed changes

b4sjoo approved these changes Jul 2, 2025

View reviewed changes

Zhangxunmt merged commit 8fff3f3 into opensearch-project:main Jul 2, 2025
8 of 10 checks passed

opensearch-trigger-bot bot mentioned this pull request Jul 2, 2025

[Backport 3.1] run auto deploy remote model in partially deployed status #3959

Merged

run auto deploy remote model in partially deployed status #3423

run auto deploy remote model in partially deployed status #3423

Uh oh!

Conversation

Zhangxunmt commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Uh oh!

Uh oh!

ylwu-amzn Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pyek-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Zhangxunmt commented Jan 23, 2025 •

edited

Loading

ylwu-amzn Jan 23, 2025 •

edited

Loading