Skip to content

Conversation

@Zhangxunmt
Copy link
Collaborator

@Zhangxunmt Zhangxunmt commented Jan 23, 2025

Description

Currently the remote model auto-deploy only happens when the model is not deployed at all, by checking the running worker nodes == 0. But in some edge cases, we'd like to auto-deploy the model even it's in PARTIALLY_DEPLOYED status.
The planning worker nodes are synced in the memory so when there is a partially deployed status, the auto deploy will apply.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

if (modelCache == null) {
return null;
}
return modelCache.getTargetWorkerNodes();
Copy link
Collaborator

@ylwu-amzn ylwu-amzn Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider deploy to all nodes case.
If deploy to all nodes is try and target work nodes may be [node1, node2] on day1.
Then day2, user add one more nodes to cluster, now the target work nodes should be [node1, node2, node3], so we can deploy to all nodes.

I think the work node from cache will still be [node1, node2], right ? Can't remember the details, maybe it returns null or empty for deploy to all node case ? Can you confirm ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see the synup job only syncs up the worker nodes, but not the target worker nodes. This is another enhancement needed. Both the target worker nodes and running worker nodes needs to be up-to-date in the memory so it can cover all kinds of cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

planning worker nodes are synced in the new commit so I think this is not an issue anymore.

@opensearch-project opensearch-project deleted a comment from dhrubo-os Jun 20, 2025
@Zhangxunmt Zhangxunmt force-pushed the main branch 3 times, most recently from ba93cec to 88332a0 Compare June 25, 2025 04:56
@Zhangxunmt Zhangxunmt had a problem deploying to ml-commons-cicd-env June 25, 2025 04:57 — with GitHub Actions Failure
@Zhangxunmt Zhangxunmt had a problem deploying to ml-commons-cicd-env June 25, 2025 04:57 — with GitHub Actions Failure
@Zhangxunmt Zhangxunmt force-pushed the main branch 2 times, most recently from efbdf48 to 707d7e1 Compare June 26, 2025 04:11
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env June 26, 2025 04:13 — with GitHub Actions Inactive
@Zhangxunmt Zhangxunmt temporarily deployed to ml-commons-cicd-env June 26, 2025 04:13 — with GitHub Actions Inactive
* @param modelPlanningWorkerNodes planning worker nodes of all models
*/
public void syncPlanningWorkerNodes(Map<String, Set<String>> modelPlanningWorkerNodes) {
log.debug("sync model planning worker nodes");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is debug log, what do you think about adding node ids to it? it can help with debugging

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a Map of <modelId: set of Nodes>, so showing this full matrix looks too much in the logs. Imagine what's looks like when there is hundreds of models and nodes in the domain. We can use the Profile API to check the memory to help debug.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

if (workerNodes == null || workerNodes.length == 0) {
String[] targetWorkerNodes = mlModelManager.getTargetWorkerNodes(modelId);

if (requiresAutoDeployment(workerNodes, targetWorkerNodes)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about calling a syncModelPlanningWorkerNodes here if this condition is true? in case the predict runs in between syncUpCronJobs and the nodes have been updated? (i guess it will be handled in the next predict?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out! But here are the things to consider 1) this is the call in the runtime for model prediction so less steps means less overheads. For most cases this syncUp shouldn't be needed so adding another syncUp layer for all model predictions may be expensive overall to justify it. 2) This syncUp is only for memory so it would cause inconsistency between memory and index before the next SyncUp job(which syncs for both).

In case the predict runs in between syncUpCronJobs and the nodes haven't been updated, the prediction will still proceed with the partially loaded nodes, and will sync up automatically to route to all nodes after the next SyncUp. So the impact is minimum to CX if not None.

With that said, I think it's not justified to add another syncUp given all the overheads it brings to the system.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, thanks!

@Zhangxunmt Zhangxunmt had a problem deploying to ml-commons-cicd-env June 30, 2025 18:48 — with GitHub Actions Failure
Copy link
Collaborator

@pyek-bot pyek-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Xun Zhang <xunzh@amazon.com>
Signed-off-by: Xun Zhang <xunzh@amazon.com>
Signed-off-by: Xun Zhang <xunzh@amazon.com>
Signed-off-by: Xun Zhang <xunzh@amazon.com>
@Zhangxunmt Zhangxunmt merged commit 8fff3f3 into opensearch-project:main Jul 2, 2025
8 of 10 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 2, 2025
* run auto deploy remote model in partially deployed status

Signed-off-by: Xun Zhang <xunzh@amazon.com>

* add sync up for planning worker nodes

Signed-off-by: Xun Zhang <xunzh@amazon.com>

* add more UTs and java doc

Signed-off-by: Xun Zhang <xunzh@amazon.com>

* rename syncPlanningWorkerNodes from comments

Signed-off-by: Xun Zhang <xunzh@amazon.com>

---------

Signed-off-by: Xun Zhang <xunzh@amazon.com>
(cherry picked from commit 8fff3f3)
mingshl pushed a commit that referenced this pull request Jul 5, 2025
)

* run auto deploy remote model in partially deployed status



* add sync up for planning worker nodes



* add more UTs and java doc



* rename syncPlanningWorkerNodes from comments



---------


(cherry picked from commit 8fff3f3)

Signed-off-by: Xun Zhang <xunzh@amazon.com>
Co-authored-by: Xun Zhang <xunzh@amazon.com>
mingshl pushed a commit to mingshl/ml-commons that referenced this pull request Jul 8, 2025
…-project#3423)

* run auto deploy remote model in partially deployed status

Signed-off-by: Xun Zhang <xunzh@amazon.com>

* add sync up for planning worker nodes

Signed-off-by: Xun Zhang <xunzh@amazon.com>

* add more UTs and java doc

Signed-off-by: Xun Zhang <xunzh@amazon.com>

* rename syncPlanningWorkerNodes from comments

Signed-off-by: Xun Zhang <xunzh@amazon.com>

---------

Signed-off-by: Xun Zhang <xunzh@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants