-
Notifications
You must be signed in to change notification settings - Fork 186
run auto deploy remote model in partially deployed status #3423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if (modelCache == null) { | ||
| return null; | ||
| } | ||
| return modelCache.getTargetWorkerNodes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider deploy to all nodes case.
If deploy to all nodes is try and target work nodes may be [node1, node2] on day1.
Then day2, user add one more nodes to cluster, now the target work nodes should be [node1, node2, node3], so we can deploy to all nodes.
I think the work node from cache will still be [node1, node2], right ? Can't remember the details, maybe it returns null or empty for deploy to all node case ? Can you confirm ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I see the synup job only syncs up the worker nodes, but not the target worker nodes. This is another enhancement needed. Both the target worker nodes and running worker nodes needs to be up-to-date in the memory so it can cover all kinds of cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
planning worker nodes are synced in the new commit so I think this is not an issue anymore.
ba93cec to
88332a0
Compare
efbdf48 to
707d7e1
Compare
| * @param modelPlanningWorkerNodes planning worker nodes of all models | ||
| */ | ||
| public void syncPlanningWorkerNodes(Map<String, Set<String>> modelPlanningWorkerNodes) { | ||
| log.debug("sync model planning worker nodes"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is debug log, what do you think about adding node ids to it? it can help with debugging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a Map of <modelId: set of Nodes>, so showing this full matrix looks too much in the logs. Imagine what's looks like when there is hundreds of models and nodes in the domain. We can use the Profile API to check the memory to help debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
| if (workerNodes == null || workerNodes.length == 0) { | ||
| String[] targetWorkerNodes = mlModelManager.getTargetWorkerNodes(modelId); | ||
|
|
||
| if (requiresAutoDeployment(workerNodes, targetWorkerNodes)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about calling a syncModelPlanningWorkerNodes here if this condition is true? in case the predict runs in between syncUpCronJobs and the nodes have been updated? (i guess it will be handled in the next predict?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call out! But here are the things to consider 1) this is the call in the runtime for model prediction so less steps means less overheads. For most cases this syncUp shouldn't be needed so adding another syncUp layer for all model predictions may be expensive overall to justify it. 2) This syncUp is only for memory so it would cause inconsistency between memory and index before the next SyncUp job(which syncs for both).
In case the predict runs in between syncUpCronJobs and the nodes haven't been updated, the prediction will still proceed with the partially loaded nodes, and will sync up automatically to route to all nodes after the next SyncUp. So the impact is minimum to CX if not None.
With that said, I think it's not justified to add another syncUp given all the overheads it brings to the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, thanks!
pyek-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Xun Zhang <xunzh@amazon.com>
Signed-off-by: Xun Zhang <xunzh@amazon.com>
Signed-off-by: Xun Zhang <xunzh@amazon.com>
Signed-off-by: Xun Zhang <xunzh@amazon.com>
* run auto deploy remote model in partially deployed status Signed-off-by: Xun Zhang <xunzh@amazon.com> * add sync up for planning worker nodes Signed-off-by: Xun Zhang <xunzh@amazon.com> * add more UTs and java doc Signed-off-by: Xun Zhang <xunzh@amazon.com> * rename syncPlanningWorkerNodes from comments Signed-off-by: Xun Zhang <xunzh@amazon.com> --------- Signed-off-by: Xun Zhang <xunzh@amazon.com> (cherry picked from commit 8fff3f3)
) * run auto deploy remote model in partially deployed status * add sync up for planning worker nodes * add more UTs and java doc * rename syncPlanningWorkerNodes from comments --------- (cherry picked from commit 8fff3f3) Signed-off-by: Xun Zhang <xunzh@amazon.com> Co-authored-by: Xun Zhang <xunzh@amazon.com>
…-project#3423) * run auto deploy remote model in partially deployed status Signed-off-by: Xun Zhang <xunzh@amazon.com> * add sync up for planning worker nodes Signed-off-by: Xun Zhang <xunzh@amazon.com> * add more UTs and java doc Signed-off-by: Xun Zhang <xunzh@amazon.com> * rename syncPlanningWorkerNodes from comments Signed-off-by: Xun Zhang <xunzh@amazon.com> --------- Signed-off-by: Xun Zhang <xunzh@amazon.com>
Description
Currently the remote model auto-deploy only happens when the model is not deployed at all, by checking the running worker nodes == 0. But in some edge cases, we'd like to auto-deploy the model even it's in PARTIALLY_DEPLOYED status.
The planning worker nodes are synced in the memory so when there is a partially deployed status, the auto deploy will apply.
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.