-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the dev of [FEATURE]Auto reload model when cluster rebooted/node rejoin #639
Conversation
Thanks for adding this feature. Can you add more details in description? |
plugin/src/main/java/org/opensearch/ml/plugin/MachineLearningPlugin.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/model/MLModelManager.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java
Outdated
Show resolved
Hide resolved
Please review the Exceptions thrown, we should not throw any exception to outside of the auto reload logic. |
plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java
Outdated
Show resolved
Hide resolved
build.gradle
Outdated
@@ -72,7 +72,7 @@ ext { | |||
} | |||
|
|||
dependencies { | |||
implementation 'junit:junit:${versions.junit}' | |||
implementation 'junit:junit:4.13.1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't modify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will report error when run "./gradlew build"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you rebase the code on main branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, of course, my code is forked from the latest code under the main branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rollback
@@ -26,129 +26,135 @@ public class CommonValue { | |||
public static String WARM_BOX_TYPE = "warm"; | |||
|
|||
public static final String ML_MODEL_INDEX = ".plugins-ml-model"; | |||
public static final String ML_MODEL_RELOAD_INDEX = ".plugins-ml-model-reload"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the schema of this index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the method of creating index and saving data can be seen in MLModelAutoReLoader#saveLatestReTryTimes
ml-algorithms/build.gradle
Outdated
@@ -41,7 +41,7 @@ dependencies { | |||
} | |||
|
|||
configurations.all { | |||
resolutionStrategy.force 'com.google.protobuf:protobuf-java:3.21.9' | |||
resolutionStrategy.force 'com.google.protobuf:protobuf-java:3.21.7' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why protobuf version is downgraded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in github, When I commit code, there will be the dependency check and I changed it according to the tip. Please see the checks
tab and click [Mend Security Check]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also have a PR to ml-commons, and I didn't change this, no error raised: #645
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what happen,but I merge the code from main branch. It didn't pass the check ,so I changed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rollback
plugin/build.gradle
Outdated
'org.opensearch.ml.model.MLModelManager', | ||
'org.opensearch.ml.action.unload.TransportUnloadModelAction', | ||
'org.opensearch.ml.action.forward.TransportForwardAction', | ||
'org.opensearch.ml.action.syncup.TransportSyncUpOnNodeAction', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why adding these files to exclusions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are original content,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should rebase the main branch first to avoid the confused change shown in your PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rollback
@@ -470,8 +481,25 @@ public List<ExecutorBuilder<?>> getExecutorBuilders(Settings settings) { | |||
ML_THREAD_POOL_PREFIX + PREDICT_THREAD_POOL, | |||
false | |||
); | |||
FixedExecutorBuilder reloadModelThreadPool = new FixedExecutorBuilder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to create a thread pool to accomplish this, we can reuse the general thread pool in opensearch or even a simple Thread would be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I have removed it
} | ||
|
||
SearchRequest searchRequest = new SearchRequest(ML_TASK_INDEX); | ||
SearchResponse response = client.execute(SearchAction.INSTANCE, searchRequest).actionGet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many records we can get by default with this invocation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not all, so there will be modified, I will changed code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found 3 field "task_type","state" and "worker_node" are all keyword
type.
So I think I can use termQuery
to make search condition task_type=LOAD_MODEL and state =COMPLETED
but field of "worker_node" is a string separated by comma. So it will be hard to use termQuery
, this code I will make a test to see how to make an accurate query and to ensure smaller numbers for query records
|
||
SearchRequest searchRequest = new SearchRequest(ML_TASK_INDEX); | ||
SearchResponse response = client.execute(SearchAction.INSTANCE, searchRequest).actionGet(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three checks can be merged into one if check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!isExistedIndex(ML_TASK_INDEX)) { return; }
this check shouldn't be removed, because I found the code will throw the IndexNotFoundException if I didn't check whether ML_TASK_INDEX existed.
And other checks have been merged into 1 by me
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
…mplement auto reload model function Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
…om one node id Signed-off-by: wujunshen <frank_wjs@hotmail.com>
…ned node Signed-off-by: wujunshen <frank_wjs@hotmail.com>
… modify auto reload funtion to align with the latest requirement Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
…update data" under index ".plugins-ml-model-reload" Signed-off-by: wujunshen <frank_wjs@hotmail.com>
…update data" under index ".plugins-ml-model-reload" Signed-off-by: wujunshen <frank_wjs@hotmail.com>
…under plugin directory) to unchanged status Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
add test code and modify code to tuning Signed-off-by: wujunshen <frank_wjs@hotmail.com>
add test code and modify code to tuning Signed-off-by: wujunshen <frank_wjs@hotmail.com>
add test code and modify code to tuning Signed-off-by: wujunshen <frank_wjs@hotmail.com>
add test code and modify code to tuning Signed-off-by: wujunshen <frank_wjs@hotmail.com>
add test code and modify code to tuning Signed-off-by: wujunshen <frank_wjs@hotmail.com>
add test code and modify code to tuning Signed-off-by: wujunshen <frank_wjs@hotmail.com>
add test code and modify code to tuning Signed-off-by: wujunshen <frank_wjs@hotmail.com>
according to the second codereview conversation by niu zan, modify the test and implement code to tuning Signed-off-by: wujunshen <frank_wjs@hotmail.com>
according to the second codereview conversation by niu zan, modify the test and implement code to tuning Signed-off-by: wujunshen <frank_wjs@hotmail.com>
according to the second codereview conversation by niu zan, rollback the gradle file
according to the second codereview conversation by niu zan, rollback the gradle file
Description
the new feature:
Auto reload model when cluster rebooted/node rejoin
Issues Resolved
please see: #577
When a ml node under the opensearch cluster halt down with some unknown reasons. The models under this node will be broken and impact the process of the inference or reduced performance. So we add a new feature: When a ml node halt down, we reboot this ml node, the opensearch on this node will auto reload all the models under this node,and user will not reload the model manually. Even in extreme cases, if the reload operation is still unsuccessful, opensearch will also tell the user via logs that the reload was unsuccessful.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.