the dev of [FEATURE]Auto reload model when cluster rebooted/node rejoin #639

wujunshen · 2022-12-19T14:04:09Z

Description

the new feature:
Auto reload model when cluster rebooted/node rejoin

Issues Resolved

please see: #577

When a ml node under the opensearch cluster halt down with some unknown reasons. The models under this node will be broken and impact the process of the inference or reduced performance. So we add a new feature: When a ml node halt down, we reboot this ml node, the opensearch on this node will auto reload all the models under this node,and user will not reload the model manually. Even in extreme cases, if the reload operation is still unsuccessful, opensearch will also tell the user via logs that the reload was unsuccessful.

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

ylwu-amzn · 2022-12-20T00:41:21Z

Thanks for adding this feature. Can you add more details in description?
BTW, please make sure ./gradlew build can pass locally before publishing PR.

plugin/src/main/java/org/opensearch/ml/plugin/MachineLearningPlugin.java

plugin/src/main/java/org/opensearch/ml/model/MLModelManager.java

plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java

zane-neo · 2022-12-21T08:16:04Z

Please review the Exceptions thrown, we should not throw any exception to outside of the auto reload logic.

common/src/main/java/org/opensearch/ml/common/CommonValue.java

plugin/build.gradle

plugin/src/main/java/org/opensearch/ml/model/MLModelAndNodeManager.java

zane-neo · 2022-12-27T01:55:14Z

build.gradle

@@ -72,7 +72,7 @@ ext {
 }

 dependencies {
-    implementation 'junit:junit:${versions.junit}'
+    implementation 'junit:junit:4.13.1'


We shouldn't modify this.

it will report error when run "./gradlew build"

Did you rebase the code on main branch?

yes, of course, my code is forked from the latest code under the main branch

zane-neo · 2022-12-27T01:56:36Z

common/src/main/java/org/opensearch/ml/common/CommonValue.java

@@ -26,129 +26,135 @@ public class CommonValue {
    public static String WARM_BOX_TYPE = "warm";

    public static final String ML_MODEL_INDEX = ".plugins-ml-model";
+    public static final String ML_MODEL_RELOAD_INDEX = ".plugins-ml-model-reload";


Where is the schema of this index?

the method of creating index and saving data can be seen in MLModelAutoReLoader#saveLatestReTryTimes

zane-neo · 2022-12-27T01:57:16Z

ml-algorithms/build.gradle

@@ -41,7 +41,7 @@ dependencies {
 }

 configurations.all {
-    resolutionStrategy.force 'com.google.protobuf:protobuf-java:3.21.9'
+    resolutionStrategy.force 'com.google.protobuf:protobuf-java:3.21.7'


Why protobuf version is downgraded?

in github, When I commit code, there will be the dependency check and I changed it according to the tip. Please see the checks tab and click [Mend Security Check]

I also have a PR to ml-commons, and I didn't change this, no error raised: #645

I don't know what happen,but I merge the code from main branch. It didn't pass the check ,so I changed it.

zane-neo · 2022-12-27T01:58:24Z

plugin/build.gradle

        'org.opensearch.ml.model.MLModelManager',
+        'org.opensearch.ml.action.unload.TransportUnloadModelAction',
+        'org.opensearch.ml.action.forward.TransportForwardAction',
+        'org.opensearch.ml.action.syncup.TransportSyncUpOnNodeAction',


Why adding these files to exclusions?

these are original content,

you should rebase the main branch first to avoid the confused change shown in your PR.

zane-neo · 2022-12-27T03:05:21Z

plugin/src/main/java/org/opensearch/ml/plugin/MachineLearningPlugin.java

@@ -470,8 +481,25 @@ public List<ExecutorBuilder<?>> getExecutorBuilders(Settings settings) {
            ML_THREAD_POOL_PREFIX + PREDICT_THREAD_POOL,
            false
        );
+        FixedExecutorBuilder reloadModelThreadPool = new FixedExecutorBuilder(


We don't need to create a thread pool to accomplish this, we can reuse the general thread pool in opensearch or even a simple Thread would be sufficient.

yes, I have removed it

zane-neo · 2022-12-27T03:14:54Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReLoader.java

+        }
+
+        SearchRequest searchRequest = new SearchRequest(ML_TASK_INDEX);
+        SearchResponse response = client.execute(SearchAction.INSTANCE, searchRequest).actionGet();


How many records we can get by default with this invocation?

not all, so there will be modified, I will changed code

I found 3 field "task_type","state" and "worker_node" are all keyword type.
So I think I can use termQuery to make search condition task_type=LOAD_MODEL and state =COMPLETED but field of "worker_node" is a string separated by comma. So it will be hard to use termQuery, this code I will make a test to see how to make an accurate query and to ensure smaller numbers for query records

zane-neo · 2022-12-27T03:15:36Z

plugin/src/main/java/org/opensearch/ml/model/MLModelAutoReLoader.java

+
+        SearchRequest searchRequest = new SearchRequest(ML_TASK_INDEX);
+        SearchResponse response = client.execute(SearchAction.INSTANCE, searchRequest).actionGet();
+


Three checks can be merged into one if check.

if (!isExistedIndex(ML_TASK_INDEX)) { return; }
this check shouldn't be removed, because I found the code will throw the IndexNotFoundException if I didn't check whether ML_TASK_INDEX existed.
And other checks have been merged into 1 by me

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

…mplement auto reload model function Signed-off-by: wujunshen <frank_wjs@hotmail.com>

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

…om one node id Signed-off-by: wujunshen <frank_wjs@hotmail.com>

…ned node Signed-off-by: wujunshen <frank_wjs@hotmail.com>

… modify auto reload funtion to align with the latest requirement Signed-off-by: wujunshen <frank_wjs@hotmail.com>

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

…update data" under index ".plugins-ml-model-reload" Signed-off-by: wujunshen <frank_wjs@hotmail.com>

…under plugin directory) to unchanged status Signed-off-by: wujunshen <frank_wjs@hotmail.com>

Signed-off-by: wujunshen <frank_wjs@hotmail.com>

add test code and modify code to tuning Signed-off-by: wujunshen <frank_wjs@hotmail.com>

according to the second codereview conversation by niu zan, modify the test and implement code to tuning Signed-off-by: wujunshen <frank_wjs@hotmail.com>

according to the second codereview conversation by niu zan, rollback the gradle file

opensearch-trigger-bot bot added the infra label Dec 19, 2022

wujunshen marked this pull request as ready for review December 19, 2022 14:04

wujunshen requested a review from a team December 19, 2022 14:04