Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the dev of [FEATURE]Auto reload model when cluster rebooted/node rejoin #639

Closed
wants to merge 35 commits into from

Conversation

wujunshen
Copy link

@wujunshen wujunshen commented Dec 19, 2022

Description

the new feature:
Auto reload model when cluster rebooted/node rejoin

Issues Resolved

please see: #577

When a ml node under the opensearch cluster halt down with some unknown reasons. The models under this node will be broken and impact the process of the inference or reduced performance. So we add a new feature: When a ml node halt down, we reboot this ml node, the opensearch on this node will auto reload all the models under this node,and user will not reload the model manually. Even in extreme cases, if the reload operation is still unsuccessful, opensearch will also tell the user via logs that the reload was unsuccessful.

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@wujunshen wujunshen marked this pull request as ready for review December 19, 2022 14:04
@wujunshen wujunshen requested a review from a team December 19, 2022 14:04
@ylwu-amzn
Copy link
Collaborator

Thanks for adding this feature. Can you add more details in description?
BTW, please make sure ./gradlew build can pass locally before publishing PR.

@zane-neo
Copy link
Collaborator

Please review the Exceptions thrown, we should not throw any exception to outside of the auto reload logic.

plugin/build.gradle Outdated Show resolved Hide resolved
build.gradle Outdated
@@ -72,7 +72,7 @@ ext {
}

dependencies {
implementation 'junit:junit:${versions.junit}'
implementation 'junit:junit:4.13.1'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't modify this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will report error when run "./gradlew build"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you rebase the code on main branch?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, of course, my code is forked from the latest code under the main branch

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rollback

@@ -26,129 +26,135 @@ public class CommonValue {
public static String WARM_BOX_TYPE = "warm";

public static final String ML_MODEL_INDEX = ".plugins-ml-model";
public static final String ML_MODEL_RELOAD_INDEX = ".plugins-ml-model-reload";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the schema of this index?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the method of creating index and saving data can be seen in MLModelAutoReLoader#saveLatestReTryTimes

@@ -41,7 +41,7 @@ dependencies {
}

configurations.all {
resolutionStrategy.force 'com.google.protobuf:protobuf-java:3.21.9'
resolutionStrategy.force 'com.google.protobuf:protobuf-java:3.21.7'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why protobuf version is downgraded?

Copy link
Author

@wujunshen wujunshen Dec 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in github, When I commit code, there will be the dependency check and I changed it according to the tip. Please see the checks tab and click [Mend Security Check]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have a PR to ml-commons, and I didn't change this, no error raised: #645

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what happen,but I merge the code from main branch. It didn't pass the check ,so I changed it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rollback

'org.opensearch.ml.model.MLModelManager',
'org.opensearch.ml.action.unload.TransportUnloadModelAction',
'org.opensearch.ml.action.forward.TransportForwardAction',
'org.opensearch.ml.action.syncup.TransportSyncUpOnNodeAction',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why adding these files to exclusions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are original content,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should rebase the main branch first to avoid the confused change shown in your PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rollback

@@ -470,8 +481,25 @@ public List<ExecutorBuilder<?>> getExecutorBuilders(Settings settings) {
ML_THREAD_POOL_PREFIX + PREDICT_THREAD_POOL,
false
);
FixedExecutorBuilder reloadModelThreadPool = new FixedExecutorBuilder(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to create a thread pool to accomplish this, we can reuse the general thread pool in opensearch or even a simple Thread would be sufficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I have removed it

}

SearchRequest searchRequest = new SearchRequest(ML_TASK_INDEX);
SearchResponse response = client.execute(SearchAction.INSTANCE, searchRequest).actionGet();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many records we can get by default with this invocation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not all, so there will be modified, I will changed code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found 3 field "task_type","state" and "worker_node" are all keyword type.
So I think I can use termQuery to make search condition task_type=LOAD_MODEL and state =COMPLETED but field of "worker_node" is a string separated by comma. So it will be hard to use termQuery, this code I will make a test to see how to make an accurate query and to ensure smaller numbers for query records


SearchRequest searchRequest = new SearchRequest(ML_TASK_INDEX);
SearchResponse response = client.execute(SearchAction.INSTANCE, searchRequest).actionGet();

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three checks can be merged into one if check.

Copy link
Author

@wujunshen wujunshen Dec 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!isExistedIndex(ML_TASK_INDEX)) { return; }
this check shouldn't be removed, because I found the code will throw the IndexNotFoundException if I didn't check whether ML_TASK_INDEX existed.
And other checks have been merged into 1 by me

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
…mplement auto reload model function

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
…om one node id

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
…ned node

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
… modify auto reload funtion to align with the latest requirement

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
…update data" under index ".plugins-ml-model-reload"

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
…update data" under index ".plugins-ml-model-reload"

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
…under plugin directory) to unchanged status

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
Signed-off-by: wujunshen <frank_wjs@hotmail.com>
add test code and modify code to tuning

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
add test code and modify code to tuning

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
add test code and modify code to tuning

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
add test code and modify code to tuning

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
add test code and modify code to tuning

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
add test code and modify code to tuning

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
add test code and modify code to tuning

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
according to the second codereview conversation by niu zan,
modify the test and implement code to tuning

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
according to the second codereview conversation by niu zan,
modify the test and implement code to tuning

Signed-off-by: wujunshen <frank_wjs@hotmail.com>
according to the second codereview conversation by niu zan,
rollback the gradle file
according to the second codereview conversation by niu zan,
rollback the gradle file
@wujunshen wujunshen closed this by deleting the head repository Dec 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants