Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: ml/engine/utils/FileUtils casts long file length to int incorrectly #3198

Merged
merged 5 commits into from
Dec 12, 2024

Conversation

maxlepikhin
Copy link
Contributor

@maxlepikhin maxlepikhin commented Nov 3, 2024

Description

"(int) file.length()" makes length negative for file sizes greater than 2GB (but less than 4GB). This results in function returning empty list of chunks and model registration task being stuck in CREATED state.

The fix is to use longs when splitting model zip file. Tested locally that updated opensearch-ml-algorithms jar fixes the problem.

Bug: #3197

Related Issues

N/A

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@pyek-bot
Copy link
Contributor

pyek-bot commented Nov 4, 2024

Should we create an issue for this and link it to the PR? @ylwu-amzn

Edit: I see that it is here already #3197
@maxlepikhin can you add it to the description?

@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 5, 2024 16:11 — with GitHub Actions Failure
@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 5, 2024 16:11 — with GitHub Actions Failure
@mingshl
Copy link
Collaborator

mingshl commented Nov 5, 2024

@maxlepikhin in every commit, you need to commit with your sign off using -s, for example. git commit -m"commit message" -s

Your last two commits are missing sign off, you can fix it by the following:

To add your Signed-off-by line to every commit in this branch:

Ensure you have a local copy of your branch by checking out the pull request locally via command line.
In your local branch, run: git rebase HEAD~2 --signoff
Force push your changes to overwrite the branch: git push --force-with-lease origin fix-3197

@b4sjoo
Copy link
Collaborator

b4sjoo commented Nov 5, 2024

Seems need to run spotlessApply

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>
Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>
Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>
@maxlepikhin maxlepikhin temporarily deployed to ml-commons-cicd-env-require-approval November 6, 2024 18:14 — with GitHub Actions Inactive
@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 6, 2024 18:14 — with GitHub Actions Failure
@brianf-aws
Copy link
Contributor

Hey @maxlepikhin ! Just curious, how did you debug this?

@maxlepikhin
Copy link
Contributor Author

Hey @maxlepikhin ! Just curious, how did you debug this?

By reading the code.

@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 11, 2024 19:17 — with GitHub Actions Failure
@maxlepikhin
Copy link
Contributor Author

will approve after all tests passed.

@maxlepikhin can you identify the version when this bug is happening? trying to figure out the backport versions cc @ylwu-amzn

From 11/17/22 (bfb0748): all releases it seems. It'd be great if somebody from the maintainers can help the tests pass, are they flaky?

@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 15, 2024 04:53 — with GitHub Actions Failure
@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 18, 2024 17:01 — with GitHub Actions Failure
@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval November 18, 2024 18:22 — with GitHub Actions Failure
@mingshl
Copy link
Collaborator

mingshl commented Nov 18, 2024

It's a flaky test.

Created the issue to track it.

Approved.

RestMLInferenceSearchResponseProcessorIT > testMLInferenceProcessorRemoteModelStringField STANDARD_ERROR
    REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.rest.RestMLInferenceSearchResponseProcessorIT.testMLInferenceProcessorRemoteModelStringField" -Dtests.seed=9E7BCE94AFC0318E -Dtests.security.manager=false -Dtests.locale=luy-KE -Dtests.timezone=Asia/Dubai -Druntime.java=21

RestMLInferenceSearchResponseProcessorIT > testMLInferenceProcessorRemoteModelStringField FAILED
    org.opensearch.client.ResponseException: method [POST], host [http://127.0.0.1:33169/], URI [/_plugins/_ml/models/null/_deploy], status line [HTTP/1.1 404 Not Found]
    {"error":{"root_cause":[{"type":"status_exception","reason":"Failed to find model"}],"type":"status_exception","reason":"Failed to find model"},"status":404}

mingshl
mingshl previously approved these changes Nov 18, 2024
@maxlepikhin
Copy link
Contributor Author

It's a flaky test.

Created the issue to track it.

Approved.

RestMLInferenceSearchResponseProcessorIT > testMLInferenceProcessorRemoteModelStringField STANDARD_ERROR
    REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.rest.RestMLInferenceSearchResponseProcessorIT.testMLInferenceProcessorRemoteModelStringField" -Dtests.seed=9E7BCE94AFC0318E -Dtests.security.manager=false -Dtests.locale=luy-KE -Dtests.timezone=Asia/Dubai -Druntime.java=21

RestMLInferenceSearchResponseProcessorIT > testMLInferenceProcessorRemoteModelStringField FAILED
    org.opensearch.client.ResponseException: method [POST], host [http://127.0.0.1:33169/], URI [/_plugins/_ml/models/null/_deploy], status line [HTTP/1.1 404 Not Found]
    {"error":{"root_cause":[{"type":"status_exception","reason":"Failed to find model"}],"type":"status_exception","reason":"Failed to find model"},"status":404}

Ok, thanks @mingshl . How to rerun it or override to submit?

@maxlepikhin
Copy link
Contributor Author

It's a flaky test.
Created the issue to track it.
Approved.

RestMLInferenceSearchResponseProcessorIT > testMLInferenceProcessorRemoteModelStringField STANDARD_ERROR
    REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.rest.RestMLInferenceSearchResponseProcessorIT.testMLInferenceProcessorRemoteModelStringField" -Dtests.seed=9E7BCE94AFC0318E -Dtests.security.manager=false -Dtests.locale=luy-KE -Dtests.timezone=Asia/Dubai -Druntime.java=21

RestMLInferenceSearchResponseProcessorIT > testMLInferenceProcessorRemoteModelStringField FAILED
    org.opensearch.client.ResponseException: method [POST], host [http://127.0.0.1:33169/], URI [/_plugins/_ml/models/null/_deploy], status line [HTTP/1.1 404 Not Found]
    {"error":{"root_cause":[{"type":"status_exception","reason":"Failed to find model"}],"type":"status_exception","reason":"Failed to find model"},"status":404}

Ok, thanks @mingshl . How to rerun it or override to submit?

Can maintainers advise how to either re-run flaky tests or override the results to submit this PR?
@mingshl @pyek-bot @ylwu-amzn - can you help please?

@brianf-aws
Copy link
Contributor

Hey Max @maxlepikhin will reach out to my team to rerun the workflow. Thank you for your patience, as of now only maintainers can rerun the workflow.

@dhrubo-os
Copy link
Collaborator

It's a flaky test.
Created the issue to track it.
Approved.

RestMLInferenceSearchResponseProcessorIT > testMLInferenceProcessorRemoteModelStringField STANDARD_ERROR
    REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.rest.RestMLInferenceSearchResponseProcessorIT.testMLInferenceProcessorRemoteModelStringField" -Dtests.seed=9E7BCE94AFC0318E -Dtests.security.manager=false -Dtests.locale=luy-KE -Dtests.timezone=Asia/Dubai -Druntime.java=21

RestMLInferenceSearchResponseProcessorIT > testMLInferenceProcessorRemoteModelStringField FAILED
    org.opensearch.client.ResponseException: method [POST], host [http://127.0.0.1:33169/], URI [/_plugins/_ml/models/null/_deploy], status line [HTTP/1.1 404 Not Found]
    {"error":{"root_cause":[{"type":"status_exception","reason":"Failed to find model"}],"type":"status_exception","reason":"Failed to find model"},"status":404}

Ok, thanks @mingshl . How to rerun it or override to submit?

Can maintainers advise how to either re-run flaky tests or override the results to submit this PR? @mingshl @pyek-bot @ylwu-amzn - can you help please?

I tried to re-run the workflow, but github isn't letting me to do that. May be you can push another commit based on the provided suggestion which will run workflow again.

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>
@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval December 12, 2024 02:00 — with GitHub Actions Failure
@maxlepikhin maxlepikhin had a problem deploying to ml-commons-cicd-env-require-approval December 12, 2024 02:00 — with GitHub Actions Failure
@maxlepikhin maxlepikhin temporarily deployed to ml-commons-cicd-env-require-approval December 12, 2024 03:53 — with GitHub Actions Inactive
@maxlepikhin maxlepikhin temporarily deployed to ml-commons-cicd-env-require-approval December 12, 2024 04:11 — with GitHub Actions Inactive
@maxlepikhin maxlepikhin temporarily deployed to ml-commons-cicd-env-require-approval December 12, 2024 04:11 — with GitHub Actions Inactive
@maxlepikhin maxlepikhin temporarily deployed to ml-commons-cicd-env-require-approval December 12, 2024 05:10 — with GitHub Actions Inactive
@mingshl mingshl merged commit e7e0dff into opensearch-project:main Dec 12, 2024
8 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 12, 2024
…tly (#3198)

* Use longs when splitting model zip file

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

* add test

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

* spotless

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

* clean up test

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

---------

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>
(cherry picked from commit e7e0dff)
@maxlepikhin maxlepikhin deleted the fix-3197 branch December 12, 2024 18:06
dhrubo-os pushed a commit that referenced this pull request Dec 13, 2024
…tly (#3198) (#3269)

* Use longs when splitting model zip file

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

* add test

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

* spotless

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

* clean up test

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

---------

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>
(cherry picked from commit e7e0dff)

Co-authored-by: Max Lepikhin <46848373+maxlepikhin@users.noreply.github.com>
tkykenmt pushed a commit to tkykenmt/ml-commons that referenced this pull request Dec 15, 2024
…tly (opensearch-project#3198)

* Use longs when splitting model zip file

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

* add test

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

* spotless

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

* clean up test

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

---------

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>
tkykenmt pushed a commit to tkykenmt/ml-commons that referenced this pull request Dec 15, 2024
…tly (opensearch-project#3198)

* Use longs when splitting model zip file

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

* add test

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

* spotless

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

* clean up test

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>

---------

Signed-off-by: Max Lepikhin <max.lepikhin@dremio.com>
Signed-off-by: tkykenmt <tkykenmto+github.com@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants