Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change text embedding processor to async mode for better isolation #27

Merged
merged 2 commits into from
Oct 27, 2022

Conversation

zane-neo
Copy link
Collaborator

@zane-neo zane-neo commented Oct 24, 2022

Description

Changed text embedding processor to async mode when handling user input.
Previously since inferencing runs in async mode, and overriding only the execute(IngestDocument) method TextEmbeddingProcessor needs a blocking approach to make sure the document enrichment is complete and then the indexing happens. This has a drawback that the blocking happens in the write thread pool, this thread pool has only availableProcessors + 1 threads, and this could have impact on non text embedding indexing.
Changing this to async mode by overriding execute(IngestDocument, BiConsumer) method, thus threads in write thread pool will not be blocked and this has better isolation for non text embedding indexing.

Issues Resolved

An enhancement without issue resolve.

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@zane-neo zane-neo requested a review from a team October 24, 2022 04:38
@zane-neo zane-neo changed the title Async text embedding Change text embedding processor to async mode for better isolation Oct 24, 2022
@jmazanec15
Copy link
Member

@zane-neo Please add brief description to PR and/or issue this relates to.

@navneet1v
Copy link
Collaborator

@zane-neo can you please add, some more details like when client will be sending the bulk request what would be the flow? is there anything changing?

Its good that we are moving towards async just want to understand the flow.

@navneet1v
Copy link
Collaborator

@zane-neo please resolve the conflicts and also can we check why the checks are failing? as per the last commit all the checks were passing. You might want to do some rebasing.

return ingestDocument;
}

public void execute(IngestDocument ingestDocument, BiConsumer<IngestDocument, Exception> handler) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an override function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is an override function, added the override annotation also added proper java doc on this method.

Comment on lines +95 to 105
} catch (Exception e) {
handler.accept(null, e);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which function is throwing the exception which we are catching here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underlying validateEmbeddingFieldsValue is throwing IllegalArgumentException.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the exception is coming from this function only, add the exception around that function only and take out predict API call from the try block.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked again, not only the method validateEmbeddingFieldsValue is throwing exception but also the mlClient.predict, and the listener.onFailure is not invoked when exception arise, this exception could populate to execute(IngestDocument, BiConsumer), if not catching it, this will further populate to upper methods which could impact the pipeline execution, this is what we don't want. Also the parent method also use try catch to catch all the exceptions please refer to: https://github.com/opensearch-project/OpenSearch/blob/8c9ca4e858e6333265080972cf57809dbc086208/server/src/main/java/org/opensearch/ingest/Processor.java#L64.

Signed-off-by: Zan Niu <zaniu@amazon.com>
Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments, but overall it looks good.

}

/**
* When received a bulk indexing request, the pipeline will be executed in the <a href="https://github.com/opensearch-project/OpenSearch/blob/8fda187bb459757164cc80c91ca305d274ca2b53/server/src/main/java/org/opensearch/action/bulk/TransportBulkAction.java#L226">doInternalExecute</a> method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Why this commit 8fda187bb459757164cc80c91ca305d274ca2b53 in the link?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this.

throw new RuntimeException("Text embedding processor failed with exception", e);
validateEmbeddingFieldsValue(ingestDocument);
Map<String, Object> knnMap = buildMapWithKnnKeyAndOriginalValue(ingestDocument);
mlCommonsClientAccessor.inferenceSentences(this.modelId, createInferenceList(knnMap), ActionListener.wrap(x -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we keep name "vectors" instead of "x"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, change to vectors

* @param ingestDocument
* @param handler
*/
@Override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, could we refactor this method to batch calls to the model so we can improve throughput?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmazanec15 what kind of batching you are suggesting here? Is it like batching the write or batching the inference calls?

Copy link
Collaborator Author

@zane-neo zane-neo Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's doable, but two things we need to consider.

  1. Effort, changing to bulk we need more effort and we need to consider a proper threshold either a batch number of linger time, and we need to consider if we need to expose these settings to user.
  2. Benefit, in a cluster each two nodes have TCP connection keeping alive, there's no overhead and the network time consuming is relative low comparing with CPU consuming. Based on the performance testing, the bottle neck is the CPU resource instead of network IO. So changing to batch may not increase performance dramatically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zane-neo That makes sense. It might make sense to create a proof of concept test in future to see what the benefit would be.

@navneet1v the idea would be to batch the inference calls. For the async action, instead of calling inference directly, submit the update to some kind of queue that will then create 1 large request for multiple docs and call the handlers once it completes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zane-neo That makes sense. It might make sense to create a proof of concept test in future to see what the benefit would be.

@navneet1v the idea would be to batch the inference calls. For the async action, instead of calling inference directly, submit the update to some kind of queue that will then create 1 large request for multiple docs and call the handlers once it completes.

I am not like 100% convinced on the idea of batching and the reason behind that is batching the data and making 1 call, will help if the number of threads are less at the ML node level and multiple requests are competing for the threads(api call threads).
Secondly, for creating the batch we need to put the data in a sync queue, that will further slow down the processing.
Third, batches create problem where 1 single input sentence can delay all the documents processing.

But we can discuss further on this, but before even doing the POC, we should first do some deep-dive on the ML-Commons code to know if batching can really help or not.

Comment on lines 87 to 92
/**
* When received a bulk indexing request, the pipeline will be executed in the <a href="https://github.com/opensearch-project/OpenSearch/blob/8fda187bb459757164cc80c91ca305d274ca2b53/server/src/main/java/org/opensearch/action/bulk/TransportBulkAction.java#L226">doInternalExecute</a> method
* Before the pipeline execution, the pipeline will be marked as resolved (means executed), and then this overriding method will be invoked when executing the text embedding processor.
* After the inference completes, the handler will invoke the doInternalExecute method again to run actual write operation.
* @param ingestDocument
* @param handler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve the documentation to explain what this function is doing not that what the whole pipeline is doing, and try to remove the commit ids from the java doc and point to right function using @link of java doc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, add the pipeline execution in the method body.

Signed-off-by: Zan Niu <zaniu@amazon.com>
Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure https://github.com/opensearch-project/neural-search/actions/runs/3328176376/jobs/5504257477 is unrelated to this PR. It appears to be related to ml-commons permission issue. Approving.

* @param ingestDocument
* @param handler
*/
@Override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zane-neo That makes sense. It might make sense to create a proof of concept test in future to see what the benefit would be.

@navneet1v the idea would be to batch the inference calls. For the async action, instead of calling inference directly, submit the update to some kind of queue that will then create 1 large request for multiple docs and call the handlers once it completes.

@zane-neo zane-neo merged commit d538ad1 into opensearch-project:main Oct 27, 2022
@zane-neo zane-neo added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Oct 27, 2022
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 27, 2022
* Change text embedding processor to async mode

Signed-off-by: Zan Niu <zaniu@amazon.com>

* Address review comments

Signed-off-by: Zan Niu <zaniu@amazon.com>

Signed-off-by: Zan Niu <zaniu@amazon.com>
(cherry picked from commit d538ad1)
@zane-neo zane-neo added backport 2.x Label will add auto workflow to backport PR to 2.x branch and removed backport 2.x Label will add auto workflow to backport PR to 2.x branch labels Oct 28, 2022
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-27-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 d538ad14679216eb91095143d4237ce58376b790
# Push it to GitHub
git push --set-upstream origin backport/backport-27-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-27-to-2.x.

@jmazanec15 jmazanec15 added backport 2.x Label will add auto workflow to backport PR to 2.x branch and removed backport 2.x Label will add auto workflow to backport PR to 2.x branch labels Oct 29, 2022
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-27-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 d538ad14679216eb91095143d4237ce58376b790
# Push it to GitHub
git push --set-upstream origin backport/backport-27-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-27-to-2.x.

@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-27-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 d538ad14679216eb91095143d4237ce58376b790
# Push it to GitHub
git push --set-upstream origin backport/backport-27-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-27-to-2.x.

@navneet1v navneet1v added v2.4.0 backport 2.x Label will add auto workflow to backport PR to 2.x branch and removed backport 2.x Label will add auto workflow to backport PR to 2.x branch labels Oct 29, 2022
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-27-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 d538ad14679216eb91095143d4237ce58376b790
# Push it to GitHub
git push --set-upstream origin backport/backport-27-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-27-to-2.x.

navneet1v pushed a commit to navneet1v/neural-search that referenced this pull request Oct 29, 2022
…pensearch-project#27)

* Change text embedding processor to async mode

Signed-off-by: Zan Niu <zaniu@amazon.com>
Signed-off-by: Navneet Verma <navneev@amazon.com>
navneet1v added a commit that referenced this pull request Oct 29, 2022
…) (#46)

Signed-off-by: Zan Niu <zaniu@amazon.com>
Signed-off-by: Navneet Verma <navneev@amazon.com>
@jmazanec15 jmazanec15 added the Enhancements Increases software capabilities beyond original client specifications label Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch Enhancements Increases software capabilities beyond original client specifications v2.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants