Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: change max chunk limit exception #717

Merged
merged 16 commits into from
Apr 30, 2024

Conversation

yuye-aws
Copy link
Member

@yuye-aws yuye-aws commented Apr 30, 2024

Description

Fix the issue: #716. You can check the issue for more examples.

Issues Resolved

Fix the issue: #716

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
@yuye-aws
Copy link
Member Author

@zhichao-aws Can you rerun the gradle checks?

@model-collapse model-collapse added backport 2.x Label will add auto workflow to backport PR to 2.x branch backport 2.13 labels Apr 30, 2024
@zhichao-aws
Copy link
Member

@zhichao-aws Can you rerun the gradle checks?

I re-runed it multiple times but still get same exception.

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
…d delimiter algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
@yuye-aws
Copy link
Member Author

@model-collapse @zane-neo This PR is ready for review now

@@ -170,9 +172,11 @@ public IngestDocument execute(final IngestDocument ingestDocument) {
// fixed token length algorithm needs runtime parameter max_token_count for tokenization
Map<String, Object> runtimeParameters = new HashMap<>();
int maxTokenCount = getMaxTokenCount(sourceAndMetadataMap);
int stringTobeChunkedCount = getStringTobeChunkedCountFromMap(sourceAndMetadataMap, fieldMap);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to chunkFieldsCount?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single chunking field may contain multiple strings like:

{
  "body": ["string 1", "string 2", "string 3"]
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be named as chunkStringCount

// update runtime max_chunk_limit if not disabled
// return an empty list for empty string
if (StringUtils.isEmpty(content)) {
return List.of();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this empty list used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not return empty list directly, stringTobeChunkedCount will be reduced by 1. This is not expected because only non-empty will take up max_chunk_limit. This parameter is thus ignored for empty string.

List<String> contentResult = chunker.chunk(content, runTimeParameters);
// update string_tobe_chunked_count for each string
int stringTobeChunkedCount = parseIntegerParameter(runTimeParameters, STRING_TOBE_CHUNKED_FIELD, 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 is a default value? why choosing 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the default case is we only chunk one string for a single document. Besides, this parameter is always available in text chunking processor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, with field_map and document, we can always calculate a chunk_string_count, we don't need a default value here. Suggesting optimize this code in next PR.

List<String> contentResult = chunker.chunk(content, runTimeParameters);
// update string_tobe_chunked_count for each string
int stringTobeChunkedCount = parseIntegerParameter(runTimeParameters, STRING_TOBE_CHUNKED_FIELD, 1);
runTimeParameters.put(STRING_TOBE_CHUNKED_FIELD, stringTobeChunkedCount - 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why -1 here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we have finished chunking 1 string.

@@ -14,6 +14,7 @@
public interface Chunker {

String MAX_CHUNK_LIMIT_FIELD = "max_chunk_limit";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it static final

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifier stack and final is redundant for interface

@@ -14,6 +14,7 @@
public interface Chunker {

String MAX_CHUNK_LIMIT_FIELD = "max_chunk_limit";
String STRING_TOBE_CHUNKED_FIELD = "string_tobe_chunked_count";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this and make this static final

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifier stack and final is redundant for interface

@@ -14,6 +14,7 @@
public interface Chunker {

String MAX_CHUNK_LIMIT_FIELD = "max_chunk_limit";
String STRING_TOBE_CHUNKED_FIELD = "string_tobe_chunked_count";
int DEFAULT_MAX_CHUNK_LIMIT = 100;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static & final

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifier stack and final is redundant for interface

@@ -14,6 +14,7 @@
public interface Chunker {

String MAX_CHUNK_LIMIT_FIELD = "max_chunk_limit";
String STRING_TOBE_CHUNKED_FIELD = "string_tobe_chunked_count";
int DEFAULT_MAX_CHUNK_LIMIT = 100;
int DISABLED_MAX_CHUNK_LIMIT = -1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifier stack and final is redundant for interface

)
);
}
public static boolean checkRunTimeMaxChunkLimit(int chunkResultSize, int runtimeMaxChunkLimit, int stringTobeChunkedCount) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this method to default method in interface? Creating a new class file for one method seems overheading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will update it.

Copy link
Collaborator

@model-collapse model-collapse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@yuye-aws
Copy link
Member Author

@zhichao-aws Can you rerun the gradle checks?

I re-runed it multiple times but still get same exception.

These test failures are due to model deploy issue, which is not related with this PR. We can focus on these test failures later.

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
@zane-neo zane-neo merged commit 86b70e0 into opensearch-project:main Apr 30, 2024
31 of 71 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 30, 2024
* change max chunk limit exception

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix integration tests for two chunking algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update changelog

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add run time parameter string_tobe_chunked_count

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix unit test for fixed token length and delimiter algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement unit test for string to be chunked in fixed token length and delimiter algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update definition for string to be chunked parameter

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix text chunking processor ut

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add string to be chunked count in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add string to be chunked count in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add more test cases for text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove chunker util

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* change chunk limit check in boht algorithms

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update ut for text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update parameter name to chunk_string_count

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* run spot less apply

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

---------

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
(cherry picked from commit 86b70e0)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 30, 2024
* change max chunk limit exception

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix integration tests for two chunking algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update changelog

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add run time parameter string_tobe_chunked_count

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix unit test for fixed token length and delimiter algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement unit test for string to be chunked in fixed token length and delimiter algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update definition for string to be chunked parameter

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix text chunking processor ut

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add string to be chunked count in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add string to be chunked count in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add more test cases for text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove chunker util

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* change chunk limit check in boht algorithms

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update ut for text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update parameter name to chunk_string_count

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* run spot less apply

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

---------

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
(cherry picked from commit 86b70e0)
model-collapse pushed a commit that referenced this pull request Apr 30, 2024
* change max chunk limit exception

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix integration tests for two chunking algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update changelog

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add run time parameter string_tobe_chunked_count

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix unit test for fixed token length and delimiter algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement unit test for string to be chunked in fixed token length and delimiter algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update definition for string to be chunked parameter

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix text chunking processor ut

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add string to be chunked count in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add string to be chunked count in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add more test cases for text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove chunker util

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* change chunk limit check in boht algorithms

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update ut for text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update parameter name to chunk_string_count

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* run spot less apply

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

---------

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
(cherry picked from commit 86b70e0)

Co-authored-by: Yuye Zhu <yuyezhu@amazon.com>
zane-neo pushed a commit that referenced this pull request May 1, 2024
* change max chunk limit exception

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix integration tests for two chunking algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update changelog

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add run time parameter string_tobe_chunked_count

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix unit test for fixed token length and delimiter algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement unit test for string to be chunked in fixed token length and delimiter algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update definition for string to be chunked parameter

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix text chunking processor ut

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add string to be chunked count in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add string to be chunked count in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add more test cases for text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove chunker util

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* change chunk limit check in boht algorithms

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update ut for text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update parameter name to chunk_string_count

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* run spot less apply

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

---------

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
(cherry picked from commit 86b70e0)

Co-authored-by: Yuye Zhu <yuyezhu@amazon.com>
@yuye-aws yuye-aws deleted the Fix/ChunkingMaxChunkLimit branch May 6, 2024 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch backport 2.13
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants