-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: change max chunk limit exception #717
Fix: change max chunk limit exception #717
Conversation
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
@zhichao-aws Can you rerun the gradle checks? |
I re-runed it multiple times but still get same exception. |
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
…d delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
@model-collapse @zane-neo This PR is ready for review now |
@@ -170,9 +172,11 @@ public IngestDocument execute(final IngestDocument ingestDocument) { | |||
// fixed token length algorithm needs runtime parameter max_token_count for tokenization | |||
Map<String, Object> runtimeParameters = new HashMap<>(); | |||
int maxTokenCount = getMaxTokenCount(sourceAndMetadataMap); | |||
int stringTobeChunkedCount = getStringTobeChunkedCountFromMap(sourceAndMetadataMap, fieldMap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to chunkFieldsCount
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single chunking field may contain multiple strings like:
{
"body": ["string 1", "string 2", "string 3"]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be named as chunkStringCount
// update runtime max_chunk_limit if not disabled | ||
// return an empty list for empty string | ||
if (StringUtils.isEmpty(content)) { | ||
return List.of(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this empty list used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do not return empty list directly, stringTobeChunkedCount will be reduced by 1. This is not expected because only non-empty will take up max_chunk_limit. This parameter is thus ignored for empty string.
List<String> contentResult = chunker.chunk(content, runTimeParameters); | ||
// update string_tobe_chunked_count for each string | ||
int stringTobeChunkedCount = parseIntegerParameter(runTimeParameters, STRING_TOBE_CHUNKED_FIELD, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 is a default value? why choosing 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the default case is we only chunk one string for a single document. Besides, this parameter is always available in text chunking processor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, with field_map and document, we can always calculate a chunk_string_count, we don't need a default value here. Suggesting optimize this code in next PR.
List<String> contentResult = chunker.chunk(content, runTimeParameters); | ||
// update string_tobe_chunked_count for each string | ||
int stringTobeChunkedCount = parseIntegerParameter(runTimeParameters, STRING_TOBE_CHUNKED_FIELD, 1); | ||
runTimeParameters.put(STRING_TOBE_CHUNKED_FIELD, stringTobeChunkedCount - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why -1
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we have finished chunking 1 string.
@@ -14,6 +14,7 @@ | |||
public interface Chunker { | |||
|
|||
String MAX_CHUNK_LIMIT_FIELD = "max_chunk_limit"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it static final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifier stack and final is redundant for interface
@@ -14,6 +14,7 @@ | |||
public interface Chunker { | |||
|
|||
String MAX_CHUNK_LIMIT_FIELD = "max_chunk_limit"; | |||
String STRING_TOBE_CHUNKED_FIELD = "string_tobe_chunked_count"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this and make this static final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifier stack and final is redundant for interface
@@ -14,6 +14,7 @@ | |||
public interface Chunker { | |||
|
|||
String MAX_CHUNK_LIMIT_FIELD = "max_chunk_limit"; | |||
String STRING_TOBE_CHUNKED_FIELD = "string_tobe_chunked_count"; | |||
int DEFAULT_MAX_CHUNK_LIMIT = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static & final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifier stack and final is redundant for interface
@@ -14,6 +14,7 @@ | |||
public interface Chunker { | |||
|
|||
String MAX_CHUNK_LIMIT_FIELD = "max_chunk_limit"; | |||
String STRING_TOBE_CHUNKED_FIELD = "string_tobe_chunked_count"; | |||
int DEFAULT_MAX_CHUNK_LIMIT = 100; | |||
int DISABLED_MAX_CHUNK_LIMIT = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifier stack and final is redundant for interface
) | ||
); | ||
} | ||
public static boolean checkRunTimeMaxChunkLimit(int chunkResultSize, int runtimeMaxChunkLimit, int stringTobeChunkedCount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this method to default method in interface? Creating a new class file for one method seems overheading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
These test failures are due to model deploy issue, which is not related with this PR. We can focus on these test failures later. |
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
* change max chunk limit exception Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix integration tests for two chunking algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update changelog Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add run time parameter string_tobe_chunked_count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit test for fixed token length and delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement unit test for string to be chunked in fixed token length and delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update definition for string to be chunked parameter Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix text chunking processor ut Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add string to be chunked count in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add string to be chunked count in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more test cases for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove chunker util Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change chunk limit check in boht algorithms Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update ut for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update parameter name to chunk_string_count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * run spot less apply Signed-off-by: yuye-aws <yuyezhu@amazon.com> --------- Signed-off-by: yuye-aws <yuyezhu@amazon.com> (cherry picked from commit 86b70e0)
* change max chunk limit exception Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix integration tests for two chunking algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update changelog Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add run time parameter string_tobe_chunked_count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit test for fixed token length and delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement unit test for string to be chunked in fixed token length and delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update definition for string to be chunked parameter Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix text chunking processor ut Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add string to be chunked count in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add string to be chunked count in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more test cases for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove chunker util Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change chunk limit check in boht algorithms Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update ut for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update parameter name to chunk_string_count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * run spot less apply Signed-off-by: yuye-aws <yuyezhu@amazon.com> --------- Signed-off-by: yuye-aws <yuyezhu@amazon.com> (cherry picked from commit 86b70e0)
* change max chunk limit exception Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix integration tests for two chunking algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update changelog Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add run time parameter string_tobe_chunked_count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit test for fixed token length and delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement unit test for string to be chunked in fixed token length and delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update definition for string to be chunked parameter Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix text chunking processor ut Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add string to be chunked count in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add string to be chunked count in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more test cases for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove chunker util Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change chunk limit check in boht algorithms Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update ut for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update parameter name to chunk_string_count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * run spot less apply Signed-off-by: yuye-aws <yuyezhu@amazon.com> --------- Signed-off-by: yuye-aws <yuyezhu@amazon.com> (cherry picked from commit 86b70e0) Co-authored-by: Yuye Zhu <yuyezhu@amazon.com>
* change max chunk limit exception Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix integration tests for two chunking algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update changelog Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add run time parameter string_tobe_chunked_count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit test for fixed token length and delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement unit test for string to be chunked in fixed token length and delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update definition for string to be chunked parameter Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix text chunking processor ut Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add string to be chunked count in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add string to be chunked count in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more test cases for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove chunker util Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change chunk limit check in boht algorithms Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update ut for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update parameter name to chunk_string_count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * run spot less apply Signed-off-by: yuye-aws <yuyezhu@amazon.com> --------- Signed-off-by: yuye-aws <yuyezhu@amazon.com> (cherry picked from commit 86b70e0) Co-authored-by: Yuye Zhu <yuyezhu@amazon.com>
Description
Fix the issue: #716. You can check the issue for more examples.
Issues Resolved
Fix the issue: #716
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.