Fix: change max chunk limit exception #717

yuye-aws · 2024-04-30T03:58:56Z

Description

Fix the issue: #716. You can check the issue for more examples.

Issues Resolved

Fix the issue: #716

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

yuye-aws · 2024-04-30T05:33:40Z

@zhichao-aws Can you rerun the gradle checks?

zhichao-aws · 2024-04-30T09:02:26Z

@zhichao-aws Can you rerun the gradle checks?

I re-runed it multiple times but still get same exception.

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

…d delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com>

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

yuye-aws · 2024-04-30T10:22:42Z

@model-collapse @zane-neo This PR is ready for review now

zane-neo · 2024-04-30T10:46:59Z

src/main/java/org/opensearch/neuralsearch/processor/TextChunkingProcessor.java

@@ -170,9 +172,11 @@ public IngestDocument execute(final IngestDocument ingestDocument) {
        // fixed token length algorithm needs runtime parameter max_token_count for tokenization
        Map<String, Object> runtimeParameters = new HashMap<>();
        int maxTokenCount = getMaxTokenCount(sourceAndMetadataMap);
+        int stringTobeChunkedCount = getStringTobeChunkedCountFromMap(sourceAndMetadataMap, fieldMap);


Can we rename this to chunkFieldsCount?

A single chunking field may contain multiple strings like:

{ "body": ["string 1", "string 2", "string 3"] }

Should be named as chunkStringCount

zane-neo · 2024-04-30T10:54:26Z

src/main/java/org/opensearch/neuralsearch/processor/TextChunkingProcessor.java

-        // update runtime max_chunk_limit if not disabled
+        // return an empty list for empty string
+        if (StringUtils.isEmpty(content)) {
+            return List.of();


What's this empty list used for?

If we do not return empty list directly, stringTobeChunkedCount will be reduced by 1. This is not expected because only non-empty will take up max_chunk_limit. This parameter is thus ignored for empty string.

zane-neo · 2024-04-30T10:59:11Z

src/main/java/org/opensearch/neuralsearch/processor/TextChunkingProcessor.java

        List<String> contentResult = chunker.chunk(content, runTimeParameters);
+        // update string_tobe_chunked_count for each string
+        int stringTobeChunkedCount = parseIntegerParameter(runTimeParameters, STRING_TOBE_CHUNKED_FIELD, 1);


1 is a default value? why choosing 1?

Because the default case is we only chunk one string for a single document. Besides, this parameter is always available in text chunking processor.

In fact, with field_map and document, we can always calculate a chunk_string_count, we don't need a default value here. Suggesting optimize this code in next PR.

zane-neo · 2024-04-30T11:01:02Z

src/main/java/org/opensearch/neuralsearch/processor/TextChunkingProcessor.java

        List<String> contentResult = chunker.chunk(content, runTimeParameters);
+        // update string_tobe_chunked_count for each string
+        int stringTobeChunkedCount = parseIntegerParameter(runTimeParameters, STRING_TOBE_CHUNKED_FIELD, 1);
+        runTimeParameters.put(STRING_TOBE_CHUNKED_FIELD, stringTobeChunkedCount - 1);


Why -1 here?

Because we have finished chunking 1 string.

zane-neo · 2024-04-30T11:01:40Z

src/main/java/org/opensearch/neuralsearch/processor/chunker/Chunker.java

@@ -14,6 +14,7 @@
 public interface Chunker {

    String MAX_CHUNK_LIMIT_FIELD = "max_chunk_limit";


Make it static final

Modifier stack and final is redundant for interface

zane-neo · 2024-04-30T11:01:54Z

src/main/java/org/opensearch/neuralsearch/processor/chunker/Chunker.java

@@ -14,6 +14,7 @@
 public interface Chunker {

    String MAX_CHUNK_LIMIT_FIELD = "max_chunk_limit";
+    String STRING_TOBE_CHUNKED_FIELD = "string_tobe_chunked_count";


Rename this and make this static final

Modifier stack and final is redundant for interface

zane-neo · 2024-04-30T11:02:12Z

src/main/java/org/opensearch/neuralsearch/processor/chunker/Chunker.java

@@ -14,6 +14,7 @@
 public interface Chunker {

    String MAX_CHUNK_LIMIT_FIELD = "max_chunk_limit";
+    String STRING_TOBE_CHUNKED_FIELD = "string_tobe_chunked_count";
    int DEFAULT_MAX_CHUNK_LIMIT = 100;


static & final

Modifier stack and final is redundant for interface

zane-neo · 2024-04-30T11:02:20Z

src/main/java/org/opensearch/neuralsearch/processor/chunker/Chunker.java

@@ -14,6 +14,7 @@
 public interface Chunker {

    String MAX_CHUNK_LIMIT_FIELD = "max_chunk_limit";
+    String STRING_TOBE_CHUNKED_FIELD = "string_tobe_chunked_count";
    int DEFAULT_MAX_CHUNK_LIMIT = 100;
    int DISABLED_MAX_CHUNK_LIMIT = -1;


Modifier stack and final is redundant for interface

zane-neo · 2024-04-30T11:04:24Z

src/main/java/org/opensearch/neuralsearch/processor/chunker/ChunkerUtil.java

-                )
-            );
-        }
+    public static boolean checkRunTimeMaxChunkLimit(int chunkResultSize, int runtimeMaxChunkLimit, int stringTobeChunkedCount) {


Can we move this method to default method in interface? Creating a new class file for one method seems overheading.

Good point. I will update it.

model-collapse

lgtm

yuye-aws · 2024-04-30T12:13:00Z

@zhichao-aws Can you rerun the gradle checks?

I re-runed it multiple times but still get same exception.

These test failures are due to model deploy issue, which is not related with this PR. We can focus on these test failures later.

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* change max chunk limit exception Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix integration tests for two chunking algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update changelog Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add run time parameter string_tobe_chunked_count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit test for fixed token length and delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement unit test for string to be chunked in fixed token length and delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update definition for string to be chunked parameter Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix text chunking processor ut Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add string to be chunked count in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add string to be chunked count in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more test cases for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove chunker util Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change chunk limit check in boht algorithms Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update ut for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update parameter name to chunk_string_count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * run spot less apply Signed-off-by: yuye-aws <yuyezhu@amazon.com> --------- Signed-off-by: yuye-aws <yuyezhu@amazon.com> (cherry picked from commit 86b70e0)

* change max chunk limit exception Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix integration tests for two chunking algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update changelog Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add run time parameter string_tobe_chunked_count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit test for fixed token length and delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement unit test for string to be chunked in fixed token length and delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update definition for string to be chunked parameter Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix text chunking processor ut Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add string to be chunked count in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add string to be chunked count in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more test cases for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove chunker util Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change chunk limit check in boht algorithms Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update ut for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update parameter name to chunk_string_count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * run spot less apply Signed-off-by: yuye-aws <yuyezhu@amazon.com> --------- Signed-off-by: yuye-aws <yuyezhu@amazon.com> (cherry picked from commit 86b70e0) Co-authored-by: Yuye Zhu <yuyezhu@amazon.com>

change max chunk limit exception

bcb1c79

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

yuye-aws requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, zane-neo, ylwu-amzn, jngz-es, vibrantvarun and zhichao-aws as code owners April 30, 2024 03:58

yuye-aws added 2 commits April 30, 2024 12:51

fix integration tests for two chunking algorithm

97ad445

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

update changelog

df89605

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

yuye-aws mentioned this pull request Apr 30, 2024

[BUG] Text chunking max_chunk_limit error #716

Closed

zhichao-aws approved these changes Apr 30, 2024

View reviewed changes

model-collapse assigned yuye-aws Apr 30, 2024

model-collapse added backport 2.x Label will add auto workflow to backport PR to 2.x branch backport 2.13 labels Apr 30, 2024

yuye-aws added 5 commits April 30, 2024 17:18

add run time parameter string_tobe_chunked_count

da90fc2

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

fix unit test for fixed token length and delimiter algorithm

b66e358

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

implement unit test for string to be chunked in fixed token length an…

5f73a18

…d delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com>

update definition for string to be chunked parameter

d18e793

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

fix text chunking processor ut

18d6fed

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

yuye-aws added 3 commits April 30, 2024 18:12

add string to be chunked count in text chunking processor

c0b5b25

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

add string to be chunked count in text chunking processor

1fdf596

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

add more test cases for text chunking processor

416f33b

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

zane-neo reviewed Apr 30, 2024

View reviewed changes

model-collapse approved these changes Apr 30, 2024

View reviewed changes

yuye-aws added 5 commits April 30, 2024 20:14

remove chunker util

3490a92

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

change chunk limit check in boht algorithms

84cd362

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

update ut for text chunking processor

65f8cf7

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

update parameter name to chunk_string_count

18b7936

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

run spot less apply

489acc9

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

zane-neo merged commit 86b70e0 into opensearch-project:main Apr 30, 2024
31 of 71 checks passed

opensearch-trigger-bot bot mentioned this pull request Apr 30, 2024

[Backport 2.x] Fix: change max chunk limit exception #719

Merged

opensearch-trigger-bot bot mentioned this pull request Apr 30, 2024

[Backport 2.13] Fix: change max chunk limit exception #720

Merged

yuye-aws deleted the Fix/ChunkingMaxChunkLimit branch May 6, 2024 07:50

yuye-aws mentioned this pull request May 6, 2024

Optimize parameter parsing in text chunking processor #733

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: change max chunk limit exception #717

Fix: change max chunk limit exception #717

yuye-aws commented Apr 30, 2024 •

edited

Loading

yuye-aws commented Apr 30, 2024

zhichao-aws commented Apr 30, 2024

yuye-aws commented Apr 30, 2024

zane-neo Apr 30, 2024

yuye-aws Apr 30, 2024

yuye-aws Apr 30, 2024

zane-neo Apr 30, 2024

yuye-aws Apr 30, 2024

zane-neo Apr 30, 2024

yuye-aws Apr 30, 2024

zane-neo Apr 30, 2024

zane-neo Apr 30, 2024

yuye-aws Apr 30, 2024

zane-neo Apr 30, 2024

yuye-aws Apr 30, 2024

zane-neo Apr 30, 2024

yuye-aws Apr 30, 2024

zane-neo Apr 30, 2024

yuye-aws Apr 30, 2024

zane-neo Apr 30, 2024

yuye-aws Apr 30, 2024

zane-neo Apr 30, 2024

yuye-aws Apr 30, 2024

model-collapse left a comment

yuye-aws commented Apr 30, 2024

		@@ -14,6 +14,7 @@
		public interface Chunker {

		String MAX_CHUNK_LIMIT_FIELD = "max_chunk_limit";

Fix: change max chunk limit exception #717

Fix: change max chunk limit exception #717

Conversation

yuye-aws commented Apr 30, 2024 • edited Loading

Description

Issues Resolved

Check List

yuye-aws commented Apr 30, 2024

zhichao-aws commented Apr 30, 2024

yuye-aws commented Apr 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

model-collapse left a comment

Choose a reason for hiding this comment

yuye-aws commented Apr 30, 2024

yuye-aws commented Apr 30, 2024 •

edited

Loading