Skip to content

[Bug] Setting OverlappingTokens (via appsettings) reduces the configured MaxTokensPerParagraph #318

Open

Description

Context / Scenario

Setting different values for MaxTokensPerParagraph and OverlappingTokens to test out the optimal chunking strategy for answering a set of test questions on a document.

What happened?

When I leave MaxTokensPerParagraph as is (1000) and incrementally increase the value of only OverlappingTokens (via appsettings) between test sessions by increments of +100 per test, the resulting chunk/paragraph size keeps decreasing, the chunks keep becoming smaller.

I finally ended up with settings of MaxTokensPerParagraph: 1000 and OverlappingTokens: 800 resulting in paragraph/chunk sizes that were only around 200 tokens large as counted by the Open AI Tokenizer.


What I expected to happen was either the resulting chunk size to be:

  • MaxTokensPerParagraph + OverlappingTokens
  • MaxTokensPerParagraph (where the OverlappingTokens are included in the MaxTokensPerParagraph)

I tested with a single 46 pages document, which I re-ingested with the exact same call to ImportDocumentAsync() between each test session (upserting/replacing (?) previous chunks for the same document id), leaving MaxTokensPerParagraph as is, but increasing OverlappingTokens between each test, saving appsettings, restarting the service, re-ingesting the same document.


Right now the issue can be circumvented by simultaneously increasing both MaxTokensPerParagraph and OverlappingTokens by the same amount if you want the resulting chunk size to be roughly equivalent to the specified MaxTokensPerParagraph.

Importance

a fix would make my life easier

Platform, Language, Versions

Windows 10, C#, Kernel Memory 0.27.240205.2

Relevant log output

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

bugSomething isn't workingtriage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions