Skip to content

The currentSize accumulation in TokenCountBatchingStrategy is inaccurate #5525

@libra19911018

Description

@libra19911018

When I was examining the source code of TokenCountBatchingStrategy, I identified an issue with the batching and saving process in the batch() method.

When creating a new instance of currentBatch, the currentSize is reset to 0, but the Document documents traversed during the current iteration will be added to the new instance of currentBatch

This results in a situation where, during the loop, each time the currentSize is reset, one token length of the current iteration through the Document is not accumulated

coding node:
for (Map.Entry<Document, Integer> entry : documentTokens.entrySet()) {
Document document = entry.getKey();
currentSize += entry.getValue();
if (currentSize > this.maxInputTokenCount) {
batches.add(currentBatch);
currentBatch = new ArrayList<>();
currentSize = 0;
}
currentBatch.add(document);
}

I apologize for my poor English. This is the bug description I generated through machine translation

Image

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions