fix: counting tokens HeadlineSplitter #2133

lovets18 · 2025-07-23T10:17:22Z

Change
Replaced words counting to tokens counting in HeadlineSplitter

Description
chunk_tokens = chunk.split() - used to split chunk (str) to words, so in line len(chunk_tokens) the number of words was counted, not the token count

greptile-apps

Greptile Summary

This PR fixes a fundamental bug in the HeadlineSplitter class where word counting was incorrectly used instead of proper token counting. The changes replace the naive chunk.split() approach with DEFAULT_TOKENIZER.encode(chunk) to accurately measure tokens using tiktoken's encoder/decoder.

The key changes include:

In the adjust_chunks method, chunk_tokens now uses DEFAULT_TOKENIZER.encode(chunk) instead of chunk.split()
Token manipulation is done on encoded token arrays rather than word arrays
Chunks are reconstructed using DEFAULT_TOKENIZER.decode() to maintain proper text formatting
A new import of num_tokens_from_string is added for consistent token counting with accumulated chunks

This fix is crucial because the class attributes are named min_tokens and max_tokens, indicating the intent to work with tokens, but the previous implementation actually counted words. This distinction matters significantly for LLM applications where models have token-based limits, and the difference between word count and token count can be substantial, especially for non-English text or text with special characters.

Confidence score: 3/5

This PR addresses a real bug but introduces an inconsistency that could cause runtime issues
The main chunking logic now uses proper token counting, but there's a mismatch in tokenizer usage between different parts of the code that could lead to incorrect behavior
The split method at line 60 still uses word counting which creates an inconsistency with the updated adjust_chunks method

_{1 file reviewed, 1 comment}

_{Edit Code Review Bot Settings | Greptile}

ragas/src/ragas/testset/transforms/splitters/headline.py

lovets18 · 2025-07-23T10:34:48Z

The main chunking logic now uses proper token counting, but there's a mismatch in tokenizer usage between different parts of the code that could lead to incorrect behavior

added encoding_name=DEFAULT_TOKENIZER.name to num_tokens_from_string call

The split method at line 60 still uses word counting which creates an inconsistency with the updated adjust_chunks method

replaced to num_tokens_from_string(text, encoding_name=DEFAULT_TOKENIZER.name)

counting tokens HeadlineSplitter

39d9305

greptile-apps bot reviewed Jul 23, 2025

View reviewed changes

ragas/src/ragas/testset/transforms/splitters/headline.py Outdated Show resolved Hide resolved

tokens counting split and adjust_chunks

3a3a482

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jul 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: counting tokens HeadlineSplitter #2133

fix: counting tokens HeadlineSplitter #2133

Uh oh!

lovets18 commented Jul 23, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

lovets18 commented Jul 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

fix: counting tokens HeadlineSplitter #2133

Are you sure you want to change the base?

fix: counting tokens HeadlineSplitter #2133

Uh oh!

Conversation

lovets18 commented Jul 23, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 3/5

Uh oh!

Uh oh!

lovets18 commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

lovets18 commented Jul 23, 2025 •

edited

Loading