Skip to content

fix: counting tokens HeadlineSplitter #2133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lovets18
Copy link

Change
Replaced words counting to tokens counting in HeadlineSplitter

Description
chunk_tokens = chunk.split() - used to split chunk (str) to words, so in line len(chunk_tokens) the number of words was counted, not the token count

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR fixes a fundamental bug in the HeadlineSplitter class where word counting was incorrectly used instead of proper token counting. The changes replace the naive chunk.split() approach with DEFAULT_TOKENIZER.encode(chunk) to accurately measure tokens using tiktoken's encoder/decoder.

The key changes include:

  • In the adjust_chunks method, chunk_tokens now uses DEFAULT_TOKENIZER.encode(chunk) instead of chunk.split()
  • Token manipulation is done on encoded token arrays rather than word arrays
  • Chunks are reconstructed using DEFAULT_TOKENIZER.decode() to maintain proper text formatting
  • A new import of num_tokens_from_string is added for consistent token counting with accumulated chunks

This fix is crucial because the class attributes are named min_tokens and max_tokens, indicating the intent to work with tokens, but the previous implementation actually counted words. This distinction matters significantly for LLM applications where models have token-based limits, and the difference between word count and token count can be substantial, especially for non-English text or text with special characters.

Confidence score: 3/5

  • This PR addresses a real bug but introduces an inconsistency that could cause runtime issues
  • The main chunking logic now uses proper token counting, but there's a mismatch in tokenizer usage between different parts of the code that could lead to incorrect behavior
  • The split method at line 60 still uses word counting which creates an inconsistency with the updated adjust_chunks method

1 file reviewed, 1 comment

Edit Code Review Bot Settings | Greptile

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jul 23, 2025
@lovets18
Copy link
Author

lovets18 commented Jul 23, 2025

  • The main chunking logic now uses proper token counting, but there's a mismatch in tokenizer usage between different parts of the code that could lead to incorrect behavior

added encoding_name=DEFAULT_TOKENIZER.name to num_tokens_from_string call

  • The split method at line 60 still uses word counting which creates an inconsistency with the updated adjust_chunks method

replaced to num_tokens_from_string(text, encoding_name=DEFAULT_TOKENIZER.name)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant