Skip to content

Conversation

@hytromo
Copy link

@hytromo hytromo commented Dec 17, 2025

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit - all new unit tests succeed - some random tests fail with Unable to locate credentials or similar.
  • My PR's scope is as isolated as possible, it only solves 1 specific problem

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.
  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Type

🐛 Bug Fix

Changes

Through pyleak we have been able to detect significant event-blocking behavior in the safety_checker for streaming LLM responses (>100ms blocking within a single response).

safety_checker runs on every new chunk that arrives. Its current implementation takes the last 100 chunks and short-circuits on the first difference it finds. For 500 total chunks this means that 401 100-chunk lists will be created and compared.

This PR changes this behavior by introducing a simple counter that holds the amount of repetitions found, avoiding the expensive list creation on each iteration. This offers a 4-5x speedup on the common case of mixed content, and a larger speedup on same chunks:

Scenario Chunks Before (ms) After (ms) Speedup
All chunks different 500 1.35 0.35 3.81x
All chunks different 5,000 15.18 3.35 4.54x
All chunks different 10,000 31.39 6.92 4.54x
All chunks different 20,000 62.49 14.00 4.46x
All chunks identical 500 2.46 0.31 7.91x
All chunks identical 5,000 28.52 3.14 9.10x
All chunks identical 10,000 60.03 6.40 9.37x
All chunks identical 20,000 119.21 13.16 9.06x
All chunks None 500 2.33 0.27 8.67x
All chunks None 5,000 28.33 2.75 10.29x
All chunks None 10,000 57.18 5.55 10.30x
All chunks None 20,000 118.37 11.30 10.47x
Mixed 500 1.43 0.36 4.04x
Mixed 5,000 16.08 3.54 4.55x
Mixed 10,000 32.54 7.39 4.40x
Mixed 20,000 65.60 16.97 3.87x
Benchmark details

The above table was generated using the below script:

benchmark_raise_on_model_repetition.py

There's also an alternative implementation that offers a few performance improvements at the cost of complexity. I leave it to the maintainers to decide if they believe that this is a better overall approach for LiteLLM.

Alternative implementation

The chosen implementation in this PR is a very simple fix to the performance issues that gives a good-enough performance for larger amount of chunks.

There is also a moving check window approach that is a bit faster across the board (approx. 1.2 times faster) but with the cost of quite more complex implementation. The significant advantage of this implementation is that all checks are paused until the amount of chunks reach REPEATED_STREAMING_CHUNK_LIMIT, so it can significantly benefit use cases where consumers are using high values of REPEATED_STREAMING_CHUNK_LIMIT.

      ...
      self._repetition_check_at_chunks_length = litellm.REPEATED_STREAMING_CHUNK_LIMIT
      ...

def raise_on_model_repetition(self) -> None:
        all_chunks_length = len(self.chunks)

        if all_chunks_length < self._repetition_check_at_chunks_length:
            # we haven't filled the check window yet
            return

        # Sliding check window size = litellm.REPEATED_STREAMING_CHUNK_LIMIT
        # Example with window size = 4
        #                 2nd check               4th check - raises error
        #                ┌──────────┐            ┌──────────┐
        # chunks = [A, B, C, C, D, E, F, F, None, A, A, A, A ]
        #          └──────────┘   └─────────────┘
        #           1st check       3rd check

        last_chunks = self.chunks[-litellm.REPEATED_STREAMING_CHUNK_LIMIT:]
        last_content = last_chunks[-1].choices[0].delta.content

        if (
            last_content is None
            or not isinstance(last_content, str)
            or len(last_content) <= 2
        ):  # ignore empty content - https://github.com/BerriAI/litellm/issues/5158#issuecomment-2287156946
            # scroll the check window forward so that this invalid chunk is out of it
            # a valid chunk will never be equal to it, so repetition check will never be triggered
            self._repetition_check_at_chunks_length = all_chunks_length + litellm.REPEATED_STREAMING_CHUNK_LIMIT
            return

        # find the most recent chunk that is different from the last chunk
        equal_tail_length = 1 # keep track how many chunks are identical to the last chunk (including the last chunk)
        for i in range(len(last_chunks) - 2, -1, -1):
            if last_chunks[i].choices[0].delta.content == last_content:
                equal_tail_length += 1
            else:
                break # unequal chunks found

        if equal_tail_length == len(last_chunks):
            # all the last_chunks are identical - repetition found
            raise litellm.InternalServerError(
                message="The model is repeating the same chunk = {}.".format(
                    last_content
                ),
                model="",
                llm_provider="",
            )

        # we found two unequal chunks, scroll the check window completely past them
        chunks_to_skip = litellm.REPEATED_STREAMING_CHUNK_LIMIT - equal_tail_length
        self._repetition_check_at_chunks_length = all_chunks_length + chunks_to_skip

@vercel
Copy link

vercel bot commented Dec 17, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
litellm Ready Ready Preview, Comment Dec 17, 2025 10:22am

@hytromo hytromo marked this pull request as ready for review December 17, 2025 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant