Fix model repetition detection performance #18120

hytromo · 2025-12-17T10:21:15Z

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
My PR passes all unit tests on make test-unit - all new unit tests succeed - some random tests fail with Unable to locate credentials or similar.
My PR's scope is as isolated as possible, it only solves 1 specific problem

CI (LiteLLM team)

CI status guideline:

50-55 passing tests: main is stable with minor issues.

45-49 passing tests: acceptable but needs attention

<= 40 passing tests: unstable; be careful with your merges and assess the risk.

Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:

Type

🐛 Bug Fix

Changes

Through pyleak we have been able to detect significant event-blocking behavior in the safety_checker for streaming LLM responses (>100ms blocking within a single response).

safety_checker runs on every new chunk that arrives. Its current implementation takes the last 100 chunks and short-circuits on the first difference it finds. For 500 total chunks this means that 401 100-chunk lists will be created and compared.

This PR changes this behavior by introducing a simple counter that holds the amount of repetitions found, avoiding the expensive list creation on each iteration. This offers a 4-5x speedup on the common case of mixed content, and a larger speedup on same chunks:

Scenario	Chunks	Before (ms)	After (ms)	Speedup
All chunks different	500	1.35	0.35	3.81x
All chunks different	5,000	15.18	3.35	4.54x
All chunks different	10,000	31.39	6.92	4.54x
All chunks different	20,000	62.49	14.00	4.46x
All chunks identical	500	2.46	0.31	7.91x
All chunks identical	5,000	28.52	3.14	9.10x
All chunks identical	10,000	60.03	6.40	9.37x
All chunks identical	20,000	119.21	13.16	9.06x
All chunks None	500	2.33	0.27	8.67x
All chunks None	5,000	28.33	2.75	10.29x
All chunks None	10,000	57.18	5.55	10.30x
All chunks None	20,000	118.37	11.30	10.47x
Mixed	500	1.43	0.36	4.04x
Mixed	5,000	16.08	3.54	4.55x
Mixed	10,000	32.54	7.39	4.40x
Mixed	20,000	65.60	16.97	3.87x

Benchmark details

The above table was generated using the below script:

benchmark_raise_on_model_repetition.py

There's also an alternative implementation that offers a few performance improvements at the cost of complexity. I leave it to the maintainers to decide if they believe that this is a better overall approach for LiteLLM.

Alternative implementation

The chosen implementation in this PR is a very simple fix to the performance issues that gives a good-enough performance for larger amount of chunks.

There is also a moving check window approach that is a bit faster across the board (approx. 1.2 times faster) but with the cost of quite more complex implementation. The significant advantage of this implementation is that all checks are paused until the amount of chunks reach REPEATED_STREAMING_CHUNK_LIMIT, so it can significantly benefit use cases where consumers are using high values of REPEATED_STREAMING_CHUNK_LIMIT.

      ...
      self._repetition_check_at_chunks_length = litellm.REPEATED_STREAMING_CHUNK_LIMIT
      ...

def raise_on_model_repetition(self) -> None:
        all_chunks_length = len(self.chunks)

        if all_chunks_length < self._repetition_check_at_chunks_length:
            # we haven't filled the check window yet
            return

        # Sliding check window size = litellm.REPEATED_STREAMING_CHUNK_LIMIT
        # Example with window size = 4
        #                 2nd check               4th check - raises error
        #                ┌──────────┐            ┌──────────┐
        # chunks = [A, B, C, C, D, E, F, F, None, A, A, A, A ]
        #          └──────────┘   └─────────────┘
        #           1st check       3rd check

        last_chunks = self.chunks[-litellm.REPEATED_STREAMING_CHUNK_LIMIT:]
        last_content = last_chunks[-1].choices[0].delta.content

        if (
            last_content is None
            or not isinstance(last_content, str)
            or len(last_content) <= 2
        ):  # ignore empty content - https://github.com/BerriAI/litellm/issues/5158#issuecomment-2287156946
            # scroll the check window forward so that this invalid chunk is out of it
            # a valid chunk will never be equal to it, so repetition check will never be triggered
            self._repetition_check_at_chunks_length = all_chunks_length + litellm.REPEATED_STREAMING_CHUNK_LIMIT
            return

        # find the most recent chunk that is different from the last chunk
        equal_tail_length = 1 # keep track how many chunks are identical to the last chunk (including the last chunk)
        for i in range(len(last_chunks) - 2, -1, -1):
            if last_chunks[i].choices[0].delta.content == last_content:
                equal_tail_length += 1
            else:
                break # unequal chunks found

        if equal_tail_length == len(last_chunks):
            # all the last_chunks are identical - repetition found
            raise litellm.InternalServerError(
                message="The model is repeating the same chunk = {}.".format(
                    last_content
                ),
                model="",
                llm_provider="",
            )

        # we found two unequal chunks, scroll the check window completely past them
        chunks_to_skip = litellm.REPEATED_STREAMING_CHUNK_LIMIT - equal_tail_length
        self._repetition_check_at_chunks_length = all_chunks_length + chunks_to_skip

vercel · 2025-12-17T10:21:20Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
litellm	Ready	Preview, Comment	Dec 17, 2025 10:22am

Improve model repetition detection performance

ed5dc6a

vercel bot deployed to Preview December 17, 2025 10:22 View deployment

hytromo marked this pull request as ready for review December 17, 2025 11:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix model repetition detection performance #18120

Fix model repetition detection performance #18120

hytromo commented Dec 17, 2025 •

edited

Loading

Uh oh!

vercel bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Fix model repetition detection performance #18120

Are you sure you want to change the base?

Fix model repetition detection performance #18120

Conversation

hytromo commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-Submission checklist

CI (LiteLLM team)

Type

Changes

Uh oh!

vercel bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hytromo commented Dec 17, 2025 •

edited

Loading

vercel bot commented Dec 17, 2025 •

edited

Loading