Fix model repetition detection performance #18120
Open
+147
−22
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit- all new unit tests succeed - some random tests fail withUnable to locate credentialsor similar.CI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Type
🐛 Bug Fix
Changes
Through pyleak we have been able to detect significant event-blocking behavior in the
safety_checkerfor streaming LLM responses (>100ms blocking within a single response).safety_checkerruns on every new chunk that arrives. Its current implementation takes the last 100 chunks and short-circuits on the first difference it finds. For 500 total chunks this means that 401 100-chunk lists will be created and compared.This PR changes this behavior by introducing a simple counter that holds the amount of repetitions found, avoiding the expensive list creation on each iteration. This offers a 4-5x speedup on the common case of mixed content, and a larger speedup on same chunks:
Benchmark details
The above table was generated using the below script:
benchmark_raise_on_model_repetition.py
There's also an alternative implementation that offers a few performance improvements at the cost of complexity. I leave it to the maintainers to decide if they believe that this is a better overall approach for LiteLLM.
Alternative implementation
The chosen implementation in this PR is a very simple fix to the performance issues that gives a good-enough performance for larger amount of chunks.
There is also a moving check window approach that is a bit faster across the board (approx. 1.2 times faster) but with the cost of quite more complex implementation. The significant advantage of this implementation is that all checks are paused until the amount of chunks reach
REPEATED_STREAMING_CHUNK_LIMIT, so it can significantly benefit use cases where consumers are using high values ofREPEATED_STREAMING_CHUNK_LIMIT.... self._repetition_check_at_chunks_length = litellm.REPEATED_STREAMING_CHUNK_LIMIT ... def raise_on_model_repetition(self) -> None: all_chunks_length = len(self.chunks) if all_chunks_length < self._repetition_check_at_chunks_length: # we haven't filled the check window yet return # Sliding check window size = litellm.REPEATED_STREAMING_CHUNK_LIMIT # Example with window size = 4 # 2nd check 4th check - raises error # ┌──────────┐ ┌──────────┐ # chunks = [A, B, C, C, D, E, F, F, None, A, A, A, A ] # └──────────┘ └─────────────┘ # 1st check 3rd check last_chunks = self.chunks[-litellm.REPEATED_STREAMING_CHUNK_LIMIT:] last_content = last_chunks[-1].choices[0].delta.content if ( last_content is None or not isinstance(last_content, str) or len(last_content) <= 2 ): # ignore empty content - https://github.com/BerriAI/litellm/issues/5158#issuecomment-2287156946 # scroll the check window forward so that this invalid chunk is out of it # a valid chunk will never be equal to it, so repetition check will never be triggered self._repetition_check_at_chunks_length = all_chunks_length + litellm.REPEATED_STREAMING_CHUNK_LIMIT return # find the most recent chunk that is different from the last chunk equal_tail_length = 1 # keep track how many chunks are identical to the last chunk (including the last chunk) for i in range(len(last_chunks) - 2, -1, -1): if last_chunks[i].choices[0].delta.content == last_content: equal_tail_length += 1 else: break # unequal chunks found if equal_tail_length == len(last_chunks): # all the last_chunks are identical - repetition found raise litellm.InternalServerError( message="The model is repeating the same chunk = {}.".format( last_content ), model="", llm_provider="", ) # we found two unequal chunks, scroll the check window completely past them chunks_to_skip = litellm.REPEATED_STREAMING_CHUNK_LIMIT - equal_tail_length self._repetition_check_at_chunks_length = all_chunks_length + chunks_to_skip