-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Description
Bug: Replicate Integration Fails for Slow-Starting Models (Kimi K2 Thinking)
Description
LiteLLM's Replicate integration does not properly poll for prediction completion, causing it to fail immediately for slow-starting models like moonshotai/kimi-k2-thinking. The handler checks the prediction status once and raises an error if the status is "starting", instead of polling until completion.
Environment
- LiteLLM Version:
ghcr.io/berriai/litellm:main-latest(Docker image) - Python Version: 3.13
- Model:
replicate/moonshotai/kimi-k2-thinking - Environment Variables:
LITELLM_REQUEST_TIMEOUT=60LITELLM_REPLICATE_POLL_TIMEOUT=60LITELLM_LOG=DEBUG
Configuration
model_list:
- model_name: kimi-k2
litellm_params:
model: replicate/moonshotai/kimi-k2-thinking
api_key: os.environ/REPLICATE_API_KEY
timeout: 60
num_retries: 0
replicate_deployment: true
litellm_settings:
set_verbose: True
request_timeout: 60Steps to Reproduce
1. Create config.yaml:
model_list:
- model_name: kimi-k2
litellm_params:
model: replicate/moonshotai/kimi-k2-thinking
api_key: os.environ/REPLICATE_API_KEY
timeout: 60
litellm_settings:
set_verbose: True
request_timeout: 602. Create docker-compose.yml:
version: '3.8'
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
volumes:
- ./config.yaml:/app/config.yaml:ro
environment:
- REPLICATE_API_KEY=${REPLICATE_API_KEY}
- LITELLM_LOG=DEBUG
command: ["--config", "/app/config.yaml", "--port", "4000"]3. Create .env file:
REPLICATE_API_KEY=r8_your_key_here4. Start LiteLLM:
docker-compose up -d5. Make a request:
curl -X POST http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "kimi-k2",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"temperature": 0.7,
"max_tokens": 100
}'6. Observe the error after ~2.6 seconds
Expected Behavior
LiteLLM should:
- Create a Replicate prediction
- Poll the prediction status (checking
statusfield) - Wait until
statusissucceeded,failed, orcanceled - Return the result once complete
This is the standard Replicate API workflow as documented: https://replicate.com/docs/topics/predictions/create-a-prediction
Actual Behavior
LiteLLM fails after ~2.6 seconds with:
litellm.UnprocessableEntityError: ReplicateException - LiteLLM Error - prediction not succeeded - {
'id': 'fcpx53c3wdrmc0ctk3wr3vphkc',
'model': 'moonshotai/kimi-k2-thinking',
'status': 'starting',
'created_at': '2025-11-18T21:58:00.419Z'
}
Error Location: /usr/lib/python3.13/site-packages/litellm/llms/replicate/chat/transformation.py:257
Root Cause
The Replicate handler in litellm/llms/replicate/chat/transformation.py raises a ReplicateError immediately if the prediction status is not "succeeded", instead of implementing a polling loop.
Comparison: Direct Replicate API vs LiteLLM
Direct Replicate API (Working):
# Create prediction
curl -X POST https://api.replicate.com/v1/models/moonshotai/kimi-k2-thinking/predictions \
-H "Authorization: Bearer $REPLICATE_API_KEY" \
-d '{"input": {"prompt": "What is the capital of France?"}}'
# Returns: {"id": "...", "status": "starting", ...}
# Poll for completion (after ~5 seconds)
curl -H "Authorization: Bearer $REPLICATE_API_KEY" \
https://api.replicate.com/v1/predictions/{id}
# Returns: {"status": "succeeded", "output": ["The", " capital", ...]}Result: ✅ Success in ~5.8 seconds total
LiteLLM via Replicate (Failing):
- Fails after 2.6 seconds with "prediction not succeeded"
- Never polls for completion
Models Affected
- Fails:
moonshotai/kimi-k2-thinking(slow-starting reasoning model) - Works:
meta/meta-llama-3-8b-instruct(fast model, ~2.8s)
The fast Llama model works because it completes before LiteLLM's single status check. Slow models fail because they're still in "starting" status when checked.
Suggested Fix
The Replicate handler should implement proper polling logic similar to the official Replicate Python client:
# Pseudo-code
prediction = create_prediction(...)
while prediction.status not in ["succeeded", "failed", "canceled"]:
time.sleep(polling_interval)
prediction = get_prediction(prediction.id)
if prediction.status == "succeeded":
return prediction.output
else:
raise errorAdditional Context
- Replicate documentation mentions cold boot times of 3-5 minutes
- The Kimi K2 Thinking model typically takes 5-10 seconds to start
- Environment variables like
LITELLM_REPLICATE_POLL_TIMEOUTdon't seem to affect polling behavior