Skip to content

[Bug] Replicate integration fails for slow-starting models - no polling for prediction completion #16801

@pditommaso

Description

@pditommaso

Bug: Replicate Integration Fails for Slow-Starting Models (Kimi K2 Thinking)

Description

LiteLLM's Replicate integration does not properly poll for prediction completion, causing it to fail immediately for slow-starting models like moonshotai/kimi-k2-thinking. The handler checks the prediction status once and raises an error if the status is "starting", instead of polling until completion.

Environment

  • LiteLLM Version: ghcr.io/berriai/litellm:main-latest (Docker image)
  • Python Version: 3.13
  • Model: replicate/moonshotai/kimi-k2-thinking
  • Environment Variables:
    • LITELLM_REQUEST_TIMEOUT=60
    • LITELLM_REPLICATE_POLL_TIMEOUT=60
    • LITELLM_LOG=DEBUG

Configuration

model_list:
  - model_name: kimi-k2
    litellm_params:
      model: replicate/moonshotai/kimi-k2-thinking
      api_key: os.environ/REPLICATE_API_KEY
      timeout: 60
      num_retries: 0
      replicate_deployment: true

litellm_settings:
  set_verbose: True
  request_timeout: 60

Steps to Reproduce

1. Create config.yaml:

model_list:
  - model_name: kimi-k2
    litellm_params:
      model: replicate/moonshotai/kimi-k2-thinking
      api_key: os.environ/REPLICATE_API_KEY
      timeout: 60

litellm_settings:
  set_verbose: True
  request_timeout: 60

2. Create docker-compose.yml:

version: '3.8'
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./config.yaml:/app/config.yaml:ro
    environment:
      - REPLICATE_API_KEY=${REPLICATE_API_KEY}
      - LITELLM_LOG=DEBUG
    command: ["--config", "/app/config.yaml", "--port", "4000"]

3. Create .env file:

REPLICATE_API_KEY=r8_your_key_here

4. Start LiteLLM:

docker-compose up -d

5. Make a request:

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kimi-k2",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

6. Observe the error after ~2.6 seconds

Expected Behavior

LiteLLM should:

  1. Create a Replicate prediction
  2. Poll the prediction status (checking status field)
  3. Wait until status is succeeded, failed, or canceled
  4. Return the result once complete

This is the standard Replicate API workflow as documented: https://replicate.com/docs/topics/predictions/create-a-prediction

Actual Behavior

LiteLLM fails after ~2.6 seconds with:

litellm.UnprocessableEntityError: ReplicateException - LiteLLM Error - prediction not succeeded - {
  'id': 'fcpx53c3wdrmc0ctk3wr3vphkc',
  'model': 'moonshotai/kimi-k2-thinking',
  'status': 'starting',
  'created_at': '2025-11-18T21:58:00.419Z'
}

Error Location: /usr/lib/python3.13/site-packages/litellm/llms/replicate/chat/transformation.py:257

Root Cause

The Replicate handler in litellm/llms/replicate/chat/transformation.py raises a ReplicateError immediately if the prediction status is not "succeeded", instead of implementing a polling loop.

Comparison: Direct Replicate API vs LiteLLM

Direct Replicate API (Working):

# Create prediction
curl -X POST https://api.replicate.com/v1/models/moonshotai/kimi-k2-thinking/predictions \
  -H "Authorization: Bearer $REPLICATE_API_KEY" \
  -d '{"input": {"prompt": "What is the capital of France?"}}'
# Returns: {"id": "...", "status": "starting", ...}

# Poll for completion (after ~5 seconds)
curl -H "Authorization: Bearer $REPLICATE_API_KEY" \
  https://api.replicate.com/v1/predictions/{id}
# Returns: {"status": "succeeded", "output": ["The", " capital", ...]}

Result: ✅ Success in ~5.8 seconds total

LiteLLM via Replicate (Failing):

  • Fails after 2.6 seconds with "prediction not succeeded"
  • Never polls for completion

Models Affected

  • Fails: moonshotai/kimi-k2-thinking (slow-starting reasoning model)
  • Works: meta/meta-llama-3-8b-instruct (fast model, ~2.8s)

The fast Llama model works because it completes before LiteLLM's single status check. Slow models fail because they're still in "starting" status when checked.

Suggested Fix

The Replicate handler should implement proper polling logic similar to the official Replicate Python client:

# Pseudo-code
prediction = create_prediction(...)
while prediction.status not in ["succeeded", "failed", "canceled"]:
    time.sleep(polling_interval)
    prediction = get_prediction(prediction.id)

if prediction.status == "succeeded":
    return prediction.output
else:
    raise error

Additional Context

  • Replicate documentation mentions cold boot times of 3-5 minutes
  • The Kimi K2 Thinking model typically takes 5-10 seconds to start
  • Environment variables like LITELLM_REPLICATE_POLL_TIMEOUT don't seem to affect polling behavior

Related Documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions