[Bug] Replicate integration fails for slow-starting models - no polling for prediction completion

## Bug: Replicate Integration Fails for Slow-Starting Models (Kimi K2 Thinking)

### Description

LiteLLM's Replicate integration does not properly poll for prediction completion, causing it to fail immediately for slow-starting models like `moonshotai/kimi-k2-thinking`. The handler checks the prediction status once and raises an error if the status is "starting", instead of polling until completion.

### Environment

- **LiteLLM Version**: `ghcr.io/berriai/litellm:main-latest` (Docker image)
- **Python Version**: 3.13
- **Model**: `replicate/moonshotai/kimi-k2-thinking`
- **Environment Variables**:
  - `LITELLM_REQUEST_TIMEOUT=60`
  - `LITELLM_REPLICATE_POLL_TIMEOUT=60`
  - `LITELLM_LOG=DEBUG`

### Configuration

```yaml
model_list:
  - model_name: kimi-k2
    litellm_params:
      model: replicate/moonshotai/kimi-k2-thinking
      api_key: os.environ/REPLICATE_API_KEY
      timeout: 60
      num_retries: 0
      replicate_deployment: true

litellm_settings:
  set_verbose: True
  request_timeout: 60
```

### Steps to Reproduce

**1. Create config.yaml:**
```yaml
model_list:
  - model_name: kimi-k2
    litellm_params:
      model: replicate/moonshotai/kimi-k2-thinking
      api_key: os.environ/REPLICATE_API_KEY
      timeout: 60

litellm_settings:
  set_verbose: True
  request_timeout: 60
```

**2. Create docker-compose.yml:**
```yaml
version: '3.8'
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./config.yaml:/app/config.yaml:ro
    environment:
      - REPLICATE_API_KEY=${REPLICATE_API_KEY}
      - LITELLM_LOG=DEBUG
    command: ["--config", "/app/config.yaml", "--port", "4000"]
```

**3. Create .env file:**
```bash
REPLICATE_API_KEY=r8_your_key_here
```

**4. Start LiteLLM:**
```bash
docker-compose up -d
```

**5. Make a request:**
```bash
curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kimi-k2",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'
```

**6. Observe the error after ~2.6 seconds**

### Expected Behavior

LiteLLM should:
1. Create a Replicate prediction
2. Poll the prediction status (checking `status` field)
3. Wait until `status` is `succeeded`, `failed`, or `canceled`
4. Return the result once complete

This is the standard Replicate API workflow as documented: https://replicate.com/docs/topics/predictions/create-a-prediction

### Actual Behavior

LiteLLM fails after ~2.6 seconds with:

```
litellm.UnprocessableEntityError: ReplicateException - LiteLLM Error - prediction not succeeded - {
  'id': 'fcpx53c3wdrmc0ctk3wr3vphkc',
  'model': 'moonshotai/kimi-k2-thinking',
  'status': 'starting',
  'created_at': '2025-11-18T21:58:00.419Z'
}
```

**Error Location**: `/usr/lib/python3.13/site-packages/litellm/llms/replicate/chat/transformation.py:257`

### Root Cause

The Replicate handler in `litellm/llms/replicate/chat/transformation.py` raises a `ReplicateError` immediately if the prediction status is not "succeeded", instead of implementing a polling loop.

### Comparison: Direct Replicate API vs LiteLLM

**Direct Replicate API (Working)**:
```bash
# Create prediction
curl -X POST https://api.replicate.com/v1/models/moonshotai/kimi-k2-thinking/predictions \
  -H "Authorization: Bearer $REPLICATE_API_KEY" \
  -d '{"input": {"prompt": "What is the capital of France?"}}'
# Returns: {"id": "...", "status": "starting", ...}

# Poll for completion (after ~5 seconds)
curl -H "Authorization: Bearer $REPLICATE_API_KEY" \
  https://api.replicate.com/v1/predictions/{id}
# Returns: {"status": "succeeded", "output": ["The", " capital", ...]}
```

**Result**: ✅ Success in ~5.8 seconds total

**LiteLLM via Replicate (Failing)**:
- Fails after 2.6 seconds with "prediction not succeeded"
- Never polls for completion

### Models Affected

- **Fails**: `moonshotai/kimi-k2-thinking` (slow-starting reasoning model)
- **Works**: `meta/meta-llama-3-8b-instruct` (fast model, ~2.8s)

The fast Llama model works because it completes before LiteLLM's single status check. Slow models fail because they're still in "starting" status when checked.

### Suggested Fix

The Replicate handler should implement proper polling logic similar to the official Replicate Python client:

```python
# Pseudo-code
prediction = create_prediction(...)
while prediction.status not in ["succeeded", "failed", "canceled"]:
    time.sleep(polling_interval)
    prediction = get_prediction(prediction.id)

if prediction.status == "succeeded":
    return prediction.output
else:
    raise error
```


### Additional Context

- Replicate documentation mentions cold boot times of 3-5 minutes
- The Kimi K2 Thinking model typically takes 5-10 seconds to start
- Environment variables like `LITELLM_REPLICATE_POLL_TIMEOUT` don't seem to affect polling behavior

### Related Documentation

- https://docs.litellm.ai/docs/providers/replicate
- https://replicate.com/docs/topics/predictions/create-a-prediction
- https://replicate.com/moonshotai/kimi-k2-thinking


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug] Replicate integration fails for slow-starting models - no polling for prediction completion #16801

Bug: Replicate Integration Fails for Slow-Starting Models (Kimi K2 Thinking)

Description

Environment

Configuration

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Comparison: Direct Replicate API vs LiteLLM

Models Affected

Suggested Fix

Additional Context

Related Documentation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] Replicate integration fails for slow-starting models - no polling for prediction completion #16801

Description

Bug: Replicate Integration Fails for Slow-Starting Models (Kimi K2 Thinking)

Description

Environment

Configuration

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Comparison: Direct Replicate API vs LiteLLM

Models Affected

Suggested Fix

Additional Context

Related Documentation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions