GitHub - runpod-workers/worker-sglang: SGLang is fast serving framework for large language models and vision language models.

Run LLMs and VLMs using SGLang

Endpoint Configuration

All behaviour is controlled through environment variables:

Environment Variable	Description	Default	Options
`MODEL_PATH`	Path of the model weights	"meta-llama/Meta-Llama-3-8B-Instruct"	Local folder or Hugging Face repo ID
`HF_TOKEN`	HuggingFace access token for gated/private models		Your HuggingFace access token
`TOKENIZER_PATH`	Path of the tokenizer
`TOKENIZER_MODE`	Tokenizer mode	"auto"	"auto", "slow"
`LOAD_FORMAT`	Format of model weights to load	"auto"	"auto", "pt", "safetensors", "npcache", "dummy"
`DTYPE`	Data type for weights and activations	"auto"	"auto", "half", "float16", "bfloat16", "float", "float32"
`CONTEXT_LENGTH`	Model's maximum context length
`QUANTIZATION`	Quantization method		"awq", "fp8", "gptq", "marlin", "gptq_marlin", "awq_marlin", "squeezellm", "bitsandbytes"
`SERVED_MODEL_NAME`	Override model name in API
`CHAT_TEMPLATE`	Chat template name or path
`MEM_FRACTION_STATIC`	Fraction of memory for static allocation
`MAX_RUNNING_REQUESTS`	Maximum number of running requests
`MAX_TOTAL_TOKENS`	Maximum tokens in memory pool
`CHUNKED_PREFILL_SIZE`	Max tokens in chunk for chunked prefill
`MAX_PREFILL_TOKENS`	Max tokens in prefill batch	16384
`SCHEDULE_POLICY`	Request scheduling policy	"fcfs"	"lpm", "random", "fcfs", "dfs-weight"
`SCHEDULE_CONSERVATIVENESS`	Conservativeness of schedule policy	1.0
`TENSOR_PARALLEL_SIZE`	Tensor parallelism size	1
`STREAM_INTERVAL`	Streaming interval in token length	1
`RANDOM_SEED`	Random seed
`LOG_LEVEL`	Logging level for all loggers	"info"
`LOG_LEVEL_HTTP`	Logging level for HTTP server
`API_KEY`	API key for the server
`FILE_STORAGE_PATH`	Directory for storing uploaded/generated files	"sglang_storage"
`DATA_PARALLEL_SIZE`	Data parallelism size	1
`LOAD_BALANCE_METHOD`	Load balancing strategy	"round_robin"	"round_robin", "shortest_queue"
`SKIP_TOKENIZER_INIT`	Skip tokenizer init	false	boolean (true or false)
`TRUST_REMOTE_CODE`	Allow custom models from Hub	false	boolean (true or false)
`LOG_REQUESTS`	Log inputs and outputs of requests	false	boolean (true or false)
`SHOW_TIME_COST`	Show time cost of custom marks	false	boolean (true or false)
`DISABLE_RADIX_CACHE`	Disable RadixAttention for prefix caching	false	boolean (true or false)
`DISABLE_CUDA_GRAPH`	Disable CUDA Graph	false	boolean (true or false)
`DISABLE_OUTLINES_DISK_CACHE`	Disable disk cache for Outlines grammar	false	boolean (true or false)
`ENABLE_TORCH_COMPILE`	Optimize model with torch.compile	false	boolean (true or false)
`ENABLE_P2P_CHECK`	Enable P2P check for GPU access	false	boolean (true or false)
`ENABLE_FLASHINFER_MLA`	Enable FlashInfer MLA optimization	false	boolean (true or false)
`TRITON_ATTENTION_REDUCE_IN_FP32`	Cast Triton attention reduce op to FP32	false	boolean (true or false)
`TOOL_CALL_PARSER`	Defines the parser used to interpret responses	qwen25	"llama3", "llama4", "mistral", "qwen25", "deepseekv3"

API Usage

This worker supports two API formats: RunPod native and OpenAI-compatible.

RunPod Native API

For testing directly in the RunPod UI, use these examples in your endpoint's request tab.

Chat Completions

{
  "input": {
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "What is the capital of France?" }
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }
}

Chat Completions (Streaming)

{
  "input": {
    "messages": [
      { "role": "user", "content": "Write a short story about a robot." }
    ],
    "max_tokens": 500,
    "temperature": 0.8,
    "stream": true
  }
}

Native Text Generation

For direct SGLang text generation without OpenAI chat format:

{
  "input": {
    "text": "The capital of France is",
    "sampling_params": {
      "max_new_tokens": 64,
      "temperature": 0.0
    }
  }
}

List Models

{
  "input": {
    "openai_route": "/v1/models"
  }
}

OpenAI-Compatible API

For external clients and SDKs, use the /openai/v1 path prefix with your RunPod API key.

Chat Completions

Path: /openai/v1/chat/completions

{
  "model": "meta-llama/Meta-Llama-3-8B-Instruct",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is the capital of France?" }
  ],
  "max_tokens": 100,
  "temperature": 0.7
}

Chat Completions (Streaming)

{
  "model": "meta-llama/Meta-Llama-3-8B-Instruct",
  "messages": [
    { "role": "user", "content": "Write a short story about a robot." }
  ],
  "max_tokens": 500,
  "temperature": 0.8,
  "stream": true
}

List Models

Path: /openai/v1/models

{}

Response Format

Both APIs return the same response format:

{
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "Paris." },
      "finish_reason": "stop"
    }
  ],
  "usage": { "prompt_tokens": 9, "completion_tokens": 1, "total_tokens": 10 }
}

Usage

Below are minimal python snippets so you can copy-paste to get started quickly.

Replace <ENDPOINT_ID> with your endpoint ID and <API_KEY> with a RunPod API key.

OpenAI compatible API

Minimal Python example using the official openai SDK:

from openai import OpenAI
import os

# Initialize the OpenAI Client with your RunPod API Key and Endpoint URL
client = OpenAI(
    api_key=os.getenv("RUNPOD_API_KEY"),
    base_url=f"https://api.runpod.ai/v2/<ENDPOINT_ID>/openai/v1",
)

Chat Completions (Non-Streaming)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Give a two lines on Planet Earth ?"}],
    temperature=0,
    max_tokens=100,

)
print(f"Response: {response}")

Chat Completions (Streaming)

response_stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Give a two lines on Planet Earth ?"}],
    temperature=0,
    max_tokens=100,
    stream=True

)
for response in response_stream:
    print(response.choices[0].delta.content or "", end="", flush=True)

Compatibility

Anything not recognized by worker-sglang is forwarded verbatim to /generate, so advanced options in the SGLang docs (logprobs, sessions, images, etc.) also work.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github		.github
.runpod		.runpod
docs		docs
public		public
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-bake.hcl		docker-bake.hcl
docker-compose.yml		docker-compose.yml
download_model.py		download_model.py
engine.py		engine.py
handler.py		handler.py
requirements.txt		requirements.txt
test_input.json		test_input.json
utils.py		utils.py
worker-config.json		worker-config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Endpoint Configuration

API Usage

RunPod Native API

Chat Completions

Chat Completions (Streaming)

Native Text Generation

List Models

OpenAI-Compatible API

Chat Completions

Chat Completions (Streaming)

List Models

Response Format

Usage

OpenAI compatible API

Compatibility

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

License

runpod-workers/worker-sglang

Folders and files

Latest commit

History

Repository files navigation

Endpoint Configuration

API Usage

RunPod Native API

Chat Completions

Chat Completions (Streaming)

Native Text Generation

List Models

OpenAI-Compatible API

Chat Completions

Chat Completions (Streaming)

List Models

Response Format

Usage

OpenAI compatible API

Compatibility

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors 8

Uh oh!

Languages

Packages