A high-performance, stateless OpenRouter proxy service built with Node.js, TypeScript, and Express. Provides REST API and WebSocket streaming capabilities for LLM inference without authentication or user tracking.
- Dual Interface: REST API for standard requests and WebSocket for streaming
- Comprehensive Parameter Support: System prompts, model/provider selection, temperature, tools, etc.
- Multi-modal Support: Text, audio, and image generation capabilities
- Robust Error Handling: Graceful failure recovery and informative error responses
- High Performance: Optimized for speed and low latency
- IP-based Rate Limiting: Protection against abuse while maintaining simplicity
- Node.js 20+
- npm or yarn
- OpenRouter API key
- Clone the repository:
git clone <repository-url>
cd llm-proxy- Install dependencies:
npm install- Set up environment variables:
cp .env.example .env
# Edit .env with your OpenRouter API key- Build the project:
npm run build- Start the server:
npm startFor development:
npm run devThe service uses environment variables for configuration. See .env.example for all available options:
OPENROUTER_API_KEY: Your OpenRouter API key
PORT: Server port (default: 3000)HOST: Server host (default: 0.0.0.0)NODE_ENV: Environment (development/production/test)LOG_LEVEL: Logging level (debug/info/warn/error)RATE_LIMIT_WINDOW_MS: Rate limit window in milliseconds (default: 900000)RATE_LIMIT_MAX_REQUESTS: Max requests per window (default: 100)WS_MAX_CONNECTIONS: Max WebSocket connections (default: 1000)WS_HEARTBEAT_INTERVAL: WebSocket heartbeat interval (default: 30000)MAX_CONCURRENT_REQUESTS: Max concurrent requests (default: 100)REQUEST_TIMEOUT: Request timeout in milliseconds (default: 30000)
Check service health status.
Response:
{
"status": "healthy",
"timestamp": "2024-01-01T00:00:00.000Z",
"uptime": 123.45,
"version": "1.0.0",
"environment": "production"
}Create a completion using the specified model.
Request Body:
{
"model": "openai/gpt-4o",
"messages": [
{
"role": "user",
"content": "Hello, world!"
}
],
"temperature": 0.7,
"max_tokens": 100,
"stream": false
}Response:
{
"id": "chatcmpl-123",
"choices": [
{
"finish_reason": "stop",
"message": {
"content": "Hello! How can I help you today?",
"role": "assistant"
}
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 15,
"total_tokens": 25
},
"model": "openai/gpt-4o",
"created": 1704067200,
"object": "chat.completion"
}Create a streaming completion using the specified model.
Request Body:
{
"model": "openai/gpt-4o",
"messages": [
{
"role": "user",
"content": "Tell me a story"
}
],
"temperature": 0.7,
"max_tokens": 500,
"stream": true
}Response: Server-Sent Events stream
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1704067200,"model":"openai/gpt-4o","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1704067200,"model":"openai/gpt-4o","choices":[{"index":0,"delta":{"content":"Once"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1704067200,"model":"openai/gpt-4o","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]}
data: [DONE]
List all available models with optional filtering and pagination.
Query Parameters:
provider(optional): Filter by provider (e.g., "openai", "anthropic")search(optional): Search in model name or descriptionlimit(optional): Number of models to return (default: 50, max: 100)offset(optional): Number of models to skip (default: 0)
Example:
GET /api/v1/models?provider=openai&search=gpt&limit=10&offset=0
Response:
{
"data": [
{
"id": "openai/gpt-4o",
"name": "GPT-4o",
"description": "Most advanced GPT-4 model",
"context_length": 128000,
"pricing": {
"prompt": "0.005",
"completion": "0.015"
},
"supported_parameters": ["temperature", "max_tokens", "top_p"],
"is_moderated": true,
"max_completion_tokens": 4096
}
],
"pagination": {
"total": 150,
"limit": 10,
"offset": 0,
"hasMore": true
}
}Get detailed information about a specific model.
Example:
GET /api/v1/models/openai/gpt-4o
Response:
{
"data": {
"id": "openai/gpt-4o",
"name": "GPT-4o",
"description": "Most advanced GPT-4 model",
"context_length": 128000,
"pricing": {
"prompt": "0.005",
"completion": "0.015"
},
"supported_parameters": ["temperature", "max_tokens", "top_p", "frequency_penalty", "presence_penalty"],
"is_moderated": true,
"max_completion_tokens": 4096
}
}Get supported parameters for a specific model.
Example:
GET /api/v1/models/openai/gpt-4o/parameters
Response:
{
"data": {
"model": "openai/gpt-4o",
"supported_parameters": [
"temperature",
"max_tokens",
"top_p",
"frequency_penalty",
"presence_penalty",
"stop",
"stream"
]
}
}Get pricing information for a specific model.
Example:
GET /api/v1/models/openai/gpt-4o/pricing
Response:
{
"data": {
"model": "openai/gpt-4o",
"pricing": {
"prompt": "0.005",
"completion": "0.015"
}
}
}Get top models by context length.
Query Parameters:
limit(optional): Number of models to return (default: 10)
Example:
GET /api/v1/models/top?limit=5
Response:
{
"data": [
{
"id": "anthropic/claude-3-5-sonnet-20241022",
"name": "Claude 3.5 Sonnet",
"context_length": 200000,
"pricing": {
"prompt": "0.003",
"completion": "0.015"
}
}
]
}Search models by query.
Query Parameters:
q(required): Search querylimit(optional): Number of results to return (default: 20)
Example:
GET /api/v1/models/search?q=code&limit=5
Response:
{
"data": [
{
"id": "openai/gpt-4o",
"name": "GPT-4o",
"description": "Most advanced GPT-4 model with code capabilities"
}
],
"query": "code",
"total": 25
}Get all available providers.
Response:
{
"data": [
"openai",
"anthropic",
"google",
"meta",
"mistral"
]
}Get models by provider.
Example:
GET /api/v1/models/providers/openai
Response:
{
"data": [
{
"id": "openai/gpt-4o",
"name": "GPT-4o",
"context_length": 128000
}
],
"provider": "openai"
}Connect to the WebSocket endpoint:
ws://localhost:3000/ws
{
"type": "inference_request",
"id": "req-123",
"data": {
"model": "openai/gpt-4o",
"messages": [
{
"role": "user",
"content": "Hello, world!"
}
],
"temperature": 0.7,
"max_tokens": 100
}
}{
"type": "inference_response",
"id": "req-123",
"data": {
"content": "Hello! How can I help you today?",
"finish_reason": "stop",
"usage": {
"prompt_tokens": 10,
"completion_tokens": 15,
"total_tokens": 25
},
"model": "openai/gpt-4o",
"created": 1704067200
}
}{
"type": "heartbeat",
"timestamp": 1704067200000
}{
"type": "error",
"id": "req-123",
"error": {
"code": 400,
"message": "Invalid model",
"type": "validation"
}
}{
"type": "close",
"reason": "Client requested close",
"code": 1000
}// Standard completion
const response = await fetch('http://localhost:3000/api/v1/inference', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'openai/gpt-4o',
messages: [
{ role: 'user', content: 'Hello, world!' }
],
temperature: 0.7,
max_tokens: 100
})
});
const data = await response.json();
console.log(data.choices[0].message.content);// Streaming completion
const response = await fetch('http://localhost:3000/api/v1/inference/stream', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'openai/gpt-4o',
messages: [
{ role: 'user', content: 'Tell me a story' }
],
temperature: 0.7,
max_tokens: 500,
stream: true
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = line.slice(6);
if (data === '[DONE]') break;
try {
const parsed = JSON.parse(data);
if (parsed.choices?.[0]?.delta?.content) {
console.log(parsed.choices[0].delta.content);
}
} catch (e) {
// Ignore invalid JSON
}
}
}
}const ws = new WebSocket('ws://localhost:3000/ws');
ws.onopen = () => {
// Send inference request
ws.send(JSON.stringify({
type: 'inference_request',
id: 'req-123',
data: {
model: 'openai/gpt-4o',
messages: [
{ role: 'user', content: 'Hello, world!' }
],
temperature: 0.7
}
}));
};
ws.onmessage = (event) => {
const message = JSON.parse(event.data);
switch (message.type) {
case 'inference_response':
if (message.data.content) {
console.log(message.data.content);
}
if (message.data.finish_reason) {
console.log('Finished:', message.data.finish_reason);
}
break;
case 'error':
console.error('Error:', message.error.message);
break;
case 'heartbeat':
console.log('Heartbeat received');
break;
}
};
ws.onclose = () => {
console.log('WebSocket connection closed');
};import requests
import json
# Standard completion
response = requests.post('http://localhost:3000/api/v1/inference',
json={
'model': 'openai/gpt-4o',
'messages': [
{'role': 'user', 'content': 'Hello, world!'}
],
'temperature': 0.7,
'max_tokens': 100
}
)
data = response.json()
print(data['choices'][0]['message']['content'])# Standard completion
curl -X POST http://localhost:3000/api/v1/inference \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o",
"messages": [
{"role": "user", "content": "Hello, world!"}
],
"temperature": 0.7,
"max_tokens": 100
}'
# List models
curl http://localhost:3000/api/v1/models
# Get model details
curl http://localhost:3000/api/v1/models/openai/gpt-4o
# Search models
curl "http://localhost:3000/api/v1/models/search?q=gpt&limit=5"All errors follow this format:
{
"error": {
"code": 400,
"message": "Validation error",
"type": "validation",
"details": {
"field": "model",
"message": "Model is required"
}
}
}validation: Request validation failedrate_limit: Rate limit exceededopenrouter: OpenRouter API errorinternal: Internal server error
400: Bad Request - Invalid request data404: Not Found - Model or endpoint not found429: Too Many Requests - Rate limit exceeded500: Internal Server Error - Server error502: Bad Gateway - OpenRouter API error503: Service Unavailable - Service temporarily unavailable
The service implements IP-based rate limiting:
- Default: 100 requests per 15 minutes per IP
- Inference endpoints: 50 requests per 15 minutes per IP
- WebSocket: 5 connections per minute per IP
Rate limit headers are included in responses:
X-RateLimit-Limit: Maximum requests allowedX-RateLimit-Remaining: Requests remaining in current windowX-RateLimit-Reset: Time when the rate limit resets
npm run dev- Start development server with hot reloadnpm run build- Build the projectnpm start- Start production servernpm test- Run testsnpm run test:watch- Run tests in watch modenpm run test:coverage- Run tests with coveragenpm run lint- Run ESLintnpm run lint:fix- Fix ESLint errors
src/
├── controllers/ # Request handlers
├── services/ # Business logic
├── middleware/ # Express middleware
├── routes/ # API routes
├── types/ # TypeScript definitions
├── utils/ # Utility functions
├── app.ts # Express app setup
└── server.ts # Server entry point
The project includes comprehensive tests:
- Unit tests: Test individual functions and classes
- Integration tests: Test complete request/response cycles
- Load tests: Test performance under load
Run tests:
npm test# Build the image
docker build -f docker/Dockerfile -t llm-proxy .
# Run the container
docker run -p 3000:3000 -e OPENROUTER_API_KEY=your-key llm-proxy# Start all services
docker-compose -f docker/docker-compose.yml up -d
# Stop all services
docker-compose -f docker/docker-compose.yml downThe service provides monitoring endpoints:
GET /health- Health check with uptime and version info
- IP-based rate limiting
- Input validation and sanitization
- CORS protection
- Security headers (Helmet)
- No authentication required (stateless design)
- Connection pooling for OpenRouter API
- Efficient WebSocket handling
- Memory-optimized streaming
- Request/response compression
- Caching for model information
- Stateless design for horizontal scaling
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
MIT License - see LICENSE file for details.
For issues and questions, please open an issue on GitHub.