An endpoint server for efficiently serving quantized open-source LLMs for code.
-
Updated
Oct 15, 2023 - Python
An endpoint server for efficiently serving quantized open-source LLMs for code.
Embedding based semantic search app for poetry [App and EDA notebooks]
This repository demonstrates LLM execution on CPUs using packages like llamafile, emphasizing low-latency, high-throughput, and cost-effective benefits for inference and serving.
EchoSight is a tool that helps visually impaired individuals by audibly describing images taken with a Raspberry Pi Camera or inputted via image path or URL across different operating systems.
Dockerized LLM inference server with constrained output (JSON mode), built on top of vLLM and outlines. Faster, cheaper and without rate limits. Compare the quality and latency to your current LLM API provider.
MLOps library for LLM deployment w/ the vLLM engine on RunPod's infra.
Low latency JSON generation using LLMs ⚡️
A discord bot which can call LLMs using either Hugging Face or vLLM on Windows platform. Combined with function calling.
Call many AIs from a single API.
A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.
Add a description, image, and links to the vllm topic page so that developers can more easily learn about it.
To associate your repository with the vllm topic, visit your repo's landing page and select "manage topics."