Skip to content
forked from bentoml/OpenLLM

Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud.

License

Notifications You must be signed in to change notification settings

isperfee/OpenLLM

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Banner for OpenLLM

🦾 OpenLLM: Self-Hosting LLMs Made Easy

pypi_status test_pypi_status ci pre-commit.ci status
Twitter Discord

📖 Introduction

OpenLLM helps developers run any open-source LLMs, such as Llama 2 and Mistral, as OpenAI-compatible API endpoints, locally and in the cloud, optimized for serving throughput and production deployment.

  • 🚂 Support a wide range of open-source LLMs including LLMs fine-tuned with your own data
  • ⛓️ OpenAI compatible API endpoints for seamless transition from your LLM app to open-source LLMs
  • 🔥 State-of-the-art serving and inference performance
  • 🎯 Simplified cloud deployment via BentoML

Gif showing OpenLLM Intro


💾 TL/DR

For starter, we provide two ways to quickly try out OpenLLM:

Jupyter Notebooks

Try this OpenLLM tutorial in Google Colab: Serving Llama 2 with OpenLLM.

🏃 Get started

The following provides instructions for how to get started with OpenLLM locally.

Prerequisites

You have installed Python 3.9 (or later) and pip. We highly recommend using a Virtual Environment to prevent package conflicts.

Install OpenLLM

Install OpenLLM by using pip as follows:

pip install openllm

To verify the installation, run:

$ openllm -h

Start a LLM server

OpenLLM allows you to quickly spin up an LLM server using openllm start. For example, to start a Llama 3 8B server, run the following:

openllm start meta-llama/Meta-Llama-3-8B

To interact with the server, you can visit the web UI at http://0.0.0.0:3000/ or send a request using curl. You can also use OpenLLM’s built-in Python client to interact with the server:

import openllm

client = openllm.HTTPClient('http://localhost:3000')
client.generate('Explain to me the difference between "further" and "farther"')

OpenLLM seamlessly supports many models and their variants. You can specify different variants of the model to be served. For example:

openllm start <model_id> --<options>

Note

OpenLLM supports specifying fine-tuning weights and quantized weights for any of the supported models as long as they can be loaded with the model architecture. Use the openllm models command to see the complete list of supported models, their architectures, and their variants.

🧩 Supported models

OpenLLM currently supports the following models. By default, OpenLLM doesn't include dependencies to run all models. The extra model-specific dependencies can be installed with the instructions below.

Baichuan

Quickstart

Note: Baichuan requires to install with:

pip install "openllm[baichuan]"

Run the following command to quickly spin up a Baichuan server:

TRUST_REMOTE_CODE=True openllm start baichuan-inc/baichuan-7b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Baichuan variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Baichuan-compatible models.

Supported models

You can specify any of the following Baichuan models via openllm start:

ChatGLM

Quickstart

Note: ChatGLM requires to install with:

pip install "openllm[chatglm]"

Run the following command to quickly spin up a ChatGLM server:

TRUST_REMOTE_CODE=True openllm start thudm/chatglm-6b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any ChatGLM variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more ChatGLM-compatible models.

Supported models

You can specify any of the following ChatGLM models via openllm start:

Dbrx

Quickstart

Note: Dbrx requires to install with:

pip install "openllm[dbrx]"

Run the following command to quickly spin up a Dbrx server:

TRUST_REMOTE_CODE=True openllm start databricks/dbrx-instruct

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Dbrx variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Dbrx-compatible models.

Supported models

You can specify any of the following Dbrx models via openllm start:

DollyV2

Quickstart

Run the following command to quickly spin up a DollyV2 server:

TRUST_REMOTE_CODE=True openllm start databricks/dolly-v2-3b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any DollyV2 variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more DollyV2-compatible models.

Supported models

You can specify any of the following DollyV2 models via openllm start:

Falcon

Quickstart

Note: Falcon requires to install with:

pip install "openllm[falcon]"

Run the following command to quickly spin up a Falcon server:

TRUST_REMOTE_CODE=True openllm start tiiuae/falcon-7b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Falcon variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Falcon-compatible models.

Supported models

You can specify any of the following Falcon models via openllm start:

FlanT5

Quickstart

Run the following command to quickly spin up a FlanT5 server:

TRUST_REMOTE_CODE=True openllm start google/flan-t5-large

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any FlanT5 variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more FlanT5-compatible models.

Supported models

You can specify any of the following FlanT5 models via openllm start:

Gemma

Quickstart

Note: Gemma requires to install with:

pip install "openllm[gemma]"

Run the following command to quickly spin up a Gemma server:

TRUST_REMOTE_CODE=True openllm start google/gemma-7b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Gemma variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Gemma-compatible models.

Supported models

You can specify any of the following Gemma models via openllm start:

GPTNeoX

Quickstart

Run the following command to quickly spin up a GPTNeoX server:

TRUST_REMOTE_CODE=True openllm start eleutherai/gpt-neox-20b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any GPTNeoX variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more GPTNeoX-compatible models.

Supported models

You can specify any of the following GPTNeoX models via openllm start:

Llama

Quickstart

Note: Llama requires to install with:

pip install "openllm[llama]"

Run the following command to quickly spin up a Llama server:

TRUST_REMOTE_CODE=True openllm start NousResearch/llama-2-7b-hf

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Llama variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Llama-compatible models.

Supported models

You can specify any of the following Llama models via openllm start:

Mistral

Quickstart

Note: Mistral requires to install with:

pip install "openllm[mistral]"

Run the following command to quickly spin up a Mistral server:

TRUST_REMOTE_CODE=True openllm start mistralai/Mistral-7B-Instruct-v0.1

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Mistral variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Mistral-compatible models.

Supported models

You can specify any of the following Mistral models via openllm start:

Mixtral

Quickstart

Note: Mixtral requires to install with:

pip install "openllm[mixtral]"

Run the following command to quickly spin up a Mixtral server:

TRUST_REMOTE_CODE=True openllm start mistralai/Mixtral-8x7B-Instruct-v0.1

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Mixtral variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Mixtral-compatible models.

Supported models

You can specify any of the following Mixtral models via openllm start:

MPT

Quickstart

Note: MPT requires to install with:

pip install "openllm[mpt]"

Run the following command to quickly spin up a MPT server:

TRUST_REMOTE_CODE=True openllm start mosaicml/mpt-7b-instruct

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any MPT variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more MPT-compatible models.

Supported models

You can specify any of the following MPT models via openllm start:

OPT

Quickstart

Note: OPT requires to install with:

pip install "openllm[opt]"

Run the following command to quickly spin up a OPT server:

openllm start facebook/opt-1.3b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any OPT variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more OPT-compatible models.

Supported models

You can specify any of the following OPT models via openllm start:

Phi

Quickstart

Note: Phi requires to install with:

pip install "openllm[phi]"

Run the following command to quickly spin up a Phi server:

TRUST_REMOTE_CODE=True openllm start microsoft/Phi-3-mini-4k-instruct

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Phi variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Phi-compatible models.

Supported models

You can specify any of the following Phi models via openllm start:

Qwen

Quickstart

Note: Qwen requires to install with:

pip install "openllm[qwen]"

Run the following command to quickly spin up a Qwen server:

TRUST_REMOTE_CODE=True openllm start qwen/Qwen-7B-Chat

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Qwen variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Qwen-compatible models.

Supported models

You can specify any of the following Qwen models via openllm start:

StableLM

Quickstart

Note: StableLM requires to install with:

pip install "openllm[stablelm]"

Run the following command to quickly spin up a StableLM server:

TRUST_REMOTE_CODE=True openllm start stabilityai/stablelm-tuned-alpha-3b

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any StableLM variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more StableLM-compatible models.

Supported models

You can specify any of the following StableLM models via openllm start:

StarCoder

Quickstart

Note: StarCoder requires to install with:

pip install "openllm[starcoder]"

Run the following command to quickly spin up a StarCoder server:

TRUST_REMOTE_CODE=True openllm start bigcode/starcoder

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any StarCoder variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more StarCoder-compatible models.

Supported models

You can specify any of the following StarCoder models via openllm start:

Yi

Quickstart

Note: Yi requires to install with:

pip install "openllm[yi]"

Run the following command to quickly spin up a Yi server:

TRUST_REMOTE_CODE=True openllm start 01-ai/Yi-6B

In a different terminal, run the following command to interact with the server:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'What are large language models?'

Note: Any Yi variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Yi-compatible models.

Supported models

You can specify any of the following Yi models via openllm start:

More models will be integrated with OpenLLM and we welcome your contributions if you want to incorporate your custom LLMs into the ecosystem. Check out Adding a New Model Guide to learn more.

💻 Run your model on multiple GPUs

OpenLLM allows you to start your model server on multiple GPUs and specify the number of workers per resource assigned using the --workers-per-resource option. For example, if you have 4 available GPUs, you set the value as one divided by the number as only one instance of the Runner server will be spawned.

TRUST_REMOTE_CODE=True openllm start microsoft/phi-2 --workers-per-resource 0.25

Note

The amount of GPUs required depends on the model size itself. You can use the Model Memory Calculator from Hugging Face to calculate how much vRAM is needed to train and perform big model inference on a model and then plan your GPU strategy based on it.

When using the --workers-per-resource option with the openllm build command, the environment variable is saved into the resulting Bento.

For more information, see Resource scheduling strategy.

🛞 Runtime implementations

Different LLMs may support multiple runtime implementations. Models that have vLLM (vllm) supports will use vLLM by default, otherwise it fallback to use PyTorch (pt).

To specify a specific runtime for your chosen model, use the --backend option. For example:

openllm start meta-llama/Llama-2-7b-chat-hf --backend vllm

Note:

  1. To use the vLLM backend, you need a GPU with at least the Ampere architecture or newer and CUDA version 11.8.
  2. To see the backend options of each model supported by OpenLLM, see the Supported models section or run openllm models.

📐 Quantization

Quantization is a technique to reduce the storage and computation requirements for machine learning models, particularly during inference. By approximating floating-point numbers as integers (quantized values), quantization allows for faster computations, reduced memory footprint, and can make it feasible to deploy large models on resource-constrained devices.

OpenLLM supports the following quantization techniques

PyTorch backend

With PyTorch backend, OpenLLM supports int8, int4, and gptq.

For using int8 and int4 quantization through bitsandbytes, you can use the following command:

TRUST_REMOTE_CODE=True openllm start microsoft/phi-2 --quantize int8

To run inference with gptq, simply pass --quantize gptq:

openllm start TheBloke/Llama-2-7B-Chat-GPTQ --quantize gptq

Note

In order to run GPTQ, make sure you run pip install "openllm[gptq]" first to install the dependency. From the GPTQ paper, it is recommended to quantized the weights before serving. See AutoGPTQ for more information on GPTQ quantization.

vLLM backend

With vLLM backend, OpenLLM supports awq, squeezellm

To run inference with awq, simply pass --quantize awq:

openllm start TheBloke/zephyr-7B-alpha-AWQ --quantize awq

To run inference with squeezellm, simply pass --quantize squeezellm:

openllm start squeeze-ai-lab/sq-llama-2-7b-w4-s0 --quantize squeezellm --serialization legacy

Important

Since both squeezellm and awq are weight-aware quantization methods, meaning the quantization is done during training, all pre-trained weights needs to get quantized before inference time. Make sure to find compatible weights on HuggingFace Hub for your model of choice.

🛠️ Serving fine-tuning layers

PEFT, or Parameter-Efficient Fine-Tuning, is a methodology designed to fine-tune pre-trained models more efficiently. Instead of adjusting all model parameters, PEFT focuses on tuning only a subset, reducing computational and storage costs. LoRA (Low-Rank Adaptation) is one of the techniques supported by PEFT. It streamlines fine-tuning by using low-rank decomposition to represent weight updates, thereby drastically reducing the number of trainable parameters.

With OpenLLM, you can take advantage of the fine-tuning feature by serving models with any PEFT-compatible layers using the --adapter-id option. For example:

openllm start facebook/opt-6.7b --adapter-id aarnphm/opt-6-7b-quotes:default

OpenLLM also provides flexibility by supporting adapters from custom file paths:

openllm start facebook/opt-6.7b --adapter-id /path/to/adapters:local_adapter

To use multiple adapters, use the following format:

openllm start facebook/opt-6.7b --adapter-id aarnphm/opt-6.7b-lora:default --adapter-id aarnphm/opt-6.7b-french:french_lora

By default, all adapters will be injected into the models during startup. Adapters can be specified per request via adapter_name:

curl -X 'POST' \
  'http://localhost:3000/v1/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "What is the meaning of life?",
  "stop": [
    "philosopher"
  ],
  "llm_config": {
    "max_new_tokens": 256,
    "temperature": 0.75,
    "top_k": 15,
    "top_p": 1
  },
  "adapter_name": "default"
}'

To include this into the Bento, you can specify the --adapter-id option when using the openllm build command:

openllm build facebook/opt-6.7b --adapter-id ...

If you use a relative path for --adapter-id, you need to add --build-ctx.

openllm build facebook/opt-6.7b --adapter-id ./path/to/adapter_id --build-ctx .

Important

Fine-tuning support is still experimental and currently only works with PyTorch backend. vLLM support is coming soon.

⚙️ Integrations

OpenLLM is not just a standalone product; it's a building block designed to integrate with other powerful tools easily. We currently offer integration with OpenAI's Compatible Endpoints, LlamaIndex, LangChain, and Transformers Agents.

OpenAI Compatible Endpoints

OpenLLM Server can be used as a drop-in replacement for OpenAI's API. Simply specify the base_url to llm-endpoint/v1 and you are good to go:

import openai

client = openai.OpenAI(base_url='http://localhost:3000/v1', api_key='na')  # Here the server is running on 0.0.0.0:3000

completions = client.chat.completions.create(
  prompt='Write me a tag line for an ice cream shop.', model=model, max_tokens=64, stream=stream
)

The compatible endpoints supports /completions, /chat/completions, and /models

Note

You can find out OpenAI example clients under the examples folder.

To start a local LLM with llama_index, simply use llama_index.llms.openllm.OpenLLM:

import asyncio
from llama_index.llms.openllm import OpenLLM

llm = OpenLLM('HuggingFaceH4/zephyr-7b-alpha')

llm.complete('The meaning of life is')


async def main(prompt, **kwargs):
  async for it in llm.astream_chat(prompt, **kwargs):
    print(it)


asyncio.run(main('The time at San Francisco is'))

If there is a remote LLM Server running elsewhere, then you can use llama_index.llms.openllm.OpenLLMAPI:

from llama_index.llms.openllm import OpenLLMAPI

Note

All synchronous and asynchronous API from llama_index.llms.LLM are supported.

Spin up an OpenLLM server, and connect to it by specifying its URL:

from langchain.llms import OpenLLM

llm = OpenLLM(server_url='http://44.23.123.1:3000', server_type='http')
llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')

Gif showing Agent integration


🚀 Deploying models to production

There are several ways to deploy your LLMs:

🐳 Docker container

  1. Building a Bento: With OpenLLM, you can easily build a Bento for a specific model, like mistralai/Mistral-7B-Instruct-v0.1, using the build command.:

    openllm build mistralai/Mistral-7B-Instruct-v0.1

    A Bento, in BentoML, is the unit of distribution. It packages your program's source code, models, files, artefacts, and dependencies.

  2. Containerize your Bento

    bentoml containerize <name:version>

    This generates a OCI-compatible docker image that can be deployed anywhere docker runs. For best scalability and reliability of your LLM service in production, we recommend deploy with BentoCloud。

☁️ BentoCloud

Deploy OpenLLM with BentoCloud, the inference platform for fast moving AI teams.

  1. Create a BentoCloud account: sign up here

  2. Log into your BentoCloud account:

    bentoml cloud login --api-token <your-api-token> --endpoint <bento-cloud-endpoint>

Note

Replace <your-api-token> and <bento-cloud-endpoint> with your specific API token and the BentoCloud endpoint respectively.

  1. Bulding a Bento: With OpenLLM, you can easily build a Bento for a specific model, such as mistralai/Mistral-7B-Instruct-v0.1:

    openllm build mistralai/Mistral-7B-Instruct-v0.1
  2. Pushing a Bento: Push your freshly-built Bento service to BentoCloud via the push command:

    bentoml push <name:version>
  3. Deploying a Bento: Deploy your LLMs to BentoCloud with a single bentoml deployment create command following the deployment instructions.

👥 Community

Engage with like-minded individuals passionate about LLMs, AI, and more on our Discord!

OpenLLM is actively maintained by the BentoML team. Feel free to reach out and join us in our pursuit to make LLMs more accessible and easy to use 👉 Join our Slack community!

🎁 Contributing

We welcome contributions! If you're interested in enhancing OpenLLM's capabilities or have any questions, don't hesitate to reach out in our discord channel.

Checkout our Developer Guide if you wish to contribute to OpenLLM's codebase.

📔 Citation

If you use OpenLLM in your research, we provide a citation to use:

@software{Pham_OpenLLM_Operating_LLMs_2023,
author = {Pham, Aaron and Yang, Chaoyu and Sheng, Sean and  Zhao, Shenyang and Lee, Sauyon and Jiang, Bo and Dong, Fog and Guan, Xipeng and Ming, Frost},
license = {Apache-2.0},
month = jun,
title = {{OpenLLM: Operating LLMs in production}},
url = {https://github.com/bentoml/OpenLLM},
year = {2023}
}

About

Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.9%
  • Shell 2.4%
  • Other 1.7%