MoLink Project

MoLink (Model-Link) is a distributed LLM serving system, aiming to achieve high performance LLM inference services with distributed computing resources that might spread over the Internet. You can also run MoLink over heterogeneous devices.

Installation Guide

MoLink is built on top of vLLM, and will manage to keep compatible with its latest version, currently we support vLLM v0.8.5post1. Please ensure that your server meets the requirements for running vLLM, refer to this.

you can install MoLink with the following steps:

git clone https://github.com/oldcpple/MoLink.git
cd MoLink
pip install -e .
pip install grpcio-tools==1.71.0 protobuf==5.29.0

We need to perform additional processing for the installation of grpcio-tools and protobuf, because of the conflicts with vLLM dependencies.

Usage Guide

Once MoLink is successfully installed, you can follow this guide to deploy LLMs with GPU servers.

This is an example, assume that we have 2 servers and each with one GPU, and attempt to deploy a 70B LLaMA2 model. On the first server, simply run:

python -m molink.entrypoints.api_server --model meta-llama/Llama-2-70b-chat-hf --port 8080 --dtype=half --max_model_len 4096 --pipeline_parallel_size 2 --serving_layers 0,39

Two important arguments are pipeline_parallel_size and serving_layers. Set pipeline_parallel_size to the number of servers used to serve the model, 2 in this example. serving_layers claims the transformer layers this server will hold, please refer to config.json of your target model from Huggingface Hub to checkout how many layers it possesses in total before deciding how to split it (80 layers for 70B LLaMA2 in this example, we split it as 0-39 and 40-79 on two servers respectively). Other arguments are inherited from vLLM and compatible with it.

During startup, the first server will print logs like the following:

DISTRIBUTED SERVICE INFO: MoLink gRPC server works at 172.17.0.17:50051
DISTRIBUTED SERVICE INFO: If this is the first node of the swarm, you can copy the DHT INFO as the initial peer of following nodes

Simply copy the first line, namely address of the DHT server, 172.17.0.17:50051 in this example, and use it as the initial_peer in the following command to start the second server:

python -m molink.entrypoints.api_server --model meta-llama/Llama-2-70b-chat-hf --port 9090 --dtype=half --max_model_len 4096 --pipeline_parallel_size 2 --serving_layers 40,79 --initial_peer 172.17.0.15:50051

You can also serve the LLM with a single node, in this case the system falls back to vLLM:

python -m molink.entrypoints.api_server --model meta-llama/Llama-2-70b-chat-hf --port 8080 --dtype=half --max_model_len 4096

For multi-GPU nodes, you can use multiple GPUs for tensor-parallel by specifying argument --tensor_parallel_size. It's also supported to run on a hybrid pipeline, for example, the tensor parallelism size of each stage can be different, and devices can be heterogeneous.

The inference service usage are also compatible with vLLM's api server, for example you can simply run (change localhost to your server IP if you're not running at local ):

curl http://localhost:8080/generate \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "San Francisco is a",
        "max_tokens": 20,
        "temperature": 0
    }'

MoLink also supports OpenAI-Compatible servers, you can start one with:

python -m molink.entrypoints.openai.api_server --model XXXXX (same as examples above)

And access the API server like：

curl http://localhost:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "San Francisco is a",
        "max_tokens": 20,
        "temperature": 0
    }'

or

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
  model="meta-llama/Llama-2-70b-chat-hf",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message)

Supported Model Architectures:

BaichuanForCausalLM
BloomForCausalLM
ChatGLMForCausalLM
CohereForCausalLM
DeepseekForCausalLM
DeepseekV2ForCausalLM
DeepseekV3ForCausalLM
FalconForCausalLM
GemmaForCausalLM
Gemma2ForCausalLM
GlmForCausalLM
GPT2LMHeadModel
LlamaForCausalLM
MambaForCausalLM
MixtralForCausalLM
PhiForCausalLM
Phi3ForCausalLM
QWenLMHeadModel
Qwen2MoeForCausalLM
Qwen2ForCausalLM
Qwen3MoeForCausalLM
Qwen3ForCausalLM

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
benchmark		benchmark
molink		molink
resources/images		resources/images
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MoLink Project

Installation Guide

Usage Guide

Supported Model Architectures:

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

oldcpple/MoLink

Folders and files

Latest commit

History

Repository files navigation

MoLink Project

Installation Guide

Usage Guide

Supported Model Architectures:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages