You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Tri-process asynchronous collaboration: tokenization, model inference, and detokenization are performed asynchronously, leading to a considerable improvement in GPU utilization.
26
-
- Nopad (Unpad): offers support for nopad attention operations across multiple models to efficiently handle requests with large length disparities.
27
-
- Dynamic Batch: enables dynamic batch scheduling of requests
28
-
-[FlashAttention](https://github.com/Dao-AILab/flash-attention): incorporates FlashAttention to improve speed and reduce GPU memory footprint during inference.
29
-
- Tensor Parallelism: utilizes tensor parallelism over multiple GPUs for faster inference.
30
-
-[Token Attention](./docs/TokenAttention.md): implements token-wise's KV cache memory management mechanism, allowing for zero memory waste during inference.
31
-
- High-performance Router: collaborates with Token Attention to meticulously manage the GPU memory of each token, thereby optimizing system throughput.
32
-
- Int8KV Cache: This feature will increase the capacity of tokens to almost twice as much. only llama support.
33
-
34
-
## Supported Model List
35
-
36
-
The following table provides a list of supported models along with any special arguments required for their configuration and annotations.
|[Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)|`--eos_id 151645 --trust_remote_code`, and run `pip install git+https://github.com/huggingface/transformers`|
-[2025/02] 🔥 LightLLM v1.0.0 release, achieving the **fastest DeepSeek-R1** serving performance on single H200 machine.
60
25
61
26
## Get started
62
27
63
-
### Installation
64
-
65
-
Use lightllm with `docker`.
66
-
67
-
```shell
68
-
docker pull ghcr.io/modeltc/lightllm:main
69
-
```
70
-
71
-
To start a container with GPU support and port mapping:
72
-
73
-
```shell
74
-
docker run -it --gpus all -p 8080:8080 \
75
-
--shm-size 1g -v your_local_path:/data/ \
76
-
ghcr.io/modeltc/lightllm:main /bin/bash
77
-
```
78
-
79
-
80
-
Note: If multiple GPUs are used, `--shm-size` in `docker run` command should be increased.
81
-
82
-
83
-
Alternatively, you can [build the docker image](https://lightllm-en.readthedocs.io/en/latest/getting_started/installation.html#installing-with-docker) or [install from source with pip](https://lightllm-en.readthedocs.io/en/latest/getting_started/installation.html#installing-from-source).
84
-
85
-
### Quick Start
86
-
87
-
Lightllm provides LLM inference services with the state-of-the-art throughput performance via efficient request routers and TokenAttention.
88
-
89
-
We provide examples to launch the LightLLM service and query the model (via console and python) for both text and multimodal models.
If the lightllm is run with `--tp > 1`, the visual model will run on the gpu 0.
97
-
Input images format: list for dict like `{'type': 'url'/'base64', 'data': xxx}`
98
-
The special image tag for Qwen-VL is `<img></img>` (`<image>` for Llava), the length of `data["multimodal_params"]["images"]` should be the same as the count of tags, The number can be 0, 1, 2, ...
99
-
100
-
101
-
### Other
102
-
103
-
Please refer to the [documentation](https://lightllm-en.readthedocs.io/en/latest/) for more information.
104
33
105
34
## Performance
106
35
107
-
Lightllm provides high throughput services. The performance comparison between LightLLM and vLLM is shown [here](https://lightllm-en.readthedocs.io/en/latest/dev/performance.html). Up to vllm=0.1.2, we have achieved a 2x larger throughput than vLLM.
108
-
36
+
Learn more in the release blogs: [v1.0.0 blog](https://www.light-ai.top/lightllm-blog//by%20mtc%20team/2025/02/16/lightllm/).
109
37
110
-
###FAQ
38
+
## FAQ
111
39
112
40
Please refer to the [FAQ](https://lightllm-en.readthedocs.io/en/latest/faq.html) for more information.
113
41
@@ -138,7 +66,7 @@ We welcome any coopoeration and contribution. If there is a project requires lig
138
66
139
67
## Community
140
68
141
-
For further information and discussion, [join our discord server](https://discord.gg/WzzfwVSguU).
69
+
For further information and discussion, [join our discord server](https://discord.gg/WzzfwVSguU). Welcome to be a member and look forward to your contribution!
0 commit comments