Skip to content

Commit

Permalink
Bump up version to v0.3.0 (vllm-project#2656)
Browse files Browse the repository at this point in the history
  • Loading branch information
zhuohan123 authored and NikolaBorisov committed Jan 31, 2024
1 parent 1507a4c commit cd36ebc
Show file tree
Hide file tree
Showing 3 changed files with 7 additions and 3 deletions.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ vLLM is fast with:
- Efficient management of attention key and value memory with **PagedAttention**
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629)
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629), FP8 KV Cache
- Optimized CUDA kernels

vLLM is flexible and easy to use with:
Expand All @@ -57,6 +57,8 @@ vLLM is flexible and easy to use with:
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs and AMD GPUs
- (Experimental) Prefix caching support
- (Experimental) Multi-lora support

vLLM seamlessly supports many Hugging Face models, including the following architectures:

Expand Down
4 changes: 3 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ vLLM is fast with:
* Efficient management of attention key and value memory with **PagedAttention**
* Continuous batching of incoming requests
* Fast model execution with CUDA/HIP graph
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_, FP8 KV Cache
* Optimized CUDA kernels

vLLM is flexible and easy to use with:
Expand All @@ -42,6 +42,8 @@ vLLM is flexible and easy to use with:
* Streaming outputs
* OpenAI-compatible API server
* Support NVIDIA GPUs and AMD GPUs
* (Experimental) Prefix caching support
* (Experimental) Multi-lora support

For more information, check out the following:

Expand Down
2 changes: 1 addition & 1 deletion vllm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from vllm.outputs import CompletionOutput, RequestOutput
from vllm.sampling_params import SamplingParams

__version__ = "0.2.7"
__version__ = "0.3.0"

__all__ = [
"LLM",
Expand Down

0 comments on commit cd36ebc

Please sign in to comment.