Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.4.2 Release Tracker #4505

Closed
simon-mo opened this issue Apr 30, 2024 · 12 comments
Closed

v0.4.2 Release Tracker #4505

simon-mo opened this issue Apr 30, 2024 · 12 comments
Labels
release Related to new version release

Comments

@simon-mo
Copy link
Collaborator

simon-mo commented Apr 30, 2024

ETA May 3rd, Friday.

@simon-mo simon-mo added misc release Related to new version release and removed misc labels Apr 30, 2024
@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Apr 30, 2024

@vrdn-23
Copy link
Contributor

vrdn-23 commented Apr 30, 2024

Would be possible to get #4419, #4357 and #3763 also included in this release? The dependency on ray for multiple GPUs in a single node is mostly a pain to deal with!

@nivibilla
Copy link

Following on from @vrdn-23 , #3466 would be great too. I already use ray for scaling across multiple nodes. And this is the only solution that works when using models that don't fit within a single GPU.

@vrdn-23
Copy link
Contributor

vrdn-23 commented Apr 30, 2024

#3466 got split into a bunch of smaller PRs from what I understand (out of which #4419 and #4357 are still yet to be merged I think), so I think we're asking for the same thing. :)

@jeejeelee
Copy link
Contributor

Could we consider #4132? It's proven to be incredibly useful in my development process

@simon-mo simon-mo pinned this issue May 1, 2024
@rkooo567
Copy link
Collaborator

rkooo567 commented May 1, 2024

#4451 -> this has to be included in the next release (otherwise chunked prefill will crash when preemption is used)

@cadedaniel
Copy link
Collaborator

@robertgshaw2-neuralmagic for block manager V2 we still need to do profiling before we swap over. I made an issue for tracking #4537

@robertgshaw2-neuralmagic
Copy link
Collaborator

@cadedaniel do you need something from the NM side on this?

@cadedaniel
Copy link
Collaborator

sure if there's interest :) I mention it because APC in BlockManagerv2 https://github.com/vllm-project/vllm/pull/4142 is not strictly necessary for release (block manager v2 not ready)

@aliozts
Copy link

aliozts commented May 2, 2024

Is it possible to include #4305 ?

@simon-mo
Copy link
Collaborator Author

simon-mo commented May 5, 2024

Released https://github.com/vllm-project/vllm/releases/tag/v0.4.2

Notably:

  • Even though our tests are starting to run against CUDA 12.4. This release is still with CUDA 12.1 for both wheel build and Docker.
  • The debug info has been stripped due to wheel size issue. Docker release is not affected.

@simon-mo simon-mo closed this as completed May 5, 2024
@RyanWMHI
Copy link

I used the squeezeLLM 4bit to quant my model. While it seems that there is a bug.
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ryan/vllm/benchmarks/benchmark_latency.py", line 195, in
[rank0]: main(args)
[rank0]: File "/home/ryan/vllm/benchmarks/benchmark_latency.py", line 20, in main
[rank0]: llm = LLM(model=args.model,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/entrypoints/llm.py", line 123, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/engine/llm_engine.py", line 292, in from_engine_args
[rank0]: engine = cls(
[rank0]: ^^^^
[rank0]: File "/home/ryan/vllm/vllm/engine/llm_engine.py", line 160, in init
[rank0]: self.model_executor = executor_class(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/executor/executor_base.py", line 41, in init
[rank0]: self._init_executor()
[rank0]: File "/home/ryan/vllm/vllm/executor/gpu_executor.py", line 23, in _init_executor
[rank0]: self._init_non_spec_worker()
[rank0]: File "/home/ryan/vllm/vllm/executor/gpu_executor.py", line 69, in _init_non_spec_worker
[rank0]: self.driver_worker.load_model()
[rank0]: File "/home/ryan/vllm/vllm/worker/worker.py", line 118, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/home/ryan/vllm/vllm/worker/model_runner.py", line 164, in load_model
[rank0]: self.model = get_model(
[rank0]: ^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/model_executor/model_loader/loader.py", line 224, in load_model
[rank0]: model.load_weights(
[rank0]: File "/home/ryan/vllm/vllm/model_executor/models/llama.py", line 407, in load_weights
[rank0]: param = params_dict[name]
[rank0]: ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.layers.0.self_attn.qkv_proj.rows'

after I changed the code llama.py in
if name != "model.layers.0.self_attn.qkv_proj.rows":
param = params_dict[name]
else:
param = params_dict["model.layers.0.self_attn.qkv_proj.qweight"]

Still get a gut
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ryan/vllm/benchmarks/benchmark_latency.py", line 195, in
[rank0]: main(args)
[rank0]: File "/home/ryan/vllm/benchmarks/benchmark_latency.py", line 20, in main
[rank0]: llm = LLM(model=args.model,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/entrypoints/llm.py", line 123, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/engine/llm_engine.py", line 292, in from_engine_args
[rank0]: engine = cls(
[rank0]: ^^^^
[rank0]: File "/home/ryan/vllm/vllm/engine/llm_engine.py", line 160, in init
[rank0]: self.model_executor = executor_class(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/executor/executor_base.py", line 41, in init
[rank0]: self._init_executor()
[rank0]: File "/home/ryan/vllm/vllm/executor/gpu_executor.py", line 23, in _init_executor
[rank0]: self._init_non_spec_worker()
[rank0]: File "/home/ryan/vllm/vllm/executor/gpu_executor.py", line 69, in _init_non_spec_worker
[rank0]: self.driver_worker.load_model()
[rank0]: File "/home/ryan/vllm/vllm/worker/worker.py", line 118, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/home/ryan/vllm/vllm/worker/model_runner.py", line 164, in load_model
[rank0]: self.model = get_model(
[rank0]: ^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/model_executor/model_loader/loader.py", line 224, in load_model
[rank0]: model.load_weights(
[rank0]: File "/home/ryan/vllm/vllm/model_executor/models/llama.py", line 412, in load_weights
[rank0]: weight_loader(param, loaded_weight, shard_id)
[rank0]: File "/home/ryan/vllm/vllm/model_executor/layers/linear.py", line 561, in weight_loader
[rank0]: loaded_weight = loaded_weight.narrow(output_dim, start_idx,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

@simon-mo simon-mo unpinned this issue May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release Related to new version release
Projects
None yet
Development

No branches or pull requests

9 participants