Closed
Description
Thanks for the repo!
I can build the repo successfully on H100 machine. But when I run the benchmarks, it shows the error below:
FATAL: kernel `fmha_cutlassF_f16_aligned_64x128_rf_sm80` is for sm80-sm100, but was built for sm50
which will further cause the issue:
Traceback (most recent call last):
File "benchmark_latency.py", line 77, in <module>
main(args)
File "benchmark_latency.py", line 57, in main
latencies.append(run_to_completion(profile=False))
File "benchmark_latency.py", line 41, in run_to_completion
llm.generate(prompt_token_ids=dummy_prompt_token_ids,
File "/home/ubuntu/vllm/vllm/entrypoints/llm.py", line 114, in generate
return self._run_engine(use_tqdm)
File "/home/ubuntu/vllm/vllm/entrypoints/llm.py", line 134, in _run_engine
step_outputs = self.llm_engine.step()
File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 225, in step
output = self._run_workers(
File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 307, in _run_workers
output = executor(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/cacheflow/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in
decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/vllm/vllm/worker/worker.py", line 279, in execute_model
output = self.model(
File "/home/ubuntu/anaconda3/envs/cacheflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, i
n _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/vllm/vllm/model_executor/models/llama.py", line 233, in forward
next_tokens = self.sampler(
File "/home/ubuntu/anaconda3/envs/cacheflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, i
n _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/vllm/vllm/model_executor/layers/sampler.py", line 81, in forward
return _sample(probs, logprobs, input_metadata)
File "/home/ubuntu/vllm/vllm/model_executor/layers/sampler.py", line 402, in _sample
parent_seq_ids, next_token_ids = _sample_from_generation_tokens(
File "/home/ubuntu/vllm/vllm/model_executor/layers/sampler.py", line 355, in _sample_from_generation_tokens
next_token_ids = torch.multinomial(
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Environment info is as bellow:
xFormers 0.0.20
memory_efficient_attention.cutlassF: available
memory_efficient_attention.cutlassB: available
memory_efficient_attention.flshattF: available
memory_efficient_attention.flshattB: available
memory_efficient_attention.smallkF: available
memory_efficient_attention.smallkB: available
memory_efficient_attention.tritonflashattF: available
memory_efficient_attention.tritonflashattB: available
indexing.scaled_index_addF: available
indexing.scaled_index_addB: available
indexing.index_select: available
swiglu.dual_gemm_silu: available
swiglu.gemm_fused_operand_sum: available
swiglu.fused.p.cpp: available
is_triton_available: True
is_functorch_available: False
pytorch.version: 2.0.1
pytorch.cuda: available
gpu.compute_capability: 9.0
gpu.name: NVIDIA H100 PCIe
build.info: available
build.cuda_version: 1108
build.python_version: 3.8.16
build.torch_version: 2.0.1+cu118
build.env.TORCH_CUDA_ARCH_LIST: 5.0+PTX 6.0 6.1 7.0 7.5 8.0 8.6
build.env.XFORMERS_BUILD_TYPE: Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS: None
build.env.NVCC_FLAGS: None
build.env.XFORMERS_PACKAGE_FROM: wheel-v0.0.20
build.nvcc_version: 11.8.89
source.privacy: open source