Skip to content

Add support for H100 #199

Closed
Closed
@LiuXiaoxuanPKU

Description

@LiuXiaoxuanPKU

Thanks for the repo!
I can build the repo successfully on H100 machine. But when I run the benchmarks, it shows the error below:

FATAL: kernel `fmha_cutlassF_f16_aligned_64x128_rf_sm80` is for sm80-sm100, but was built for sm50

which will further cause the issue:

Traceback (most recent call last):                                                                                 
  File "benchmark_latency.py", line 77, in <module>                                                                
    main(args)                                                                                                     
  File "benchmark_latency.py", line 57, in main                                                                    
    latencies.append(run_to_completion(profile=False))                                                             
  File "benchmark_latency.py", line 41, in run_to_completion                                                       
    llm.generate(prompt_token_ids=dummy_prompt_token_ids,                                                          
  File "/home/ubuntu/vllm/vllm/entrypoints/llm.py", line 114, in generate                                          
    return self._run_engine(use_tqdm)                                                                              
  File "/home/ubuntu/vllm/vllm/entrypoints/llm.py", line 134, in _run_engine                                       
    step_outputs = self.llm_engine.step()                                                                          
  File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 225, in step                                            
    output = self._run_workers(                                                                                    
  File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 307, in _run_workers                                    
    output = executor(*args, **kwargs)                                                                             
  File "/home/ubuntu/anaconda3/envs/cacheflow/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in
 decorate_context                                                                                                  
    return func(*args, **kwargs)                                                                                   
  File "/home/ubuntu/vllm/vllm/worker/worker.py", line 279, in execute_model                                       
    output = self.model(                                                                                           
  File "/home/ubuntu/anaconda3/envs/cacheflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, i
n _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/model_executor/models/llama.py", line 233, in forward
    next_tokens = self.sampler(
  File "/home/ubuntu/anaconda3/envs/cacheflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, i
n _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/vllm/vllm/model_executor/layers/sampler.py", line 81, in forward
    return _sample(probs, logprobs, input_metadata)
  File "/home/ubuntu/vllm/vllm/model_executor/layers/sampler.py", line 402, in _sample
    parent_seq_ids, next_token_ids = _sample_from_generation_tokens(
  File "/home/ubuntu/vllm/vllm/model_executor/layers/sampler.py", line 355, in _sample_from_generation_tokens
    next_token_ids = torch.multinomial(
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Environment info is as bellow:

xFormers 0.0.20
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.flshattF:               available
memory_efficient_attention.flshattB:               available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        available
memory_efficient_attention.tritonflashattB:        available
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               True
is_functorch_available:                            False
pytorch.version:                                   2.0.1
pytorch.cuda:                                      available
gpu.compute_capability:                            9.0
gpu.name:                                          NVIDIA H100 PCIe
build.info:                                        available
build.cuda_version:                                1108
build.python_version:                              3.8.16
build.torch_version:                               2.0.1+cu118
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0 8.6
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.20
build.nvcc_version:                                11.8.89
source.privacy:                                    open source

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions