Closed
Description
Unable to handle request for model mosaicml/mpt-30b-chat
INFO 07-09 00:50:38 llm_engine.py:131] # GPU blocks: 716, # CPU blocks: 195
INFO: Started server process [89934]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 07-09 00:50:42 async_llm_engine.py:117] Received request cmpl-41fa40b022f54beaa423ec71c5c090e9: prompt: 'hello', sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, temperature=0.0, top_p=1.0, top_k=-1, use_beam_search=False, stop=[], ignore_eos=False, max_tokens=7, logprobs=None), prompt token ids: None.
INFO 07-09 00:50:42 scheduler.py:269] Throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%
INFO 07-09 00:50:42 async_llm_engine.py:196] Aborted request cmpl-41fa40b022f54beaa423ec71c5c090e9.
INFO: 8.218.79.36:49514 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
return await self.app(scope, receive, send)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 289, in __call__
await super().__call__(scope, receive, send)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
await self.app(scope, receive, send)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
raise e
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
await self.app(scope, receive, send)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 273, in app
raw_response = await run_endpoint_function(
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
return await dependant.call(**values)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 481, in create_completion
async for res in result_generator:
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 151, in generate
raise e
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 148, in generate
await self.engine_step(request_id)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 74, in engine_step
request_outputs = self.engine.step()
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 242, in step
output = self._run_workers(
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 330, in _run_workers
output = executor(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 284, in execute_model
output = self.model(
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/mpt.py", line 233, in forward
hidden_states = self.transformer(input_ids, positions, kv_caches,
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/mpt.py", line 201, in forward
hidden_states = block(
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/mpt.py", line 152, in forward
x = self.attn(
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/mpt.py", line 101, in forward
attn_output = self.attn(q, k, v, k_cache, v_cache, input_metadata,
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/attention.py", line 170, in forward
self.multi_query_kv_attention(
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/attention.py", line 352, in multi_query_kv_attention
out = xops.memory_efficient_attention_forward(
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 213, in memory_efficient_attention_forward
return _memory_efficient_attention_forward(
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 310, in _memory_efficient_attention_forward
out, *_ = op.apply(inp, needs_gradient=False)
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/cutlass.py", line 186, in apply
out, lse, rng_seed, rng_offset = cls.OPERATOR(
File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_ops.py", line 502, in __call__
return self._op(*args, **kwargs or {})
RuntimeError: attn_bias is not correctly aligned
Here is my Xformers
python -m xformers.info
xFormers 0.0.21+55a4798.d20230709
memory_efficient_attention.cutlassF: available
memory_efficient_attention.cutlassB: available
memory_efficient_attention.flshattF: available
memory_efficient_attention.flshattB: available
memory_efficient_attention.smallkF: available
memory_efficient_attention.smallkB: available
memory_efficient_attention.tritonflashattF: available
memory_efficient_attention.tritonflashattB: available
indexing.scaled_index_addF: available
indexing.scaled_index_addB: available
indexing.index_select: available
swiglu.dual_gemm_silu: available
swiglu.gemm_fused_operand_sum: available
swiglu.fused.p.cpp: available
is_triton_available: True
is_functorch_available: False
pytorch.version: 2.0.1+cu118
pytorch.cuda: available
gpu.compute_capability: 9.0
gpu.name: NVIDIA H100 PCIe
build.info: available
build.cuda_version: 1108
build.python_version: 3.10.12
build.torch_version: 2.0.1+cu118
build.env.TORCH_CUDA_ARCH_LIST: 9.0
build.env.XFORMERS_BUILD_TYPE: None
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS: None
build.env.NVCC_FLAGS: None
build.env.XFORMERS_PACKAGE_FROM: None
build.nvcc_version: 11.8.89
source.privacy: open source
Pytorch Version:
Collecting environment information...
PyTorch version: 2.0.1+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.4
Libc version: glibc-2.31
Python version: 3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-73-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA H100 PCIe
Nvidia driver version: 525.105.17
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 57 bits virtual
CPU(s): 26
On-line CPU(s) list: 0-25
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 26
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 143
Model name: Intel(R) Xeon(R) Platinum 8480+
Stepping: 8
CPU MHz: 2000.000
BogoMIPS: 4000.00
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 832 KiB
L1i cache: 832 KiB
L2 cache: 104 MiB
L3 cache: 416 MiB
NUMA node0 CPU(s): 0-25
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Unknown: No mitigations
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; TSX disabled
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk avx512_fp16 arch_capabilities
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.25.1
[pip3] torch==2.0.1+cu118
[pip3] torchaudio==2.0.2+cu118
[pip3] torchvision==0.15.2+cu118
[conda] numpy 1.25.1 pypi_0 pypi
[conda] torch 2.0.1+cu118 pypi_0 pypi
[conda] torchaudio 2.0.2+cu118 pypi_0 pypi
[conda] torchvision 0.15.2+cu118 pypi_0 pypi