-
Notifications
You must be signed in to change notification settings - Fork 118
Description
Problem Description
In building SGLang for the gfx90a
(MI250s) architecture, it fails due to Aiter, even if we target the gfx90a
architecture. It seems due to inclusion of fp8 kernels in the build. Is there a flag I should be passing to disable all fp8, or some other set of arguments to allow the build to go forward for MI250s?
Operating System
Ubuntu 22.04.5 LTS (Jammy Jellyfish)
CPU
AMD EPYC 7713 64-Core Processor
GPU
AMD Instinct MI250X/MI250 - amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
ROCm Version
ROCm 6.3.3
ROCm Component
No response
Steps to Reproduce
git clone git@github.com:sgl-project/sglang.git
cd sglang/docker
In the aiter (commit) build, in l.60 of the Dockerfile.rocm
, replace
GPU_ARCHS=gfx942
with
GPU_ARCHS=gfx90a
The Dockerfile fails to build with command:
PREBUILD_KERNELS=1 GPU_ARCHS=gfx90a python3 setup.py develop
This is seemingly due to fp8 kernels being included which are not supported on gfx90a
. Is there a flag I should be passing to disable all fp8
, or to have it run on gfx90
? It seems largely hardcoded in in some places (although admittedly the following example is for DeepSeek CSRC kernels):
aiter/aiter/jit/optCompilerConfig.json
Line 342 in e12d350
"'-DENABLE_FP8'" |
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
I have multiple GPUs on this node, but here is one of them:
ROCk module version 6.10.5 is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
...
*******
Agent 16
*******
Name: gfx90a
Uuid: GPU-124d13ccd7b050e5
Marketing Name: AMD Instinct MI250X/MI250
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 15
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29708(0x740c)
ASIC Revision: 1(0x1)
Cacheline Size: 128(0x80)
Max Clock Freq. (MHz): 1700
BDFID: 37632
Internal Node ID: 15
Compute Unit: 104
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 92
SDMA engine uCode:: 9
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Additional Information
Full SGLang Dockerfile.rocm
:
# Usage (to build SGLang ROCm docker image):
# docker build --build-arg SGL_BRANCH=v0.4.3.post2 -t v0.4.3.post2-rocm630 -f Dockerfile.rocm .
# default base image
ARG BASE_IMAGE="rocm/vllm-dev:20250114"
FROM $BASE_IMAGE AS base
USER root
WORKDIR /sgl-workspace
ARG BUILD_TYPE=all
ARG SGL_REPO="https://github.com/sgl-project/sglang"
ENV SGL_DEFAULT="main"
ARG SGL_BRANCH=${SGL_DEFAULT}
ARG TRITON_REPO="https://github.com/ROCm/triton.git"
ARG TRITON_COMMIT="improve_fa_decode_3.0.0"
ARG AITER_REPO="https://github.com/ROCm/aiter.git"
ARG AITER_COMMIT="testx"
RUN git clone ${SGL_REPO} \
&& cd sglang \
&& if [ "${SGL_BRANCH}" = ${SGL_DEFAULT} ]; then \
echo "Using ${SGL_DEFAULT}, default branch."; \
else \
echo "Using ${SGL_BRANCH} branch."; \
git checkout ${SGL_BRANCH}; \
fi \
&& cd sgl-kernel \
&& python setup_rocm.py install \
&& cd .. \
&& if [ "$BUILD_TYPE" = "srt" ]; then \
python -m pip --no-cache-dir install -e "python[srt_hip]"; \
else \
python -m pip --no-cache-dir install -e "python[all_hip]"; \
fi
RUN cp -r /sgl-workspace/sglang /sglang
RUN python -m pip cache purge
RUN pip install IPython \
&& pip install orjson \
&& pip install python-multipart \
&& pip install torchao \
&& pip install pybind11
RUN pip uninstall -y triton
RUN git clone ${TRITON_REPO} \
&& cd triton \
&& git checkout ${TRITON_COMMIT} \
&& cd python \
&& python3 setup.py install
RUN git clone ${AITER_REPO} \
&& cd aiter \
&& git checkout ${AITER_COMMIT} \
&& git submodule update --init --recursive \
&& PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py develop
# Copy config files to support MI300X in virtualized environments (MI300X_VF). Symlinks will not be created in image build.
RUN find /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/ \
/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/ \
-type f -name '*MI300X*' | xargs -I {} sh -c 'vf_config=$(echo "$1" | sed "s/MI300X/MI300X_VF/"); cp "$1" "$vf_config"' -- {}
# Performance environment variable.
ENV HIP_FORCE_DEV_KERNARG=1
ENV HSA_NO_SCRATCH_RECLAIM=1
ENV SGLANG_SET_CPU_AFFINITY=1
ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
ENV NCCL_MIN_NCHANNELS=112
ENV MOE_PADDING=1
ENV VLLM_FP8_PADDING=1
ENV VLLM_FP8_ACT_PADDING=1
ENV VLLM_FP8_WEIGHT_PADDING=1
ENV VLLM_FP8_REDUCE_CONV=1
ENV TORCHINDUCTOR_MAX_AUTOTUNE=1
ENV TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1
CMD ["/bin/bash"]