[Issue]: Build errors for gfx90a (MI250) architecture

### Problem Description

In building SGLang for the `gfx90a` (MI250s) architecture, it fails due to Aiter, even if we target the `gfx90a` architecture. It seems due to inclusion of fp8 kernels in the build. Is there a flag I should be passing to disable all fp8, or some other set of arguments to allow the build to go forward for MI250s?

### Operating System

Ubuntu 22.04.5 LTS (Jammy Jellyfish)

### CPU

AMD EPYC 7713 64-Core Processor

### GPU

AMD Instinct MI250X/MI250 - amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-

### ROCm Version

ROCm 6.3.3

### ROCm Component

_No response_

### Steps to Reproduce

```bash
git clone git@github.com:sgl-project/sglang.git
cd sglang/docker
```
In the aiter ([commit](https://github.com/ROCm/aiter/commit/e12d350a7ddff818baec43eafcc66ba7b2191567)) build, in [l.60 of the `Dockerfile.rocm`](https://github.com/sgl-project/sglang/blob/d3d4d76758b15c2c03e37e82cb85044f45332bfa/docker/Dockerfile.rocm#L60), replace 
```bash
GPU_ARCHS=gfx942
```
with
```bash
GPU_ARCHS=gfx90a
```
The Dockerfile fails to build with command:
```bash
PREBUILD_KERNELS=1 GPU_ARCHS=gfx90a python3 setup.py develop
```
This is seemingly due to fp8 kernels being included which are not supported on `gfx90a`. Is there a flag I should be passing to disable all `fp8`, or to have it run on `gfx90`? It seems largely hardcoded in in some places (although admittedly the following example is for DeepSeek CSRC kernels):

https://github.com/ROCm/aiter/blob/e12d350a7ddff818baec43eafcc66ba7b2191567/aiter/jit/optCompilerConfig.json#L342


### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

I have multiple GPUs on this node, but here is one of them:
``` 
ROCk module version 6.10.5 is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========

...

*******
Agent 16
*******
  Name:                    gfx90a
  Uuid:                    GPU-124d13ccd7b050e5
  Marketing Name:          AMD Instinct MI250X/MI250
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    15
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 29708(0x740c)
  ASIC Revision:           1(0x1)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   1700
  BDFID:                   37632
  Internal Node ID:        15
  Compute Unit:            104
  SIMDs per CU:            4
  Shader Engines:          8
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 92
  SDMA engine uCode::      9
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***
```

### Additional Information

Full SGLang `Dockerfile.rocm`:
```
# Usage (to build SGLang ROCm docker image):
#   docker build --build-arg SGL_BRANCH=v0.4.3.post2 -t v0.4.3.post2-rocm630 -f Dockerfile.rocm .

# default base image
ARG BASE_IMAGE="rocm/vllm-dev:20250114"

FROM $BASE_IMAGE AS base
USER root

WORKDIR /sgl-workspace
ARG BUILD_TYPE=all
ARG SGL_REPO="https://github.com/sgl-project/sglang"
ENV SGL_DEFAULT="main"
ARG SGL_BRANCH=${SGL_DEFAULT}

ARG TRITON_REPO="https://github.com/ROCm/triton.git"
ARG TRITON_COMMIT="improve_fa_decode_3.0.0"


ARG AITER_REPO="https://github.com/ROCm/aiter.git"
ARG AITER_COMMIT="testx"

RUN git clone ${SGL_REPO} \
    && cd sglang \
    && if [ "${SGL_BRANCH}" = ${SGL_DEFAULT} ]; then \
         echo "Using ${SGL_DEFAULT}, default branch."; \
       else \
         echo "Using ${SGL_BRANCH} branch."; \
         git checkout ${SGL_BRANCH}; \
       fi \
    && cd sgl-kernel \
    && python setup_rocm.py install \
    && cd .. \
    && if [ "$BUILD_TYPE" = "srt" ]; then \
         python -m pip --no-cache-dir install -e "python[srt_hip]"; \
       else \
         python -m pip --no-cache-dir install -e "python[all_hip]"; \
       fi

RUN cp -r /sgl-workspace/sglang /sglang
RUN python -m pip cache purge

RUN pip install IPython \
    && pip install orjson \
    && pip install python-multipart \
    && pip install torchao \
    && pip install pybind11

RUN pip uninstall -y triton
RUN git clone ${TRITON_REPO} \
    && cd triton \
    && git checkout ${TRITON_COMMIT} \
    && cd python \
    && python3 setup.py install

RUN git clone ${AITER_REPO} \
    && cd aiter \
    && git checkout ${AITER_COMMIT} \
    && git submodule update --init --recursive \
    && PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py develop

# Copy config files to support MI300X in virtualized environments (MI300X_VF).  Symlinks will not be created in image build.
RUN find /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/ \
         /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/ \
         -type f -name '*MI300X*' | xargs -I {} sh -c 'vf_config=$(echo "$1" | sed "s/MI300X/MI300X_VF/"); cp "$1" "$vf_config"' -- {}

# Performance environment variable.

ENV HIP_FORCE_DEV_KERNARG=1
ENV HSA_NO_SCRATCH_RECLAIM=1
ENV SGLANG_SET_CPU_AFFINITY=1
ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
ENV NCCL_MIN_NCHANNELS=112

ENV MOE_PADDING=1
ENV VLLM_FP8_PADDING=1
ENV VLLM_FP8_ACT_PADDING=1
ENV VLLM_FP8_WEIGHT_PADDING=1
ENV VLLM_FP8_REDUCE_CONV=1
ENV TORCHINDUCTOR_MAX_AUTOTUNE=1
ENV TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1

CMD ["/bin/bash"]

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: Build errors for gfx90a (MI250) architecture #179

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: Build errors for gfx90a (MI250) architecture #179

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions