-
Notifications
You must be signed in to change notification settings - Fork 293
Closed
Description
Hello, I tried to reproduce the performance of chunk_scan kernel reported in the PipeThreader paper. But the actual result does not match the reported performance.
tilelang kernel: examples/linear_attention/example_mamba_chunk_scan.py
commit: 569b012
nvcc version: 12.4
torch version: 2.6.0
cmd: python example_mamba_chunk_scan.py --batch 64 --seq_len 8192 --tune
hardware: H100 PCIe
The output is
Best latency: 48.831199645996094
Best TFlops: 28.145725369921212
Best config: {'block_M': 64, 'block_N': 32, 'block_K': 64, 'block_Dstate': 128, 'num_stages': 4}But as the paper said (Table 3) , ChunkScan for bs=64 seq=8k should cost 6.981ms.
Is the example file the one used in paper experiements?
Metadata
Metadata
Assignees
Labels
No labels