Question about mamba chunk_scan kernel performance

Hello, I tried to reproduce the performance of chunk_scan kernel reported in the PipeThreader paper. But the actual result does not match the reported performance. 


tilelang kernel: `examples/linear_attention/example_mamba_chunk_scan.py`
commit: 569b0127c5f97730b0cab960af7313c3d401b06d
nvcc version: 12.4
torch version: 2.6.0
cmd: python example_mamba_chunk_scan.py --batch 64 --seq_len 8192 --tune
hardware: H100 PCIe 

The output is
```bash
Best latency: 48.831199645996094
Best TFlops: 28.145725369921212
Best config: {'block_M': 64, 'block_N': 32, 'block_K': 64, 'block_Dstate': 128, 'num_stages': 4}
```

But as the paper said (Table 3) , ChunkScan for bs=64 seq=8k should cost 6.981ms.

Is the example file the one used in paper experiements? 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about mamba chunk_scan kernel performance #707

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about mamba chunk_scan kernel performance #707

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions