Skip to content

Conversation

@lzhangzz
Copy link
Collaborator

No description provided.

@lzhangzz lzhangzz merged commit 62c6080 into InternLM:main Jun 21, 2023
lvhan028 pushed a commit that referenced this pull request Aug 30, 2024
* support ascend using infer_ext

* fix(ascend): make infer_ext using TND format q,k,v in paged_token_attention

* support ascend using infer_ext

* feat: support ascend moe_gating_topk_softmax

* feat: change infer_ext ops function param order (#2)

* ascend: align attention mask to 32bytes (#7)

* fix attn args (#9)

* fix: expand shape of attn_mask (#10)

* feat: udpate infer_ext ops interface (#13)

* rename infer_ext to dlinfer

* format code

* Support internlm 2.5 (#14)

* refactor ascend pagedattention

* fix ascend apply_rotary_pos_emb

* fix import dlinfer (#16)

* fix: fix rms_norm params (#18)

* fix sync on ascend

---------

Co-authored-by: chenchiyu <chenchiyu@pjlab.org.cn>
Co-authored-by: CyCle1024 <ccy_justin@163.com>
Co-authored-by: Wei Tao <1136862851@qq.com>
Co-authored-by: jinminxi104 <jinminxi104@hotmail.com>
Co-authored-by: pdx1989 <pdx1989@gmail.com>
roy-shih pushed a commit to roy-shih/lmdeploy that referenced this pull request Nov 24, 2025
This commit implements 4 high-priority kernels to bridge the gap between
TurboMind CUDA and PyTorch Triton, enabling cross-platform deployment:

1. GELU and Mul kernel (activation.py)
   - Fused GELU activation + elementwise multiply
   - Follows TurboMind's GELU formula: x * 0.5 * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3)))
   - Auto-tuned for different vocab sizes
   - Estimated speedup: 1.2-1.5x vs unfused PyTorch

2. Top-K Sampling kernel (topk_sampling.py)
   - High-performance top-k sampling with softmax normalization
   - Iterative max-finding approach optimized for Triton
   - Includes topk_filter for logits filtering
   - Reference PyTorch implementation for testing
   - Critical for inference quality

3. Top-P (Nucleus) Sampling kernel (topp_sampling.py)
   - Nucleus sampling with cumulative probability threshold
   - Greedy nucleus selection for Triton efficiency
   - Fused softmax + cumsum + sampling
   - topp_filter for pre-sampling logits filtering
   - Reference implementation included

4. Embedding Lookup + Position Encoding kernel (embedding_lookup.py)
   - Fused embedding lookup + position encoding
   - Three variants:
     * embedding_lookup: Basic lookup
     * embedding_lookup_pos_encoding: Fused lookup + pos encoding + scaling
     * add_position_encoding: Add pos encoding to existing embeddings
   - Auto-tuned for different hidden dimensions
   - Memory bandwidth optimized with vectorized loads

Additionally:
- test_gelu_kernel.py: Comprehensive correctness and performance tests

These kernels address critical gaps identified in KERNEL_MIGRATION_CHECKLIST.md:
- Sampling: PyTorch backend had only multinomial, now has Top-K/Top-P
- Activation: Extended from SiLU to include GELU
- Embedding: Enables fused prefill operations

Performance targets (vs TurboMind CUDA):
- GELU and Mul: ≥95% (simple elementwise)
- Embedding Lookup: ≥90% (memory-bound)
- Top-K/Top-P Sampling: ≥85% (compute-bound)

All kernels support:
- FP16/BF16/FP32 precision
- Auto-tuning for optimal performance
- Cross-platform (CUDA/ROCm/Intel XPU via Triton)

Resolves tasks from KERNEL_TODO_QUICK_REF.md:
- Task InternLM#8: GELU and Mul ✅
- Task InternLM#1: Top-K Sampling ✅
- Task InternLM#2: Top-P Sampling ✅
- Task InternLM#10: Embedding + Pos Encoding ✅

Next steps:
- Performance benchmarking on GPU
- Integration tests with lmdeploy models
- KV Cache quantization kernels (INT4/INT8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant