check-in `.clang-format` #10

lzhangzz · 2023-06-21T04:38:46Z

No description provided.

* support ascend using infer_ext * fix(ascend): make infer_ext using TND format q,k,v in paged_token_attention * support ascend using infer_ext * feat: support ascend moe_gating_topk_softmax * feat: change infer_ext ops function param order (#2) * ascend: align attention mask to 32bytes (#7) * fix attn args (#9) * fix: expand shape of attn_mask (#10) * feat: udpate infer_ext ops interface (#13) * rename infer_ext to dlinfer * format code * Support internlm 2.5 (#14) * refactor ascend pagedattention * fix ascend apply_rotary_pos_emb * fix import dlinfer (#16) * fix: fix rms_norm params (#18) * fix sync on ascend --------- Co-authored-by: chenchiyu <chenchiyu@pjlab.org.cn> Co-authored-by: CyCle1024 <ccy_justin@163.com> Co-authored-by: Wei Tao <1136862851@qq.com> Co-authored-by: jinminxi104 <jinminxi104@hotmail.com> Co-authored-by: pdx1989 <pdx1989@gmail.com>

This commit implements 4 high-priority kernels to bridge the gap between TurboMind CUDA and PyTorch Triton, enabling cross-platform deployment: 1. GELU and Mul kernel (activation.py) - Fused GELU activation + elementwise multiply - Follows TurboMind's GELU formula: x * 0.5 * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3))) - Auto-tuned for different vocab sizes - Estimated speedup: 1.2-1.5x vs unfused PyTorch 2. Top-K Sampling kernel (topk_sampling.py) - High-performance top-k sampling with softmax normalization - Iterative max-finding approach optimized for Triton - Includes topk_filter for logits filtering - Reference PyTorch implementation for testing - Critical for inference quality 3. Top-P (Nucleus) Sampling kernel (topp_sampling.py) - Nucleus sampling with cumulative probability threshold - Greedy nucleus selection for Triton efficiency - Fused softmax + cumsum + sampling - topp_filter for pre-sampling logits filtering - Reference implementation included 4. Embedding Lookup + Position Encoding kernel (embedding_lookup.py) - Fused embedding lookup + position encoding - Three variants: * embedding_lookup: Basic lookup * embedding_lookup_pos_encoding: Fused lookup + pos encoding + scaling * add_position_encoding: Add pos encoding to existing embeddings - Auto-tuned for different hidden dimensions - Memory bandwidth optimized with vectorized loads Additionally: - test_gelu_kernel.py: Comprehensive correctness and performance tests These kernels address critical gaps identified in KERNEL_MIGRATION_CHECKLIST.md: - Sampling: PyTorch backend had only multinomial, now has Top-K/Top-P - Activation: Extended from SiLU to include GELU - Embedding: Enables fused prefill operations Performance targets (vs TurboMind CUDA): - GELU and Mul: ≥95% (simple elementwise) - Embedding Lookup: ≥90% (memory-bound) - Top-K/Top-P Sampling: ≥85% (compute-bound) All kernels support: - FP16/BF16/FP32 precision - Auto-tuning for optimal performance - Cross-platform (CUDA/ROCm/Intel XPU via Triton) Resolves tasks from KERNEL_TODO_QUICK_REF.md: - Task InternLM#8: GELU and Mul ✅ - Task InternLM#1: Top-K Sampling ✅ - Task InternLM#2: Top-P Sampling ✅ - Task InternLM#10: Embedding + Pos Encoding ✅ Next steps: - Performance benchmarking on GPU - Integration tests with lmdeploy models - KV Cache quantization kernels (INT4/INT8)

check-in .clang-format

6dfbc11

lzhangzz merged commit 62c6080 into InternLM:main Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

check-in `.clang-format` #10

check-in `.clang-format` #10

Uh oh!

lzhangzz commented Jun 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

check-in .clang-format #10

check-in .clang-format #10

Uh oh!

Conversation

lzhangzz commented Jun 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

check-in `.clang-format` #10

check-in `.clang-format` #10