Skip to content

Conversation

@yuanheng-zhao
Copy link
Contributor

@yuanheng-zhao yuanheng-zhao commented Apr 23, 2024

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs
  • I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

  1. Update Llama3 Config and inference settings in example benchmark.
  2. Add llama generation demo script (Llama3)
  3. Add a clean-up version of Llama3 benchmark
  4. fix init rope in llama policy
image

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@yuanheng-zhao yuanheng-zhao added the example example-related issuer or pull request label Apr 23, 2024
@yuanheng-zhao yuanheng-zhao marked this pull request as ready for review April 23, 2024 09:26
@yuanheng-zhao yuanheng-zhao requested a review from a team as a code owner April 23, 2024 09:26
@yuanheng-zhao yuanheng-zhao changed the title [example] Update Llama Inference Benchmark [example] Update Llama Inference example Apr 23, 2024
@yuanheng-zhao
Copy link
Contributor Author

A clean-up version of Llama benchmark is added (examples/inference/benchmark_llama3.py) so that users won't need to install vLLM to run benchmark (the old-comparison one). The clean-up version remove unused args and invalid statements, only focusing on benchmark Colossal-Inference Engine.

@yuanheng-zhao
Copy link
Contributor Author

CI - Test Example on PR will be fixed in another PR fixing Inference CI issues.

@yuanheng-zhao yuanheng-zhao merged commit 04863a9 into hpcaitech:feature/colossal-infer Apr 23, 2024
@yuanheng-zhao yuanheng-zhao deleted the inference/example/benchmark-llama3 branch April 23, 2024 14:23
botbw added a commit that referenced this pull request May 23, 2024
* [Inference] First PR for rebuild colossal-infer (#5143)

* add engine and scheduler

* add dirs

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [Inference] Add readme (roadmap) and fulfill request handler (#5147)

* request handler

* add readme

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159)

* [inference/nfc] remove outdated inference tests

* remove outdated kernel tests

* remove deprecated triton kernels

* remove imports from deprecated kernels

* [Inference]Add BatchInferState, Sequence and InferConfig (#5149)

* add infer_struct and infer_config

* update codes

* change InferConfig

* Add hf_model_config to the engine

* rm _get_hf_model_config

* update codes

* made adjustments according to the feedback from the reviewer.

* update codes

* add ci test for config and struct

* [Inference] Add CacheBlock and KV-Cache Manager (#5156)

* [Inference] Add KVCache Manager

* function refactored

* add test for KVCache Manager

* add attr beam width

* Revise alloc func in CacheManager

* Fix docs and pytests

* add tp slicing for head number

* optimize shapes of tensors used as physical cache

* Apply using InferenceConfig on KVCacheManager

* rm duplicate config file

* Optimize cache allocation: use contiguous cache

* Fix config in pytest (and config)

* [Inference]Update inference config and fix test (#5178)

* unify the config setting

* fix test

* fix import

* fix test

* fix

* fix

* add logger

* revise log info

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [Inference] Add the logic of the inference engine (#5173)

* add infer_struct and infer_config

* update codes

* change InferConfig

* Add hf_model_config to the engine

* rm _get_hf_model_config

* update codes

* made adjustments according to the feedback from the reviewer.

* update codes

* add ci test for config and struct

* Add the logic of the inference engine

* update engine and test

* Recover cache_manager.py

* add logger

* fix conflict

* update codes

* update codes

* update model and tokenizer

* fix add the logic about shardformer

* change kvcache_manager docstring

* add policy

* fix ci bug in test_kvcache_manager.py

* remove codes related o tokenizer and move model_policy

* fix  code style

* add ordered_set to requirements-infer.txt

* Delete extra empty lines

* add ordered_set to requirements-test.txt

* [Inference] add logit processor and request handler (#5166)

* add logit processor and request handler

* add

* add

* add

* fix

* add search tokens and update func

* finish request handler

* add running list test

* fix test

* fix some bug

* add

* add

* fix bugs

* fix some bugs

* fix bug

* fix

* fix

* add copy fun

* del useless attn

* fix request status

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* Add padding llama model

* Fixed a bug in the inference frame

* fix bugs in request_handler

* precision alignment

* Fixed a writing error

* [kernel] Add triton kernel for context attention (FAv2) without padding (#5192)

* add context attn unpadded triton kernel

* test compatibility

* kv cache copy (testing)

* fix k/v cache copy

* fix kv cache copy and test

* fix boundary of block ptrs

* add support for GQA/MQA and testing

* fix import statement

---------

Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>

* add context_attention_unpadded

* fix bugs in sampler

* Fixed a typo

* fix beam_width

* [Inference] Pytorch Attention func, pad&nopad input support (#5219)

* add attn

* add attention test

* fix attn forward

* fix decoding

* fix bugs in attention.py and request_handler.py

* adapted to pad_context_forward

* [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229)

* fix accuracy

* alignment in attention

* fix attention

* fix

* fix bugs

* fix bugs

* fix bugs

* fix bugs related to processing padding mask

* fix CI bugs

* rm torch.cuda.synchronize

* fix bugs in request_handler.py and engine.py

* [Inference] Kernel: no pad rotary embedding (#5252)

* fix bugs

* comment

* use more accurate atol

* fix

* [kernel] Add flash decoding triton kernel for blocked kv cache (#5249)

* add flash decoding unpad triton kernel

* rename flash decoding kernel

* add kernel testing (draft)

* revise pytest

* support kv group (GQA)

* (trivial) fix api and pytest

* (trivial) func renaming

* (trivial) func/file renaming

* refactor pytest for attention

* (trivial) format and consistent vars of context/decode attn

* (trivial) remove test redundancy

* [git] fixed rebased files

* [kernel] Add KV cache copy kernel during decoding  (#5261)

* add kv copy triton kernel during decoding stage

* add pytest and fix kernel

* fix test utilities

* revise kernel config

* add benchmark for kvcache copy

* [doc] updated inference readme (#5269)

* [Inference] Fix request handler and add recycle logic (#5260)

* fix request handler

* fix comment

* [kernel] Revise KVCache copy triton kernel API (#5273)

* [kernel/fix] revise kvcache copy kernel api

* fix benchmark

* [Inference]Adapted to the triton attn kernels (#5264)

* adapted to the triton attn kernels

* fix pad input

* adapted to copy_kv_to_blocked_cache

* fix ci test

* update kv memcpy

* remove print

* [kernel] Add RMSLayerNorm triton kernel (#5262)

* add layerrmsnorm triton kernel

* add layerrmsnorm kernel

* modify the atol and rtol in test file

* Remove the logics of mean computations, and update the name of ther kernel functions and files

* add benchmark of rms norm

* [Hotfix] Fix bugs in testing continuous batching (#5270)

* fix bug

* fix bugs

* fix bugs

* fix bugs and add padding

* add funcs and fix bugs

* fix typos

* fix bugs

* add func

* [kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274)

* prevent re-creating intermediate tensors

* add singleton class holding intermediate values

* fix triton kernel api

* add benchmark in pytest

* fix kernel api and add benchmark

* revise flash decoding triton kernel in/out shapes

* fix calling of triton kernel in modeling

* fix pytest: extract to util functions

* [inference] Adapted to Rotary Embedding and RMS Norm (#5283)

* adapted to rotary_embedding

* adapted to nopad rms norm

* fix bugs in benchmark

* fix flash_decoding.py

* add utils.py

* [Inference] Benchmarking rotary embedding and add a fetch function (#5277)

* fix bugs and add a cos/sin cache fetch func

* add docstring

* fix bug

* fix

* [Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301)

* fix decoding kernel pytest

* revise and add triton context attn benchmark

* [Inference]Add fused rotary kernel and get cos cache kernel (#5302)

* add fused rotary and get cos cache func

* staged

* fix bugs

* fix bugs

* [hotfix] fix boundary check in batch (#5306)

* [inference]Optimize the usage of the mid tensors space in flash attn (#5304)

* opt flash attn

* opt tmp tensor

* fix benchmark_llama

* fix code style

* fix None logic for output tensor

* fix adapted to get_xine_cache

* add comment

* fix ci bugs

* fix some codes

* rm duplicated codes

* rm duplicated codes

* fix code style

* add _get_dtype in config.py

* fix (#5311)

* [Inference] Update rms norm kernel, benchmark with vLLM (#5315)

* add

* xi

* del

* del

* fix

* [DOC] Update inference readme  (#5280)

* add readme

* add readme

* 1

* update engine

* finish readme

* add readme

* [Inference]Add Nopadding Llama Modeling (#5327)

* add nopadding llama modeling

* add nopadding_llama.py

* rm unused codes

* fix bugs in test_xine_copy.py

* fix code style

* [Infer] Optimize Blocked KVCache And Kernels Using It (#5325)

* revise shape of kvcache (context attn kernel)

* revise shape of kvcache (flash decoding kernel)

* revise shape of kvcache (kvcache copy) and attn func

* init of kvcache in kvcache manager

* revise llama modeling

* revise block size retrieval

* use torch for rms_norm benchmarking

* revise block size retrieval

* [Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336)

* revise rotary embedding

* remove useless print

* adapt

* [inference] simplified config verification (#5346)

* [inference] simplified config verification

* polish

* polish

* [Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340)

* add fused qkv

* replace attn and mlp by shardformer

* fix bugs in mlp

* add docstrings

* fix test_inference_engine.py

* add optimize unbind

* add fused_addmm

* rm squeeze(1)

* refactor codes

* fix ci bugs

* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention

* Removed the dependency on LlamaFlashAttention2

* rollback test_inference_engine.py

* [inference] removed redundancy init_batch (#5353)

* [inference] moved ops tests to test_infer (#5354)

* [doc] updated inference readme (#5343)

* [Inference/opt]Optimize the mid tensor of RMS Norm (#5350)

* opt rms_norm

* fix bugs in rms_layernorm

* [Inference]Optimize generation process of inference engine (#5356)

* opt inference engine

* fix run_benchmark.sh

* fix generate in engine.py

* rollback tesh_inference_engine.py

* [Fix/Infer] Remove unused deps and revise requirements (#5341)

* remove flash-attn dep

* rm padding llama

* revise infer requirements

* move requirements out of module

* [Inference]Fused the gate and up proj in mlp,and optimized the autograd process. (#5365)

* fused the gate and up proj in mlp

* fix code styles

* opt auto_grad

* rollback test_inference_engine.py

* modifications based on the review feedback.

* fix bugs in flash attn

* Change reshape to view

* fix test_rmsnorm_triton.py

* [Inference] Adapt to Fused rotary (#5348)

* revise rotary embedding

* remove useless print

* adapt

* fix

* add

* fix

* modeling

* fix

* fix

* fix

* Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373)

This reverts commit 9f4ab2e.

* [inference] added inference template (#5375)

* [Inference/opt] Fused KVCahce Memcopy (#5374)

* fused kv memcopy

* add TODO in test_kvcache_copy.py

* [Inference] User Experience: update the logic of default tokenizer and generation config.  (#5337)

* add

* fix

* fix

* pause

* fix

* fix pytest

* align

* fix

* license

* fix

* fix

* fix readme

* fix some bugs

* remove tokenizer config

* [inference] refactored config (#5376)

* [Inference]Support vllm testing in benchmark scripts (#5379)

* add vllm benchmark scripts

* fix code style

* update run_benchmark.sh

* fix code style

* [Inference] Optimize and Refactor Inference Batching/Scheduling (#5367)

* add kvcache manager funcs for batching

* add batch bucket for batching

* revise RunningList struct in handler

* add kvcache/batch funcs for compatibility

* use new batching methods

* fix indexing bugs

* revise abort logic

* use cpu seq lengths/block tables

* rm unused attr in Sequence

* fix type conversion/default arg

* add and revise pytests

* revise pytests, rm unused tests

* rm unused statements

* fix pop finished indexing issue

* fix: use index in batch when retrieving inputs/update seqs

* use dict instead of odict in batch struct

* arg type hinting

* fix make compress

* refine comments

* fix: pop_n_seqs to pop the first n seqs

* add check in request handler

* remove redundant conversion

* fix test for request handler

* fix pop method in batch bucket

* fix prefill adding

* [Inference]Fused kv copy into rotary calculation (#5383)

* revise rotary embedding

* remove useless print

* adapt

* fix

* add

* fix

* modeling

* fix

* fix

* fix

* fused kv copy

* fused copy

* colossalai/kernel/triton/no_pad_rotary_embedding.py

* del padding llama

* del

* Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390)

* opt_view_and_memcopy

* fix bugs in ci

* fix ci bugs

* update benchmark scripts

* fix ci bugs

* [Fix/Inference] Fix format of input prompts and input model  in inference engine (#5395)

* Fix bugs in inference_engine

* fix bugs in engine.py

* rm  CUDA_VISIBLE_DEVICES

* add request_ids in generate

* fix bug in engine.py

* add logger.debug for BatchBucket

* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

fix dependency in pytest

* [Inference]Add CUDA KVCache Kernel (#5406)

* add cuda KVCache kernel

* annotation benchmark_kvcache_copy

* add use cuda

* fix import path

* move benchmark scripts to example/

* rm benchmark codes in test_kv_cache_memcpy.py

* rm redundancy codes

* rm redundancy codes

* pr was modified according to the review

* [Inference]Move benchmark-related code to the example directory. (#5408)

* move benchmark-related code to the example directory.

* fix bugs in test_fused_rotary_embedding.py

* add silu_and_mul for infer

* [feat] cuda graph support and refactor non-functional api

* add reusable utils for cuda

* refactor code

* feat rmsnorm cuda kernel and add unittest, benchmark script (#5417)

* [fix] multi graphs capture error

* [fix] multi graphs capture error

* [doc] add doc

* refactor code

* optimize rmsnorm: add vectorized elementwise op, feat loop unrolling (#5441)

* fix include path

* fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test (#5454)

* [Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418)

* add rotary embedding kernel

* add rotary_embedding_kernel

* add fused rotary_emb and kvcache memcopy

* add fused_rotary_emb_and_cache_kernel.cu

* add fused_rotary_emb_and_memcopy

* fix bugs in fused_rotary_emb_and_cache_kernel.cu

* fix ci bugs

* use vec memcopy and opt the  gloabl memory access

* fix code style

* fix test_rotary_embdding_unpad.py

* codes revised based on the review comments

* fix bugs about include path

* rm inline

* [fix] pytest and fix dyn grid bug

* diverse tests

* add implementatino for GetGPULaunchConfig1D

* [fix] tmp for test

* add some comments

* refactor vector utils

* [feat] add use_cuda_kernel option

* add vec_type_trait implementation (#5473)

* [fix] unused option

* [fix]

* [fix]

* [fix] remove unused comment

* [Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461)

* Support FP16/BF16 Flash Attention 2

* fix bugs in test_kv_cache_memcpy.py

* add context_kv_cache_memcpy_kernel.cu

* rm typename MT

* add tail process

* add high_precision

* add high_precision to config.py

* rm unused code

* change the comment for the high_precision parameter

* update test_rotary_embdding_unpad.py

* fix vector_copy_utils.h

* add comment for self.high_precision when using float32

* [fix] PR #5354 (#5501)

* [fix]

* [fix]

* Update config.py docstring

* [fix] docstring align

* [fix] docstring align

* [fix] docstring align

* [Inference] Optimize request handler of llama (#5512)

* optimize request_handler

* fix ways of writing

* The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519)

* [Inference/Kernel]Add get_cos_and_sin Kernel (#5528)

* Add get_cos_and_sin kernel

* fix code comments

* fix code typos

* merge common codes of get_cos_and_sin kernel.

* Fixed a typo

* Changed 'asset allclose' to 'assert equal'.

* [Inference] Add Reduce Utils (#5537)

* add reduce utils

* add using to delele namespace prefix

* [Fix/Inference] Remove unused and non-functional functions (#5543)

* [fix] remove unused func

* rm non-functional partial

* add cast and op_functor for cuda build-in types (#5546)

* remove unused triton kernels

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove outdated triton test

* [Infer] Revise and Adapt Triton Kernels for Spec-Dec (#5401)

* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

fix dependency in pytest

* resolve conflicts for revising flash-attn

* adapt kv cache copy kernel for spec-dec

* fix seqlen-n kvcache copy kernel/tests

* test kvcache copy - use torch.equal

* add assertions

* (trivial) comment out

* [Inference/SpecDec] Add Basic Drafter Model Container (#5405)

* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

fix dependency in pytest

* add drafter model container (basic ver)

* [Inference/SpecDec] Add Speculative Decoding Implementation (#5423)

* fix flash decoding mask during verification

* add spec-dec

* add test for spec-dec

* revise drafter init

* remove drafter sampling

* retire past kv in drafter

* (trivial) rename attrs

* (trivial) rename arg

* revise how we enable/disable spec-dec

* [SpecDec] Fix inputs for speculation and revise past KV trimming (#5449)

* fix drafter pastkv and usage of batch bucket

* [Inference/SpecDec] Support GLIDE Drafter Model (#5455)

* add glide-llama policy and modeling

* update glide modeling, compitable with transformers 4.36.2

* revise glide llama modeling/usage

* fix issues of glimpsing large kv

* revise the way re-loading params for glide drafter

* fix drafter and engine tests

* enable convert to glide strict=False

* revise glide llama modeling

* revise vicuna prompt template

* revise drafter and tests

* apply usage of glide model in engine

* [doc] Add inference/speculative-decoding README (#5552)

* add README for spec-dec

* update roadmap

* [Fix] resolve conflicts of rebasing feat/speculative-decoding (#5557)

- resolve conflicts of rebasing feat/speculative-decoding

* [Fix] Llama Modeling Control with Spec-Dec (#5580)

- fix ref before asgmt
- fall back to use triton kernels when using spec-dec

* refactor csrc (#5582)

* [Inference/Refactor] Delete Duplicated code and refactor vec_copy utils and reduce utils (#5593)

* delete duplicated code and refactor vec_copy utils and reduce utils

* delete unused header file

* [inference/model]Adapted to the baichuan2-7B model (#5591)

* Adapted to the baichuan2-7B model

* modified according to the review comments.

* Modified the method of obtaining random weights.

* modified according to the review comments.

* change mlp layewr 'NOTE'

* [Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531)

* feat flash decoding for paged attention

* refactor flashdecodingattention

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Feat]Tensor Model Parallel Support For Inference (#5563)

* tensor parallel support naive source

* [fix]precision, model load and refactor the framework

* add tp unit test

* docstring

* fix do_sample

* feat baichuan2 rmsnorm whose hidden size equals to 5120 (#5611)

* [Fix/Inference] Fix GQA Triton and Support Llama3 (#5624)

* [fix] GQA calling of flash decoding triton

* fix kv cache alloc shape

* fix rotary triton - GQA

* fix sequence max length assigning

* Sequence max length logic

* fix scheduling and spec-dec

* skip without import error

* fix pytest - skip without ImportError

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Fix/Inference]Fix CUDA Rotary Rmbedding GQA (#5623)

* fix rotary embedding GQA

* change test_rotary_embdding_unpad.py KH

* [example] Update Llama Inference example (#5629)

* [example] add infernece benchmark llama3

* revise inference config - arg

* remove unused args

* add llama generation demo script

* fix init rope in llama policy

* add benchmark-llama3 - cleanup

* [Inference/Refactor] Refactor compilation mechanism and unified multi hw (#5613)

* refactor compilation mechanism and unified multi hw

* fix file path bug

* add init.py to make pybind a module to avoid relative path error caused by softlink

* delete duplicated micros

* fix micros bug in gcc

* [Fix/Inference]Fix vllm benchmark (#5630)

* Fix bugs about OOM when running vllm-0.4.0

* rm used params

* change generation_config

* change benchmark log file name

* [Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643)

* optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x])

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Fix] Remove obsolete files - inference (#5650)

* [Inference]Adapt to baichuan2 13B (#5614)

* adapt to baichuan2 13B

* adapt to baichuan2 13B

* change BAICHUAN_MODEL_NAME_OR_PATH

* fix test_decoding_attn.py

* Modifications based on review comments.

* change BAICHUAN_MODEL_NAME_OR_PATH

* mv attn mask processes to test flash decoding

* mv get_alibi_slopes baichuan modeling

* fix bugs in test_baichuan.py

* [kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658)

* add context attn triton kernel - new kcache layout

* add benchmark triton

* tiny revise

* trivial - code style, comment

* [Inference/Feat] Add kvcache quantization support for FlashDecoding (#5656)

* [Inference/Feat] Feat quant kvcache step2 (#5674)

* [Inference] Adapt Baichuan2-13B TP (#5659)

* adapt to baichuan2 13B

* add baichuan2 13B TP

* update baichuan tp logic

* rm unused code

* Fix TP logic

* fix alibi slopes tp logic

* rm nn.Module

* Polished the code.

* change BAICHUAN_MODEL_NAME_OR_PATH

* Modified the logic for loading Baichuan weights.

* fix typos

* [Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663)

* refactor kvcache manager and rotary_embedding and kvcache_memcpy operator

* refactor decode_kv_cache_memcpy

* enable alibi in pagedattention

* [Inference/Feat] Add kvcache quant support for fused_rotary_embedding_cache_copy (#5680)

* [inference]Add alibi to flash attn function (#5678)

* add alibi to flash attn function

* rm redundant modifications

* [Inference] Fix quant bits order (#5681)

* [kernel] Support New KCache Layout - Triton Kernel (#5677)

* kvmemcpy triton for new kcache layout

* revise tests for new kcache layout

* naive triton flash decoding - new kcache layout

* rotary triton kernel - new kcache layout

* remove redundancy - triton decoding

* remove redundancy - triton kvcache copy

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Fix] Fix & Update Inference Tests (compatibility w/ main)

* [Inference] Remove unnecessary float4_ and rename float8_ to float8 (#5679)

* [Inference/Feat] Add quant kvcache support for decode_kv_cache_memcpy (#5686)

* [hotfix] Fix KV Heads Number Assignment in KVCacheManager (#5695)

- Fix key value number assignment in KVCacheManager, as well as method of accessing

* [Fix] Fix Inference Example, Tests, and Requirements (#5688)

* clean requirements

* modify example inference struct

* add test ci scripts

* mark test_infer as submodule

* rm deprecated cls & deps

* import of HAS_FLASH_ATTN

* prune inference tests to be run

* prune triton kernel tests

* increment pytest timeout mins

* revert import path in openmoe

* [hotfix] fix OpenMOE example import path (#5697)

* [Inference]Adapt temperature processing logic (#5689)

* Adapt temperature processing logic

* add ValueError for top_p and top_k

* add GQA Test

* fix except_msg

* [Inference] Support the logic related to ignoring EOS token (#5693)

* Adapt temperature processing logic

* add ValueError for top_p and top_k

* add GQA Test

* fix except_msg

* support ignore EOS token

* change variable's name

* fix annotation

* [Inference] ADD  async and sync Api server using FastAPI (#5396)

* add api server

* fix

* add

* add completion service and fix bug

* add generation config

* revise shardformer

* fix bugs

* add docstrings and fix some bugs

* fix bugs and add choices for prompt template

* [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432)

* finish online test and add examples

* fix test_contionus_batching

* fix some bugs

* fix bash

* fix

* fix inference

* finish revision

* fix typos

* revision

* [Online Server] Chat Api for streaming and not streaming response (#5470)

* fix bugs

* fix bugs

* fix api server

* fix api server

* add chat api and test

* del request.n

* [Inference] resolve rebase conflicts

fix

* [Inference] Fix bugs and docs for feat/online-server (#5598)

* fix test bugs

* add do sample test

* del useless lines

* fix comments

* fix tests

* delete version tag

* delete version tag

* add

* del test sever

* fix test

* fix

* Revert "add"

This reverts commit b9305fb.

* resolve rebase conflicts on Branch feat/online-serving

* [Inference] Add example test_ci script

* [Inference/Feat] Add quant kvcache interface (#5700)

* add quant kvcache interface

* delete unused output

* complete args comments

* [Inference/Feat] Add convert_fp8 op for fp8 test in the future (#5706)

* add convert_fp8 op for fp8 test in the future

* rerun ci

* [Inference]Adapt repetition_penalty and no_repeat_ngram_size (#5708)

* Adapt repetition_penalty and no_repeat_ngram_size

* fix no_repeat_ngram_size_logit_process

* remove batch_updated

* fix annotation

* modified codes based on the review feedback.

* rm get_batch_token_ids

* [Feat]Inference RPC Server Support (#5705)

* rpc support source
* kv cache logical/physical disaggregation
* sampler refactor
* colossalai launch built in
* Unitest
* Rpyc support

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add paged-attetionv2: support seq length split across thread block (#5707)

* [Inference] Delete duplicated copy_vector (#5716)

* [ci] Fix example tests (#5714)

* [fix] revise timeout value on example CI

* trivial

* [Fix] Llama3 Load/Omit CheckpointIO Temporarily (#5717)

* Fix Llama3 Load error
* Omit Checkpoint IO Temporarily

* [Inference] Fix API server, test and example (#5712)

* fix api server

* fix generation config

* fix api server

* fix comments

* fix infer hanging bug

* resolve comments, change backend to free port

* 【Inference] Delete duplicated package (#5723)

* [example] Update Inference Example (#5725)

* [example] update inference example

* [lazy] fix lazy cls init (#5720)

* fix

* fix

* fix

* fix

* fix

* remove kernel intall

* rebase

revert

fix

* fix

* fix

* [Inference] Fix Inference Generation Config and Sampling (#5710)

* refactor and add

* config default values

* fix gen config passing

* fix rpc generation config

* [Fix/Inference] Add unsupported auto-policy error message (#5730)

* [fix] auto policy error message

* trivial

* [doc] Update Inference Readme (#5736)

* [doc] update inference readme

* add contents

* trivial

* [Shardformer] Add parallel output for shardformer models(bloom, falcon) (#5702)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

* add parallel cross entropy output for falcon model & fix some typos in bloom.py

* fix module name error, self.model -> self.transformers in bloom, falcon model

* Fix the overflow bug of distributed cross entropy loss function when training with fp16

* add dtype to parallel cross entropy loss function

* fix dtype related typos adn prettify the loss.py

* fix grad dtype and update dtype mismatch error

* fix typo bugs

* [bug] fix silly bug

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [chore] add test for prefetch

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [ci] Temporary fix for build on pr (#5741)

* temporary fix for CI

* timeout to 90

* [NFC] Fix code factors on inference triton kernels (#5743)

* [NFC]  fix requirements (#5744)

* [inference] release (#5747)

* [inference] release

* [inference] release

* [inference] release

* [inference] release

* [inference] release

* [inference] release

* [inference] release

---------

Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>
Co-authored-by: FrankLeeeee <somerlee.9@gmail.com>
Co-authored-by: Yaozheng Fang <62918515+nkfyz@users.noreply.github.com>
Co-authored-by: xs_courtesy <xs1580802568@gmail.com>
Co-authored-by: Runyu Lu <runyulu@umich.edu>
Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Co-authored-by: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Co-authored-by: Yuanheng <jonathan.zhaoyh@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
Co-authored-by: flybird11111 <1829166702@qq.com>
Co-authored-by: Haze188 <haze188@qq.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
botbw added a commit that referenced this pull request May 23, 2024
commit 4647ec28c8450ee96f4709626617763712efd77e
Author: binmakeswell <binmakeswell@gmail.com>
Date:   Thu May 23 17:44:06 2024 +0800

    [inference] release (#5747)

    * [inference] release

    * [inference] release

    * [inference] release

    * [inference] release

    * [inference] release

    * [inference] release

    * [inference] release

commit df6747603f11e2a1929db193ceb014799e02e2c1
Merge: 22ce873c 498f42c4
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed May 22 14:31:09 2024 +0800

    [Colossal-Inference] (v0.1.0) Merge pull request #5739 from hpcaitech/feature/colossal-infer

    [Inference] Merge feature/colossal-infer

commit 498f42c45b256b5cfc32d74b552e1e306f317a42
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed May 22 12:08:49 2024 +0800

    [NFC]  fix requirements (#5744)

commit bd38fe6b912379080673a43d77fd3bdf0e5c852e
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue May 21 22:12:15 2024 +0800

    [NFC] Fix code factors on inference triton kernels (#5743)

commit c2c8c9cf17d67000df8a5b75ae9dbecee0e1c00a
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue May 21 18:20:57 2024 +0800

    [ci] Temporary fix for build on pr (#5741)

    * temporary fix for CI

    * timeout to 90

commit c06208e72c35d74e150b6a83e72375f5021d10b1
Merge: d8b1ea4a 8633c15d
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue May 21 11:26:37 2024 +0800

    Merge pull request #5737 from yuanheng-zhao/inference/sync/main

    [sync] Sync feature/colossal-infer with main

commit 22ce873c3f26fd7f4217cdf19071c173683c2b47
Author: Haze188 <haze188@qq.com>
Date:   Tue May 21 11:07:13 2024 +0800

    [Shardformer] Add parallel output for shardformer models(bloom, falcon) (#5702)

    * [pre-commit.ci] auto fixes from pre-commit.com hooks

    * add parallel cross entropy output for falcon model & fix some typos in bloom.py

    * fix module name error, self.model -> self.transformers in bloom, falcon model

    * Fix the overflow bug of distributed cross entropy loss function when training with fp16

    * add dtype to parallel cross entropy loss function

    * fix dtype related typos adn prettify the loss.py

    * fix grad dtype and update dtype mismatch error

    * fix typo bugs

commit 8633c15da9b82c675c59ad292e7f0d77f092653c
Merge: d8b1ea4a 9d83c6d7
Author: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Date:   Mon May 20 15:50:53 2024 +0000

    [sync] Sync feature/colossal-infer with main

commit d8b1ea4ac90317ad6126acbd854e66583a8f9c8f
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon May 20 22:50:04 2024 +0800

    [doc] Update Inference Readme (#5736)

    * [doc] update inference readme

    * add contents

    * trivial

commit bdf9a001d61cfad4bb68752c4a808295165307a0
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon May 20 22:49:18 2024 +0800

    [Fix/Inference] Add unsupported auto-policy error message (#5730)

    * [fix] auto policy error message

    * trivial

commit 283c407a19002118bda7edd1b8a3acf099843205
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Sun May 19 15:08:42 2024 +0800

    [Inference] Fix Inference Generation Config and Sampling (#5710)

    * refactor and add

    * config default values

    * fix gen config passing

    * fix rpc generation config

commit 9d83c6d715e8cdb802f82335e651923baab5cfc6
Author: flybird11111 <1829166702@qq.com>
Date:   Fri May 17 18:18:59 2024 +0800

    [lazy] fix lazy cls init (#5720)

    * fix

    * fix

    * fix

    * fix

    * fix

    * remove kernel intall

    * rebase

    revert

    fix

    * fix

    * fix

commit 8bcfe360fdae7ccec7051aaced48497519afc2f2
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Fri May 17 11:28:53 2024 +0800

    [example] Update Inference Example (#5725)

    * [example] update inference example

commit a8d459f99a1d415fc843327e4dafce19ecee1f3e
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Thu May 16 10:49:03 2024 +0800

    【Inference] Delete duplicated package (#5723)

commit f47f2fbb2467df15548d2c663b119f4ae0103890
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed May 15 15:47:31 2024 +0800

    [Inference] Fix API server, test and example (#5712)

    * fix api server

    * fix generation config

    * fix api server

    * fix comments

    * fix infer hanging bug

    * resolve comments, change backend to free port

commit 74c47921facd26dbd93172bf887abcad4eab2d5c
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Tue May 14 20:17:43 2024 +0800

    [Fix] Llama3 Load/Omit CheckpointIO Temporarily (#5717)

    * Fix Llama3 Load error
    * Omit Checkpoint IO Temporarily

commit 5bbab1533ae7672ab37e91b7bc9e584b3a4e9cc1
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue May 14 16:08:51 2024 +0800

    [ci] Fix example tests (#5714)

    * [fix] revise timeout value on example CI

    * trivial

commit 121d7ad629c746e52a96ec53d6e26c0194016a03
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue May 14 14:35:33 2024 +0800

    [Inference] Delete duplicated copy_vector (#5716)

commit 7806842f2dbb4b6d6e74014efc7db5be8ccf0bbd
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Tue May 14 12:46:54 2024 +0800

    add paged-attetionv2: support seq length split across thread block (#5707)

commit 18d67d0e8e79c22bded0745c7d3daf8ca40d445c
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Tue May 14 10:00:55 2024 +0800

    [Feat]Inference RPC Server Support (#5705)

    * rpc support source
    * kv cache logical/physical disaggregation
    * sampler refactor
    * colossalai launch built in
    * Unitest
    * Rpyc support

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit de4bf3dedf2c7cb7ba6c3044745bab3c3ef6352d
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Sat May 11 15:13:25 2024 +0800

    [Inference]Adapt repetition_penalty and no_repeat_ngram_size (#5708)

    * Adapt repetition_penalty and no_repeat_ngram_size

    * fix no_repeat_ngram_size_logit_process

    * remove batch_updated

    * fix annotation

    * modified codes based on the review feedback.

    * rm get_batch_token_ids

commit 50104ab340e6c7067fbaaf9b47c608eb828aa95b
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Fri May 10 18:39:54 2024 +0800

    [Inference/Feat] Add convert_fp8 op for fp8 test in the future (#5706)

    * add convert_fp8 op for fp8 test in the future

    * rerun ci

commit bfad39357b0fe31ecf6f7639e2c4056165078a3f
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Thu May 9 18:03:24 2024 +0800

    [Inference/Feat] Add quant kvcache interface (#5700)

    * add quant kvcache interface

    * delete unused output

    * complete args comments

commit 492520dbdb962d207ac40d216e0414807f73eb19
Merge: d4829220 5d9a4948
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Thu May 9 17:19:45 2024 +0800

    Merge pull request #5588 from hpcaitech/feat/online-serving

    [Feature]Online Serving

commit 5d9a49483d98ccd4bebebbfd039162caceefe6bd
Author: CjhHa1 <cjh18671720497@outlook.com>
Date:   Thu May 9 05:44:05 2024 +0000

    [Inference] Add example test_ci script

commit bc9063adf1598c3be32fc2d12577d76b9daa79bf
Author: CjhHa1 <cjh18671720497@outlook.com>
Date:   Wed May 8 10:36:42 2024 +0000

    resolve rebase conflicts on Branch feat/online-serving

commit 61a1b2e798edcbf91ac35966a4047407ad6aa62d
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed May 8 15:14:06 2024 +0800

    [Inference] Fix bugs and docs for feat/online-server (#5598)

    * fix test bugs

    * add do sample test

    * del useless lines

    * fix comments

    * fix tests

    * delete version tag

    * delete version tag

    * add

    * del test sever

    * fix test

    * fix

    * Revert "add"

    This reverts commit b9305fb02440d5cd566d32b508bee9f9c13dda15.

commit 7bbb28e48bdb5849d9dfb118d7bf2959d79bbe02
Author: CjhHa1 <cjh18671720497@outlook.com>
Date:   Thu Apr 11 10:12:31 2024 +0800

    [Inference] resolve rebase conflicts

    fix

commit c06403286567f62cb0a6dfc5e075cf60e291cea9
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Sun Apr 7 14:45:43 2024 +0800

    [Online Server] Chat Api for streaming and not streaming response (#5470)

    * fix bugs

    * fix bugs

    * fix api server

    * fix api server

    * add chat api and test

    * del request.n

commit de378cd2abd77b464786dc5f8298c9edbf023fbc
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Mar 18 17:06:05 2024 +0800

    [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432)

    * finish online test and add examples

    * fix test_contionus_batching

    * fix some bugs

    * fix bash

    * fix

    * fix inference

    * finish revision

    * fix typos

    * revision

commit 69cd7e069d5705c7e431b301ac14924711c74e41
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Fri Mar 1 14:47:36 2024 +0800

    [Inference] ADD  async and sync Api server using FastAPI (#5396)

    * add api server

    * fix

    * add

    * add completion service and fix bug

    * add generation config

    * revise shardformer

    * fix bugs

    * add docstrings and fix some bugs

    * fix bugs and add choices for prompt template

commit d482922035ff7b6fe7ced8e6c4028faa2d68197f
tAuthor: yuehuayingxueluo <867460659@qq.com>
Date:   Wed May 8 19:59:10 2024 +0800

     [Inference] Support the logic related to ignoring EOS token (#5693)

    * Adapt temperature processing logic

    * add ValueError for top_p and top_k

    * add GQA Test

    * fix except_msg

    * support ignore EOS token

    * change variable's name

    * fix annotation

commit 9c2fe7935ff5aaec4f174cfba6f324df623c7447
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed May 8 17:58:29 2024 +0800

    [Inference]Adapt temperature processing logic (#5689)

    * Adapt temperature processing logic

    * add ValueError for top_p and top_k

    * add GQA Test

    * fix except_msg

commit 12e7c28d5e8f219480d1dbc682fd225dc76fcc2b
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Daqte:   Wed May 8 15:48:47 2024 +0800

    [hotfix] fix OpenMOE example import path (#5697)

commit 55cc7f3df7c600deae2f344ee162abae5a5c63e1
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed May 8 11:30:15 2024 +0800

    [Fix] Fix Inference Example, Tests, and Requirements (#5688)

    * clean requirements

    * modify example inference struct

    * add test ci scripts

    * mark test_infer as submodule

    * rm deprecated cls & deps

    * import of HAS_FLASH_ATTN

    * prune inference tests to be run

    * prune triton kernel tests

    * increment pytest timeout mins

    * revert import path in openmoe

commit f9afe0addd89303de4819debd93efe97d5618238
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue May 7 23:13:14 2024 +0800

    [hotfix] Fix KV Heads Number Assignment in KVCacheManager (#5695)

    - Fix key value number assignment in KVCacheManager, as well as method of accessing

commit 1ace1065e6bff175a0af88cae86d272acef29c9f
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Mon May 6 15:35:13 2024 +0800

    [Inference/Feat] Add quant kvcache support for decode_kv_cache_memcpy (#5686)

commit db7b3051f4379862f88790bf1653ddb6443c002e
Merge: 725fbd2e 8754abae
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon May 6 14:43:38 2024 +0800

    [Sync] Update from main to feature/colossal-infer (Merge pull request #5685)

    [Sync] Update from main to feature/colossal-infer

    - Merge pull request #5685 from yuanheng-zhao/inference/merge/main

commit 725fbd2ed067f9c58ac04670377d3e6f2a96fe00
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Mon May 6 10:55:34 2024 +0800

    [Inference] Remove unnecessary float4_ and rename float8_ to float8 (#5679)

commit 8754abae24dbcc492d2992d1091428592b615285
Author: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Date:   Sun May 5 16:28:56 2024 +0000

    [Fix] Fix & Update Inference Tests (compatibility w/ main)

commit 56ed09aba5e017fc0c211dac70215c2f83815919
Merge: 537a3cbc d3f34ee8
Author: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Date:   Sun May 5 05:14:00 2024 +0000

    [sync] resolve conflicts of merging main

commit 537a3cbc4df445786c8ecf2af0a2998e2fd881b6
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Fri May 3 17:20:45 2024 +0800

    [kernel] Support New KCache Layout - Triton Kernel (#5677)

    * kvmemcpy triton for new kcache layout

    * revise tests for new kcache layout

    * naive triton flash decoding - new kcache layout

    * rotary triton kernel - new kcache layout

    * remove redundancy - triton decoding

    * remove redundancy - triton kvcache copy

    * [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit 9df016fc4520a5a5c95a11ed04a8ac62bde039c4
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Apr 30 19:38:00 2024 +0800

    [Inference] Fix quant bits order (#5681)

commit f79963199cd30c5e917d430aedd79113d06d608c
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Apr 30 19:35:05 2024 +0800

    [inference]Add alibi to flash attn function (#5678)

    * add alibi to flash attn function

    * rm redundant modifications

commit ef8e4ffe310bfe21f83feb965d962d816d75bc88
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Apr 30 18:33:53 2024 +0800

    [Inference/Feat] Add kvcache quant support for fused_rotary_embedding_cache_copy (#5680)

commit 5cd75ce4c7edc95bacd8ec5fc04b8add339e8331
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Tue Apr 30 15:52:23 2024 +0800

    [Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663)

    * refactor kvcache manager and rotary_embedding and kvcache_memcpy operator

    * refactor decode_kv_cache_memcpy

    * enable alibi in pagedattention

commit 5f00002e43bd738a99fea250306e54c8c908f05a
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Apr 30 15:47:07 2024 +0800

    [Inference] Adapt Baichuan2-13B TP (#5659)

    * adapt to baichuan2 13B

    * add baichuan2 13B TP

    * update baichuan tp logic

    * rm unused code

    * Fix TP logic

    * fix alibi slopes tp logic

    * rm nn.Module

    * Polished the code.

    * change BAICHUAN_MODEL_NAME_OR_PATH

    * Modified the logic for loading Baichuan weights.

    * fix typos

commit 808ee6e4addccb51990398434547fa5df3c255b0
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Apr 30 11:26:36 2024 +0800

    [Inference/Feat] Feat quant kvcache step2 (#5674)

commit 8ccb6714e79137c8e6e50d9a585eadbf70ae6fc0
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Fri Apr 26 19:40:37 2024 +0800

    [Inference/Feat] Add kvcache quantization support for FlashDecoding (#5656)

commit 5be590b99eb6c58c3aa809d453680139fdd2b9f7
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Fri Apr 26 17:51:49 2024 +0800

    [kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658)

    * add context attn triton kernel - new kcache layout

    * add benchmark triton

    * tiny revise

    * trivial - code style, comment

commit 3c91e3f1763d2a30a85187a3a606dbe4d1b9454d
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Apr 25 23:11:30 2024 +0800

    [Inference]Adapt to baichuan2 13B (#5614)

    * adapt to baichuan2 13B

    * adapt to baichuan2 13B

    * change BAICHUAN_MODEL_NAME_OR_PATH

    * fix test_decoding_attn.py

    * Modifications based on review comments.

    * change BAICHUAN_MODEL_NAME_OR_PATH

    * mv attn mask processes to test flash decoding

    * mv get_alibi_slopes baichuan modeling

    * fix bugs in test_baichuan.py

commit f342a9387168cedc2e5cc33155939c6d0c4e99a0
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Thu Apr 25 22:04:59 2024 +0800

    [Fix] Remove obsolete files - inference (#5650)

commit a8fd3b034235e1fa987a1ae85a9a2b465ee6128f
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Thu Apr 25 14:24:02 2024 +0800

    [Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643)

    * optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x])

    * [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit 90cd5227a348dfe506e95b2e49f2a8dcd34fdbca
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Apr 24 14:51:36 2024 +0800

    [Fix/Inference]Fix vllm benchmark (#5630)

    * Fix bugs about OOM when running vllm-0.4.0

    * rm used params

    * change generation_config

    * change benchmark log file name

commit 279300dc5f34db219c90a297c0996d00221eae96
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Wed Apr 24 14:17:54 2024 +0800

    [Inference/Refactor] Refactor compilation mechanism and unified multi hw (#5613)

    * refactor compilation mechanism and unified multi hw

    * fix file path bug

    * add init.py to make pybind a module to avoid relative path error caused by softlink

    * delete duplicated micros

    * fix micros bug in gcc

commit 04863a9b144fc7dd46a57d2c7b0cf2f4b351ffb6
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Apr 23 22:23:07 2024 +0800

    [example] Update Llama Inference example (#5629)

    * [example] add infernece benchmark llama3

    * revise inference config - arg

    * remove unused args

    * add llama generation demo script

    * fix init rope in llama policy

    * add benchmark-llama3 - cleanup

commit 12f10d5b0b49a180bc162e166337942e0bbfb96b
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Apr 23 13:44:49 2024 +0800

    [Fix/Inference]Fix CUDA Rotary Rmbedding GQA (#5623)

    * fix rotary embedding GQA

    * change test_rotary_embdding_unpad.py KH

commit 5d4c1fe8f5f7019284f6cbc0ed29506748f63bf1
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Apr 23 13:09:55 2024 +0800

    [Fix/Inference] Fix GQA Triton and Support Llama3 (#5624)

    * [fix] GQA calling of flash decoding triton

    * fix kv cache alloc shape

    * fix rotary triton - GQA

    * fix sequence max length assigning

    * Sequence max length logic

    * fix scheduling and spec-dec

    * skip without import error

    * fix pytest - skip without ImportError

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit ccf72797e3bfafcbfc42870ce24ee484858d4852
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Fri Apr 19 15:34:53 2024 +0800

    feat baichuan2 rmsnorm whose hidden size equals to 5120 (#5611)

commit e37ee2fb65fc77c275b816968d91776322fd7695
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Thu Apr 18 16:56:46 2024 +0800

    [Feat]Tensor Model Parallel Support For Inference (#5563)

    * tensor parallel support naive source

    * [fix]precision, model load and refactor the framework

    * add tp unit test

    * docstring

    * fix do_sample

commit be396ad6cc102fa610731291bf28e531a5641c7a
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Thu Apr 18 16:45:07 2024 +0800

    [Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531)

    * feat flash decoding for paged attention

    * refactor flashdecodingattention

    * [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit 56b222eff8c996a4677a158d4b5d4834a1bc0cfc
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Apr 15 16:53:02 2024 +0800

    [inference/model]Adapted to the baichuan2-7B model (#5591)

    * Adapted to the baichuan2-7B model

    * modified according to the review comments.

    * Modified the method of obtaining random weights.

    * modified according to the review comments.

    * change mlp layewr 'NOTE'

commit d4cb023b62ea8e092783be437cb16d74a1afc6a7
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Mon Apr 15 10:57:51 2024 +0800

    [Inference/Refactor] Delete Duplicated code and refactor vec_copy utils and reduce utils (#5593)

    * delete duplicated code and refactor vec_copy utils and reduce utils

    * delete unused header file

commit a21912339a2c41627b43fd00e6adba38308a2ea0
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Thu Apr 11 15:41:36 2024 +0800

    refactor csrc (#5582)

commit 25928d84961b60264a6dabbddeae32af04a43fa2
Merge: d56c9633 f8598e3e
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed Apr 10 18:39:27 2024 +0800

    [Inference/Spec-Dec] Merge pull request #5565 from hpcaitech/feat/speculative-decoding

    Add Speculative Decoding and GLIDE Spec-Dec

commit f8598e3ec56bbe6bc6dd9fd84a1e0543adbd3073
Author: Yuanheng <jonathan.zhaoyh@gmail.com>
Date:   Wed Apr 10 11:14:04 2024 +0800

    [Fix] Llama Modeling Control with Spec-Dec (#5580)

    - fix ref before asgmt
    - fall back to use triton kernels when using spec-dec

commit e60d430cf53c9009af4682908d01742147654429
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Sun Apr 7 14:53:30 2024 +0800

    [Fix] resolve conflicts of rebasing feat/speculative-decoding (#5557)

    - resolve conflicts of rebasing feat/speculative-decoding

commit e1acb58423c53ece50b72db3bf9b91475d5d3d64
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed Apr 3 18:06:23 2024 +0800

    [doc] Add inference/speculative-decoding README (#5552)

    * add README for spec-dec

    * update roadmap

commit d85d91435ae25d875bfeb012b1e66cbfce6f6525
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Apr 1 21:54:24 2024 +0800

    [Inference/SpecDec] Support GLIDE Drafter Model (#5455)

    * add glide-llama policy and modeling

    * update glide modeling, compitable with transformers 4.36.2

    * revise glide llama modeling/usage

    * fix issues of glimpsing large kv

    * revise the way re-loading params for glide drafter

    * fix drafter and engine tests

    * enable convert to glide strict=False

    * revise glide llama modeling

    * revise vicuna prompt template

    * revise drafter and tests

    * apply usage of glide model in engine

commit 912e24b2aaf4acda0e2b9a45a7d4327fbfc8bd39
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Mar 12 17:57:01 2024 +0800

    [SpecDec] Fix inputs for speculation and revise past KV trimming (#5449)

    * fix drafter pastkv and usage of batch bucket

commit a37f82629d7b9e3c3a0f430b8dd3ff6f38ddf1d4
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Mar 11 09:51:42 2024 +0800

    [Inference/SpecDec] Add Speculative Decoding Implementation (#5423)

    * fix flash decoding mask during verification

    * add spec-dec

    * add test for spec-dec

    * revise drafter init

    * remove drafter sampling

    * retire past kv in drafter

    * (trivial) rename attrs

    * (trivial) rename arg

    * revise how we enable/disable spec-dec

commit 5a9b05f7b297bc9ce3479990aeee94891c7f5edf
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed Feb 28 13:48:17 2024 +0800

    [Inference/SpecDec] Add Basic Drafter Model Container (#5405)

    * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

    fix dependency in pytest

    * add drafter model container (basic ver)

commit d63c469f45bc20115aaf5ba01e62dc67ab47953f
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed Feb 28 13:47:00 2024 +0800

    [Infer] Revise and Adapt Triton Kernels for Spec-Dec (#5401)

    * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

    fix dependency in pytest

    * resolve conflicts for revising flash-attn

    * adapt kv cache copy kernel for spec-dec

    * fix seqlen-n kvcache copy kernel/tests

    * test kvcache copy - use torch.equal

    * add assertions

    * (trivial) comment out

commit d56c96334e8a0626696609c3803ba5c73798f073
Merge: 7ebdf48a 7ca1d1c5
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Apr 9 10:09:34 2024 +0800

    Sync main to feature/colossal-infer

    [Sync] Merge feature/colossal-infer with main

commit 7ca1d1c5453de3e726bca6334c360045050f94c4
Author: Yuanheng <jonathan.zhaoyh@gmail.com>
Date:   Mon Apr 8 17:00:55 2024 +0800

    remove outdated triton test

commit d78817539ea03b7b4bc79e0ef50db33d3e347f24
Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Date:   Mon Apr 8 08:41:07 2024 +0000

    [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

commit ce9401ad52b870012846abcde120f1e87d5da7fe
Author: Yuanheng <jonathan.zhaoyh@gmail.com>
Date:   Mon Apr 8 16:25:12 2024 +0800

    remove unused triton kernels

commit ed5ebd1735db4541709eebdd37839ad161f542e8
Merge: 7ebdf48a 641b1ee7
Author: Yuanheng <jonathan.zhaoyh@gmail.com>
Date:   Mon Apr 8 16:21:47 2024 +0800

    [Fix] resolve conflicts of merging main

commit 7ebdf48ac50ca7bab827ef611551c6c48113b684
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Mon Apr 8 11:38:05 2024 +0800

    add cast and op_functor for cuda build-in types (#5546)

commit 4bb5d8923a6e85a0f89a483f15933698635a9f9c
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Apr 2 14:16:59 2024 +0800

    [Fix/Inference] Remove unused and non-functional functions (#5543)

    * [fix] remove unused func

    * rm non-functional partial

commit a2878e39f42f509f237f3d3fd0741f53e3feff0e
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Mon Apr 1 15:34:25 2024 +0800

    [Inference] Add Reduce Utils (#5537)

    * add reduce utils

    * add using to delele namespace prefix

commit 04aca9e55bd91ea4dd8d1231aa66df7848b08f03
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Apr 1 13:47:14 2024 +0800

    [Inference/Kernel]Add get_cos_and_sin Kernel (#5528)

    * Add get_cos_and_sin kernel

    * fix code comments

    * fix code typos

    * merge common codes of get_cos_and_sin kernel.

    * Fixed a typo

    * Changed 'asset allclose' to 'assert equal'.

commit 934e31afb22d2a281464aebde074eb2f238fb812
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Mar 28 10:42:51 2024 +0800

    The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519)

commit e6496dd37144202c8602dfdd66bb83f297eb5805
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Mar 26 16:37:14 2024 +0800

    [Inference] Optimize request handler of llama (#5512)

    * optimize request_handler

    * fix ways of writing

commit 6251d68dc9f92c333a8f07ddf94e80ff7462726e
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Mon Mar 25 15:24:17 2024 +0800

    [fix] PR #5354 (#5501)

    * [fix]

    * [fix]

    * Update config.py docstring

    * [fix] docstring align

    * [fix] docstring align

    * [fix] docstring align

commit 1d626233ce8dbf35405cb7d92a5638ee1d830e8f
Merge: 87079cff 68e9396b
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Mon Mar 25 14:55:59 2024 +0800

    Merge pull request #5434 from LRY89757/colossal-infer-cuda-graph

    [feat] cuda graph support and refactor non-functional api

commit 68e9396bc084f03fe9315e9fed93292c0efc7a48
Merge: ff4998c6 87079cff
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 25 14:48:28 2024 +0800

    [fix] merge conflicts

commit 87079cffe8e006d4949aa7ca7cb60e6b813ff701
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Mar 25 13:40:34 2024 +0800

    [Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461)

    * Support FP16/BF16 Flash Attention 2

    * fix bugs in test_kv_cache_memcpy.py

    * add context_kv_cache_memcpy_kernel.cu

    * rm typename MT

    * add tail process

    * add high_precision

    * add high_precision to config.py

    * rm unused code

    * change the comment for the high_precision parameter

    * update test_rotary_embdding_unpad.py

    * fix vector_copy_utils.h

    * add comment for self.high_precision when using float32

commit ff4998c6f39cbfd6d3d11f038c55cca3c9d3abd0
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 25 12:00:57 2024 +0800

    [fix] remove unused comment

commit 9fe61b44753083c89a50540daa1e9a3daedeb335
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 25 11:37:58 2024 +0800

    [fix]

commit 5b017d6324c9881e02a5440e0b1a3156612a8044
Author: Runyu Lu <runyulu@umich.edu>
Date:   Thu Mar 21 15:55:25 2024 +0800

    [fix]

commit 606603bb8805c39f6ee01029337ddc614c8d46ef
Merge: 4eafe0c8 7ff42cc0
Author: Runyu Lu <runyulu@umich.edu>
Date:   Thu Mar 21 14:25:22 2024 +0800

    Merge branch 'feature/colossal-infer' of https://github.com/hpcaitech/ColossalAI into colossal-infer-cuda-graph

commit 4eafe0c8141c120229be3ddce9c5591c1535348a
Author: Runyu Lu <runyulu@umich.edu>
Date:   Thu Mar 21 11:28:42 2024 +0800

    [fix] unused option

commit 7ff42cc06d007ae78fe091da65cb89c4bb62bc38
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Mar 19 18:36:40 2024 +0800

    add vec_type_trait implementation (#5473)

commit b96557b5e15dbb521bf0f77b6b1f24dcbd9464d6
Merge: b6e97858 48c4f29b
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Mar 19 13:53:26 2024 +0800

    Merge pull request #5469 from Courtesy-Xs/add_vec_traits

    Refactor vector utils

commit aabc9fb6aada9e7feb2ff8cf1f34e6ac37ade2e7
Author: Runyu Lu <runyulu@umich.edu>
Date:   Tue Mar 19 13:24:25 2024 +0800

    [feat] add use_cuda_kernel option

commit 48c4f29b275e2d8105842913cd84f5d66c378b36
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Tue Mar 19 11:32:01 2024 +0800

    refactor vector utils

commit b6e97858856ee8637216c51f14ac544b1bc0f872
Merge: f366a5ea 5724b9e3
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Fri Mar 15 11:23:44 2024 +0800

    Merge pull request #5457 from Courtesy-Xs/ly_add_implementation_for_launch_config

    add implementatino for GetGPULaunchConfig1D

commit 5724b9e31e13e07d8ade0444c3e2f3e6894d13b1
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Fri Mar 15 11:18:57 2024 +0800

    add some comments

commit 6e30248683c0e4ccc63d15f39f8149875cba1263
Author: Runyu Lu <runyulu@umich.edu>
Date:   Thu Mar 14 16:13:00 2024 +0800

    [fix] tmp for test

commit 388e0439301834a1ad0d11da26b23f4cdc6c82d7
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Thu Mar 14 11:13:40 2024 +0800

    add implementatino for GetGPULaunchConfig1D

commit d02e257abd778812d64491dde893c0d691ed4328
Merge: ae24b4f0 f366a5ea
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Thu Mar 14 10:37:05 2024 +0800

    Merge branch 'feature/colossal-infer' into colossal-infer-cuda-graph

commit ae24b4f025285949253a21c41bee4b80679a0bfe
Author: Runyu Lu <runyulu@umich.edu>
Date:   Thu Mar 14 10:35:08 2024 +0800

    diverse tests

commit 1821a6dab0ad6ad24ae25216e56268c4b0c0d365
Author: Runyu Lu <runyulu@umich.edu>
Date:   Wed Mar 13 17:28:32 2024 +0800

    [fix] pytest and fix dyn grid bug

commit f366a5ea1f2626a7870acaf8866f21d5fb49c388
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Mar 13 17:20:03 2024 +0800

    [Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418)

    * add rotary embedding kernel

    * add rotary_embedding_kernel

    * add fused rotary_emb and kvcache memcopy

    * add fused_rotary_emb_and_cache_kernel.cu

    * add fused_rotary_emb_and_memcopy

    * fix bugs in fused_rotary_emb_and_cache_kernel.cu

    * fix ci bugs

    * use vec memcopy and opt the  gloabl memory access

    * fix code style

    * fix test_rotary_embdding_unpad.py

    * codes revised based on the review comments

    * fix bugs about include path

    * rm inline

commit ed431de4e4f73584e6b9c11ab041ef54a8e83de6
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Wed Mar 13 16:00:55 2024 +0800

    fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test (#5454)

commit 6fd355a5a6bb46bfee41d2bc75578e8fba001144
Merge: b699f540 c1c45e9d
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Wed Mar 13 11:26:41 2024 +0800

    Merge pull request #5452 from Courtesy-Xs/fix_include_path

    fix include path

commit c1c45e9d8ecb6743e88e63dd151c617c0014e7c1
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Wed Mar 13 11:21:06 2024 +0800

    fix include path

commit b699f54007c52b2f4ec56326a495b06858cf8856
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Tue Mar 12 17:48:02 2024 +0800

    optimize rmsnorm: add vectorized elementwise op, feat loop unrolling (#5441)

commit 368a2aa5433d127adaa3674c6d00bb9dc3e0729c
Merge: 21e1e364 095c070a
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Mar 12 14:14:37 2024 +0800

    Merge pull request #5445 from Courtesy-Xs/refactor_infer_compilation

    Refactor colossal-infer code arch

commit 095c070a6eefe1a76fe3483b21986826114d6d17
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Mon Mar 11 17:06:57 2024 +0800

    refactor code

commit 21e1e3645c8f2e0d4e556f3e13d0d2aa5053911b
Merge: f7aecc0c 5eb5ff14
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Mon Mar 11 11:15:29 2024 +0800

    Merge pull request #5435 from Courtesy-Xs/add_gpu_launch_config

    Add query and other components

commit 633e95b301336c4c237537f584882b3d8e5f4145
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 11 10:56:51 2024 +0800

    [doc] add doc

commit 9dec66fad6c2f85166903aa80d0c077e37512fce
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 11 10:51:16 2024 +0800

    [fix] multi graphs capture error

commit b2c0d9ff2b4e4015660f2967837688cf7293b21e
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 11 10:49:31 2024 +0800

    [fix] multi graphs capture error

commit f7aecc0c6bac001d10c1dd00274e0152e4c86df6
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Fri Mar 8 16:21:12 2024 +0800

    feat rmsnorm cuda kernel and add unittest, benchmark script (#5417)

commit 5eb5ff1464311ac16c29307d03a3c076aced7e03
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Fri Mar 8 15:41:14 2024 +0800

    refactor code

commit 01d289d8e51384131d536b1c223c473aeea463e9
Merge: a46598ac 2b28b54a
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Fri Mar 8 15:04:55 2024 +0800

    Merge branch 'feature/colossal-infer' of https://github.com/hpcaitech/ColossalAI into add_gpu_launch_config

commit a46598ac5984c7dc5804d0cf8621698f1a6a8720
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Fri Mar 8 14:53:29 2024 +0800

    add reusable utils for cuda

commit 2b28b54ac6d19d33079d9117b9717fd2779f2b08
Merge: 593a72e4 95c21498
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Fri Mar 8 14:44:37 2024 +0800

    Merge pull request #5433 from Courtesy-Xs/add_silu_and_mul

    【Inference】Add silu_and_mul for infer

commit cefaeb5fdd551c8b95837a475cb810f4991cf674
Author: Runyu Lu <runyulu@umich.edu>
Date:   Fri Mar 8 14:19:35 2024 +0800

    [feat] cuda graph support and refactor non-functional api

commit 95c21498d4f6e640e218f4b00349020f4ae7c69a
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Thu Mar 7 16:57:49 2024 +0800

    add silu_and_mul for infer

commit 593a72e4d58b8c3feebde2d19c78d44f702f7b06
Merge: 0aa27f19 0310b76e
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Mon Mar 4 10:13:59 2024 +0800

    Merge pull request #5424 from FrankLeeeee/sync/main

    Sync/main

commit 0310b76e9d485703d5afc128b8d97d01b00f3317
Merge: 0aa27f19 4b8312c0
Author: FrankLeeeee <somerlee.9@gmail.com>
Date:   Mon Mar 4 10:09:36 2024 +0800

    Merge branch 'main' into sync/main

commit 0aa27f196109bfb4ce6171d7ce921052b9eee969
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Feb 28 16:46:03 2024 +0800

    [Inference]Move benchmark-related code to the example directory. (#5408)

    * move benchmark-related code to the example directory.

    * fix bugs in test_fused_rotary_embedding.py

commit 600881a8ea9b17c436ded922a9d4e3d5969acd87
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Feb 28 14:36:50 2024 +0800

    [Inference]Add CUDA KVCache Kernel (#5406)

    * add cuda KVCache kernel

    * annotation benchmark_kvcache_copy

    * add use cuda

    * fix import path

    * move benchmark scripts to example/

    * rm benchmark codes in test_kv_cache_memcpy.py

    * rm redundancy codes

    * rm redundancy codes

    * pr was modified according to the review

commit 19061188c396d851ef17bc34b526e2f2b4fc1479
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Feb 26 16:17:47 2024 +0800

    [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

    fix dependency in pytest

commit bc1da87366d81e144f1f133801d5f20520433c52
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Fri Feb 23 10:51:35 2024 +0800

    [Fix/Inference] Fix format of input prompts and input model  in inference engine (#5395)

    * Fix bugs in inference_engine

    * fix bugs in engine.py

    * rm  CUDA_VISIBLE_DEVICES

    * add request_ids in generate

    * fix bug in engine.py

    * add logger.debug for BatchBucket

commit 2a718c8be89918ec70b88f1f059148a7294dbccb
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Feb 21 13:23:57 2024 +0800

    Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390)

    * opt_view_and_memcopy

    * fix bugs in ci

    * fix ci bugs

    * update benchmark scripts

    * fix ci bugs

commit 730103819dc0636c85af1af80cc17914dcf196c1
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed Feb 21 11:31:48 2024 +0800

    [Inference]Fused kv copy into rotary calculation (#5383)

    * revise rotary embedding

    * remove useless print

    * adapt

    * fix

    * add

    * fix

    * modeling

    * fix

    * fix

    * fix

    * fused kv copy

    * fused copy

    * colossalai/kernel/triton/no_pad_rotary_embedding.py

    * del padding llama

    * del

commit b21aac5baeddf7ea19615fae454e6f78f7469cd2
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Feb 19 17:18:20 2024 +0800

    [Inference] Optimize and Refactor Inference Batching/Scheduling (#5367)

    * add kvcache manager funcs for batching

    * add batch bucket for batching

    * revise RunningList struct in handler

    * add kvcache/batch funcs for compatibility

    * use new batching methods

    * fix indexing bugs

    * revise abort logic

    * use cpu seq lengths/block tables

    * rm unused attr in Sequence

    * fix type conversion/default arg

    * add and revise pytests

    * revise pytests, rm unused tests

    * rm unused statements

    * fix pop finished indexing issue

    * fix: use index in batch when retrieving inputs/update seqs

    * use dict instead of odict in batch struct

    * arg type hinting

    * fix make compress

    * refine comments

    * fix: pop_n_seqs to pop the first n seqs

    * add check in request handler

    * remove redundant conversion

    * fix test for request handler

    * fix pop method in batch bucket

    * fix prefill adding

commit 8c69debdc7128e1b8839f12aa3f19ad327569017
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Feb 8 15:27:26 2024 +0800

     [Inference]Support vllm testing in benchmark scripts (#5379)

    * add vllm benchmark scripts

    * fix code style

    * update run_benchmark.sh

    * fix code style

commit 9afa52061f89dde87a73e36f740f62781d658a01
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Thu Feb 8 14:04:14 2024 +0800

    [inference] refactored config (#5376)

commit 1f8c7e70469191610d9536029f624b4f30db8caf
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed Feb 7 17:55:48 2024 +0800

    [Inference] User Experience: update the logic of default tokenizer and generation config.  (#5337)

    * add

    * fix

    * fix

    * pause

    * fix

    * fix pytest

    * align

    * fix

    * license

    * fix

    * fix

    * fix readme

    * fix some bugs

    * remove tokenizer config

commit 6fb4bcbb2420b9f977ab74de60c6d311b6c9ed9a
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Feb 7 17:15:42 2024 +0800

    [Inference/opt] Fused KVCahce Memcopy (#5374)

    * fused kv memcopy

    * add TODO in test_kvcache_copy.py

commit 58740b5f6872bc5a26dbf7c3112b86a1b66c083a
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Wed Feb 7 17:11:43 2024 +0800

    [inference] added inference template (#5375)

commit 8106ede07fae7e239203feb815162efdf46975ec
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Wed Feb 7 14:27:04 2024 +0800

    Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373)

    This reverts commit 9f4ab2eb924b938348df2c713bb4580972f18eb1.

commit 9f4ab2eb924b938348df2c713bb4580972f18eb1
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed Feb 7 11:36:04 2024 +0800

    [Inference] Adapt to Fused rotary (#5348)

    * revise rotary embedding

    * remove useless print

    * adapt

    * fix

    * add

    * fix

    * modeling

    * fix

    * fix

    * fix

commit 35382a7fbf96c731ba1ed76cf5529ea3220a5b66
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Feb 6 19:38:25 2024 +0800

    [Inference]Fused the gate and up proj in mlp,and optimized the autograd process. (#5365)

    * fused the gate and up proj in mlp

    * fix code styles

    * opt auto_grad

    * rollback test_inference_engine.py

    * modifications based on the review feedback.

    * fix bugs in flash attn

    * Change reshape to view

    * fix test_rmsnorm_triton.py

commit 1dedb57747270f32be5d0e67abc1ad2fff658f8f
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Feb 6 17:27:45 2024 +0800

    [Fix/Infer] Remove unused deps and revise requirements (#5341)

    * remove flash-attn dep

    * rm padding llama

    * revise infer requirements

    * move requirements out of module

commit 631862f3390f874db118a25c0137f86630e9b167
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Fri Feb 2 15:38:21 2024 +0800

    [Inference]Optimize generation process of inference engine (#5356)

    * opt inference engine

    * fix run_benchmark.sh

    * fix generate in engine.py

    * rollback tesh_inference_engine.py

commit 21ad4a27f91659220bec6c4d4f2d0f62f7093a45
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Fri Feb 2 15:06:01 2024 +0800

    [Inference/opt]Optimize the mid tensor of RMS Norm (#5350)

    * opt rms_norm

    * fix bugs in rms_layernorm

commit 027aa1043f1c7b3668d5ca9b91d35c846736e9c4
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Fri Feb 2 14:31:10 2024 +0800

    [doc] updated inference readme (#5343)

commit e76acbb076582e0aade1ee8a5fa7696d95c1bef5
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Fri Feb 2 13:51:22 2024 +0800

    [inference] moved ops tests to test_infer (#5354)

commit db1a763307a54ca262751ebebd5f1c503d9bca74
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Fri Feb 2 11:44:15 2024 +0800

    [inference] removed redundancy init_batch (#5353)

commit 249644c23b0402ccf9d0908f13ed15b41b95145f
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Feb 1 15:49:39 2024 +0800

    [Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340)

    * add fused qkv

    * replace attn and mlp by shardformer

    * fix bugs in mlp

    * add docstrings

    * fix test_inference_engine.py

    * add optimize unbind

    * add fused_addmm

    * rm squeeze(1)

    * refactor codes

    * fix ci bugs

    * rename ShardFormerLlamaMLP and ShardFormerLlamaAttention

    * Removed the dependency on LlamaFlashAttention2

    * rollback test_inference_engine.py

commit f8e456d20295af52665ca06a21f9fd8b468204d7
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Thu Feb 1 15:31:01 2024 +0800

    [inference] simplified config verification (#5346)

    * [inference] simplified config verification

    * polish

    * polish

commit df0aa49585d2dd19d7397dfbd3b5f136abac609b
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed Jan 31 16:31:29 2024 +0800

    [Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336)

    * revise rotary embedding

    * remove useless print

    * adapt

commit 1336838a9149fb210a956b0ad338197c4ae77821
Merge: 5f98a9d6 c5655199
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Wed Jan 31 16:29:26 2024 +0800

    Merge pull request #5339 from FrankLeeeee/sync/merge-main

    Sync/merge main

commit c56551991379a457fc34df699710ab94132779fc
Merge: 5f98a9d6 71321a07
Author: FrankLeeeee <somerlee.9@gmail.com>
Date:   Wed Jan 31 10:41:47 2024 +0800

    merge commit

commit 5f98a9d68a0a35031e1c740c19e33b32f4fa8d9c
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Jan 30 16:06:09 2024 +0800

    [Infer] Optimize Blocked KVCache And Kernels Using It (#5325)

    * revise shape of kvcache (context attn kernel)

    * revise shape of kvcache (flash decoding kernel)

    * revise shape of kvcache (kvcache copy) and attn func

    * init of kvcache in kvcache manager

    * revise llama modeling

    * revise block size retrieval

    * use torch for rms_norm benchmarking

    * revise block size retrieval

commit e8f0642f2841f6aeb6ed0e6695ff9d9ef14f198b
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 30 10:31:46 2024 +0800

    [Inference]Add Nopadding Llama Modeling (#5327)

    * add nopadding llama modeling

    * add nopadding_llama.py

    * rm unused codes

    * fix bugs in test_xine_copy.py

    * fix code style

commit c7c104cb7ccc353faa10667853ed210e042f1be8
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Jan 29 16:21:06 2024 +0800

    [DOC] Update inference readme  (#5280)

    * add readme

    * add readme

    * 1

    * update engine

    * finish readme

    * add readme

commit 1f8a75d470d548bfd4db877e73102b8fad5cdfa9
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Jan 29 10:22:33 2024 +0800

    [Inference] Update rms norm kernel, benchmark with vLLM (#5315)

    * add

    * xi

    * del

    * del

    * fix

commit 7ddd8b37f0f1160e28a2919a2e37f8e8ad199773
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Fri Jan 26 15:02:12 2024 +0800

    fix (#5311)

commit 4f28cb43c0c2afbc970b9f0f300e7aa28e39bd2e
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Fri Jan 26 14:00:10 2024 +0800

    [inference]Optimize the usage of the mid tensors space in flash attn (#5304)

    * opt flash attn

    * opt tmp tensor

    * fix benchmark_llama

    * fix code style

    * fix None logic for output tensor

    * fix adapted to get_xine_cache

    * add comment

    * fix ci bugs

    * fix some codes

    * rm duplicated codes

    * rm duplicated codes

    * fix code style

    * add _get_dtype in config.py

commit af8359c430ce3fabb22748870b67b0c6c33f610c
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Thu Jan 25 10:23:12 2024 +0800

    [hotfix] fix boundary check in batch (#5306)

commit c647e00e3c092d3d6219f7686f260f2932a0c27d
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed Jan 24 16:20:42 2024 +0800

    [Inference]Add fused rotary kernel and get cos cache kernel (#5302)

    * add fused rotary and get cos cache func

    * staged

    * fix bugs

    * fix bugs

commit 3da9993b0d03923755c1fcd6279cc4c7b8d00d1e
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Jan 23 17:16:02 2024 +0800

    [Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301)

    * fix decoding kernel pytest

    * revise and add triton context attn benchmark

commit 8e606ecc7e89ffed80537e89a27bb1eb6759f4bc
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Tue Jan 23 12:11:53 2024 +0800

    [Inference] Benchmarking rotary embedding and add a fetch function (#5277)

    * fix bugs and add a cos/sin cache fetch func

    * add docstring

    * fix bug

    * fix

commit b7853196a0a46558d7c0cac7deac9a36c7a5ba38
Merge: bfff9254 cea9c86e
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Jan 22 17:07:14 2024 +0800

    Merge pull request #5297 from yuehuayingxueluo/fix_rotary_embedding

    [Inference/fix]Add utils.py for Rotary Embedding

commit cea9c86e453e36b4848064312c9a4f0d2de6ea98
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Jan 22 16:06:27 2024 +0800

    add utils.py

commit bfff9254ac8ca866673746ec47cfd2f87aab2b66
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Jan 22 10:55:34 2024 +0800

     [inference] Adapted to Rotary Embedding and RMS Norm (#5283)

    * adapted to rotary_embedding

    * adapted to nopad rms norm

    * fix bugs in benchmark

    * fix flash_decoding.py

commit 6e487e7d3cf5295ca908fa69c8e03af8980391bf
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Fri Jan 19 15:47:16 2024 +0800

    [kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274)

    * prevent re-creating intermediate tensors

    * add singleton class holding intermediate values

    * fix triton kernel api

    * add benchmark in pytest

    * fix kernel api and add benchmark

    * revise flash decoding triton kernel in/out shapes

    * fix calling of triton kernel in modeling

    * fix pytest: extract to util functions

commit 9e2342bde2c0ffe1a8cdd2fe8917254ef0a06e7f
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Thu Jan 18 16:31:14 2024 +0800

    [Hotfix] Fix bugs in testing continuous batching (#5270)

    * fix bug

    * fix bugs

    * fix bugs

    * fix bugs and add padding

    * add funcs and fix bugs

    * fix typos

    * fix bugs

    * add func

commit 5ae9099f9203a4f8350f383b838e8f2ad15d6fdd
Author: Yaozheng Fang <62918515+nkfyz@users.noreply.github.com>
Date:   Thu Jan 18 10:21:03 2024 +0800

    [kernel] Add RMSLayerNorm triton kernel (#5262)

    * add layerrmsnorm triton kernel

    * add layerrmsnorm kernel

    * modify the atol and rtol in test file

    * Remove the logics of mean computations, and update the name of ther kernel functions and files

    * add benchmark of rms norm

commit 86b63f720cf60deefe40874517b3d8e1dccb7af3
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Jan 17 16:03:10 2024 +0800

    [Inference]Adapted to the triton attn kernels (#5264)

    * adapted to the triton attn kernels

    * fix pad input

    * adapted to copy_kv_to_blocked_cache

    * fix ci test

    * update kv memcpy

    * remove print

commit 0f2b46a41c2c308cc6fbeaf0e86d0e0b93435b77
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Jan 16 14:41:02 2024 +0800

    [kernel] Revise KVCache copy triton kernel API (#5273)

    * [kernel/fix] revise kvcache copy kernel api

    * fix benchmark

commit d8db500efc0e67dea995c2124d20aadd07afb6f0
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Jan 15 17:50:46 2024 +0800

    [Inference] Fix request handler and add recycle logic (#5260)

    * fix request handler

    * fix comment

commit c597678da475abd4ecc075c0b80996989f1bcdc0
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Mon Jan 15 17:37:41 2024 +0800

    [doc] updated inference readme (#5269)

commit fa85e02b3b1b316009c4557482f998b903730ec3
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Jan 15 17:37:20 2024 +0800

    [kernel] Add KV cache copy kernel during decoding  (#5261)

    * add kv copy triton kernel during decoding stage

    * add pytest and fix kernel

    * fix test utilities

    * revise kernel config

    * add benchmark for kvcache copy

commit 1ded7e81ef08d574798dd98d1f4d33da07b7f4c9
Author: FrankLeeeee <somerlee.9@gmail.com>
Date:   Thu Jan 11 13:50:45 2024 +0000

    [git] fixed rebased files

commit 1513f20f4d80f782fab381996368ff2c2f3c95c3
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Thu Jan 11 18:06:39 2024 +0800

    [kernel] Add flash decoding triton kernel for blocked kv cache (#5249)

    * add flash decoding unpad triton kernel

    * rename flash decoding kernel

    * add kernel testing (draft)

    * revise pytest

    * support kv group (GQA)

    * (trivial) fix api and pytest

    * (trivial) func renaming

    * (trivial) func/file renaming

    * refactor pytest for attention

    * (trivial) format and consistent vars of context/decode attn

    * (trivial) remove test redundancy

commit fded91d049997ed87dee965fc42c35a239e3ec03
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Thu Jan 11 16:24:54 2024 +0800

    [Inference] Kernel: no pad rotary embedding (#5252)

    * fix bugs

    * comment

    * use more accurate atol

    * fix

commit d40eb26029e8c61fc2b8ef3a1b8126a229e48047
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Jan 10 10:38:53 2024 +0800

    fix bugs in request_handler.py and engine.py

commit 10e3c9f923caf4fb68ab61e96c244bd5cca9b9da
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 9 15:53:04 2024 +0800

    rm torch.cuda.synchronize

commit fab294c7f4a5db0a4e19109ac5656492ff3ca08b
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 9 15:18:28 2024 +0800

    fix CI bugs

commit 2a73e828eba565017d19eaf70a304e1b1eddba1f
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 9 14:29:45 2024 +0800

    fix bugs related to processing padding mask

commit e545a871b8a89093f5d01e3fea1fe873ef52d51a
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Jan 8 15:56:00 2024 +0800

    [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229)

    * fix accuracy

    * alignment in attention

    * fix attention

    * fix

    * fix bugs

    * fix bugs

    * fix bugs

commit fa4fbdbffb6996e8aa1f65bddce5844f2bbbfdf1
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 9 13:52:53 2024 +0800

    adapted to pad_context_forward

commit 47e53eaa1ca08fd55b657b53b75d13cc72f9cd05
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Jan 8 12:35:06 2024 +0800

    fix bugs in attention.py and request_handler.py

commit bfd9b1b494b4414835b22cbba52005921127e4f6
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Thu Jan 4 16:39:00 2024 +0800

    [Inference] Pytorch Attention func, pad&nopad input support (#5219)

    * add attn

    * add attention test

    * fix attn forward

    * fix decoding

commit 3ad1f3b78b830c90079ed9f1e0b5cd26601194fa
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Jan 4 16:48:53 2024 +0800

    fix beam_width

commit b2eb9cd18665317ec7900364ef21a38c3edb9e3f
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Jan 4 15:09:06 2024 +0800

    Fixed a typo

commit bbfebfb9fc5250c1e4d3a6f008af652f7a0a9ca0
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Jan 4 15:03:18 2024 +0800

    fix bugs in sampler

commit 02c1bf8b2abef137a653b86b733d66b6dfbcc022
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Jan 3 18:50:26 2024 +0800

    add context_attention_unpadded

commit 07b5283b6a3899ebe84cbe8c7902d142ffbc4b9c
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed Jan 3 14:41:35 2024 +0800

    [kernel] Add triton kernel for context attention (FAv2) without padding (#5192)

    * add context attn unpadded triton kernel

    * test compatibility

    * kv cache copy (testing)

    * fix k/v cache copy

    * fix kv cache copy and test

    * fix boundary of block ptrs

    * add support for GQA/MQA and testing

    * fix import statement

    ---------

    Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>

commit 4df8876fcad799ace567b2458df5feb3109ee917
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 2 18:34:19 2024 +0800

    Fixed a writing error

commit 9489dc64d8e01b04c9033c3dcaee83e25afebe42
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 2 18:30:11 2024 +0800

    precision alignment

commit 62968588d195126adc9b1bdb3adc02f199303ddf
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 2 13:02:20 2024 +0800

    fix bugs in request_handler

commit 62fd08ee4425e031f8f1c43b25bf1ba5e7e33e8d
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Dec 26 21:34:27 2023 +0800

    Fixed a bug in the inference frame

commit 86853a37d5243b40d4b229d163494624b8027cd0
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Dec 25 14:07:43 2023 +0800

    Add padding llama model

commit 0e616462a7f9e8faaa33d1700a2020ceb03ccd34
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Dec 25 12:15:15 2023 +0800

    [Inference] add logit processor and request handler (#5166)

    * add logit processor and request handler

    * add

    * add

    * add

    * fix

    * add search tokens and update func

    * finish request handler

    * add running list test

    * fix test

    * fix some bug

    * add

    * add

    * fix bugs

    * fix some bugs

    * fix bug

    * fix

    * fix

    * add copy fun

    * del useless attn

    * fix request status

    ---------

    Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

commit 8daee26989adad5ae5b152b24d3344db727986fe
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Dec 18 10:40:47 2023 +0800

    [Inference] Add the logic of the inference engine (#5173)

    * add infer_struct and infer_config

    * update codes

    * change InferConfig

    * Add hf_model_config to the engine

    * rm _get_hf_model_config

    * update codes

    * made adjustments according to the feedback from the reviewer.

    * update codes

    * add ci test for config and struct

    * Add the logic of the inference engine

    * update engine and test

    * Recover cache_manager.py

    * add logger

    * fix conflict

    * update codes

    * update codes

    * update model and tokenizer

    * fix add the logic about shardformer

    * change kvcache_manager docstring

    * add policy

    * fix ci bug in test_kvcache_manager.py

    * remove codes related o tokenizer and move model_policy

    * fix  code style

    * add ordered_set to requirements-infer.txt

    * Delete extra empty lines

    * add ordered_set to requirements-test.txt

commit 93aeacca342ab03732362dbb9096ab1265f4a8b3
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Tue Dec 12 17:22:41 2023 +0800

    [Inference]Update inference config and fix test (#5178)

    * unify the config setting

    * fix test

    * fix import

    * fix test

    * fix

    * fix

    * add logger

    * revise log info

    ---------

    Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

commit 3de2e622995321b042d4a8cffcd61686cda4a58e
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Dec 11 10:56:18 2023 +0800

    [Inference] Add CacheBlock and KV-Cache Manager (#5156)

    * [Inference] Add KVCache Manager

    * function refactored

    * add test for KVCache Manager

    * add attr beam width

    * Revise alloc func in CacheManager

    * Fix docs and pytests

    * add tp slicing for head number

    * optimize shapes of tensors used as physical cache

    * Apply using InferenceConfig on KVCacheManager

    * rm duplicate config file

    * Optimize cache allocation: use contiguous cache

    * Fix config in pytest (and config)

commit fab9b931d9e24c6e8ada8025cf8cf12719c3d2af
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Dec 7 14:34:01 2023 +0800

    [Inference]Add BatchInferState, Sequence and InferConfig (#5149)

    * add infer_struct and infer_config

    * update codes

    * change InferConfig

    * Add hf_model_config to the engine

    * rm _get_hf_model_config

    * update codes

    * made adjustments according to the feedback from the reviewer.

    * update codes

    * add ci test for config and struct

commit 2bb92243d4151873d75a9d6d9c2275b390e1716a
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Dec 5 15:12:57 2023 +0800

    [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159)

    * [inference/nfc] remove outdated inference tests

    * remove outdated kernel tests

    * remove deprecated triton kernels

    * remove imports from deprecated kernels

commit 56e75eeb063279fbc0fc84e25f267f1ca208e784
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Fri Dec 1 17:31:31 2023 +0800

    [Inference] Add readme (roadmap) and fulfill request handler (#5147)

    * request handler

    * add readme

    ---------

    Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

commit 4cf4682e70f70dea8e0510705d3383de0bf1a4a8
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Fri Dec 1 17:02:44 2023 +0800

    [Inference] First PR for rebuild colossal-infer (#5143)

    * add engine and scheduler

    * add dirs

    ---------

    Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

example example-related issuer or pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants