Skip to content

Merge with mlc-ai/main (835223541d4135e511a50cba1deca06731b03abd, April 18th 2024) #260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 203 commits into from
Apr 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
203 commits
Select commit Hold shift + click to select a range
43d38ee
[Attn] Making decode attn kernel be aware of webgpu target (#1817)
MasterJH5574 Feb 22, 2024
e30a457
[Serving][Refactor] Logit processor and logit bias support (#1828)
MasterJH5574 Feb 23, 2024
bcb9b6a
[Serving][Grammar] BNF grammar simplifier and matcher (#1801)
Ubospica Feb 24, 2024
ce42880
[Serving] LogProbs support (#1832)
MasterJH5574 Feb 24, 2024
1cbd67b
[Serving] Support Mixtral in MLC Serve (#1840)
MasterJH5574 Feb 26, 2024
607dc5a
[Fix] Fix `u_char` for Windows build (#1848)
MasterJH5574 Feb 27, 2024
c4d1b69
Auto updated submodule references
Feb 27, 2024
31e0571
[Fix] Add phi lm head name to is_final_fc, add q4f16_ft to CI (#1849)
CharlieFRuan Feb 28, 2024
89f3e41
[Build] Replace mod_transform_before_build with IRModule pass (#1852)
Lunderberg Feb 28, 2024
6ce1759
[SLM] Add support for InternLM architecture (#1835)
tlopex Feb 28, 2024
1497744
[Bugfix] Handle model names with multiple path components (#1851)
Lunderberg Feb 28, 2024
7456314
[KVCache] Add max num threads awareness to KVCache kernels (#1822)
CharlieFRuan Feb 28, 2024
52d002f
[KVCache] Migrate Baichuan model to PagedKVCache (#1854)
tlopex Feb 28, 2024
ac57c03
[Python] Lazy import of transformers for tiktoken conversion (#1860)
MasterJH5574 Feb 29, 2024
1f70d71
[SLM] RWKV5 World Support (#1787)
Hzfengsy Feb 29, 2024
eb465ec
[Serving] Register the ChatML conversation template (#1862)
tlopex Feb 29, 2024
5bbe204
[Utils][Transform] Added SetEntryFuncs transform (#1855)
Lunderberg Mar 1, 2024
eb66452
[Build] Update transform_params_for_each_rank to IRModule pass (#1856)
Lunderberg Mar 1, 2024
5f2a06e
[Serving][Grammar] Integrate JSON grammar into the generation pipelin…
Ubospica Mar 2, 2024
7806dee
[Serving] Support "n" for parallel generation (#1868)
MasterJH5574 Mar 2, 2024
63c338b
[CI] Add retry to scm checkout (#1869)
tqchen Mar 2, 2024
e8b5b0b
[Attn] Use float32 accumulation in attention kernel (#1870)
MasterJH5574 Mar 2, 2024
91008ae
[Utils] Allow ReorderTransformFunc to be used without param manager (…
Lunderberg Mar 3, 2024
731616e
[SLM] Migrate Phi-2 to paged KV Cache #1871 (#1872)
Kartik14 Mar 3, 2024
e4341b3
[Fix] Fix the use of "call_inplace_packed" and "call_pure_packed" (#1…
MasterJH5574 Mar 3, 2024
c0606ec
[Fix] Add the missing BundleModelParams pass (#1875)
MasterJH5574 Mar 3, 2024
07af0f9
[Docs] Update Android APK download link (#1876)
MasterJH5574 Mar 3, 2024
837869a
Fix MLC-LLM website link weight convert not accessible (#1877)
DiegoCao Mar 3, 2024
d2cfb1e
[Serving][Grammar] Support termination state in GrammarStateMatcher (…
Ubospica Mar 4, 2024
65ec85d
[Serving] Make RequestState as a standalone object class (#1878)
MasterJH5574 Mar 4, 2024
ffef890
[SLM] Update StableLM model and migrate it to paged KV Cache (#1882)
tlopex Mar 4, 2024
ef2db85
[KVCache] Qwen 1.0 Model PagedKV Support (#1887)
DiegoCao Mar 4, 2024
25877f9
[Serving] Estimate KV cache memory usage with metadata (#1888)
MasterJH5574 Mar 4, 2024
aeb55f1
[KVCache] Migrate bigcode arch to PagedKVCache (#1891)
davidpissarra Mar 5, 2024
e7b6cbc
[Serving] Add Phi-2 conv template to mlc serve (#1890)
Kartik14 Mar 5, 2024
8a8c529
[Attn] Fix attention kernel for head dim not divisble by 32 (#1889)
MasterJH5574 Mar 5, 2024
b345a9e
[Python] Enable "thrust" for CUDA by default (#1866)
MasterJH5574 Mar 5, 2024
2f26e05
[Serving] Fix loading presharded weights (#1894)
vinx13 Mar 6, 2024
a41f903
[Serving] Address embedding lookup OOM issue (#1899)
MasterJH5574 Mar 7, 2024
88ac813
[Model] Remove redundant `batch_forward` and move broadcast (#1900)
MasterJH5574 Mar 7, 2024
1eaef7c
[KVCache]Migrate Qwen2 model to PagedKVCache (#1903)
tlopex Mar 7, 2024
068d5ea
[CI] Skip not supported quantization in model compilation test (#1904)
MasterJH5574 Mar 7, 2024
655ae5c
[Serving] Add missing header for `std::iota` (#1905)
MasterJH5574 Mar 8, 2024
068091c
[Serving] Fix Model TokenEmbed function with TP (#1906)
MasterJH5574 Mar 8, 2024
73fa4a2
[SLM] Add support for Orion architecture. (#1883)
gesanqiu Mar 8, 2024
3f3e3fd
[Model] Eliminate the reshape in embedding func (#1908)
MasterJH5574 Mar 8, 2024
3f05a1f
[Pass] Low batch GEMM using GEMV-like schedule (#1769)
jinhongyii Mar 8, 2024
c2258ae
Auto updated submodule references
Mar 8, 2024
1b3cfd5
[Serving] Avoid unnecessary worker sync in Model (#1909)
MasterJH5574 Mar 9, 2024
448c5c4
[Serving][Grammar] Enhance GrammarStateMatcher to support general gra…
Ubospica Mar 9, 2024
b44cdc5
[Android] Improve perf of TIR PagedAttn kernel on Android (#1915)
spectrometerHBH Mar 10, 2024
20efccb
Deprecate old flow (#1928)
tqchen Mar 11, 2024
716b8e1
[Serving] Register the StableLM3B conversation template (#1920)
tlopex Mar 11, 2024
2e6f9cb
Remove deprecated build.py
tqchen Mar 11, 2024
9c80105
[Fix] KVCache creation with call_pure_packed (#1930)
MasterJH5574 Mar 11, 2024
d8fedd1
[KVCache] Update FlashInfer PackedFunc names (#1931)
MasterJH5574 Mar 11, 2024
4290a05
[REFACTOR] remove tests/legacy-python (#1933)
tqchen Mar 12, 2024
8beed7a
[REFACTOR] rename mlc_chat => mlc_llm (#1932)
tqchen Mar 12, 2024
c268f95
Auto updated submodule references
Mar 12, 2024
d6d972c
[Docs] Deprecating CUDA 11.7/11.8 support (#1939)
MasterJH5574 Mar 12, 2024
9df8f03
[Fix] Fix KV cache call in mistral (#1938)
MasterJH5574 Mar 12, 2024
4893415
[ChatModule] Remove eos_token_ids (#1940)
MasterJH5574 Mar 12, 2024
738e353
[SLM] Weight conversion with generator (#1916)
MasterJH5574 Mar 12, 2024
5b8c529
[Serve] Introducing GPU sampler for CUDA (#1934)
MasterJH5574 Mar 12, 2024
73b9965
[Serve] Constrain KV cache capacity on Metal (#1943)
MasterJH5574 Mar 13, 2024
8a29ee1
[CI] Add windows ci (#1942)
tqchen Mar 13, 2024
5c29f02
Auto updated submodule references
Mar 13, 2024
8d192ef
[Fix] Fix embedding shape check in ChatModule (#1953)
MasterJH5574 Mar 13, 2024
c0b2ccd
[Fix] Fetching the Git-LFS tokenizer files (#1954)
MasterJH5574 Mar 14, 2024
2872f70
[LogitProcessor] Add max thread awareness to logit processing kernels…
CharlieFRuan Mar 14, 2024
d546134
[Model] Use static hidden size in mixtral scatter_output (#1959)
vinx13 Mar 14, 2024
01527e9
Auto updated submodule references
Mar 15, 2024
09fe1bc
[CompilerFlag] Detect if FlashInfer is enabled from libinfo (#1941)
MasterJH5574 Mar 15, 2024
c7d52c4
[Serving][Grammar] Add grammar termination as a stop condition (#1964)
Ubospica Mar 15, 2024
994f928
Unify schema for conversation template and embed into mlc-chat-config…
rickzx Mar 15, 2024
73f2b27
[SLM] Small correction on Stablelm and Qwen2. (#1958)
tlopex Mar 16, 2024
d6b86d1
[Serving][Fix] Fix JSON output check in test_server.py (#1966)
Ubospica Mar 16, 2024
edffce4
[Model] Migrate Mistral to use PagedKVCache (#1967)
MasterJH5574 Mar 16, 2024
8f5e25d
Auto updated submodule references
Mar 18, 2024
386af8d
[REST] Update Rest API docs for the latest serve flow (#1972)
Kartik14 Mar 18, 2024
4db4373
[Conv] Add bos_token to llama and mistral in ConvTemplateRegistry (#1…
rickzx Mar 18, 2024
949ff2d
[Model][Serve] Add support for LLaVa model in serving engine (#1974)
anibohara2000 Mar 18, 2024
058c583
[Serve] Hot fix for the mixtral serving (#1975)
yongwww Mar 19, 2024
3cbc169
[REST] REST API Deprecated (#1973)
shreygupta2809 Mar 19, 2024
587e341
[Fix] Fix handling of non-numerical cuda arch (#1976)
vinx13 Mar 19, 2024
bed4f53
[Serving][Grammar] Support specifying the main rule in grammar (#1982)
Ubospica Mar 19, 2024
5485782
[Fix] Fix `MLC_MULTI_ARCH` with arch `sm_90a` (#1984)
cyx-6 Mar 19, 2024
06d6115
Fix Llama-2 and Mistral conversation template. Update ConvTemplateReg…
rickzx Mar 20, 2024
39d0865
[SpecDecode] Fix sampler selection. (#1971)
KnowingNothing Mar 20, 2024
a0484bd
[Serving][Grammar] Utility to convert json schema to EBNF grammar (#1…
Ubospica Mar 20, 2024
3b9b51a
Auto updated submodule references
Mar 20, 2024
d4ec25e
[Fix] Fix serve model to adapt the latest Allocator signature (#1989)
MasterJH5574 Mar 20, 2024
c74f176
[Model] Use optimized group gemm for Mixtral (#1988)
vinx13 Mar 20, 2024
244c2e7
[Attn] Fix the construction of attn result merge kernel (#1995)
MasterJH5574 Mar 21, 2024
ddfbcda
[iOS][Android] Add validation of library file for iOS and Android bui…
tqchen Mar 21, 2024
cc36324
Auto updated submodule references
Mar 21, 2024
96d9c8b
[Serve] add allocator in Storage as the upstream change (#1997)
yongwww Mar 21, 2024
0772940
[Compiler] Support IPC memory and customized all-reduce kernels (#1990)
MasterJH5574 Mar 22, 2024
ae97b8d
Auto updated submodule references
Mar 22, 2024
8405cb1
[Model] Fix the top-k TIR script for well-formedness (#2002)
MasterJH5574 Mar 22, 2024
64badb5
Fix invalid use of dataflow var in sampler output (#2003)
vinx13 Mar 22, 2024
837ee53
[Fix] Fix KV cache creation pass after nn.Module changes (#2011)
MasterJH5574 Mar 24, 2024
10f2d00
[iOS] Fix typo in prepare_model_lib.py (#2013)
HuitingLiu Mar 24, 2024
a6de1ff
Remove unstable assertion in KV cache creation dispatch (#2017)
MasterJH5574 Mar 24, 2024
1c8b72e
Auto updated submodule references
Mar 25, 2024
ab9fa81
[SLM] Qwen2 Multi-GPU support (#1985)
tlopex Mar 25, 2024
f04cd3e
more info for preshard (#2027)
na20215 Mar 25, 2024
1c975de
Register stablelm-2 conversation template (#2029)
rickzx Mar 25, 2024
8796fb4
[Serving][Fix] Fix problems in PopenServer (#2032)
Ubospica Mar 26, 2024
a6d31d7
[Quantization] Skip MoE gate layer (#2012)
MasterJH5574 Mar 26, 2024
f2518ab
[Serving][Grammar] Integration of JSON schema generation (#2030)
Ubospica Mar 27, 2024
0a23af5
[Compiler] Support AUTO mode for all-reduce strategy (#2034)
MasterJH5574 Mar 27, 2024
47c8350
[LLaVa] Follow-up for TODOs in LLaVa model (#2010)
anibohara2000 Mar 27, 2024
2d68e64
[Pipeline] Defer GPU IPC memory lowering (#2038)
MasterJH5574 Mar 27, 2024
be42bec
[Model] Add missing broadcast of logit_position for multigpu (#2040)
vinx13 Mar 28, 2024
5ebcda1
[Preshard] apply presharding after quantization (#2039)
vinx13 Mar 28, 2024
a0c0f21
[SLM] Baichuan Multi-GPU support (#2037)
tlopex Mar 28, 2024
34497ea
Auto updated submodule references
Mar 28, 2024
cf8d458
[Model] Skip TVMSynchronize when tracing is not enabled (#2041)
MasterJH5574 Mar 28, 2024
4255a45
[Serving] Support NVTX for benchmarking (#2043)
MasterJH5574 Mar 28, 2024
2b82091
Update huggingface_loader.py
tqchen Mar 28, 2024
522db05
[Serve] Separate callback invocation to another thread in AsyncEngine…
MasterJH5574 Mar 29, 2024
ad068c2
[LLaVa] Fix random token output after first sentence (#2048)
anibohara2000 Mar 29, 2024
b4b8e91
Auto updated submodule references
Mar 29, 2024
1acd5f5
[Pass] Fix LiftGlobalBufferAlloc for proper GlobalVar struct info (#2…
MasterJH5574 Mar 29, 2024
2f171b4
Auto updated submodule references
Mar 29, 2024
55d7dc3
[Serving] CLI Support for SERVE (#2014)
Kartik14 Mar 29, 2024
203afab
[Pipeline] Insert hints to enable cuda graph symbolic capture (#2050)
vinx13 Mar 29, 2024
6431bda
[Loader] Print message when multi-GPU loader is finished (#2051)
vinx13 Mar 30, 2024
12c9808
[KVCache] Support matching arbitrary element offset for aux data (#2057)
MasterJH5574 Mar 30, 2024
af7ef3e
[Serving] Support copy stream in LogitProcessor and GPUSampler (#2058)
MasterJH5574 Mar 30, 2024
2600a70
[SLM] Stablelm Multi-GPU support (#2052)
tlopex Mar 30, 2024
9ecc00e
[KVCache] Introducing single page copy func for KV cache fork (#2060)
MasterJH5574 Mar 30, 2024
e370ac7
[Python] Implement testing.DebugChat for end-to-end model debugging (…
rickzx Mar 30, 2024
069b73a
[Docs] Fix docs for python server and rest call (#2066)
yogeshg Mar 31, 2024
3e91e70
[CI] Enable submodule clone for WASM model compilation (#2068)
MasterJH5574 Mar 31, 2024
ed62796
[Serve] Fork sequence at specified positions (#2067)
MasterJH5574 Mar 31, 2024
5243b27
[SLM] Add support for RWKV6 model (#1977)
Celve Mar 31, 2024
8cac74c
[Quantization] Reorganize utils code in group_quantization (#2055)
vinx13 Apr 1, 2024
8a82f93
[Serving] Bugfix for empty stop string (#2070)
Kartik14 Apr 1, 2024
eb3d1e4
[SLM] Internlm Multi-GPU support (#2072)
tlopex Apr 1, 2024
10017db
[WebGPU] Add mlc wasm runtime, support grammar in web (#2061)
CharlieFRuan Apr 1, 2024
9121126
[Build] Use TVM_HOME environment variable (#2073)
Lunderberg Apr 1, 2024
b7416c0
[Serving] Support input chunking (#2069)
MasterJH5574 Apr 1, 2024
52de798
[Docs] API Code Completion Guide (#2054)
davidpissarra Apr 2, 2024
12ca8fd
Allow "mlc_llm --host" option to override host triple the model compi…
yuxuanchiadm Apr 2, 2024
63fc972
[Web] Move prep emcc deps script to web folder (#2077)
CharlieFRuan Apr 2, 2024
5bc3ffa
[SLM] Qwen Multi-GPU support (#2075)
tlopex Apr 2, 2024
96b8c33
Fix mismatch of metadata func and global symbol (#2078)
vinx13 Apr 3, 2024
1d34527
[Disco] Set worker CPU affinity with env variable (#2042)
MasterJH5574 Apr 3, 2024
7f1aacc
[Quantization] Introduce PerTensor and F8 quantization (#2079)
vinx13 Apr 4, 2024
700206b
[Serving][Refactor] Rename AsyncThreadedEngine to ThreadedEngine (#2081)
MasterJH5574 Apr 4, 2024
2e9cc1c
[Serving] Add cuda profiling in benchmark test (#2084)
yongwww Apr 5, 2024
41da87a
[Grammar] Fix broken grammar tests (#2083)
MasterJH5574 Apr 5, 2024
791623a
[Serving][Fix] Fix chunked prefill condition (#2082)
MasterJH5574 Apr 5, 2024
7e0f102
[Conversation] Fix RedPajama conversation template (#2087)
MasterJH5574 Apr 5, 2024
c2f2e59
[Serving][Refactor] Python interface refactor (#2085)
MasterJH5574 Apr 5, 2024
5cf700b
[Serving] Separating ThreadedEngine creation and initialization (#2090)
MasterJH5574 Apr 5, 2024
d6d3d7e
[Serving] Enhance robustness with small KV capacity (#2091)
MasterJH5574 Apr 5, 2024
a73eae2
[REST] Update REST API docs (#2092)
Kartik14 Apr 5, 2024
466fa8a
[DOCS] Clarify vulkan loader dependency (#2095)
tqchen Apr 5, 2024
a75eb0b
[SLM] Add support for Chatglm3 architecture (#2096)
tlopex Apr 6, 2024
3d564f3
[Quantization] Add OpenCL device (#2097)
mengshyu Apr 6, 2024
61f76c7
[Serving] Support stream=True for Python API (#2098)
MasterJH5574 Apr 6, 2024
50766fd
[Serving][Refactor] OpenAI API Python interface alignment (#2099)
MasterJH5574 Apr 7, 2024
fb24fcf
[DOC] fix small python env install error (#2102)
DiegoCao Apr 7, 2024
cc8b747
[JSONFFIEngine] Initial implementation of JSONFFIEngine (#2101)
anibohara2000 Apr 8, 2024
95d268b
[Model] Use tanh approximation of GeLU in Gemma MLP (#2106)
jeethu Apr 8, 2024
36d0e6a
Auto updated submodule references
Apr 8, 2024
3e71b70
[Quantization] Stricter checks for MoE gate (#2109)
MasterJH5574 Apr 9, 2024
623ed62
Auto updated submodule references
Apr 10, 2024
021c29c
[LLaVa] Fix allowed text model value in config (#2062)
anibohara2000 Apr 10, 2024
c4169d8
Auto updated submodule references
Apr 10, 2024
f832bde
Revert "Allow "mlc_llm --host" option to override host triple the mod…
tqchen Apr 10, 2024
716a5ed
Revert "Auto updated submodule references" (#2117)
MasterJH5574 Apr 10, 2024
6c48755
[Metadata] Include picojson rather than forward declaring (#2118)
MasterJH5574 Apr 10, 2024
39dfa3e
Auto updated submodule references
Apr 10, 2024
7f7c01f
Auto updated submodule references
Apr 11, 2024
a815148
[Serving][Grammar] Porting the json schema converter from python to C…
Ubospica Apr 11, 2024
9b71443
[Model] Use R.topk/cumsum for mixtral (#2107)
vinx13 Apr 11, 2024
880c68a
Enable flashinfer when group_size == 6 (#2124)
vinx13 Apr 12, 2024
4dfb9f0
[SpecDecode] Support Eagle in speculative decoding (#2080)
KnowingNothing Apr 12, 2024
65e4a56
[Pass] Attach non-negative TIR var attributes (#2125)
MasterJH5574 Apr 12, 2024
8e8a921
[Serving][Refactor] Engine constructor interface refactor (#2126)
MasterJH5574 Apr 12, 2024
8139a47
[Serving] Revamp engine mode selection logging info (#2128)
MasterJH5574 Apr 13, 2024
a361119
[SLM] Chatglm3 Multi-GPU support (#2123)
tlopex Apr 14, 2024
661abb2
[Serving] Fix support of large `n` under low max batch size (#2136)
MasterJH5574 Apr 14, 2024
3403a4e
[Docs] Revamp landing page with Engine Python API and server (#2137)
MasterJH5574 Apr 15, 2024
4cbda04
[Target] Update Target tags (#2141)
Hzfengsy Apr 16, 2024
8f33c30
[Util] Support debug debug_compare (#2142)
Hzfengsy Apr 16, 2024
3d25d9d
[Minor][SpecInfer] Fix Optional FC Bias for Mixtral Eagle Model (#2146)
zxybazh Apr 17, 2024
2de2875
[Serving] fix hardcoded host and port in popen_server (#2147)
yongwww Apr 17, 2024
8c673b4
[Docs] Introductory tutorial (#2145)
MasterJH5574 Apr 17, 2024
9f9436b
[Serving] Support `DebugCallFuncOnAllAllWorker` and CUDA profiler (#2…
MasterJH5574 Apr 17, 2024
2a24f13
[DOCS] Update introduction (#2151)
tqchen Apr 17, 2024
5a37e55
[Serving][Python] Rename Engine to LLMEngine (#2152)
MasterJH5574 Apr 17, 2024
751783b
Auto updated submodule references
Apr 17, 2024
e9a4a0b
[Quantization] Add e4m3 mode and enable fp8 storage type (#2154)
vinx13 Apr 17, 2024
7d3f34e
Revert "[Quantization] Add e4m3 mode and enable fp8 storage type" (#2…
tqchen Apr 18, 2024
8352235
[Serving] EngineConfig refactor (#2159)
MasterJH5574 Apr 18, 2024
cfd22f3
merged
sunggg Apr 18, 2024
09e91c0
temporary hack for byoc
sunggg Apr 22, 2024
16c676b
Merge remote-tracking branch 'origin/mlc-serve-v0.2.0' into HEAD
sunggg Apr 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/task/pylint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ export PYTHONPATH="./python":${PYTHONPATH:-""}

# TVM Unity is a dependency to this testing
pip install --quiet --pre -U -f https://mlc.ai/wheels mlc-ai-nightly
pip install --quiet --pre -U cuda-python

pylint --jobs $NUM_THREADS ./python/
pylint --jobs $NUM_THREADS --recursive=y ./tests/python/
204 changes: 204 additions & 0 deletions cpp/json_ffi/json_ffi_engine.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
#include "json_ffi_engine.h"

#include <picojson.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/registry.h>

namespace mlc {
namespace llm {
namespace json_ffi {

using namespace tvm::runtime;

JSONFFIEngine::JSONFFIEngine() { engine_ = serve::ThreadedEngine::Create(); }

bool JSONFFIEngine::ChatCompletion(std::string request_json_str, std::string request_id) {
bool success = this->AddRequest(request_json_str, request_id);
if (!success) {
this->StreamBackError(request_id);
}
return success;
}

void JSONFFIEngine::StreamBackError(std::string request_id) {
ChatCompletionMessage delta;
delta.content = std::vector<std::unordered_map<std::string, std::string>>{
{{"type", "text"}, {"text", this->err_}}};
delta.role = Role::assistant;

ChatCompletionStreamResponseChoice choice;
choice.finish_reason = FinishReason::error;
choice.index = 0;
choice.delta = delta;

ChatCompletionStreamResponse response;
response.id = request_id;
response.choices = std::vector<ChatCompletionStreamResponseChoice>{choice};
response.model = "json_ffi"; // TODO: Return model name from engine (or from args)
response.system_fingerprint = "";

this->request_stream_callback_(Array<String>{picojson::value(response.ToJSON()).serialize()});
}

bool JSONFFIEngine::AddRequest(std::string request_json_str, std::string request_id) {
std::optional<ChatCompletionRequest> optional_request =
ChatCompletionRequest::FromJSON(request_json_str, &err_);
if (!optional_request.has_value()) {
return false;
}
ChatCompletionRequest request = optional_request.value();
// Create Request
// TODO: Check if request_id is present already

// inputs
// TODO: Apply conv template
Array<Data> inputs;
for (const auto& message : request.messages) {
if (message.content.has_value()) {
for (const auto& content : message.content.value()) {
if (content.find("type") == content.end()) {
err_ += "Content should have a type field";
return false;
}
std::string type = content.at("type");
if (type == "text") {
if (content.find("text") == content.end()) {
err_ += "Content should have a text field";
return false;
}
std::string text = content.at("text");
inputs.push_back(TextData(text));
} else {
err_ += "Content type not supported";
return false;
}
}
}
}

// generation_cfg
Optional<GenerationConfig> generation_cfg = GenerationConfig::FromJSON(request_json_str, &err_);
if (!generation_cfg.defined()) {
return false;
}

Request engine_request(request_id, inputs, generation_cfg.value());
this->engine_->AddRequest(engine_request);

return true;
}

bool JSONFFIEngine::Abort(std::string request_id) {
this->engine_->AbortRequest(request_id);
return true;
}

std::string JSONFFIEngine::GetLastError() { return err_; }

void JSONFFIEngine::ExitBackgroundLoop() { this->engine_->ExitBackgroundLoop(); }

JSONFFIEngine::~JSONFFIEngine() { this->ExitBackgroundLoop(); }

class JSONFFIEngineImpl : public JSONFFIEngine, public ModuleNode {
public:
TVM_MODULE_VTABLE_BEGIN("mlc.json_ffi");
TVM_MODULE_VTABLE_ENTRY("init_background_engine", &JSONFFIEngineImpl::InitBackgroundEngine);
TVM_MODULE_VTABLE_ENTRY("chat_completion", &JSONFFIEngineImpl::ChatCompletion);
TVM_MODULE_VTABLE_ENTRY("abort", &JSONFFIEngineImpl::Abort);
TVM_MODULE_VTABLE_ENTRY("get_last_error", &JSONFFIEngineImpl::GetLastError);
TVM_MODULE_VTABLE_ENTRY("run_background_loop", &JSONFFIEngineImpl::RunBackgroundLoop);
TVM_MODULE_VTABLE_ENTRY("run_background_stream_back_loop",
&JSONFFIEngineImpl::RunBackgroundStreamBackLoop);
TVM_MODULE_VTABLE_ENTRY("exit_background_loop", &JSONFFIEngineImpl::ExitBackgroundLoop);
TVM_MODULE_VTABLE_END();

void InitBackgroundEngine(EngineConfig engine_config,
Optional<PackedFunc> request_stream_callback,
Optional<EventTraceRecorder> trace_recorder) {
this->streamer_ = TextStreamer(Tokenizer::FromPath(engine_config->model));

CHECK(request_stream_callback.defined())
<< "JSONFFIEngine requires request stream callback function, but it is not given.";
this->request_stream_callback_ = request_stream_callback.value();

auto frequest_stream_callback_wrapper = [this](TVMArgs args, TVMRetValue* ret) {
ICHECK_EQ(args.size(), 1);
Array<RequestStreamOutput> delta_outputs = args[0];
Array<String> responses = this->GetResponseFromStreamOutput(delta_outputs);
this->request_stream_callback_(responses);
};

request_stream_callback = PackedFunc(frequest_stream_callback_wrapper);
this->engine_->InitBackgroundEngine(
std::move(engine_config), std::move(request_stream_callback), std::move(trace_recorder));
}

void RunBackgroundLoop() { this->engine_->RunBackgroundLoop(); }

void RunBackgroundStreamBackLoop() { this->engine_->RunBackgroundStreamBackLoop(); }

Array<String> GetResponseFromStreamOutput(Array<RequestStreamOutput> delta_outputs) {
std::unordered_map<std::string, std::vector<ChatCompletionStreamResponseChoice>> response_map;
for (const auto& delta_output : delta_outputs) {
std::string request_id = delta_output->request_id;
if (response_map.find(request_id) == response_map.end()) {
response_map[request_id] = std::vector<ChatCompletionStreamResponseChoice>();
}
ChatCompletionStreamResponseChoice choice;

if (delta_output->group_finish_reason.size() != 1) {
// Only support n = 1 in ChatCompletionStreamResponse for now
this->err_ += "Group finish reason should have exactly one element";
}
Optional<String> finish_reason = delta_output->group_finish_reason[0];
if (finish_reason.defined()) {
if (finish_reason.value() == "stop") {
choice.finish_reason = FinishReason::stop;
} else if (finish_reason.value() == "length") {
choice.finish_reason = FinishReason::length;
} else if (finish_reason.value() == "tool_calls") {
choice.finish_reason = FinishReason::tool_calls;
} else if (finish_reason.value() == "error") {
choice.finish_reason = FinishReason::error;
}
} else {
choice.finish_reason = std::nullopt;
}

choice.index = response_map[request_id].size();

ChatCompletionMessage delta;
// Size of delta_output->group_delta_token_ids Array should be 1
IntTuple delta_token_ids = delta_output->group_delta_token_ids[0];
std::vector<int32_t> delta_token_ids_vec(delta_token_ids.begin(), delta_token_ids.end());
delta.content = std::vector<std::unordered_map<std::string, std::string>>();
delta.content.value().push_back(std::unordered_map<std::string, std::string>{
{"type", "text"}, {"text", this->streamer_->Put(delta_token_ids_vec)}});

delta.role = Role::assistant;

choice.delta = delta;

response_map[request_id].push_back(choice);
}

Array<String> response_arr;
for (const auto& [request_id, choices] : response_map) {
ChatCompletionStreamResponse response;
response.id = request_id;
response.choices = choices;
response.model = "json_ffi"; // TODO: Return model name from engine (or from args)
response.system_fingerprint = "";
response_arr.push_back(picojson::value(response.ToJSON()).serialize());
}
return response_arr;
}
};

TVM_REGISTER_GLOBAL("mlc.json_ffi.CreateJSONFFIEngine").set_body_typed([]() {
return Module(make_object<JSONFFIEngineImpl>());
});

} // namespace json_ffi
} // namespace llm
} // namespace mlc
56 changes: 56 additions & 0 deletions cpp/json_ffi/json_ffi_engine.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
/*!
* Copyright (c) 2023 by Contributors
* \file json_ffi/json_ffi_engine.h
* \brief The header of JSON FFI engine in MLC LLM.
*/
#ifndef MLC_LLM_JSON_FFI_JSON_FFI_ENGINE_H_
#define MLC_LLM_JSON_FFI_JSON_FFI_ENGINE_H_

#include <tvm/runtime/packed_func.h>

#include <string>

#include "../serve/threaded_engine.h"
#include "../streamer.h"
#include "openai_api_protocol.h"

namespace mlc {
namespace llm {
namespace json_ffi {

using namespace tvm::runtime;
using namespace mlc::llm::serve;

/*!
* \brief // Todo: document this class, fields and member functions
*/
class JSONFFIEngine {
public:
JSONFFIEngine();

~JSONFFIEngine();

bool ChatCompletion(std::string request_json_str, std::string request_id);

bool AddRequest(std::string request_json_str, std::string request_id);

void StreamBackError(std::string request_id);

bool Abort(std::string request_id);

std::string GetLastError();

void ExitBackgroundLoop();

protected:
std::unique_ptr<ThreadedEngine> engine_;
std::string err_;
PackedFunc request_stream_callback_;
TextStreamer streamer_; // TODO: Support "n", and support different streamers for each request
};

} // namespace json_ffi
} // namespace llm
} // namespace mlc

#endif // MLC_LLM_JSON_FFI_JSON_FFI_ENGINE_H_
Loading