-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge with mlc-ai/main
(835223541d4135e511a50cba1deca06731b03abd
, April 18th 2024)
#260
Merge with mlc-ai/main
(835223541d4135e511a50cba1deca06731b03abd
, April 18th 2024)
#260
Conversation
This PR enables the decode attn kernel to have awareness of the webgpu backend, so that it helps make sure the total number of threads does not exceed the 256 limit of WebGPU. Co-authored-by: Bohan Hou <spectrometerh@gmail.com>
This PR refactors the existing logit processing pipeline with a unfiied logit processor class. The logit processor class exposes two functions: - `InplaceUpdateLogits`, which takes in the raw logits produced by the model, and apply logit bias (which is introduced in this PR), presence/frequency/repetition penalties, and token id mask in order when needed. - `ComputeProbsFromLogits`, which takes in the updated logits, and invoke softmax with temperature to compute the probability distribution. The logit processor completely runs on GPU. This being said, all the logit bias / penalty / mask application and the softmax is backed by GPU kernels. This is a highlight difference compared with the logit processing prior to this PR, where the processing happens on CPU, and softmax also happens on CPU when any logit process is needed. With the unified logit processor, we simplified the interface of handling model's output logits in engine actions to make it cleaner. We also simplified the interface of Sampler. Preliminary results show that LogitProcessor brings a bit perf improvement when any processing is needed.
This PR introduces the logprobs support with OpenAI API compatibility. It enhances the sampler with a function to get the top-probability tokens (supporting 5 tokens at most as of now). To make it easy to pass logprob results back from serving engine to frontend, we choose to pass logprob results in JSON string with OpenAI API spec. Unit tests are added to ensure the correctness of logprobs. And the logprobs support also work with speculative decoding.
This PR supports Mixtral in MLC serve. The main thing is only introducing the Mistral conversation template to Python registry so that MLC Serve can use. Besides that, this PR updates the KV cache capacity analysis to make it more accurate in terms of usage calculation, while being conservative since there is a known issue regarding batch-prefill embedding taking which may lead to OOM. We will reset the follow up on the issue with a fix in the future and then enable the estimation to use more GPU vRAM.
Prior to this PR, `u_char` was used while it is not a standard type in C++, which causes Windows build failure. This PR fixes it by using `unsigned char`.
…#1849) [Fix] Add phi lm head name to is_final_fc
…#1852) Instead of a python function that returns an updated `IRModule`, the new `optimize_mod_pipeline` function returns a `tvm.ir.transform.Pass` which can be applied to an `IRModule`.
* Create __init__.py * Add files via upload * Update model.py * Update model_preset.py * Update conv_templates.cc * Update internlm_loader.py * Update internlm_quantization.py * fix name of notes * Update model.py * Migration * fix pylint issue * fix pylint issue * fix pylint error * Update internlm_loader.py * Update __init__.py * Update __init__.py * Delete python/mlc_chat/model/internlm/__init__.py * Add files via upload
Prior to this commit, a model name with multiple path components (e.g. `dist/models/group_name/model_name`) would have duplicated path components (e.g. `dist/group_name/artifact_path/group_name/libname.so`). This commit resolves the duplication.
* [KVCache] Add max num threads to KVCache kernels, fix WebGPU * Read max_num_threads_per_block when available * Change merge state in place kernel * Make attention decode aware of max num threads, not just webgpu Co-authored-by: Egor Churaev <egor.churaev@gmail.com> * Change util function name --------- Co-authored-by: Egor Churaev <egor.churaev@gmail.com>
…1860) This PR moves the import of transformers into the function body of tiktoken tokenizer conversion, so we do not have a force dependency on transformers.
This PR adds RWKV5 support with RNNState, a similar interface as PagedAttention. Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Following mlc-ai#1854 , this pr registers the ChatML conversation template.
Sets the entry functions for a module. This utility is intended for cases where only module contains several externally-exposed functions, and only one is desired for use. (e.g. Separating out a `transform_params` function from an `IRModule` that also contains inference functions.) This commit only updates the external visibility, after which `relax.transform.DeadCodeElimination()` can be applied.
…i#1856) This allows it to be used as part of a optimization pipeline specified as a `tvm.ir.transform.Sequential`.
mlc-ai#1867) This PR is the 3rd part of the grammar-guided generation. This intregrates the grammar framework into the generation process, and supports JSON output for now. The API this PR provides is compatible with the OpenAI api. ### APIs #### Python API ``` @DataClass class ResponseFormat: type: Literal["text", "json_object"] = "text" json_schema: Optional[str] = None @DataClass class GenerationConfig: response_format: ResponseFormat = ResponseFormat(type="text") ``` #### Rest API ``` response_format: { "type": "text" } # text generation, by default response_format: { "type": "json_object" } # json generation response_format: { "type": "json_object", json_schema="..."} # json generation with schema ``` JSON generation with schema is not supported yet, but has been planned to be realized in the future. ### Performance #### Without JSON ``` Single token prefill latency: 891.2234 ms/tok Single token decode latency: 31.3399 ms/tok Prefill token throughput: 4693.3077 tok/s Decode token throughput: 226.4406 tok/s Overall token throughput: 470.3180 tok/s ``` #### With JSON ``` Single token prefill latency: 219.2287 ms/tok Single token decode latency: 29.1399 ms/tok Prefill token throughput: 7392.1555 tok/s Decode token throughput: 179.2296 tok/s Overall token throughput: 1052.1996 tok/s ``` We observed a slight decrease in performance under JSON mode. This will be further optimized in the future.
This PR brings field `n` to generation config and thereby supports parallel generation. This parallel generation effectively leverages the "fork" functionality of paged KV cache. This PR supports specifying the number of parallel generation `n` in stardard OpenAI ChatCompletion API. This is the last feature towards the OpenAI API feature completeness.
Sometimes scm checkout can timeout, this PR add retry to that
Prior to this PR, the TIR attention kernels does not cast matmul operands to fp32 before multiplying. For models like Phi-2 which may have large Q/K/V data (at the level of a few hundreds), the fp16 multiplication exceeds the range of fp16, and lead to attention result being NAN sometimes. This PR fixes this issue.
…lc-ai#1857) Prior to this commit, the `ReorderTransformFunc` required several components of the `ParamManager` to use. The functionality it provides, reordering dataflow blocks to minimize the liveset, is useful outside of the context of the `ParamManager`. This commit makes the following changes, allowing it to be used independently of the `ParamManager`. - Generate the `pidx2binname` dictionary outside of `ReorderTransformFunc` - Allow parameters to be separate `func.params`, rather than a single bundled tuple parameter.
This PR migrates Phi-2 for Paged KV cache Attention as a part of Model definition migration according to mlc-ai#1749 . Co-authored-by: Shrey Gupta <shrey2809@gmail.com>
…c-ai#1874) The use of `call_inplace_packed` and `call_pure_packed` in the old flow is outdated due to signature changes. This PR fixes the issue.
PR mlc-ai#1852 missed to apply the BundleModelParams pass and thus made the compiled models not runnable through ChatModule (mlc-ai#1864). This PR fixes the issue.
As pointed out by mlc-ai#1830, this PR fixes the Android app download link in docs.
Fix website link not accessible
This PR adopts suggestions from the support of OpenAI API parallel generation `n` in mlc-ai#1868. The main update in this PR is to make the RequestState as a standalone object class, which was a typedef from `std::vector<RequestStateEntry>` before. This PR also fixes a bug in prefill that will cause engine failure when `n` is large.
This PR fixes the picojson uses in MLC that conflicts with the latest changes on the picojson side.
…++ (mlc-ai#2112) [Serve][Grammar] Porting the json schema converter from python to C++ This PR ports the json schema converter from python to C++. It defines the interface: ``` std::string JSONSchemaToEBNF( std::string schema, std::optional<int> indent = std::nullopt, std::optional<std::pair<std::string, std::string>> separators = std::nullopt, bool strict_mode = true); ``` And uses it in BNFGrammar::FromSchema. This helps cases where python cannot be deployed.
1. Add Eagle-Llama-7b-chat model support. 2. Add speculative decoding support with Eagle.
This PR attaches the attributes of `tir.non_negative_var` for memory planning.
This PR is a refactor of the engine's contructor interface and the serve CLI interface. This PR introduces the "mode" argument for engine, which has options "local", "interactive" and "server". The choice of mode will affect the automatically inferred value of `max_batch_size`, `max_total_sequence_length` and `prefill_chunk_size` (only effective when arguements are not specified. Once an argument is specified, we will not override it). For detailed specification of the mode, please check out the CLI help messages in `mlc_llm/help.py` or the engine constructor in `mlc_llm/serve/engine.py`. No matter which mode is chosen, we will print out the current mode and the values of these arguments, for peopple to understand the settings of the engine. We also provide hints on how to adjust the mode. For example, ``` [2024-04-12 16:12:26] INFO chat_module.py:379: Using model folder: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16-MLC [2024-04-12 16:12:26] INFO chat_module.py:380: Using mlc chat config: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16-MLC/mlc-chat-config.json [2024-04-12 16:12:26] INFO chat_module.py:529: Using library model: dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so [2024-04-12 16:12:26] INFO chat_module.py:379: Using model folder: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q4f16_1-MLC [2024-04-12 16:12:26] INFO chat_module.py:380: Using mlc chat config: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json [2024-04-12 16:12:26] INFO chat_module.py:529: Using library model: dist/Llama-2-7b-chat-hf-q4f16_1-MLC/Llama-2-7b-chat-hf-q4f16_1-MLC-cuda.so [2024-04-12 16:12:29] INFO engine_base.py:382: Engine mode is "local". Max batch size is set to 4. Max KV cache token capacity is set to 4096. Prefill chunk size is set to 4096. [2024-04-12 16:12:29] INFO engine_base.py:387: Estimated total single GPU memory usage: 21543.74 MB (Parameters: 16467.64 MB. KVCache: 4450.07 MB. Temporary buffer: 626.03 MB). The actual usage might be slightly larger than the estimated number. [2024-04-12 16:12:29] INFO engine_base.py:398: Please switch to mode "server" if you want to use more GPU memory and support more concurrent requests. ``` After the refactor, we bring the speculative decoding to the serve CLI so that people can use multiple models and run speculative decoding with the server launched in CLI (which was not doable before).
This PR revamps the logging info for engine mode selection to provide more detailed information and the rationale of different modes.
This PR enables TP for Chatglm3 model.
) Prior to this PR, due to the improper prefill policy on `n` (parallel generation), the engine will loop forever when the a request has `n` larger than the maximum batch size that the engine can support. This PR fixes this issue by updating the prefill action, and with this PR, even the "interactive" engine mode can well support multiple parallel generation. After this fix, it is possible that a request require 10 parallel generation while the max batch size is 1. Given the shapes of temporary NDArrays in GPU sampler is determined by the max batch size, GPU sampler does not natively support sampling 10 tokens at a time. To approach this issue, this PR introduces chunking to GPU sampler. Therefore, in this particular case, the GPU sampler will have chunk size 1, and the 10 required samples will be processed by the GPU sampler one by one in order. Chunking is the minimum change we can do to support large `n`.
…2137) This PR revamps the landing documentation page. * The Python API panel is changed from showing ChatModule to showing Engine. * A new panel "REST Server" is added to show a quick start example of launching REST server and send request. * A "what to do next" section is introduced at the bottom of the landing page. Todo items for future PR: * add the page of Python API with Engine. * revamp weight conversion page. * revamp model library compilation page.
The commit updates the target tags, in order to identify the different SoC hardware targets for further target-specific optimizations. Meanwhile, update the vulkan support for int64.
…ai#2146) * Add optional fc bias for mixtral. * Fix lint.
This PR updates the documentation with an introduction turorial. The landing page now directs to the quick start page and the tutorial.
…c-ai#2148) This PR adds a new function `DebugCallFuncOnAllAllWorker` which calls a global function of sigunature `[] -> None` on all distributed workers when tensor parallelism is enabled (or the local session itself if not enabled). As the name suggests, this function is only for the debug purpose, and we will not expose any public interface to invoke this function. This PR also introduces the global functions `"mlc.debug_cuda_profiler_start"` and `"mlc.debug_cuda_profiler_stop"`, which enables CUDA profiling when using PopenServer.
* [DOCS] Update introduction Some minor tweaks on the introduction doc * Update docs/get_started/introduction.rst Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> --------- Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
We rename the public Python serve interface from `Engine` to `LLMEngine` (and from `AsyncEngine` to `AsyncLLMEngine` accordingly) for better class name clarity. This is because in cases people do wildcard import, in which case the name `Engine` itself does not convey enough meaning.
* [Quantization] Add e4m3 mode and enable fp8 storage type * add quantize linear flag
…c-ai#2158) Revert "[Quantization] Add e4m3 mode and enable fp8 storage type (mlc-ai#2154)" This reverts commit e9a4a0b.
This PR refactors EngineConfig for a cleaner interface of internal Engine constructor in MLC serve. This is a preparation step towards the engine reload/unload which will be introduced in follow-up PRs for JSONFFIEngine functionality on mobile and other platforms.
def transform(self) -> IRModule: | ||
"""Entry point of the transformation""" | ||
for g_var, func in self.mod.functions_items(): | ||
# TODO(@eric): This is a temporary hack to get around with two functions for BYOC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lunderberg, please follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe what issue is occurring here? From a quick glance, it looks like this is a bug in the remove_global_buf_alloc
, that it assumes any PrimFunc
present in the IRModule
will be a schedulable PrimFunc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No description provided.