-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge with mlc-ai/main
(adc6ee6ae2de97a507291aaff6279af4e3d16a83
, July 2nd 2024)
#272
Conversation
This PR migrates JSONFFIEngine to a formal namespace. Also list TODOs to further simplify the JSONFFIEngine.
improve Install via environment variable
This PR integrates the sampling function in FlashInfer. We integrate the one without top-p for now.
* add model lib delivery * fix lint
This PR simplifies the tool function names in encoding.h. The new names are - PrintAsUTF8 - PrintAsEscaped - ParseNextUTF8 - ParseUTF8 - ParseNextUTF8OrEscaped Also make ParseNextUTF8 return the new char pointer instead of the number of chars processed to make the interface simpler.
Using DraftTokenWorkspaceManager to maintain workspace for draft probs and hidden states (if needed). This allows states of the draft token to be kept fully on GPU.
* Fix typo in event_tracer
… by engine (mlc-ai#2264) * [Model] Fix llama2 chat template and remove redundant separator added by engine
…mlc-ai#2268) * This PR refactors the EngineConfig to allow minimal JSON string passing. This is helpful for the JSONFFIEngine construction. * This PR moves the automatic engine config inference from Python side to C++ side, so that we don't have duplicate code on multiple platforms. * This PR renames `model_lib_path` to `model_lib`. * This PR makes the reload/unload of ThreadedEngine act in a blocking style. * This PR refactors the default generation config process flow, and unifies everything to C++.
* [Serving] Add some try-except captures in AsyncMLCEngine
* [Fix] Fix the two-stage softmax func by removing log2e When two-stage softmax was introduced, we use a log2e numeric transformation for some potentially better performance. However, under the case of low temperature, the log2e transformation is not numerically stable, which may cause the softmax result not summing up to 1. This PR fixes this by removing all the log2e related calculation. * Remove redundant import
…#2271) * [Eagle] Fix missing broadcast in hidden states gather/scatter
…2272) This PR integrates the pivot-based prob renormalization for top-p sampling, whose performance is a few times faster than the current sort-based top-p sampling on CUDA.
…#2275) This PR updates the error checking in JSONFFIEngine and related request parsing to use the Result class.
…-ai#2277) * improve Install via environment variable * [HotFix] fix kv_cache_transpose_append buffer region
By default the prefill chunk size is set to the context window size or the sliding window size. When the number is large, our memory planning during model compilation will allocate a lot memory. Given we have support for input chunking, we can reduce the prefill chunk size to a small value to save runtime memory. This PR sets the prefill chunk size to be at most 2048.
[iOS] Initial scaffolding of LLMEngine in Swift This PR adds initial scaffolding of LLMEngine in swift. We wraps callback to AsyncStream so it can be accessed using for await API. We also added an minimal example app to showcase the new MLCEngine, the old ChatModule is still used in the MLCChat App. The return value is structified already. We will still need to structurify the chat completion interface.
Using new Result interface Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>
…-ai#2559) This PR updates the tokenzier load logic, so that we prioritize the use of HuggingFace and SentencePiece tokenizers over the ByteLevelBPE tokenizer. This fixes the issue that token `<im_start>` in Qwen model is tokenized into multiple tokens when the ByteLevelBPE tokenizer is chosen when available.
[Serve][Grammar] Jump-forward decoding This PR supports the jump-forward decoding as described in <https://lmsys.org/blog/2024-02-05-compressed-fsm/>. The jump-forward decoding uses the grammar constraint to predict the next output string and tokenize the string into tokens, and therefore speeds up the decoding. This PR implements these optimizations to ensure the output quality: - Retokenization in jumpforward: Tokenize the last k token as string appended with the predicted string. If the tokenization result differs from the old tokens, roll back these tokens and accept the new ones. - Retokenization in decoding: Tokenize the last k token as string appended with the decoded token. This will happen in decoding stage when the jumpforward decoding happens in the last round. If the result differs, the old tokens will be rolled back. - Skip prefix tokens in jumpforward: We call tokens that is a prefix of another token as prefix tokens. If the last token from jumpforward is a prefix token, it's highly possible that it will be rolled back in the next decode stage, as it may be combined with the decoded token. It also effects the output distribution as such pattern is rare in training data. Therefore, we skip the last prefix token in jumpforward decoding. This PR also includes the following changes: - Add several metrics for request and engine, especially about the jumpforward decoding - Fix a bug in `_async_query_engine_metrics` to avoid throwing CancelledError from early return Performance and benchmark: Schema(Pydantic): ``` class Product(BaseModel): product_id: int is_available: bool price: float is_featured: Literal[True] category: Literal["Electronics", "Clothing", "Food"] tags: List[str] stock: Dict[str, int] ``` Platform: AMD Ryzen 9 5900X, NVIDIA 3080 10G Results: ``` Jump forward: False, Batch: 1 Engine metrics: { "engine_decode_time_sum": 0.4988938220000001, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 66, "decode_tokens_sum": 66, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 132.2926785010378, } Jump forward: True, Batch: 1 Engine metrics: { "engine_decode_time_sum": 0.37242740600000007, "engine_jump_forward_time_sum": 0.027989265000000006, "completion_tokens_sum": 68, "decode_tokens_sum": 68, "jump_forward_tokens_sum": 28, "decode_tokens_per_s": 182.58591850246378, } Jump forward: False, Batch: 4 Engine metrics: { "engine_decode_time_sum": 0.9106805410000002, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 261, "decode_tokens_sum": 261, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 286.5988546470984, } Jump forward: True, Batch: 4 Engine metrics: { "engine_decode_time_sum": 0.6843025599999999, "engine_jump_forward_time_sum": 0.028089531999999997, "completion_tokens_sum": 266, "decode_tokens_sum": 266, "jump_forward_tokens_sum": 112, "decode_tokens_per_s": 388.71694415405966, } Jump forward: False, Batch: 8 Engine metrics: { "engine_decode_time_sum": 1.62462493, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 538, "decode_tokens_sum": 538, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 331.1533573475325, } Jump forward: True, Batch: 8 Engine metrics: { "engine_decode_time_sum": 1.0509048310000002, "engine_jump_forward_time_sum": 0.027971332000000022, "completion_tokens_sum": 525, "decode_tokens_sum": 525, "jump_forward_tokens_sum": 224, "decode_tokens_per_s": 499.5694990767436, } Jump forward: False, Batch: 16 Engine metrics: { "engine_decode_time_sum": 2.317279175, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 1068, "decode_tokens_sum": 1068, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 460.8853398080531, } Jump forward: True, Batch: 16 Engine metrics: { "engine_decode_time_sum": 1.3962938819999997, "engine_jump_forward_time_sum": 0.030129287999999994, "completion_tokens_sum": 1059, "decode_tokens_sum": 1059, "jump_forward_tokens_sum": 448, "decode_tokens_per_s": 758.4363246533227, } ```
Some improvements of the delivery script: - provide different overrides for different quantization. e.g. we can change prefill chunk size for q0/q3/q4 - rerun gen config only if only conv_template changes - do NOT recreate HF repo when the repo already exists. This will preserve commit history - dry-run validation
…lc-ai#2566) This PR enhances the error reporting for multi-GPU model compilation, so we can provide as many error reasons as possible before loading and running the models.
This adds the interface to draft token state and sampler to allow tree structure being recorded and used for verification
* [Bench] Json mode bench This PR refactors mlc bench to enable json mode in dataset. * upd * fix lint
This PR introduces the multi-GPU support for the Qwen-MoE model. Validated on 4090x2.
This PR adds the missing fields that were not cleared up in `EngineMetrics::Reset`.
Update documentation for WebLLM. Currently we only provide a high-level view for WebLLM runtime here, and refer user to the WebLLM repo README for more. The documentation focuses on adding their own model variant / model library for WebLLM. Will follow up with more thorough runtime documentation.
This PR introduces a top-4 kernel for MoE model (particularly for the Qwen-MoE) at this moment. This is still a manual implementation and has some duplication with the existing top-2 kernel. In the future we'll consider leveraging meta-programming of TIR to unify the top-k kernel implementations.
This PR updates the Gemma config so that MLC can work properly with Gemma 1.1.
This PR adds the support for the hybrid prefill. So during the prefill engine action, it will do the decode for running requests as well.
Update quick_start.rst Fix broken links for convert weights and compile model pages
This PR fixes a bug which fails to set the prefill finish time and results in metric error.
This PR updates the Android app the reduce the binary size. Right now it can be reduced to 108MB when only building with the Phi-3-mini-4k model.
This PR fixes the Gemma config compatibility issue.
This PR fixes a bug of the debug_compare.py script.
This commit introduces the InternLM2 model support.
…ai#2615) This PR updates the prefix cache to align the logic of enabling sliding window. Now only leaf sequence is enabled sliding window attention.
This PR updates the include directories for the Android app so that we can avoid using macros for src file include.
@@ -429,109 +287,6 @@ def _quantize( # pylint: disable=too-many-locals | |||
scale = topi.transpose(scale) | |||
return quantized_weight, scale | |||
|
|||
def _quantize_float8( # pylint: disable=too-many-locals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @csullivan @JosephTheOctonaut , it looks like this merge removed some of the FP8 quantization functions, which may need to be restored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our FP8 quantization flow is mainly using functions defined in mlc-serve/slm, which unfortunately is now diverged from the upstream copy. E.g., for the specific function _quantize_float8
, it lives in slm/quantization/per_tensor_quantization.py for our flow. I'm not totally clear on the upstream-only flow for FP8, but @vinx13 has been updating that over time, so I'm assuming it's intact.
We probably want to visit unifying the flows down the line, if reasonable, to avoid the duplication and conflicts that Sung pointed out.
Summary
cpp/
,tests/
, etc.) are removed. For our usage, onlypython/
and some of the scripts (e.g.,setup.py
) matter.