Merge with `mlc-ai/main` (`adc6ee6ae2de97a507291aaff6279af4e3d16a83`, July 2nd 2024) #272

sunggg · 2024-07-03T18:38:25Z

Summary

To avoid confusion, all the unnecessary files (e.g., cpp/, tests/, etc.) are removed. For our usage, only python/ and some of the scripts (e.g., setup.py) matter.
Seemingly no significant change that affects our usage.
I had to partially revert [FP8] Use f32 scale to enable better fusion mlc-ai/mlc-llm#2505 since it breaks one of our assumption in fp8 flow. (See the reverted line) cc. @JosephTheOctonaut @csullivan

This PR migrates JSONFFIEngine to a formal namespace. Also list TODOs to further simplify the JSONFFIEngine.

improve Install via environment variable

This PR integrates the sampling function in FlashInfer. We integrate the one without top-p for now.

* add model lib delivery * fix lint

This PR simplifies the tool function names in encoding.h. The new names are - PrintAsUTF8 - PrintAsEscaped - ParseNextUTF8 - ParseUTF8 - ParseNextUTF8OrEscaped Also make ParseNextUTF8 return the new char pointer instead of the number of chars processed to make the interface simpler.

Using DraftTokenWorkspaceManager to maintain workspace for draft probs and hidden states (if needed). This allows states of the draft token to be kept fully on GPU.

* Fix typo in event_tracer

… by engine (mlc-ai#2264) * [Model] Fix llama2 chat template and remove redundant separator added by engine

…mlc-ai#2268) * This PR refactors the EngineConfig to allow minimal JSON string passing. This is helpful for the JSONFFIEngine construction. * This PR moves the automatic engine config inference from Python side to C++ side, so that we don't have duplicate code on multiple platforms. * This PR renames `model_lib_path` to `model_lib`. * This PR makes the reload/unload of ThreadedEngine act in a blocking style. * This PR refactors the default generation config process flow, and unifies everything to C++.

* [Serving] Add some try-except captures in AsyncMLCEngine

* [Fix] Fix the two-stage softmax func by removing log2e When two-stage softmax was introduced, we use a log2e numeric transformation for some potentially better performance. However, under the case of low temperature, the log2e transformation is not numerically stable, which may cause the softmax result not summing up to 1. This PR fixes this by removing all the log2e related calculation. * Remove redundant import

…#2271) * [Eagle] Fix missing broadcast in hidden states gather/scatter

…2272) This PR integrates the pivot-based prob renormalization for top-p sampling, whose performance is a few times faster than the current sort-based top-p sampling on CUDA.

…#2275) This PR updates the error checking in JSONFFIEngine and related request parsing to use the Result class.

…-ai#2277) * improve Install via environment variable * [HotFix] fix kv_cache_transpose_append buffer region

By default the prefill chunk size is set to the context window size or the sliding window size. When the number is large, our memory planning during model compilation will allocate a lot memory. Given we have support for input chunking, we can reduce the prefill chunk size to a small value to save runtime memory. This PR sets the prefill chunk size to be at most 2048.

[iOS] Initial scaffolding of LLMEngine in Swift This PR adds initial scaffolding of LLMEngine in swift. We wraps callback to AsyncStream so it can be accessed using for await API. We also added an minimal example app to showcase the new MLCEngine, the old ChatModule is still used in the MLCChat App. The return value is structified already. We will still need to structurify the chat completion interface.

Using new Result interface Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>

…-ai#2559) This PR updates the tokenzier load logic, so that we prioritize the use of HuggingFace and SentencePiece tokenizers over the ByteLevelBPE tokenizer. This fixes the issue that token `<im_start>` in Qwen model is tokenized into multiple tokens when the ByteLevelBPE tokenizer is chosen when available.

[Serve][Grammar] Jump-forward decoding This PR supports the jump-forward decoding as described in <https://lmsys.org/blog/2024-02-05-compressed-fsm/>. The jump-forward decoding uses the grammar constraint to predict the next output string and tokenize the string into tokens, and therefore speeds up the decoding. This PR implements these optimizations to ensure the output quality: - Retokenization in jumpforward: Tokenize the last k token as string appended with the predicted string. If the tokenization result differs from the old tokens, roll back these tokens and accept the new ones. - Retokenization in decoding: Tokenize the last k token as string appended with the decoded token. This will happen in decoding stage when the jumpforward decoding happens in the last round. If the result differs, the old tokens will be rolled back. - Skip prefix tokens in jumpforward: We call tokens that is a prefix of another token as prefix tokens. If the last token from jumpforward is a prefix token, it's highly possible that it will be rolled back in the next decode stage, as it may be combined with the decoded token. It also effects the output distribution as such pattern is rare in training data. Therefore, we skip the last prefix token in jumpforward decoding. This PR also includes the following changes: - Add several metrics for request and engine, especially about the jumpforward decoding - Fix a bug in `_async_query_engine_metrics` to avoid throwing CancelledError from early return Performance and benchmark: Schema(Pydantic): ``` class Product(BaseModel): product_id: int is_available: bool price: float is_featured: Literal[True] category: Literal["Electronics", "Clothing", "Food"] tags: List[str] stock: Dict[str, int] ``` Platform: AMD Ryzen 9 5900X, NVIDIA 3080 10G Results: ``` Jump forward: False, Batch: 1 Engine metrics: { "engine_decode_time_sum": 0.4988938220000001, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 66, "decode_tokens_sum": 66, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 132.2926785010378, } Jump forward: True, Batch: 1 Engine metrics: { "engine_decode_time_sum": 0.37242740600000007, "engine_jump_forward_time_sum": 0.027989265000000006, "completion_tokens_sum": 68, "decode_tokens_sum": 68, "jump_forward_tokens_sum": 28, "decode_tokens_per_s": 182.58591850246378, } Jump forward: False, Batch: 4 Engine metrics: { "engine_decode_time_sum": 0.9106805410000002, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 261, "decode_tokens_sum": 261, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 286.5988546470984, } Jump forward: True, Batch: 4 Engine metrics: { "engine_decode_time_sum": 0.6843025599999999, "engine_jump_forward_time_sum": 0.028089531999999997, "completion_tokens_sum": 266, "decode_tokens_sum": 266, "jump_forward_tokens_sum": 112, "decode_tokens_per_s": 388.71694415405966, } Jump forward: False, Batch: 8 Engine metrics: { "engine_decode_time_sum": 1.62462493, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 538, "decode_tokens_sum": 538, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 331.1533573475325, } Jump forward: True, Batch: 8 Engine metrics: { "engine_decode_time_sum": 1.0509048310000002, "engine_jump_forward_time_sum": 0.027971332000000022, "completion_tokens_sum": 525, "decode_tokens_sum": 525, "jump_forward_tokens_sum": 224, "decode_tokens_per_s": 499.5694990767436, } Jump forward: False, Batch: 16 Engine metrics: { "engine_decode_time_sum": 2.317279175, "engine_jump_forward_time_sum": 0, "completion_tokens_sum": 1068, "decode_tokens_sum": 1068, "jump_forward_tokens_sum": 0, "decode_tokens_per_s": 460.8853398080531, } Jump forward: True, Batch: 16 Engine metrics: { "engine_decode_time_sum": 1.3962938819999997, "engine_jump_forward_time_sum": 0.030129287999999994, "completion_tokens_sum": 1059, "decode_tokens_sum": 1059, "jump_forward_tokens_sum": 448, "decode_tokens_per_s": 758.4363246533227, } ```

Some improvements of the delivery script: - provide different overrides for different quantization. e.g. we can change prefill chunk size for q0/q3/q4 - rerun gen config only if only conv_template changes - do NOT recreate HF repo when the repo already exists. This will preserve commit history - dry-run validation

…lc-ai#2566) This PR enhances the error reporting for multi-GPU model compilation, so we can provide as many error reasons as possible before loading and running the models.

This adds the interface to draft token state and sampler to allow tree structure being recorded and used for verification

* [Bench] Json mode bench This PR refactors mlc bench to enable json mode in dataset. * upd * fix lint

This PR introduces the multi-GPU support for the Qwen-MoE model. Validated on 4090x2.

This PR adds the missing fields that were not cleared up in `EngineMetrics::Reset`.

Update documentation for WebLLM. Currently we only provide a high-level view for WebLLM runtime here, and refer user to the WebLLM repo README for more. The documentation focuses on adding their own model variant / model library for WebLLM. Will follow up with more thorough runtime documentation.

This PR introduces a top-4 kernel for MoE model (particularly for the Qwen-MoE) at this moment. This is still a manual implementation and has some duplication with the existing top-2 kernel. In the future we'll consider leveraging meta-programming of TIR to unify the top-k kernel implementations.

This PR updates the Gemma config so that MLC can work properly with Gemma 1.1.

This PR adds the support for the hybrid prefill. So during the prefill engine action, it will do the decode for running requests as well.

Update quick_start.rst Fix broken links for convert weights and compile model pages

This PR fixes a bug which fails to set the prefill finish time and results in metric error.

This PR updates the Android app the reduce the binary size. Right now it can be reduced to 108MB when only building with the Phi-3-mini-4k model.

This PR fixes the Gemma config compatibility issue.

This PR fixes a bug of the debug_compare.py script.

This commit introduces the InternLM2 model support.

…ai#2615) This PR updates the prefix cache to align the logic of enabling sliding window. Now only leaf sequence is enabled sliding window attention.

This PR updates the include directories for the Android app so that we can avoid using macros for src file include.

…m-07022024

Lunderberg · 2024-07-11T13:48:16Z

python/mlc_llm/quantization/group_quantization.py

@@ -429,109 +287,6 @@ def _quantize(  # pylint: disable=too-many-locals
            scale = topi.transpose(scale)
        return quantized_weight, scale

-    def _quantize_float8(  # pylint: disable=too-many-locals


FYI @csullivan @JosephTheOctonaut , it looks like this merge removed some of the FP8 quantization functions, which may need to be restored.

Our FP8 quantization flow is mainly using functions defined in mlc-serve/slm, which unfortunately is now diverged from the upstream copy. E.g., for the specific function _quantize_float8, it lives in slm/quantization/per_tensor_quantization.py for our flow. I'm not totally clear on the upstream-only flow for FP8, but @vinx13 has been updating that over time, so I'm assuming it's intact.

We probably want to visit unifying the flows down the line, if reasonable, to avoid the duplication and conflicts that Sung pointed out.

rickzx and others added 30 commits April 27, 2024 15:52

[Bugfix] layer_norm_eps in GPT2Config should be float (mlc-ai#2240)

fd65973

[REFACTOR] Migrate JSONFFIEngine to formal namespace (mlc-ai#2241)

63a3804

This PR migrates JSONFFIEngine to a formal namespace. Also list TODOs to further simplify the JSONFFIEngine.

[Serving] Share disco sessions among multiple model function tables (m…

1a8bad0

…lc-ai#2242)

[DOC] Improve Install via environment variable (mlc-ai#2245)

5a26795

improve Install via environment variable

[Sampler] FlashInfer sampling func integration (mlc-ai#2224)

3cb2ee8

This PR integrates the sampling function in FlashInfer. We integrate the one without top-p for now.

Model Library Delivery (mlc-ai#2139)

d3d264d

* add model lib delivery * fix lint

[Serving] Introduce DraftTokenWorkspaceManager (mlc-ai#2250)

afde65c

Using DraftTokenWorkspaceManager to maintain workspace for draft probs and hidden states (if needed). This allows states of the draft token to be kept fully on GPU.

[Fix] fix a typo in event_trace_recorder (mlc-ai#2253)

6a43570

* Fix typo in event_tracer

[Tokenizer] Support ByteLevel BPE in tokenizer token table (mlc-ai#2248)

ca7cdcc

[Eagle] Avoid worker - engine transfer for hidden states (mlc-ai#2256)

51391c3

[Serving] Add engine stats for speculative decoding (mlc-ai#2257)

eb4d624

[Serving] Fix lints (mlc-ai#2258)

d206c44

[Sampler] Avoid unnecessary sync in GPU verifier (mlc-ai#2260)

9941b4f

Fix typo in token_postproc_method names (mlc-ai#2261)

cfd3b2c

[Sampler] Add missing sync in gpu verifier (mlc-ai#2262)

8e5af29

[Model] Remove redundant space in llama2 tokenizer (mlc-ai#2263)

e756f23

[Model] Fix llama2 chat template and remove redundant separator added…

878be83

… by engine (mlc-ai#2264) * [Model] Fix llama2 chat template and remove redundant separator added by engine

[Serving] Add some try-except captures in AsyncMLCEngine (mlc-ai#2265)

17fb1c4

* [Serving] Add some try-except captures in AsyncMLCEngine

[Eagle] Fix token shifting for prefill step (mlc-ai#2266)

b124b0b

[Eagle] Fix missing broadcast in hidden states gather/scatter (mlc-ai…

8d58e52

…#2271) * [Eagle] Fix missing broadcast in hidden states gather/scatter

[Sampler] Use pivot-based renormalization for top-p sampling (mlc-ai#…

c166a90

…2272) This PR integrates the pivot-based prob renormalization for top-p sampling, whose performance is a few times faster than the current sort-based top-p sampling on CUDA.

[JSONFFI] Update JSONFFI error checking with the Result class (mlc-ai…

0ca6b33

…#2275) This PR updates the error checking in JSONFFIEngine and related request parsing to use the Result class.

[Bugfix] fix _kv_cache_transpose_append buffer read region error (mlc…

f181ce2

…-ai#2277) * improve Install via environment variable * [HotFix] fix kv_cache_transpose_append buffer region

Rename READMD.md to README.md

d31941f

[Serving] Image support in JSONFFIEngine (mlc-ai#2208)

5ae393a

Using new Result interface Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>

Neet-Nestor and others added 26 commits June 9, 2024 18:18

[Docs] Fix typo in mlc_llm chat command (mlc-ai#2560)

c25834d

Fix compilation for gcc 13.2 (mlc-ai#2561)

931587b

[Model] Enhance error reporting for invalid tensor-parallel settings (m…

873827c

…lc-ai#2566) This PR enhances the error reporting for multi-GPU model compilation, so we can provide as many error reasons as possible before loading and running the models.

[Serving] Apply tree structure in draft token verification (mlc-ai#2563)

dcece51

This adds the interface to draft token state and sampler to allow tree structure being recorded and used for verification

[Bench] Json mode bench (mlc-ai#2552)

07c92b0

* [Bench] Json mode bench This PR refactors mlc bench to enable json mode in dataset. * upd * fix lint

[Model] Support Multi-GPU for Qwen-MoE model (mlc-ai#2573)

94a0295

This PR introduces the multi-GPU support for the Qwen-MoE model. Validated on 4090x2.

[Metrics] Add missing fields in Reset (mlc-ai#2574)

ceba951

This PR adds the missing fields that were not cleared up in `EngineMetrics::Reset`.

[Model] Gemma 1.1 compatibility (mlc-ai#2594)

437166a

This PR updates the Gemma config so that MLC can work properly with Gemma 1.1.

[Serving] Hybrid prefill (mlc-ai#2604)

6a48a02

This PR adds the support for the hybrid prefill. So during the prefill engine action, it will do the decode for running requests as well.

Update quick_start.rst to fix broken links (mlc-ai#2607)

cbf0b02

Update quick_start.rst Fix broken links for convert weights and compile model pages

[Fix] Set the missed prefill finish time (mlc-ai#2613)

d911c60

This PR fixes a bug which fails to set the prefill finish time and results in metric error.

[Android] Reduce binary size (mlc-ai#2606)

fbb6a48

This PR updates the Android app the reduce the binary size. Right now it can be reduced to 108MB when only building with the Phi-3-mini-4k model.

[Fix] Gemma hidden_activation compatibility (mlc-ai#2614)

0575b92

This PR fixes the Gemma config compatibility issue.

Update debug_compare (mlc-ai#2612)

c09b108

This PR fixes a bug of the debug_compare.py script.

[SLM] Add support for InternLM2 architecture (mlc-ai#2608)

2d32094

This commit introduces the InternLM2 model support.

[Fix] Prefix cache only enables sliding window on leaf sequence (mlc-…

0fb5609

…ai#2615) This PR updates the prefix cache to align the logic of enabling sliding window. Now only leaf sequence is enabled sliding window attention.

[Android] Update include path for tvm runtime src (mlc-ai#2616)

adc6ee6

This PR updates the include directories for the Android app so that we can avoid using macros for src file include.

remove

18b6e2b

Merge remote-tracking branch 'upstream/main' into spark/merge-upstrea…

4712016

…m-07022024

works

48e807e

seems working

6b5b0a9

sunggg mentioned this pull request Jul 7, 2024

Remove the TVM submodule #273

Merged

Merge remote-tracking branch 'origin/mlc-serve-v0.2.0' into HEAD

0cb6ed5

sunggg merged commit 843ad8a into mlc-serve-v0.2.0 Jul 8, 2024

Lunderberg reviewed Jul 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge with `mlc-ai/main` (`adc6ee6ae2de97a507291aaff6279af4e3d16a83`, July 2nd 2024) #272

Merge with `mlc-ai/main` (`adc6ee6ae2de97a507291aaff6279af4e3d16a83`, July 2nd 2024) #272

Uh oh!

sunggg commented Jul 3, 2024 •

edited

Loading

Uh oh!

Lunderberg Jul 11, 2024

Uh oh!

JosephTheOctonaut Jul 11, 2024

Uh oh!

Uh oh!

Merge with mlc-ai/main (adc6ee6ae2de97a507291aaff6279af4e3d16a83, July 2nd 2024) #272

Merge with mlc-ai/main (adc6ee6ae2de97a507291aaff6279af4e3d16a83, July 2nd 2024) #272

Uh oh!

Conversation

sunggg commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Lunderberg Jul 11, 2024

Choose a reason for hiding this comment

Uh oh!

JosephTheOctonaut Jul 11, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge with `mlc-ai/main` (`adc6ee6ae2de97a507291aaff6279af4e3d16a83`, July 2nd 2024) #272

Merge with `mlc-ai/main` (`adc6ee6ae2de97a507291aaff6279af4e3d16a83`, July 2nd 2024) #272

sunggg commented Jul 3, 2024 •

edited

Loading