Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge with mlc-ai/main (adc6ee6ae2de97a507291aaff6279af4e3d16a83, July 2nd 2024) #272

Merged
merged 494 commits into from
Jul 8, 2024

Conversation

sunggg
Copy link
Member

@sunggg sunggg commented Jul 3, 2024

Summary

rickzx and others added 30 commits April 27, 2024 15:52
This PR migrates JSONFFIEngine to a formal namespace.
Also list TODOs to further simplify the JSONFFIEngine.
improve Install via environment variable
This PR integrates the sampling function in FlashInfer.
We integrate the one without top-p for now.
* add model lib delivery

* fix lint
This PR simplifies the tool function names in encoding.h. The new names are
- PrintAsUTF8
- PrintAsEscaped
- ParseNextUTF8
- ParseUTF8
- ParseNextUTF8OrEscaped

Also make ParseNextUTF8 return the new char pointer instead of the number of
chars processed to make the interface simpler.
Using DraftTokenWorkspaceManager to maintain workspace for draft probs
and hidden states (if needed). This allows states of the draft token to
be kept fully on GPU.
… by engine (mlc-ai#2264)

* [Model] Fix llama2 chat template and remove redundant separator added by engine
…mlc-ai#2268)

* This PR refactors the EngineConfig to allow minimal JSON string
passing. This is helpful for the JSONFFIEngine construction.
* This PR moves the automatic engine config inference from Python side
to C++ side, so that we don't have duplicate code on multiple platforms.
* This PR renames `model_lib_path` to `model_lib`.
* This PR makes the reload/unload of ThreadedEngine act in a blocking
style.
* This PR refactors the default generation config process flow,
and unifies everything to C++.
* [Serving] Add some try-except captures in AsyncMLCEngine
* [Fix] Fix the two-stage softmax func by removing log2e

When two-stage softmax was introduced, we use a log2e numeric
transformation for some potentially better performance.

However, under the case of low temperature, the log2e transformation
is not numerically stable, which may cause the softmax result not
summing up to 1.

This PR fixes this by removing all the log2e related calculation.

* Remove redundant import
…#2271)

* [Eagle] Fix missing broadcast in hidden states gather/scatter
…2272)

This PR integrates the pivot-based prob renormalization for top-p
sampling, whose performance is a few times faster than the current
sort-based top-p sampling on CUDA.
…#2275)

This PR updates the error checking in JSONFFIEngine and related request
parsing to use the Result class.
…-ai#2277)

* improve Install via environment variable

* [HotFix] fix kv_cache_transpose_append buffer region
By default the prefill chunk size is set to the context window size
or the sliding window size. When the number is large, our memory
planning during model compilation will allocate a lot memory.

Given we have support for input chunking, we can reduce the prefill
chunk size to a small value to save runtime memory.

This PR sets the prefill chunk size to be at most 2048.
[iOS] Initial scaffolding of LLMEngine in Swift

This PR adds initial scaffolding of LLMEngine in swift.
We wraps callback to AsyncStream so it can be accessed using for await API.

We also added an minimal example app to showcase the new MLCEngine,
the old ChatModule is still used in the MLCChat App.

The return value is structified already.
We will still need to structurify the chat completion interface.
Using new Result interface

Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>
Neet-Nestor and others added 26 commits June 9, 2024 18:18
…-ai#2559)

This PR updates the tokenzier load logic, so that we prioritize
the use of HuggingFace and SentencePiece tokenizers over the
ByteLevelBPE tokenizer.

This fixes the issue that token `<im_start>` in Qwen model is
tokenized into multiple tokens when the ByteLevelBPE tokenizer
is chosen when available.
[Serve][Grammar] Jump-forward decoding

This PR supports the jump-forward decoding as described in
<https://lmsys.org/blog/2024-02-05-compressed-fsm/>. The jump-forward
decoding uses the grammar constraint to predict the next output string and
tokenize the string into tokens, and therefore speeds up the decoding.

This PR implements these optimizations to ensure the output quality:
- Retokenization in jumpforward: Tokenize the last k token as string appended with the predicted
  string. If the tokenization result differs from the old tokens, roll back
  these tokens and accept the new ones.
- Retokenization in decoding: Tokenize the last k token as string appended with
  the decoded token. This will happen in decoding stage when the jumpforward decoding happens
  in the last round. If the result differs, the old tokens will be rolled back.
- Skip prefix tokens in jumpforward: We call tokens that is a prefix of another token
  as prefix tokens. If the last token from jumpforward is a prefix token, it's highly possible
  that it will be rolled back in the next decode stage, as it may be combined with the
  decoded token. It also effects the output distribution as such pattern is rare in training data.
  Therefore, we skip the last prefix token in jumpforward decoding.

This PR also includes the following changes:
- Add several metrics for request and engine, especially about the jumpforward decoding
- Fix a bug in `_async_query_engine_metrics` to avoid throwing CancelledError from early return

Performance and benchmark:

Schema(Pydantic):
```
class Product(BaseModel):
    product_id: int
    is_available: bool
    price: float
    is_featured: Literal[True]
    category: Literal["Electronics", "Clothing", "Food"]
    tags: List[str]
    stock: Dict[str, int]
```

Platform: AMD Ryzen 9 5900X, NVIDIA 3080 10G

Results:
```
Jump forward: False, Batch: 1
Engine metrics:
{
    "engine_decode_time_sum": 0.4988938220000001,
    "engine_jump_forward_time_sum": 0,
    "completion_tokens_sum": 66,
    "decode_tokens_sum": 66,
    "jump_forward_tokens_sum": 0,
    "decode_tokens_per_s": 132.2926785010378,
}
Jump forward: True, Batch: 1
Engine metrics:
{
    "engine_decode_time_sum": 0.37242740600000007,
    "engine_jump_forward_time_sum": 0.027989265000000006,
    "completion_tokens_sum": 68,
    "decode_tokens_sum": 68,
    "jump_forward_tokens_sum": 28,
    "decode_tokens_per_s": 182.58591850246378,
}
Jump forward: False, Batch: 4
Engine metrics:
{
    "engine_decode_time_sum": 0.9106805410000002,
    "engine_jump_forward_time_sum": 0,
    "completion_tokens_sum": 261,
    "decode_tokens_sum": 261,
    "jump_forward_tokens_sum": 0,
    "decode_tokens_per_s": 286.5988546470984,
}
Jump forward: True, Batch: 4
Engine metrics:
{
    "engine_decode_time_sum": 0.6843025599999999,
    "engine_jump_forward_time_sum": 0.028089531999999997,
    "completion_tokens_sum": 266,
    "decode_tokens_sum": 266,
    "jump_forward_tokens_sum": 112,
    "decode_tokens_per_s": 388.71694415405966,
}
Jump forward: False, Batch: 8
Engine metrics:
{
    "engine_decode_time_sum": 1.62462493,
    "engine_jump_forward_time_sum": 0,
    "completion_tokens_sum": 538,
    "decode_tokens_sum": 538,
    "jump_forward_tokens_sum": 0,
    "decode_tokens_per_s": 331.1533573475325,
}
Jump forward: True, Batch: 8
Engine metrics:
{
    "engine_decode_time_sum": 1.0509048310000002,
    "engine_jump_forward_time_sum": 0.027971332000000022,
    "completion_tokens_sum": 525,
    "decode_tokens_sum": 525,
    "jump_forward_tokens_sum": 224,
    "decode_tokens_per_s": 499.5694990767436,
}
Jump forward: False, Batch: 16
Engine metrics:
{
    "engine_decode_time_sum": 2.317279175,
    "engine_jump_forward_time_sum": 0,
    "completion_tokens_sum": 1068,
    "decode_tokens_sum": 1068,
    "jump_forward_tokens_sum": 0,
    "decode_tokens_per_s": 460.8853398080531,
}
Jump forward: True, Batch: 16
Engine metrics:
{
    "engine_decode_time_sum": 1.3962938819999997,
    "engine_jump_forward_time_sum": 0.030129287999999994,
    "completion_tokens_sum": 1059,
    "decode_tokens_sum": 1059,
    "jump_forward_tokens_sum": 448,
    "decode_tokens_per_s": 758.4363246533227,
}
```
Some improvements of the delivery script:

- provide different overrides for different quantization. e.g. we can change
prefill chunk size for q0/q3/q4
- rerun gen config only if only conv_template changes
- do NOT recreate HF repo when the repo already exists. This will preserve
commit history
- dry-run validation
…lc-ai#2566)

This PR enhances the error reporting for multi-GPU model compilation,
so we can provide as many error reasons as possible before loading and
running the models.
This adds the interface to draft token state and sampler to allow tree
structure being recorded and used for verification
* [Bench] Json mode bench

This PR refactors mlc bench to enable json mode in dataset.

* upd

* fix lint
This PR introduces the multi-GPU support for the Qwen-MoE model.
Validated on 4090x2.
This PR adds the missing fields that were not cleared up in
`EngineMetrics::Reset`.
Update documentation for WebLLM. Currently we only provide a high-level view for WebLLM runtime here, and refer user to the WebLLM repo README for more. The documentation focuses on adding their own model variant / model library for WebLLM. Will follow up with more thorough runtime documentation.
This PR introduces a top-4 kernel for MoE model (particularly for
the Qwen-MoE) at this moment.

This is still a manual implementation and has some duplication
with the existing top-2 kernel. In the future we'll consider leveraging
meta-programming of TIR to unify the top-k kernel implementations.
This PR updates the Gemma config so that MLC can work properly with
Gemma 1.1.
This PR adds the support for the hybrid prefill. So during the prefill
engine action, it will do the decode for running requests as well.
Update quick_start.rst

Fix broken links for convert weights and compile model pages
This PR fixes a bug which fails to set the prefill finish time
and results in metric error.
This PR updates the Android app the reduce the binary size.
Right now it can be reduced to 108MB when only building with the
Phi-3-mini-4k model.
This PR fixes the Gemma config compatibility issue.
This PR fixes a bug of the debug_compare.py script.
This commit introduces the InternLM2 model support.
…ai#2615)

This PR updates the prefix cache to align the logic of enabling sliding window. Now only leaf sequence is enabled sliding window attention.
This PR updates the include directories for the Android app
so that we can avoid using macros for src file include.
@sunggg sunggg mentioned this pull request Jul 7, 2024
@sunggg sunggg merged commit 843ad8a into mlc-serve-v0.2.0 Jul 8, 2024
@@ -429,109 +287,6 @@ def _quantize( # pylint: disable=too-many-locals
scale = topi.transpose(scale)
return quantized_weight, scale

def _quantize_float8( # pylint: disable=too-many-locals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @csullivan @JosephTheOctonaut , it looks like this merge removed some of the FP8 quantization functions, which may need to be restored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our FP8 quantization flow is mainly using functions defined in mlc-serve/slm, which unfortunately is now diverged from the upstream copy. E.g., for the specific function _quantize_float8, it lives in slm/quantization/per_tensor_quantization.py for our flow. I'm not totally clear on the upstream-only flow for FP8, but @vinx13 has been updating that over time, so I'm assuming it's intact.

We probably want to visit unifying the flows down the line, if reasonable, to avoid the duplication and conflicts that Sung pointed out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.