Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge with mlc-ai/main (835223541d4135e511a50cba1deca06731b03abd, April 18th 2024) #260

Merged
merged 203 commits into from
Apr 22, 2024

Conversation

sunggg
Copy link
Member

@sunggg sunggg commented Apr 22, 2024

No description provided.

MasterJH5574 and others added 30 commits February 22, 2024 17:46
This PR enables the decode attn kernel to have awareness of
the webgpu backend, so that it helps make sure the total
number of threads does not exceed the 256 limit of WebGPU.

Co-authored-by: Bohan Hou <spectrometerh@gmail.com>
This PR refactors the existing logit processing pipeline
with a unfiied logit processor class. The logit processor class
exposes two functions:
- `InplaceUpdateLogits`, which takes in the raw logits produced
by the model, and apply logit bias (which is introduced in this PR),
presence/frequency/repetition penalties, and token id mask in
order when needed.
- `ComputeProbsFromLogits`, which takes in the updated logits,
and invoke softmax with temperature to compute the probability
distribution.

The logit processor completely runs on GPU. This being said,
all the logit bias / penalty / mask application and the softmax
is backed by GPU kernels. This is a highlight difference compared
with the logit processing prior to this PR, where the processing
happens on CPU, and softmax also happens on CPU when any logit
process is needed.

With the unified logit processor, we simplified the interface
of handling model's output logits in engine actions to make it
cleaner. We also simplified the interface of Sampler.

Preliminary results show that LogitProcessor brings a bit perf
improvement when any processing is needed.
This PR introduces the logprobs support with OpenAI API
compatibility. It enhances the sampler with a function to get
the top-probability tokens (supporting 5 tokens at most as of now).

To make it easy to pass logprob results back from serving engine
to frontend, we choose to pass logprob results in JSON string with
OpenAI API spec.

Unit tests are added to ensure the correctness of logprobs.
And the logprobs support also work with speculative decoding.
This PR supports Mixtral in MLC serve. The main thing is only
introducing the Mistral conversation template to Python registry
so that MLC Serve can use.

Besides that, this PR updates the KV cache capacity analysis to
make it more accurate in terms of usage calculation, while being
conservative since there is a known issue regarding batch-prefill
embedding taking which may lead to OOM. We will reset the follow up
on the issue with a fix in the future and then enable the estimation
to use more GPU vRAM.
Prior to this PR, `u_char` was used while it is not a standard
type in C++, which causes Windows build failure.

This PR fixes it by using `unsigned char`.
…#1852)

Instead of a python function that returns an updated `IRModule`, the
new `optimize_mod_pipeline` function returns a `tvm.ir.transform.Pass`
which can be applied to an `IRModule`.
* Create __init__.py

* Add files via upload

* Update model.py

* Update model_preset.py

* Update conv_templates.cc

* Update internlm_loader.py

* Update internlm_quantization.py

* fix name of notes

* Update model.py

* Migration

* fix pylint issue

* fix pylint issue

* fix pylint error

* Update internlm_loader.py

* Update __init__.py

* Update __init__.py

* Delete python/mlc_chat/model/internlm/__init__.py

* Add files via upload
Prior to this commit, a model name with multiple path
components (e.g. `dist/models/group_name/model_name`) would have
duplicated path components
(e.g. `dist/group_name/artifact_path/group_name/libname.so`).
This commit resolves the duplication.
* [KVCache] Add max num threads to KVCache kernels, fix WebGPU

* Read max_num_threads_per_block when available

* Change merge state in place kernel

* Make attention decode aware of max num threads, not just webgpu

Co-authored-by: Egor Churaev <egor.churaev@gmail.com>

* Change util function name

---------

Co-authored-by: Egor Churaev <egor.churaev@gmail.com>
…1860)

This PR moves the import of transformers into the function body
of tiktoken tokenizer conversion, so we do not have a force dependency
on transformers.
This PR adds RWKV5 support with RNNState, a similar interface as
PagedAttention.

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Following mlc-ai#1854 , this pr registers the ChatML conversation template.
Sets the entry functions for a module.  This utility is intended for
cases where only module contains several externally-exposed functions,
and only one is desired for use.  (e.g. Separating out a
`transform_params` function from an `IRModule` that also contains
inference functions.)  This commit only updates the external
visibility, after which `relax.transform.DeadCodeElimination()` can be
applied.
…i#1856)

This allows it to be used as part of a optimization pipeline specified
as a `tvm.ir.transform.Sequential`.
mlc-ai#1867)

This PR is the 3rd part of the grammar-guided generation.
This intregrates the grammar framework into the generation
process, and supports JSON output for now.

The API this PR provides is compatible with the OpenAI api.

### APIs
#### Python API
```
@DataClass
class ResponseFormat:
    type: Literal["text", "json_object"] = "text"
    json_schema: Optional[str] = None

@DataClass
class GenerationConfig:
        response_format: ResponseFormat = ResponseFormat(type="text")
```

#### Rest API
```
response_format: { "type": "text" } # text generation, by default
response_format: { "type": "json_object" } # json generation
response_format: { "type": "json_object", json_schema="..."} # json generation with schema
```

JSON generation with schema is not supported yet,
but has been planned to be realized in the future.

### Performance
#### Without JSON
```
Single token prefill latency: 891.2234 ms/tok
Single token decode latency: 31.3399 ms/tok
Prefill token throughput: 4693.3077 tok/s
Decode token throughput: 226.4406 tok/s
Overall token throughput: 470.3180 tok/s
```
#### With JSON
```
Single token prefill latency: 219.2287 ms/tok
Single token decode latency: 29.1399 ms/tok
Prefill token throughput: 7392.1555 tok/s
Decode token throughput: 179.2296 tok/s
Overall token throughput: 1052.1996 tok/s
```

We observed a slight decrease in performance under JSON mode.
This will be further optimized in the future.
This PR brings field `n` to generation config and thereby
supports parallel generation. This parallel generation effectively
leverages the "fork" functionality of paged KV cache.

This PR supports specifying the number of parallel generation
`n` in stardard OpenAI ChatCompletion API. This is the last
feature towards the OpenAI API feature completeness.
Sometimes scm checkout can timeout, this PR add retry to that
Prior to this PR, the TIR attention kernels does not cast matmul
operands to fp32 before multiplying.
For models like Phi-2 which may have large Q/K/V data (at the level
of a few hundreds), the fp16 multiplication exceeds the range of
fp16, and lead to attention result being NAN sometimes.

This PR fixes this issue.
…lc-ai#1857)

Prior to this commit, the `ReorderTransformFunc` required several
components of the `ParamManager` to use.  The functionality it
provides, reordering dataflow blocks to minimize the liveset, is
useful outside of the context of the `ParamManager`.  This commit
makes the following changes, allowing it to be used independently of
the `ParamManager`.

- Generate the `pidx2binname` dictionary outside of `ReorderTransformFunc`

- Allow parameters to be separate `func.params`, rather than a single
  bundled tuple parameter.
This PR migrates Phi-2 for Paged KV cache Attention as a part of Model definition migration according to mlc-ai#1749 .

Co-authored-by: Shrey Gupta <shrey2809@gmail.com>
…c-ai#1874)

The use of `call_inplace_packed` and `call_pure_packed` in the old
flow is outdated due to signature changes. This PR fixes the issue.
PR mlc-ai#1852 missed to apply the BundleModelParams pass and thus made
the compiled models not runnable through ChatModule (mlc-ai#1864). This PR
fixes the issue.
As pointed out by mlc-ai#1830, this PR fixes the Android app download
link in docs.
This PR adopts suggestions from the support of OpenAI API parallel
generation `n` in mlc-ai#1868. The main update in this PR is to make
the RequestState as a standalone object class, which was a typedef
from `std::vector<RequestStateEntry>` before.

This PR also fixes a bug in prefill that will cause engine failure
when `n` is large.
MasterJH5574 and others added 28 commits April 10, 2024 11:31
This PR fixes the picojson uses in MLC that conflicts with the latest
changes on the picojson side.
…++ (mlc-ai#2112)

[Serve][Grammar] Porting the json schema converter from python to C++

This PR ports the json schema converter from python to C++. It defines
the interface:
```
std::string JSONSchemaToEBNF(
    std::string schema, std::optional<int> indent = std::nullopt,
    std::optional<std::pair<std::string, std::string>> separators = std::nullopt,
    bool strict_mode = true);
```

And uses it in BNFGrammar::FromSchema.

This helps cases where python cannot be deployed.
1. Add Eagle-Llama-7b-chat model support.
2. Add speculative decoding support with Eagle.
This PR attaches the attributes of `tir.non_negative_var` for memory
planning.
This PR is a refactor of the engine's contructor interface
and the serve CLI interface.

This PR introduces the "mode" argument for engine, which has options
"local", "interactive" and "server". The choice of mode will affect
the automatically inferred value of `max_batch_size`,
`max_total_sequence_length` and `prefill_chunk_size` (only effective
when arguements are not specified. Once an argument is specified,
we will not override it). For detailed specification of the mode,
please check out the CLI help messages in `mlc_llm/help.py` or the
engine constructor in `mlc_llm/serve/engine.py`.

No matter which mode is chosen, we will print out the current mode
and the values of these arguments, for peopple to understand the
settings of the engine. We also provide hints on how to adjust the
mode. For example,

```
[2024-04-12 16:12:26] INFO chat_module.py:379: Using model folder: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16-MLC
[2024-04-12 16:12:26] INFO chat_module.py:380: Using mlc chat config: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16-MLC/mlc-chat-config.json
[2024-04-12 16:12:26] INFO chat_module.py:529: Using library model: dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so
[2024-04-12 16:12:26] INFO chat_module.py:379: Using model folder: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q4f16_1-MLC
[2024-04-12 16:12:26] INFO chat_module.py:380: Using mlc chat config: /home/ruihang/Workspace/mlc-llm/dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json
[2024-04-12 16:12:26] INFO chat_module.py:529: Using library model: dist/Llama-2-7b-chat-hf-q4f16_1-MLC/Llama-2-7b-chat-hf-q4f16_1-MLC-cuda.so
[2024-04-12 16:12:29] INFO engine_base.py:382: Engine mode is "local". Max batch size is set to 4. Max KV cache token capacity is set to 4096. Prefill chunk size is set to 4096.
[2024-04-12 16:12:29] INFO engine_base.py:387: Estimated total single GPU memory usage: 21543.74 MB (Parameters: 16467.64 MB. KVCache: 4450.07 MB. Temporary buffer: 626.03 MB). The actual usage might be slightly larger than the estimated number.
[2024-04-12 16:12:29] INFO engine_base.py:398: Please switch to mode "server" if you want to use more GPU memory and support more concurrent requests.
```

After the refactor, we bring the speculative decoding to the serve
CLI so that people can use multiple models and run speculative
decoding with the server launched in CLI (which was not doable before).
This PR revamps the logging info for engine mode selection to provide
more detailed information and the rationale of different modes.
This PR enables TP for Chatglm3 model.
)

Prior to this PR, due to the improper prefill policy on `n` (parallel
generation), the engine will loop forever when the a request has `n`
larger than the maximum batch size that the engine can support.

This PR fixes this issue by updating the prefill action, and with this
PR, even the "interactive" engine mode can well support multiple
parallel generation.

After this fix, it is possible that a request require 10 parallel
generation while the max batch size is 1. Given the shapes of temporary
NDArrays in GPU sampler is determined by the max batch size, GPU sampler
does not natively support sampling 10 tokens at a time. To approach
this issue, this PR introduces chunking to GPU sampler. Therefore,
in this particular case, the GPU sampler will have chunk size 1,
and the 10 required samples will be processed by the GPU sampler
one by one in order. Chunking is the minimum change we can do to support
large `n`.
…2137)

This PR revamps the landing documentation page.

* The Python API panel is changed from showing ChatModule to showing
Engine.
* A new panel "REST Server" is added to show a quick start example
of launching REST server and send request.
* A "what to do next" section is introduced at the bottom of the
landing page.

Todo items for future PR:

* add the page of Python API with Engine.
* revamp weight conversion page.
* revamp model library compilation page.
The commit updates the target tags, in order to identify the different
SoC hardware targets for further target-specific optimizations.

Meanwhile, update the vulkan support for int64.
This PR updates the documentation with an introduction turorial.
The landing page now directs to the quick start page and the tutorial.
…c-ai#2148)

This PR adds a new function `DebugCallFuncOnAllAllWorker` which calls
a global function of sigunature `[] -> None` on all distributed workers
when tensor parallelism is enabled (or the local session itself if not
enabled).

As the name suggests, this function is only for the debug purpose, and
we will not expose any public interface to invoke this function.

This PR also introduces the global functions
`"mlc.debug_cuda_profiler_start"` and `"mlc.debug_cuda_profiler_stop"`,
which enables CUDA profiling when using PopenServer.
* [DOCS] Update introduction

Some minor tweaks on the introduction doc

* Update docs/get_started/introduction.rst

Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>

---------

Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
We rename the public Python serve interface from `Engine` to
`LLMEngine` (and from `AsyncEngine` to `AsyncLLMEngine` accordingly)
for better class name clarity.

This is because in cases people do wildcard import, in which case
the name `Engine` itself does not convey enough meaning.
* [Quantization] Add e4m3 mode and enable fp8 storage type

* add quantize linear flag
…c-ai#2158)

Revert "[Quantization] Add e4m3 mode and enable fp8 storage type (mlc-ai#2154)"

This reverts commit e9a4a0b.
This PR refactors EngineConfig for a cleaner interface of internal
Engine constructor in MLC serve. This is a preparation step towards
the engine reload/unload which will be introduced in follow-up PRs
for JSONFFIEngine functionality on mobile and other platforms.
def transform(self) -> IRModule:
"""Entry point of the transformation"""
for g_var, func in self.mod.functions_items():
# TODO(@eric): This is a temporary hack to get around with two functions for BYOC.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lunderberg, please follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what issue is occurring here? From a quick glance, it looks like this is a bug in the remove_global_buf_alloc, that it assumes any PrimFunc present in the IRModule will be a schedulable PrimFunc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunggg sunggg merged commit d448fdb into octoml:mlc-serve-v0.2.0 Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.