Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] seq_scheduler_document #1336

Merged
merged 3 commits into from
Nov 22, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
seq_scheduler_tutorial
  • Loading branch information
KexinFeng committed Nov 21, 2023
commit 6ef609ad4767ba12df56ec3575356e6cc7e9c714
99 changes: 99 additions & 0 deletions serving/docs/lmi/tutorials/seq_scheduler_tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Seq-Scheduler Tutorial

Seq-scheduler is a language model serving system built on native Huggingface inference backend, and works with any
huggingface model out-of-box. It features dynamic batching, dynamic batch splitting and prompt caching, with the
support of various autoregressive search methods, including greedy, sampling and [contrastive search](https://huggingface.co/blog/introducing-csearch). In the
following part of the tutorial, the way to run these features will be introduced.

## Start a model server with a serving.properties file
The model server is configured by a stand-along serving.properties file. Here `/workspace/llama-2-70b-hf/` model is
used as an example.
```
# serving.properties

engine=MPI
option.model_id=/workspace/llama-2-70b-hf/
option.tensor_parallel_degree=1
option.dtype=fp16
option.trust_remote_code=true

# rolling-batch parameters
option.max_rolling_batch_size=64
option.rolling_batch=scheduler

# seq-scheduler parameters
option.max_sparsity=0.33
option.max_splits=3
option.decoding_strategy=contrastive # other options: greedy, sample
```

Then inside a djl-serving docker image, by running the following command, the server is started.
```bash
djl-serving -m <directory-of-serving.properties>
```
sindhuvahinis marked this conversation as resolved.
Show resolved Hide resolved

## Send request to the server with a curl command
Next, by running a curl command, a request can be sent to the model server for text generation. A generic request
looks like the following.
```bash
export CONCURRENCY=64
export REPETITION=3
TOKENIZER=/workspace/llama-2-70b-hf/ ./awscurl -c ${CONCURRENCY} -N ${REPETITION} -X POST http://127.0.0.
1:8080/invocations \
--connect-timeout 60 \
-H "Content-type: application/json" \
-d '{"inputs":"The new movie that got Oscar this year","parameters":{"max_new_tokens":50, "do_sample":true},
"cached_prompt":"This is a prompt that will be cached and appended to the front of the inputs strings."}' \
-t \
-o output-test.txt
```

## The dynamic batch splitting
In real application of large language model (LLM), input prompt text’s lengths are not only different, but also can
vary a lot, ranging over several orders of magnitude. To deal with these inhomogeneous-length sequences, the most
general way is padding.

Based on the padding solution, in this section, we introduce the dynamic batch splitting feature, which optimally
split the batch kv_cache into sub-batches, to avoid the usage of too much padding, and run inference on each part
separately. The optimal partition is computed efficiently by a dynamic optimization process.
This allows serving largely-variant-length prompts.

### MaxSparse parameters
* Feature: in the padding solution of various-length input, for a batch input of length [300, 50], the padding for the sequence will be [0, 250], which causes sparsity of 0.7, a huge waste of memory. The max-sparsity-thresholding effectively solves this problem.
* Configuration:
* `max_sparsity`: 0 - 1, default 0.33
* `max_splits`: >= 1, default 3.


* The max_splits limit the max number of sequential inference call in one step of generation, which is used to guarantee the latency is not too much
* The max_sparsityis used to control the sparsity caused by padding and number of splits.
* When max_sparsity_threshold is small, then the high sparsity is not tolerated, then the number of splits will increase. The space is saved, but the latency may increase due to the increase in the number of inference call.
* When max_sparsity_threshold is large, then the high sparsity is tolerable, then the number of splits will decrease. Then more padding is needed, the latency will decrease.
* They are set by passing the values through serving.properties as shown the in section above.

### Demonstration

This demonstration shows that, for the same sequence length array [38, 33, 24, 22], how it is dynamically partitioned according to different max_sparsity threshold. It can be seen that the partition is the optimal one in the sense that after the partition, the number of padding is maximally reduced, below certain max_sparsity threshold, which saves as much memory as possible.

```bash
seq_lengths: [38, 33, 24, 22]
max_sparsity: 0.01
num_splits: 4
partition: [[1], [2], [3], [4]]


seq_lengths: [38, 33, 24, 22]
max_sparsity: 0.025
num_splits: 3
partition: [[1], [2], [3, 4]]

seq_lengths: [38, 33, 24, 22]
max_sparsity: 0.1
num_splits: 2
partition: [[1, 2], [3, 4]]

seq_lengths: [38, 33, 24, 22]
max_sparsity: 0.9
num_splits: 1
partition: [[1, 2, 3, 4]]
```