Lots of improvements (Still 2 allocators) #2449

Narsil · 2024-08-22T13:00:15Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Remove paged as a default too, and using FD everywhere.

Co-authored-by: drbh <david.richard.holtz@gmail.com>

input and not (since it's super important with the prefixing now)

- Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room

modification).

smaller.

OlivierDehaene · 2024-08-29T09:20:07Z

backends/v3/src/backend.rs

-        } else {
-            16
-        };
+        let prefix_caching =


In the future, this could be done by calling Info from the gRPC client.

backends/v3/src/radix.rs

OlivierDehaene · 2024-08-29T09:38:19Z

launcher/src/main.rs

 use thiserror::Error;
 use tracing_subscriber::{filter::LevelFilter, EnvFilter};

 mod env_runtime;

+fn get_config(


Same, this could be done through Info in the future.

I think I prefer doing it super early on, so that dependencies between flags can be determined early, and the regular code can just use 1 strong value.

But I see the point.

OlivierDehaene · 2024-08-29T09:42:15Z

router/src/lib.rs

+    /// already contain the templated input therefore
+    /// we shouldn't add the special tokens.
+    #[serde(default = "default_true", skip)]
+    pub add_special_tokens: bool,


Shouldn't we add this to the internal tokenize, infer... router functions instead?

This was way faster to do it this way (as we're passing GenerateRequest everywhere).
Using [serde(skip)] means it's not settable through public API nor is it visible with utoipa.

My first attempt I used a separate variable and found it was just adding lots of noise for nothing (we need that information on every generateRequest).

OlivierDehaene · 2024-08-29T09:43:31Z

server/tests/conftest.py

+os.environ["USE_PREFIX_CACHING"] = "1"
+os.environ["ATTENTION"] = "flashinfer"


This is for the server tests, those are run in isolation (without the router/launcher).

Therefore we need those variables to be set (those variables are enforced to exist in the python code, in order to minize issues in the resolution of their values.

server/text_generation_server/layers/attention/common.py

OlivierDehaene · 2024-08-29T09:45:04Z

server/text_generation_server/layers/attention/common.py

-                device=device,
-                dtype=torch.int32,
-            )
+            if cu_seqlen_q is None:


When can it be None, and shouldn't it take into account speculative tokens into the arange?

This is already done beforehand currently.

Althrough this should be fixed once speculative is correctly implemented with cu_seqlen_q and cu_speculated_ids or somethign (to get the offsets of the input_ids that are speculated for instance).

OlivierDehaene · 2024-08-29T09:47:01Z

server/text_generation_server/models/flash_causal_lm.py

-                    assert prefix_len > 0
-                    prefix_len -= 1
-            else:
-                prefix_len = 0


Do we correctly default to 0 in the router?

Yes, we do not however correctly account for leaving 1 token spare for the prefill to actually occur normally.

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>

Narsil changed the title ~~Only 1 allocator2~~ Lots of improvements (Still 2 allocators) Aug 26, 2024

Narsil force-pushed the only_1_allocator2 branch from 8be5e32 to c1572bf Compare August 27, 2024 13:34

Narsil and others added 28 commits August 27, 2024 20:05

Making prefix/flashinfer the default and testing the full release tests.

60719ba

Include flashinfer in the docker.

9d4c5d3

Using prebuilt.

f2bdc65

Allowing window_left_size (dummy version).

f55278d

Disabling flashinfer/prefix caching on odd head_dim

cba59ac

Disable prefix caching for lora.

a6cd5fe

More specific codes.

f0b35f9

Update lock

ffb6841

Updating integration tests with new values with FI/FD.

ba1ce20

Remove paged as a default too, and using FD everywhere.

Update cargo lock ?

17c8a5e

Upgrade to 1.80 because of bitstream...

344fee0

Everywhere 1.80

860b550

Forgot last default place.

8d0220a

Apply suggestions from code review

b80593b

Co-authored-by: drbh <david.richard.holtz@gmail.com>

Updated flake lock

0bf4eb9

Tmp

5eb6ea0

Upgrade resolution system for less errors in resolution.

32f6416

Remove lambda for cleaner function.

c53968d

Handling debugger.

682db34

OVerride the env in server tests.

1568e82

Is this enough to make it work ?

f5182c1

This seems to be working.

26e5037

Downgrade some logs.

27b566b

Fixing the default for vlm.

e30fb25

Don't enable prefix caching on VLM just yet.

f1c0735

Change add_special_tokens in order to have the correct tokens for chat

7f1816a

input and not (since it's super important with the prefixing now)

Fixing prefix caching for flashdecoding.

65b94a6

Update all models.

bb9769e

Narsil added 5 commits August 27, 2024 20:06

Fixed flashinfer version.

55d984d

add_special_tokens is internal only

9dacac3

Fixing seqlen with the new vlms.

e0069a3

Fixing the issue with add_special_tokens not being passed around.

2cf1f5c

Fixing the test.

ccaf1d0

Narsil force-pushed the only_1_allocator2 branch from c70335f to ccaf1d0 Compare August 27, 2024 18:06

Narsil added 11 commits August 27, 2024 21:11

Removing encoder_decoder (seq2seq).

8ac1ffa

Update the chat test.

c6f1a61

Fixing the batching tokenization in flash causal lm.

0a60973

Truncating left for radix purposes.

e6ee67f

Oops this doesn't belong here.

f886747

Put back default pure shell.

1232556

Update server tests

8d01848

- Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room

Only n_heads / process_group.size() are necessary.

8a4df6e

Revert the integrationt tests change (seem linked to head_size

e7e0363

modification).

Adding error message when assert is violated.

9c839ca

Fixing the free algorithm to handle times where the common prefix is

bef2f6b

smaller.

OlivierDehaene reviewed Aug 29, 2024

View reviewed changes

Narsil and others added 5 commits August 29, 2024 11:58

Apply suggestions from code review

4b37500

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

Update server/text_generation_server/layers/attention/common.py

d77f5f2

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

Fix disabling prefix caching - Fix windowing checks.

9bfdac2

Revert the Cohere tokenizer change (for now using a revision instead).

0c00b94

Fmt.

b412679

OlivierDehaene approved these changes Aug 29, 2024

View reviewed changes

Narsil merged commit e415b69 into main Aug 29, 2024
11 checks passed

Narsil deleted the only_1_allocator2 branch August 29, 2024 14:29

Narsil mentioned this pull request Aug 29, 2024

Use prefix caching default #2427

Closed

5 tasks

sywangyi mentioned this pull request Aug 30, 2024

hotfix: fix regression of attention api change in intel platform #2439

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lots of improvements (Still 2 allocators) #2449

Lots of improvements (Still 2 allocators) #2449

Narsil commented Aug 22, 2024

OlivierDehaene Aug 29, 2024

OlivierDehaene Aug 29, 2024

Narsil Aug 29, 2024

OlivierDehaene Aug 29, 2024

Narsil Aug 29, 2024

OlivierDehaene Aug 29, 2024

Narsil Aug 29, 2024

OlivierDehaene Aug 29, 2024

Narsil Aug 29, 2024

OlivierDehaene Aug 29, 2024

Narsil Aug 29, 2024

		os.environ["USE_PREFIX_CACHING"] = "1"
		os.environ["ATTENTION"] = "flashinfer"

Lots of improvements (Still 2 allocators) #2449

Lots of improvements (Still 2 allocators) #2449

Conversation

Narsil commented Aug 22, 2024

What does this PR do?

Before submitting

Who can review?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment