v3.0.0
TL;DR
Big new release
Details: https://huggingface.co/docs/text-generation-inference/conceptual/chunking
What's Changed
- feat: concat the adapter id to the model id in chat response by @drbh in #2779
- Move JSON grammar -> regex grammar conversion to the router by @danieldk in #2772
- Use FP8 KV cache when specified by compressed-tensors by @danieldk in #2761
- upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageat… by @sywangyi in #2778
- Fix: docs typo by @jp1924 in #2777
- Support continue final message by @drbh in #2733
- Fix doc. by @Narsil in #2792
- Removing ../ that broke the link by @Getty in #2789
- fix: add merge-lora arg for model id by @drbh in #2788
- fix: only use eos_token_id as pad_token_id if int by @dvrogozh in #2774
- Sync (most) server dependencies with Nix by @danieldk in #2782
- Saving some VRAM. by @Narsil in #2790
- fix: avoid setting use_sgmv if no kernels present by @drbh in #2796
- use oneapi 2024 docker image directly for xpu by @sywangyi in #2793
- feat: auto max_new_tokens by @OlivierDehaene in #2803
- Auto max prefill by @Narsil in #2797
- Adding A100 compute. by @Narsil in #2806
- Enable paligemma2 by @drbh in #2807
- Attempt for cleverer auto batch_prefill values (some simplifications). by @Narsil in #2808
- V3 doc by @Narsil in #2809
- Prep new version by @Narsil in #2810
- Hotfixing the link. by @Narsil in #2811
New Contributors
Full Changelog: v2.4.1...v3.0.0