v3.0.0

Narsil released this 09 Dec 20:22

8f326c9

TL;DR

Big new release

Details: https://huggingface.co/docs/text-generation-inference/conceptual/chunking

What's Changed

feat: concat the adapter id to the model id in chat response by @drbh in #2779
Move JSON grammar -> regex grammar conversion to the router by @danieldk in #2772
Use FP8 KV cache when specified by compressed-tensors by @danieldk in #2761
upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageat… by @sywangyi in #2778
Fix: docs typo by @jp1924 in #2777
Support continue final message by @drbh in #2733
Fix doc. by @Narsil in #2792
Removing ../ that broke the link by @Getty in #2789
fix: add merge-lora arg for model id by @drbh in #2788
fix: only use eos_token_id as pad_token_id if int by @dvrogozh in #2774
Sync (most) server dependencies with Nix by @danieldk in #2782
Saving some VRAM. by @Narsil in #2790
fix: avoid setting use_sgmv if no kernels present by @drbh in #2796
use oneapi 2024 docker image directly for xpu by @sywangyi in #2793
feat: auto max_new_tokens by @OlivierDehaene in #2803
Auto max prefill by @Narsil in #2797
Adding A100 compute. by @Narsil in #2806
Enable paligemma2 by @drbh in #2807
Attempt for cleverer auto batch_prefill values (some simplifications). by @Narsil in #2808
V3 doc by @Narsil in #2809
Prep new version by @Narsil in #2810
Hotfixing the link. by @Narsil in #2811

New Contributors

@jp1924 made their first contribution in #2777
@Getty made their first contribution in #2789

Full Changelog: v2.4.1...v3.0.0

Contributors

Getty, danieldk, and 6 other contributors

Assets 2