Release v3.2.0 · huggingface/text-generation-inference

Important changes

BREAKING CHANGE: Lots of modifications around tool calling. Tool calling now respects fully OpenAI return results (arguments return type is a string instead of a real JSON object). Lots of improvements around the tool calling and side effects fixed.
Added Gemma 3 support.

fix(neuron): explicitly install toolchain by @dacorvo in #3072
Only add token when it is defined. by @Narsil in #3073
Making sure Olmo (transformers backend) works. by @Narsil in #3074
Making tool_calls a vector. by @Narsil in #3075
Nix: add openai to impure shell for integration tests by @danieldk in #3081
Update --max-batch-total-tokens description by @alvarobartt in #3083
Fix tool call2 by @Narsil in #3076
Nix: the launcher needs a Python env with Torch for GPU detection by @danieldk in #3085
Add request parameters to OTel span for /v1/chat/completions endpoint by @aW3st in #3000
Add qwen2 multi lora layers support by @EachSheep in #3089
Add modules_to_not_convert in quantized model by @jiqing-feng in #3053
Small test and typing fixes by @danieldk in #3078
hotfix: qwen2 formatting by @danieldk in #3093
Pr 3003 ci branch by @drbh in #3007
Update the llamacpp backend by @angt in #3022
Fix qwen vl by @Narsil in #3096
Update README.md by @celsowm in #3095
Fix tool call3 by @Narsil in #3086
Add gemma3 model by @mht-sharma in #3099
Fix tool call4 by @Narsil in #3094
Update neuron backend by @dacorvo in #3098
Preparing relase 3.2.0 by @Narsil in #3100
Try to fix on main CI color. by @Narsil in #3101

Full Changelog: v3.1.1...v3.2.0