Releases: huggingface/text-generation-inference
Releases · huggingface/text-generation-inference
v1.3.4
What's Changed
- feat: relax mistral requirements by @OlivierDehaene in #1351
- fix: fix logic if sliding window key is not present in config by @OlivierDehaene in #1352
- fix: fix offline (#1341) by @OlivierDehaene in #1347
- fix: fix gpt-q with groupsize = -1 by @OlivierDehaene in #1358
- Peft safetensors. by @Narsil in #1364
- Change URL for Habana Gaudi support in doc by @regisss in #1343
- feat: update exllamav2 kernels by @OlivierDehaene in #1370
- Fix local load for peft by @Narsil in #1373
Full Changelog: v1.3.3...v1.3.4
v1.3.3
What's Changed
- fix gptq params loading
- improve decode latency for long sequences two fold
- feat: add more latency metrics in forward by @OlivierDehaene in #1346
- fix: max_past default value must be -1, not 0 by @OlivierDehaene in #1348
Full Changelog: v1.3.2...v1.3.3
v1.3.2
What's Changed
- fix: support null sliding window for mistral models by @OlivierDehaene in #1337
- feat: add quant to mixtral by @OlivierDehaene in #1337
Full Changelog: v1.3.1...v1.3.2
v1.3.1
Hotfix Mixtral implementation
Full Changelog: v1.3.0...v1.3.1
v1.3.0
What's Changed
- Fix AMD documentation by @fxmarty in #1307
- Medusa and N-Gram Speculative decoding by @Narsil in #1308
- Mixtral support by @OlivierDehaene in #1328
Full Changelog: v1.2.0...v1.3.0
v.1.2.0
What's Changed
- fix: do not leak inputs on error by @OlivierDehaene in #1228
- Fix missing
trust_remote_code
flag for AutoTokenizer in utils.peft by @creatorrr in #1270 - Load PEFT weights from local directory by @tleyden in #1260
- chore: update to torch 2.1.0 by @OlivierDehaene in #1182
- Fix IDEFICS dtype by @vakker in #1214
- Exllama v2 by @Narsil in #1211
- Add RoCm support by @fxmarty in #1243
- Let each model resolve their own default dtype. by @Narsil in #1287
- Make GPTQ test less flaky by @Narsil in #1295
New Contributors
- @creatorrr made their first contribution in #1270
- @tleyden made their first contribution in #1260
- @vakker made their first contribution in #1214
Full Changelog: v1.1.1...v1.2.0
v1.1.1
What's Changed
- Fix launcher.md by @mishig25 in #1075
- Update launcher.md to wrap code blocks by @mishig25 in #1076
- Fixing eetq dockerfile. by @Narsil in #1081
- Fix window_size_left for flash attention v1 by @peterlowrance in #1089
- raise exception on invalid images by @leot13 in #999
- [Doc page] Fix launcher page highlighting by @mishig25 in #1080
- Handling bloom prefix. by @Narsil in #1090
- Update idefics_image_processing.py by @Narsil in #1091
- fixed command line arguments in docs by @Fluder-Paradyne in #1092
- Adding titles to CLI doc. by @Narsil in #1094
- Receive base64 encoded images for idefics. by @Narsil in #1096
- Modify the default for
max_new_tokens
. by @Narsil in #1097 - fix: type hint typo in tokens.py by @vejvarm in #1102
- Fixing GPTQ exllama kernel usage. by @Narsil in #1101
- Adding yarn support. by @Narsil in #1099
- Hotfixing idefics base64 parsing. by @Narsil in #1103
- Prepare for v1.1.1 by @Narsil in #1100
- Remove some content from the README in favour of the documentation by @osanseviero in #958
- Fix link in preparing_model.md by @mishig25 in #1140
- Fix calling cuda() on load_in_8bit by @mmngays in #1153
- Fix: Replace view() with reshape() in neox_modeling.py to resolve RuntimeError by @Mario928 in #1155
- fix: EETQLinear with bias in layers.py by @SidaZh in #1176
- fix: remove useless token by @rtrompier in #1179
- #1049 CI by @OlivierDehaene in #1178
- Fix link to quantization page in preparing_model.md by @aasthavar in #1187
- feat: paged attention v2 by @OlivierDehaene in #1183
- feat: remove flume by @OlivierDehaene in #1184
- Adding the video -> moving the architecture picture lower by @Narsil in #1239
- Narsil patch 1 by @Narsil in #1241
- Update README.md by @Narsil in #1242
- Fix link in quantization guide by @osanseviero in #1246
New Contributors
- @peterlowrance made their first contribution in #1089
- @leot13 made their first contribution in #999
- @Fluder-Paradyne made their first contribution in #1092
- @vejvarm made their first contribution in #1102
- @mmngays made their first contribution in #1153
- @Mario928 made their first contribution in #1155
- @SidaZh made their first contribution in #1176
- @rtrompier made their first contribution in #1179
- @aasthavar made their first contribution in #1187
Full Changelog: v1.1.0...v1.1.1
v1.1.0
Notable changes
What's Changed
- Fix f180 by @Narsil in #951
- Fix Falcon weight mapping for H2O.ai checkpoints by @Vinno97 in #953
- Fixing top_k tokens when k ends up < 0 by @Narsil in #966
- small fix on idefics by @VictorSanh in #954
- chore(client): Support Pydantic 2 by @JelleZijlstra in #900
- docs: typo in streaming.js by @revolunet in #971
- Disabling exllama on old compute. by @Narsil in #986
- sync text-generation version from 0.3.0 to 0.6.0 with pyproject.toml by @yzbx in #950
- Fix exllama wronfully loading by @maximelaboisson in #990
- add transformers gptq support by @flozi00 in #963
- Fix call vs forward. by @Narsil in #993
- fit for baichuan models by @XiaoBin1992 in #981
- Fix missing arguments in Galactica's from_pb by @Vinno97 in #1022
- Fixing t5 loading. by @Narsil in #1042
- Add AWQ quantization inference support (#1019) by @Narsil in #1054
- Fix GQA llama + AWQ by @Narsil in #1061
- support local model config file by @zhangsibo1129 in #1058
- fix discard_names bug in safetensors convertion by @zhangsibo1129 in #1052
- Install curl to be able to perform more advanced healthchecks by @oOraph in #1033
- Fix position ids logic instantiation of idefics vision part by @VictorSanh in #1064
- Fix top_n_tokens returning non-log probs for some models by @Vinno97 in #1023
- Support eetq weight only quantization by @Narsil in #1068
- Remove the stripping of the prefix space (and any other mangling that tokenizers might do). by @Narsil in #1065
- Complete FastLinear.load parameters in OPTDecoder initialization by @zhangsibo1129 in #1060
- feat: add mistral model by @OlivierDehaene in #1071
New Contributors
- @VictorSanh made their first contribution in #954
- @JelleZijlstra made their first contribution in #900
- @revolunet made their first contribution in #971
- @yzbx made their first contribution in #950
- @maximelaboisson made their first contribution in #990
- @XiaoBin1992 made their first contribution in #981
- @sywangyi made their first contribution in #1034
- @zhangsibo1129 made their first contribution in #1058
Full Changelog: v1.0.3...v1.1.0
v1.0.3
What's Changed
Codellama.
- Upgrade version number in docs. by @Narsil in #910
- Added gradio example to docs by @merveenoyan in #867
- Supporting code llama. by @Narsil in #918
- Fixing the lora adaptation on docker. by @Narsil in #935
- Rebased #617 by @Narsil in #868
- New release. by @Narsil in #941
Full Changelog: v1.0.2...v1.0.3
v1.0.2
What's Changed
- Have snippets in Python/JavaScript in quicktour by @osanseviero in #809
- Added two more features in readme.md file by @sawanjr in #831
- Fix rope dynamic + factor by @Narsil in #822
- fix: LlamaTokenizerFast to AutoTokenizer at flash_llama.py by @dongs0104 in #619
- README edit -- running the service with no GPU or CUDA support by @pminervini in #773
- Fix
tokenizers==0.13.4
. by @Narsil in #838 - Update README.md by @adarshxs in #848
- Fixing watermark. by @Narsil in #851
- Misc minor improvements for InferenceClient docs by @osanseviero in #852
- "Fix" for rw-1b. by @Narsil in #860
- Upgrading versions of python client. by @Narsil in #862
- Adding Idefics multi modal model. by @Narsil in #842
- Add streaming guide by @osanseviero in #858
- Adding small benchmark script. by @Narsil in #881
New Contributors
- @sawanjr made their first contribution in #831
- @dongs0104 made their first contribution in #619
- @pminervini made their first contribution in #773
- @adarshxs made their first contribution in #848
Full Changelog: v1.0.1...v1.0.2