Llama 3.2, mllama, Qwen2-Audio, Qwen2-VL, OLMoE, Llava Onevision, Pixtral, FalconMamba, Modular Transformers
New model additions
mllama
The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
- Add MLLama #33703, by @qubvel, @zucchini-nlp, @ArthurZucker
Qwen2-VL
The Qwen2-VL is a major update from the previous Qwen-VL by the Qwen team.
An extract from the Qwen2-VL blogpost available here is as follows:
Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of:
- SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
- Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
- Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
- Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.
Qwen2-Audio
The Qwen2-Audio is the new model series of large audio-language models from the Qwen team. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.
They introduce two distinct audio interaction modes:
- voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input
- audio analysis: users could provide audio and text instructions for analysis during the interaction
OLMoE
OLMoE is a series of Open Language Models using sparse Mixture-of-Experts designed to enable the science of language models. The team releases all code, checkpoints, logs, and details involved in training these models.
- Add OLMoE by @Muennighoff in #32406
Llava Onevision
LLaVA-Onevision is a Vision-Language Model that can generate text conditioned on one or several images/videos. The model consists of SigLIP vision encoder and a Qwen2 language backbone. The images are processed with anyres-9 technique where the image is split into 9 patches to better process high resolution images and capture as much details as possible. However, videos are pooled to a total sequence length of 196 tokens each frame for more memory efficient computation. LLaVA-Onevision is available in three sizes: 0.5B, 7B and 72B and achieves remarkable performance on benchmark evaluations.
- Llava Onevision: add model by @zucchini-nlp in #32673
FalconMamba
The FalconMamba model was proposed by TII UAE (Technology Innovation Institute) in their release.
The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.
The team releases an accompanying blog post.
- Add new model by @younesbelkada in #32615
Granite Language Models
he Granite model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.
PowerLM-3B is a 3B state-of-the-art small language model trained with the Power learning rate scheduler. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerLM-3B has shown promising results compared to other models in the size categories across various benchmarks, including natural language multi-choices, code generation, and math reasoning.
- Granite language models by @mayank31398 in #31502
Granite MOE
The GraniteMoe model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.
PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a mix of open-source and proprietary datasets. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.
- Granitemoe by @mayank31398 in #33207
Descript-Audio-Codec
The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets.
- Add Descript-Audio-Codec model by @kamilakesbi in #31494
Pixtral
The Pixtral model was released by the Mistral AI team. Pixtral is a multimodal model, taking images and text as input, and producing text as output. This model follows the Llava family, meaning image embeddings are placed instead of the [IMG] token placeholders.
The model uses PixtralVisionModel for its vision encoder, and MistralForCausalLM for its language decoder. The main contribution is the 2d ROPE (rotary postiion embeddings) on the images, and support for arbitrary image sizes (the images are not padded together nor are they resized).
- Add support for Pixtral by @ArthurZucker in #33449
Mimi
The Mimi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. Mimi is a high-fidelity audio codec model developed by the Kyutai team, that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps. In other words, it can be used to map audio waveforms into “audio tokens”, known as “codebooks”.
OmDet-Turbo
The OmDet-Turbo model was proposed in Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head by Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee. OmDet-Turbo incorporates components from RT-DETR and introduces a swift multimodal fusion module to achieve real-time open-vocabulary object detection capabilities while maintaining high accuracy. The base model achieves performance of up to 100.2 FPS and 53.4 AP on COCO zero-shot.
- Add OmDet-Turbo by @yonigozlan in #31843
Quantization
GGUF
GGUF support continues to be enhanced in the library by offering a way to load GGUF models within transformers
by unquantizing them, before re-quantizing them for re-use within the GGUF/GGML ecosystem.
- Add Qwen2Moe GGUF loading support by @VladOS95-cyber in #33264
- Fix incorrect vocab size retrieval in GGUF config by @Isotr0py in #32551
- Add chat_template for tokenizer extracted from GGUF model by @Isotr0py in #32908
- 🚨 Support dequantization for most GGML types by @Isotr0py in #32625
- Add support for GGUF Phi-3 by @a8nova in #31844
Torch AO
An ongoing effort is to add the ability to use torchao
as a quantization backend. Future PRs will enable saving and fine-tuning with peft
.
- Add TorchAOHfQuantizer by @jerryzh168 in #32306
Liger Kernel
The Liger kernel is now supported in the Trainer
class.
- Integrate Liger (Linkedin GPU Efficient Runtime) Kernel to Trainer by @JasonZhu1313 in #32860
Modular Transformers
This PR introduces Modularity for transformers, which has always been prohibited when working with transformers (see blog post for the accompanying design philosophy).
The core idea behind this PR is to facilitate model addition by enabling Pythonic inheritance while keeping true to our single-file policy in which models/processors must be contained within a single file, enabling working around the object without going through 10 layers of abstractions.
It is heavily recommended to read the PR description in order to understand the depth of the change: #33248
- Modular
transformers
: modularity and inheritance for new model additions by @ArthurZucker in #33248
Agents
Agents
continue being improved at each release; this time making it much simpler to leverage a local engine through a local Transformers Engine.
- Multi agents with manager by @aymeric-roucher in #32687
- Add new documentation page for advanced agent usage by @aymeric-roucher in #33265
- Create local Transformers Engine by @aymeric-roucher in #33218
- Agents use grammar by @aymeric-roucher in #31735
Dynamic cache for decoder-only models
This PR adds to all decoder-only models (except for XLNet) support for dynamic cache.
The documentation for the Dynamic cache can be found here, and documentation related to the KV cache in transformers
in general can be found here.
- Cache: new Cache format in decoder-only models by @zucchini-nlp in #31421
Chat templates updates
We've made several updates to our handling of chat models and chat templates. The most noticeable change is that assistant prefill is now supported. This means you can end a chat with an assistant
message, and the model will continue that message instead of starting a new one, allowing you to guide the model's response:
pipe = pipeline("text-generation", model_checkpoint)
chat = [
{"role": "user", "content": "Can you format the answer in JSON?"},
{"role": "assistant", "content": '{"name": "'}
]
output = pipe(chat) # The model will continue outputting JSON!
We've also enabled several new functionalities in Jinja that will allow more powerful templates in future, including Loop Controls and a strftime_now
function that can get the current date and time, which is commonly used in system messages. For more details, see the updated chat template docs.
- Enable some Jinja extensions and add datetime capabilities by @Rocketknight1 in #32684
- Update Jinja docs with new functions and general cleanup by @Rocketknight1 in #33097
- Add assistant prefill for chat templates and TextGenerationPipeline by @Rocketknight1 in #33198
- Add a warning to the chat template docs about the tool_calls format by @Rocketknight1 in #33277
- Add tip to clarify tool calling by @Rocketknight1 in #32883
Bugfixes and improvements
- 🌐 [i18n-KO] Translated
mask_generation.md
to Korean by @jeongiin in #32257 - 🌐 [i18n-KO] Translated
idefics.md
to Korean by @boyunJang in #32258 - 🌐 [i18n-KO] Translated
image_to_image.md
to Korean by @shinhyunji36 in #32327 - Gemma2: add cache warning by @zucchini-nlp in #32279
- enable xla fsdp by @hanwen-sun in #32048
- Fix typo in tokenization_utils_base.py by @blubitz in #32484
- fix broken link in docs by @jorahn in #32491
- Docs: alert for the possibility of manipulating logits by @gante in #32467
- 🌐 [i18n-KO] Translated
gptq.md
to Korean by @1kmmk1 in #32293 - 🌐 [i18n-KO] Translated
prompting.md
to Korean by @chhaewxn in #32294 - 🌐 [i18n-KO] Translated
quantization/quanto.md
to Korean by @fabxoe in #32281 - 🌐 [i18n-KO] Translated
image_feature_extraction.md
to Korean by @mreraser in #32239 - Fix references to model google mt5 small by @JuanFKurucz in #32497
- Docs: Fixed WhisperModel.forward’s docstring link by @Sai-Suraj-27 in #32498
- 🌐 [i18n-KO] Translated
chat_templating.md
to Korean by @enchantee00 in #32362 - Fix link to autoclass_tutorial.md in i18n.md by @JuanFKurucz in #32501
- Fix typo: depracted -> deprecated by @tomaarsen in #32489
- Fix issue #32518: Update llm_tutorial.md by @doomdagadiggiedahdah in #32523
- Change Phi3
_supports_sdpa
to True by @pocca2048 in #32457 - Uniformize kwargs for processors - GroundingDINO by @SangbumChoi in #31964
- Fix add-new-model-like by @molbap in #31773
- filter flash_attn optional imports loading remote code by @eaidova in #30954
- 🌐 [i18n-KO] Translated
ko-llm_tutorial_optimization.md
to Korean by @010kim in #32372 - 🌐 [i18n-KO] Translated
trainer.md
to Korean by @cjfghk5697 in #32260 - 🌐 [i18n-KO] Translated
eetq.md
to Korean by @jun048098 in #32352 - 🌐 [i18n-KO] Translated
fsdp.md
to Korean by @win2dvp21 in #32261 - 🌐 [i18n-KO] Translated
bitsandbytes.md
to Korean by @SeungAhSon in #32408 - Fix generate with
inputs_embeds
as input by @molbap in #32493 - Fixed test
test_static_cache_exportability
with torch 2.4.0 by @guangy10 in #32516 - Fix code example to load bigcode starcoder2 7b by @JuanFKurucz in #32474
- [docs] Translation guide by @stevhliu in #32547
- Gemma2: fix FA2 generation by @zucchini-nlp in #32553
- Fix a bug in Qwen2Audio by @faychu in #32552
- fix slow integration gemma2 test by @ArthurZucker in #32534
- fix non contiguous tensor value error in save_pretrained by @congcongke in #32422
- 🌐 [i18n-KO] Translated
agent.md
to Korean by @Jwaminju in #32351 - Fix: FA2 with packed training by @zucchini-nlp in #32487
- Fix sliding window attention used in Gemma2FlashAttention2 by @brcps12 in #32522
- fix: Fixed conditional check for
encodec
model names by @Sai-Suraj-27 in #32581 - Fix
.push_to_hub(..., create_pr=True, revision="my-branch")
when creating PR on not-owned repo by @Wauplin in #32094 - Cleanup tool calling documentation and rename doc by @Rocketknight1 in #32337
- 🌐 [i18n-KO] Translated
deepspeed.md
to Korean by @4N3MONE in #32431 - 🌐 [i18n-KO] Translated
awq.md
to Korean by @ahnjj in #32324 - fix: Fixed failing
test_find_base_model_checkpoint
by @Sai-Suraj-27 in #32638 - "to be not" -> "not to be" by @qgallouedec in #32636
- fix: Updated the
is_torch_mps_available()
function to includemin_version
argument by @Sai-Suraj-27 in #32545 - Expand inputs in processors for VLMs by @zucchini-nlp in #30962
- Automatically add
transformers
tag to the modelcard by @LysandreJik in #32623 - Fix tests by @molbap in #32649
- fix tensors on different devices in
WhisperGenerationMixin
by @faaany in #32316 - Add support for GrokAdamW optimizer by @ehartford in #32521
- Add Depth Anything V2 Metric models by @bt2513 in #32126
- Fix: Fixed directory path for utils folder in
test_tokenization_utils.py
by @Sai-Suraj-27 in #32601 - Modify ProcessorTesterMixin for better generalization by @yonigozlan in #32637
- TF_Deberta supporting mixed precision by @pinesnow72 in #32618
- Fix tests recurrent by @molbap in #32651
- Support MUSA (Moore Threads GPU) backend in transformers by @fmo-mt in #31913
- fix: Fixed failing tests in
tests/utils/test_add_new_model_like.py
by @Sai-Suraj-27 in #32678 - Update translation docs review by @stevhliu in #32662
- Fix
JetMoeIntegrationTest
by @ydshieh in #32332 - Update the distributed CPU training on Kubernetes documentation by @dmsuehir in #32669
- fix: Fixed unknown pytest config option
doctest_glob
by @Sai-Suraj-27 in #32475 - Unpin deepspeed in Docker image/tests by @muellerzr in #32572
- Updated workflows to the latest versions by @Sai-Suraj-27 in #32405
- reopen: llava-next fails to consider padding_side during Training by @jp1924 in #32679
- fix: Corrected
falcon-mamba-7b
model checkpoint name by @Sai-Suraj-27 in #32837 - fix: update doc link for runhouse in README.md by @muddlebee in #32664
- VLMs: small clean-up for cache class by @zucchini-nlp in #32417
- add back the position ids by @ArthurZucker in #32554
- Use head_dim if in config for RoPE by @suiyoubi in #32495
- Generate: unify
LogitsWarper
andLogitsProcessor
by @gante in #32626 - [tests] make test_sdpa_equivalence device-agnostic by @faaany in #32520
- Cache: use
batch_size
instead ofmax_batch_size
by @gante in #32657 - Fix AutoConfig and AutoModel support for Llava-Next-Video by @TKONIY in #32844
- improve _get_is_as_tensor_fns by @zrr1999 in #32596
- Revert PR 32299, flag users when Zero-3 was missed by @muellerzr in #32851
- fix multi-gpu with static cache by @SunMarc in #32543
- Reduce the error log when using core models that need their weights renamed, and provide a step forward by @muellerzr in #32656
- Make beam_constraints.Constraint.advance() docstring more accurate by @alex-calderwood in #32674
- generate: missing
to
in DoLa body, causing exceptions in multi-gpu generation by @gante in #32856 - Add Flax Dinov2 by @MHRDYN7 in #31960
- support torch-speech by @itazap in #32537
- [tests] make
test_sdpa_can_compile_dynamic
device-agnostic by @faaany in #32519 - Add repr for Conv1D by @AaronZLT in #32425
- Support save/load ckpt for XLA FSDP by @yitongh in #32311
- RT-DETR parameterized batchnorm freezing by @AlanBlanchet in #32631
- Mamba / FalconMamba: Fix mamba left padding by @younesbelkada in #32677
- Fix: Mamba2 generation mismatch between input_ids and inputs_embeds by @vasqu in #32694
- Docs: Fixed
whisper-large-v2
model link in docs by @Sai-Suraj-27 in #32871 - Allow-head-dim by @ArthurZucker in #32857
- 🚨🚨🚨 Update min version of accelerate to 0.26.0 by @SunMarc in #32627
- Fix repr for conv by @ArthurZucker in #32897
- fix: jamba cache fails to use torch.nn.module by @xgal in #32894
- Fix: Mamba2
norm_before_gate
usage by @vasqu in #32686 - Replace
tensor.norm()
with decomposed version for CLIP executorch export by @qubvel in #32887 - link for optimizer names by @nbroad1881 in #32400
- [i18n-ar] add README_ar.md to README.md by @AhmedAlmaghz in #32583
- fix: [whisper] don't overwrite GenerationConfig's
return_timestamps
whenreturn_timestamps
is not passed togenerate
function by @hrl in #31296 - Update docker image building by @ArthurZucker in #32918
- Jamba: update integration tests by @gante in #32250
- fix: Added missing
huggingface_hub
installation to workflows by @Sai-Suraj-27 in #32891 - fix: no need to dtype A in jamba by @xgal in #32924
- FEAT / Trainer: Add adamw 4bit optimizer by @SunMarc in #31865
- CI: separate step to download nltk files by @gante in #32935
- FIX / Hub: Also catch for
exceptions.ConnectionError
by @younesbelkada in #31469 - Add SynCode to llm_tutorial by @shubhamugare in #32884
- Fix benchmark script by @ydshieh in #32635
- Improve greedy search memory usage by @regisss in #32895
- fix: (issue #32689)
AttributeError
raised when usingTrainer
witheval_on_start=True
in Jupyter Notebook. by @fshp971 in #32849 - Gemma2: eager attention by default by @gante in #32865
- [run_slow] idefics2 by @andimarafioti in #32840
- Fix regression on
Processor.save_pretrained
caused by #31691 by @leloykun in #32921 - 🌐 [i18n-KO] Translated `knowledge_distillation_for_image_classification.md to Korean" by @JinukHong in #32334
- Generate: Deprecate returning legacy cache by default; Handle
use_cache=False
by @gante in #32863 - docs: fix outdated link to TF32 explanation by @anakin87 in #32947
- Reducing memory usage: removing useless logits computation in generate() by @Cyrilvallez in #31292
- Forbid
PretrainedConfig
from savinggenerate
parameters; Update deprecations ingenerate
-related code 🧹 by @gante in #32659 - DeviceGuard added to use Deformable Attention more safely on multi-GPU by @DonggeunYu in #32910
- added doctring to SchedulerType class by @Arunprakash-A in #32898
- Updated the custom_models.md changed cross_entropy code by @S-M-J-I in #33118
- CI: add torchvision to the consistency image by @gante in #32941
- Test: add higher
atol
intest_forward_with_num_logits_to_keep
by @gante in #33093 - mps: add
isin_mps_friendly
, a wrapper function fortorch.isin
by @gante in #33099 - Add changes for uroman package to handle non-Roman characters by @nandwalritik in #32404
- fix: Fixed
pydantic
required version in dockerfiles to make it compatible with DeepSpeed by @Sai-Suraj-27 in #33105 - quickfix documentation by @molbap in #32566
- Fixup py 38 type hints for mps friendly by @muellerzr in #33128
- fix: Fixed CodeGenTokenizationTest::test_truncation failing test by @Sai-Suraj-27 in #32850
- fix: multilingual midel convert to tflite get wrong token by @Ayaa17 in #32079
- disable scheduled daily CI temporarily by @ydshieh in #33136
- CI: fix
efficientnet
pipeline timeout and prevent future similar issues due to large image size by @gante in #33123 - Log additional test metrics with the CometCallback by @Lothiraldan in #33124
- [docs] add quick usage snippet to Whisper. by @Vaibhavs10 in #31289
- Update stateful_callbacks state before saving checkpoint by @pedrobrs in #32115
- fix Idefics2VisionConfig type annotation by @chenzizhao in #33103
- Add a fix for custom code tokenizers in pipelines by @Rocketknight1 in #32300
- Llama: make slow tests green 🟢 by @gante in #33138
- fix redundant checkpointing in example training scripts by @eminorhan in #33131
- update torch req for 4-bit optimizer by @SunMarc in #33144
- 🌐 [i18n-KO] Translated
conversations.md
to Korean by @newfull5 in #32468 - Very small change to one of the function parameters by @alisalamatian1 in #32548
- 🚨 Add Blip2ForImageTextRetrieval by @jpizarrom in #29261
- fix model name and copyright by @mayank31398 in #33152
- Fix: Jamba batched generation by @vasqu in #32914
- [whisper] pass attention_mask to generate_with_fallback() by @benniekiss in #33145
- [RoBERTa-based] Add support for sdpa by @hackyon in #30510
- Fix import paths for test_module by @rasmi in #32888
- Zero-shot pipelines: minor doc changes by @pcuenca in #33127
- Customise the separator used for splicing in DataCollatorWithFlattening by @beep-bebop in #33114
- Fix spell mistakes by @matsuo1234567 in #33149
- update push CI workflow files for security by @ydshieh in #33142
- added quick clarification by @DuyguA in #33166
- pass module to Params4bit.from_prequantized to ensure quant_state by @winglian in #32524
- Mamba2 conversion script for original models by @vasqu in #32580
- Add a static cache that offloads to the CPU or other device by @gerbenvv in #32161
- use a single for loop by @ArthurZucker in #33148
- Pipeline: fix bad generation kwargs docs by @gante in #33205
- Add missing quotes in modeling_llava_next_video.py by @juliendenize in #33214
- Add warning for stop string edge case by @Rocketknight1 in #33169
- Fix local repos with remote code not registering for pipelines by @Rocketknight1 in #33100
- Refactor CI: more explicit by @ArthurZucker in #30674
- 🌐 [i18n-KO] Translated
llm_optims.md
to Korean by @yijun-lee in #32325 - Fix red amin by @ArthurZucker in #33220
- Test fetcher: missing return on filtered tests; don't write empty files by @gante in #33224
- Generate: throw warning when
return_dict_in_generate
is False but should be True by @gante in #33146 - Add video text to text docs by @merveenoyan in #33164
- Add GraniteRMSNorm by @NielsRogge in #33177
- Add duckduckgo search tool by @aymeric-roucher in #32882
- Fix: Suppressed 'use_reentrant=False' warning by @ankush13r in #33208
- docs: Replace package abbreviations with full name(
bitsandbytes
) in docstrings by @rapsealk in #33230 - Generate: fix assistant in different device by @gante in #33257
- remove to restriction for 4-bit model by @SunMarc in #33122
- Fixed typo repeated word in DETR docs by @sergiopaniego in #33250
- Fix: use
torch.from_numpy()
to create tensors for np.ndarrays by @shinyano in #33201 - remove torch input dependant control flow by @ArthurZucker in #33245
- Fix:
num_logits_to_keep
in composite models by @zucchini-nlp in #33168 - Fix Bark saving by @ylacombe in #33266
- Update chat template docs to remove Blenderbot by @Rocketknight1 in #33254
- Add sdpa support for Albert by @OmarManzoor in #32092
- Only disallow DeepSpeed Zero-3 for auto bs finder by @muellerzr in #31731
- fix the parallel number of CI nodes when it is smaller than number of tests by @ArthurZucker in #33276
- Repo checks: check documented methods exist by @gante in #32320
- Fix: multigpu training by @zucchini-nlp in #33271
- Cache docs: update by @zucchini-nlp in #32929
- Config: unified logic to retrieve text config by @gante in #33219
- [fix] LlavaNextProcessor '_get_unpadded_features' method by @laurentd-lunit in #33263
- wait 15m before SSH into runner workflow stops by @ydshieh in #33300
- Bugfix/alexsherstinsky/fix none check for attention factor in rope scaling 2024 08 28 0 by @alexsherstinsky in #33188
- [InstructBLIP] qformer_tokenizer is required input by @amyeroberts in #33222
- [BUG] fix upper nltk version by @ylacombe in #33301
- Fix excessive CPU memory usage with FSDP and cpu_ram_efficient_loading by @matthewdouglas in #33154
- Add validate images and text inputs order util for processors and test_processing_utils by @yonigozlan in #33285
- Fix: Fix
FalconMamba
training issues due to incompatible kernels by @younesbelkada in #33195 - Add paper link by @Muennighoff in #33305
- 🚨 Fix
torch.jit.trace
forinterpolate_pos_encoding
in all vision models by @xenova in #33226 - Update SECURITY.md by @Michellehbn in #32680
- simple align qwen2vl kv_seq_len calculation with qwen2 by @simonJJJ in #33161
- Add a community notebook for fine-tuning with QLoRA, PEFT, and MLflow by @daniellok-db in #33319
- Fix: StaticCache &
inputs_embeds
by @zucchini-nlp in #32932 - Docs: add more cross-references to the KV cache docs by @gante in #33323
- [whisper] alternative fix for long-form timestamps by @sanchit-gandhi in #32131
- fix qwen2vl vision eager-attention by @simonJJJ in #33213
- Load dynamic module (remote code) only once if code isn't change by @XuehaiPan in #33162
- support loading model without config.json file by @itazap in #32356
- Add validation for maximum sequence length in modeling_whisper.py by @AmirMohammadFakhimi in #33196
- add self.head_dim for VisionAttention in Qwen2-VL by @GeLee-Q in #33211
- support 3D attention mask in bert by @gathierry in #32105
- Support reading tiktoken tokenizer.model file by @itazap in #31656
- red-ci on main, fix copies by @ArthurZucker in #33356
- RoPE: fix BC warning by @gante in #33331
- Fix Prefill docs by @Rocketknight1 in #33352
- Update author for QLorA/PEFT community notebook by @daniellok-db in #33338
- add sdpa mbart by @nbroad1881 in #32033
- Fix quantized cache tests by @zucchini-nlp in #33351
- schedulefree optimizers by @winglian in #30079
- Add visit webpage tool by @aymeric-roucher in #33353
- Fixed Majority of the Typos in
transformers[en]
Documentation by @nnilayy in #33350 - Compile compatibilty for decoder-only models by @zucchini-nlp in #32617
- Adjust templates by @LysandreJik in #33384
- Remove repeated prepare_images in processor tests by @amyeroberts in #33163
- Fix import of
FalconMambaForCausalLM
by @younesbelkada in #33381 - Import structure & first three model refactors by @LysandreJik in #31329
- VLM: fixes after refactor by @zucchini-nlp in #32907
- fixed Mask2Former image processor segmentation maps handling by @maciej-adamiak in #33364
- Bug Fix: Update hub.py to fix NoneType error by @rishiraj in #33315
- Update WhisperTokenizer Doc: Timestamps and Previous Tokens Behaviour by @bruno-hays in #33390
- Make StaticCache configurable at model construct time by @guangy10 in #32830
- use diff internal model in tests by @itazap in #33387
- Fix
FbgemmFp8Linear
not preserving tensor shape by @vgel in #33239 - Fix failing windows by @LysandreJik in #33436
- Remove deprecated task in load_dataset by @albertvillanova in #33433
- Dynamic number of speculative tokens in order to accelerate speculative decoding by @jmamou in #33258
- Fix: Cast prefetch_bucket_size to integer for deepspeed >= 0.15 by @kiddj in #33402
- [docs] add the missing huggingface hub username by @faaany in #33431
- [docs] add the missing tokenizer when pushing models to huggingface hub by @faaany in #33428
- Update stale.yml by @LysandreJik in #33434
- Docs - update formatting of llama3 model card by @MichaelCurrin in #33438
- Fix incomplete sentence in
Zero-shot object detection
documentation by @sergiopaniego in #33430 - Fix flax whisper tokenizer bug by @hannan72 in #33151
- Clean-up deprecated code by @zucchini-nlp in #33446
- Fix default revision for pipelines by @ankane in #33395
- Revive AMD scheduled CI by @ydshieh in #33448
- Allow send
SSH into runner
info. to DM by @ydshieh in #33346 - Correct Whisper's beam search scores computation by @ylacombe in #32336
- Qwen2-VL: clean-up and add more tests by @zucchini-nlp in #33354
- [whisper] Clarify error message when setting max_new_tokens by @benniekiss in #33324
- [docs] refine the doc for
train with a script
by @faaany in #33423 - Return image hidden states by @zucchini-nlp in #33426
- add a callback hook right before the optimizer step by @winglian in #33444
- Enable
padding_side
as call time kwargs by @zucchini-nlp in #33385 - Mitigate a conflict when using sentencepiece by @tengomucho in #33327
- [Phi-3] Bug on stale kv cache by @garg-amit in #33129
- Fix the initialization of the cache when we have multi gpu by @SunMarc in #33303
- Enable finetuning with torchao quantized model by @SunMarc in #33361
- Corrected
Agents and tools
documentation links typos by @sergiopaniego in #33471 - chore: fix typo in comment in tokenization_utils_base.py by @DavidLemayian in #33466
- Cohere: update RoPE structure by @gante in #33408
- Fix SSH workflow by @ydshieh in #33451
- Add keypoint-detection task guide by @merveenoyan in #33274
- Uniformize kwargs for LLaVa processor and update docs by @yonigozlan in #32858
Agents, supercharged - Multi-agents, External tools, and more
docs typo fixed by @sergiopaniego in #33478- [i18n-ar] Add File :
docs/source/ar/_toctree.yml
by @AhmedAlmaghz in #32696 - [Whisper test] Fix some failing tests by @ylacombe in #33450
- Fix: Qwen2-VL training on video datasets by @hiyouga in #33307
- Updated Trainer's liger-kernel integration to call correct patching API by @shimizust in #33502
- Replace
accelerator.use_fp16
in examples by @hlky in #33513 - Fix parametrization-based weight norm by @ylacombe in #33275
- Fix number of patch check for different vision feature select strategy by @insujang in #32494
- chore: migrate coverage cfg to pyproject.toml by @SauravMaheshkar in #32650
- idefics2 enable_input_require_grads not aligned with disable_input_re… by @sywangyi in #33194
- Update chameleon.md — fix runtime type error by @maxwbuckley in #33494
- Add explicit example for RAG chat templating by @A-Duss in #33503
- CI Build image - move runners by @glegendre01 in #33530
- fix to jamba config, asserting attention and expert offset by @ErezSC42 in #33316
- Fix missing
sequences_scores
in the Whisper beam search output by @Nik-Kras in #32970 - Uniformize kwargs for Pixtral processor by @yonigozlan in #33521
- Add revision to trainer push_to_hub by @teamclouday in #33482
- fix patch_attention_mask incorrect setting which leads to the differe… by @sywangyi in #33499
- Support LLaVa-OV-Chat by @zucchini-nlp in #33532
- Decorator for easier tool building by @aymeric-roucher in #33439
- Fix for slow the bug tokenizer adding spaces to single id decodes by @DuyguA in #32564
- Chat template: save and load correctly for processors by @zucchini-nlp in #33462
- Fix missing head_dim in llama config from gguf model by @Isotr0py in #33526
- [i18n-ur] Added README_ur.md file by @akkefa in #33461
- fix the wandb logging issue by @ZIYU-DEEP in #33464
- Fix tests in ASR pipeline by @ylacombe in #33545
- Added support for bfloat16 to zero-shot classification pipeline by @umarbutler in #33554
- Pipeline: no side-effects on
model.config
andmodel.generation_config
🔫 by @gante in #33480 - Return attention mask in ASR pipeline to avoid warnings by @Rocketknight1 in #33509
- enforce original size to be a list by @dom-dziela in #33564
- Improve compiled RT-DETR inference speed by @yonigozlan in #33412
- Fix bnb dequantization by @SunMarc in #33546
- Load and save video-processor from separate folder by @zucchini-nlp in #33562
- VLMs: enable generation tests by @zucchini-nlp in #33533
- rag: fix CI by @gante in #33578
- Cache: don't show warning in forward passes when
past_key_values
is None by @gante in #33541 - fix tests with main revision and read token by @molbap in #33560
- add uniform processors for altclip + chinese_clip by @molbap in #31198
- Generate: check that
attention_mask
is 2D by @gante in #33575 - change sequence_bias type of SequenceBiasLogitsProcessor to list, add… by @VladOS95-cyber in #33375
- [
Mamba2
] Move dt calculations to kernel by @vasqu in #33520 - Cache: don't throw warnings on
gemma2
when instantiating a new cache by @gante in #33595 - Uniformize kwargs for Paligemma processor and update docs by @yonigozlan in #33571
- [tests] skip tests for xpu by @faaany in #33553
- [tests] enable GemmaIntegrationTest on XPU by @faaany in #33555
- Fix Llama 3 TikToken conversion by @pcuenca in #33538
- Docs: add the ability to manually trigger jobs by @gante in #33598
- Fix CircleCI nightly run by @ydshieh in #33558
- Allow CI could be run on private forked repositories (e.g. new model additions) by @ydshieh in #33594
- [tests] make more tests device-agnostic by @faaany in #33580
- Update modeling_mamba2.py, fix pad size by @klae01 in #32599
- Generate: remove flakyness in
test_generate_from_inputs_embeds_decoder_only
by @gante in #33602 - Remove unnecessary CPM model tests by @amyeroberts in #33621
- Add sdpa for BioGpt by @OmarManzoor in #33592
- VLM generate: tests can't generate image/video tokens by @gante in #33623
- Fix missing test in
torch_job
by @ydshieh in #33593 - Add support for args to ProcessorMixin for backward compatibility by @yonigozlan in #33479
- Fix contrastive search to correctly handle input with padding by @ducviet00 in #33507
- Generate: assistant should sample when the main model samples by @gante in #33534
- Fix some missing tests in circleci by @ydshieh in #33559
- Update daily ci to use new cluster by @ydshieh in #33627
- Fix qwen2vl float16 inference bug by @GeLee-Q in #33312
- Fix typos by @litianjian in #33583
- enable low-precision pipeline by @jiqing-feng in #31625
- Pixtral update example checkpoint by @amyeroberts in #33633
- Sdpa dino v2 by @avishaiElmakies in #33403
- Clean up Unpack imports by @molbap in #33631
- Fix DPT /Dinov2 sdpa regression on main by @molbap in #33660
- handle dependency errors in check_imports by @molbap in #33622
- add back self.max_position_embeddings = config.max_position_embeddings by @chengchengpei in #33550
- Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower by @Isotr0py in #33613
- Uniformize kwargs for Udop processor and update docs by @yonigozlan in #33628
- Generation: deprecate
PreTrainedModel
inheriting fromGenerationMixin
by @gante in #33203 - Enable BNB multi-backend support by @jiqing-feng in #31098
- Fix error string after refactoring into get_chat_template by @tibor-reiss in #33652
- uniformize git processor by @yonigozlan in #33668
- Fix CIs post merging modular transformers by @ArthurZucker in #33681
- Fixed docstring for cohere model regarding unavailability of prune_he… by @mnauf in #33253
- Generation tests: update imagegpt input name, remove unused functions by @gante in #33663
- Improve Error Messaging for Flash Attention 2 on CPU by @sizhky in #33655
- Gemma2: fix config initialization (
cache_implementation
) by @gante in #33684 - Fix ByteLevel alphabet missing when Sequence pretokenizer is used by @umarbutler in #33556
- Uniformize kwargs for image-text-to-text processors by @yonigozlan in #32544
- 🚨🚨 Setting default behavior of assisted decoding by @jmamou in #33657
- tests: fix pytorch tensor placement errors by @dvrogozh in #33485
- bump tokenizers, fix added tokens fast by @ArthurZucker in #32535
- [Pixtral] Improve docs, rename model by @NielsRogge in #33491
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @enchantee00
- 🌐 [i18n-KO] Translated
chat_templating.md
to Korean (#32362)
- 🌐 [i18n-KO] Translated
- @faychu
- @010kim
- 🌐 [i18n-KO] Translated
ko-llm_tutorial_optimization.md
to Korean (#32372)
- 🌐 [i18n-KO] Translated
- @cjfghk5697
- 🌐 [i18n-KO] Translated
trainer.md
to Korean (#32260)
- 🌐 [i18n-KO] Translated
- @younesbelkada
- @4N3MONE
- 🌐 [i18n-KO] Translated
deepspeed.md
to Korean (#32431)
- 🌐 [i18n-KO] Translated
- @jerryzh168
- Add TorchAOHfQuantizer (#32306)
- @MHRDYN7
- Add Flax Dinov2 (#31960)
- @kamilakesbi
- Add Descript-Audio-Codec model (#31494)
- @Isotr0py
- Fix incorrect vocab size retrieval in GGUF config (#32551)
- Add chat_template for tokenizer extracted from GGUF model (#32908)
- 🚨 Support dequantization for most GGML types (#32625)
- Fix missing head_dim in llama config from gguf model (#33526)
- Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower (#33613)
- @AhmedAlmaghz
- @simonJJJ
- @jpizarrom
- 🚨 Add Blip2ForImageTextRetrieval (#29261)
- @mayank31398
- @hackyon
- [RoBERTa-based] Add support for sdpa (#30510)
- @Muennighoff
- @VladOS95-cyber
- @jiqing-feng