Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
113 commits
Select commit Hold shift + click to select a range
00bcfd4
raw start
Cyrilvallez Feb 25, 2025
aef5f66
update
Cyrilvallez Feb 26, 2025
60595b3
update
Cyrilvallez Feb 26, 2025
ddfe10a
add to imports
Cyrilvallez Feb 26, 2025
88f473e
update
Cyrilvallez Feb 26, 2025
5012749
up
Cyrilvallez Feb 26, 2025
bc1d197
simplify configs
Cyrilvallez Feb 26, 2025
e56e7b0
clean configs
Cyrilvallez Feb 26, 2025
8d35ac9
style
Cyrilvallez Feb 26, 2025
f490482
typos
Cyrilvallez Feb 26, 2025
c0e1da4
Update convert_phi4_multimodal_weights_to_hf.py
Cyrilvallez Feb 26, 2025
c435c22
Update convert_phi4_multimodal_weights_to_hf.py
Cyrilvallez Feb 26, 2025
98b393c
fix
Cyrilvallez Feb 26, 2025
0bd29a3
up
Cyrilvallez Feb 26, 2025
52bf0e8
up
Cyrilvallez Feb 26, 2025
a37b084
up
Cyrilvallez Feb 26, 2025
ce4735b
Update convert_phi4_multimodal_weights_to_hf.py
Cyrilvallez Feb 26, 2025
fe9fed1
Update convert_phi4_multimodal_weights_to_hf.py
Cyrilvallez Feb 26, 2025
5fffe53
up
Cyrilvallez Feb 26, 2025
c102b46
up
Cyrilvallez Feb 26, 2025
dbbad21
up
Cyrilvallez Feb 26, 2025
67cad7f
Update feature_extraction_phi4_multimodal.py
Cyrilvallez Feb 26, 2025
cc4cd0e
up
Cyrilvallez Feb 26, 2025
da8b0aa
up
Cyrilvallez Feb 26, 2025
d41fde1
up
Cyrilvallez Feb 26, 2025
f78aec5
up
Cyrilvallez Feb 26, 2025
18c33de
up
Cyrilvallez Feb 26, 2025
abd15e4
simplify configs
Cyrilvallez Feb 26, 2025
438ee1a
typo
Cyrilvallez Feb 26, 2025
01f68a0
cut code
Cyrilvallez Feb 27, 2025
d942e26
typo
Cyrilvallez Feb 27, 2025
0c1d082
typo
Cyrilvallez Feb 27, 2025
e9910cc
typo
Cyrilvallez Feb 27, 2025
28c4d40
re
Cyrilvallez Feb 27, 2025
1c744f0
typo
Cyrilvallez Feb 27, 2025
2659e69
up
Cyrilvallez Feb 27, 2025
d42a60e
up
Cyrilvallez Feb 27, 2025
9a0c374
up
Cyrilvallez Feb 27, 2025
7b83b9e
add tests
Cyrilvallez Feb 27, 2025
23bbdd5
fix
Cyrilvallez Feb 27, 2025
7598e61
fix
Cyrilvallez Feb 27, 2025
a52acd4
Update test_modeling_phi4_multimodal.py
Cyrilvallez Feb 27, 2025
42c9ca5
up
Cyrilvallez Feb 27, 2025
c35fdc0
Update test_modeling_phi4_multimodal.py
Cyrilvallez Feb 27, 2025
3e50728
doc
Cyrilvallez Feb 27, 2025
6638418
fix
Cyrilvallez Feb 27, 2025
09ea6b7
up
Cyrilvallez Feb 27, 2025
41fc578
up
Cyrilvallez Feb 27, 2025
e109109
up
Cyrilvallez Feb 27, 2025
0f4f425
up
Cyrilvallez Feb 27, 2025
ca1d04b
up
Cyrilvallez Feb 27, 2025
6377c06
up
Cyrilvallez Feb 27, 2025
cb829f3
simplify
Cyrilvallez Feb 28, 2025
046cd48
up
Cyrilvallez Feb 28, 2025
3cbd8dd
simplify
Cyrilvallez Feb 28, 2025
07f5827
config docstrings
Cyrilvallez Feb 28, 2025
28b29e9
cleanup
Cyrilvallez Feb 28, 2025
719f204
clean
Cyrilvallez Feb 28, 2025
5dd09fa
typo
Cyrilvallez Feb 28, 2025
741bfdc
typo
Cyrilvallez Feb 28, 2025
4f909ab
fix
Cyrilvallez Feb 28, 2025
406ed4c
Update phi4_multimodal.md
Cyrilvallez Feb 28, 2025
2de1ac3
fix
Cyrilvallez Mar 2, 2025
65bdb47
fix
Cyrilvallez Mar 2, 2025
a20dfeb
Update test_modeling_phi4_multimodal.py
Cyrilvallez Mar 2, 2025
6acd428
update
Cyrilvallez Mar 3, 2025
249d1fb
simplify reshapes and permutes
Cyrilvallez Mar 3, 2025
1da7e9b
up
Cyrilvallez Mar 3, 2025
5bcc6c8
simplify special tokens
Cyrilvallez Mar 3, 2025
014c89e
simplify processor a lot
Cyrilvallez Mar 3, 2025
4bf35b9
Update processing_phi4_multimodal.py
Cyrilvallez Mar 3, 2025
d06a808
Update processing_phi4_multimodal.py
Cyrilvallez Mar 3, 2025
f754f10
switch to fast processor
Cyrilvallez Mar 4, 2025
1d18749
image processor
Cyrilvallez Mar 4, 2025
0177893
Update image_processing_phi4_multimodal_fast.py
Cyrilvallez Mar 4, 2025
1bbb298
add lora extraction to converter
Cyrilvallez Mar 4, 2025
80d0e83
Update convert_phi4_multimodal_weights_to_hf.py
Cyrilvallez Mar 4, 2025
923b2d0
Update __init__.py
Cyrilvallez Mar 4, 2025
136d45a
add AudioInput type in audio_utils
eustlb Mar 7, 2025
fcd909b
rewrite feature_extraction: support torch batched FFT
eustlb Mar 7, 2025
11444c7
input_audio_embeds -> audio_input_features, input_image_embeds -> ima…
eustlb Mar 7, 2025
44c7296
test update
eustlb Mar 7, 2025
33c61fd
not mono channel warning update
eustlb Mar 10, 2025
1a1e024
remove auto maps from processor
Cyrilvallez Mar 10, 2025
e28dbb0
kargs dispatch in processor
Cyrilvallez Mar 10, 2025
07c2153
simplify kwargs dispatch
Cyrilvallez Mar 10, 2025
6334dd7
simplify merging
Cyrilvallez Mar 10, 2025
4aa8086
remove default sampling rate
Cyrilvallez Mar 10, 2025
93323fc
style
Cyrilvallez Mar 10, 2025
95e5597
Update test_modeling_phi4_multimodal.py
Cyrilvallez Mar 10, 2025
37b3dbe
update doc
Cyrilvallez Mar 10, 2025
bc6d6a5
doc
Cyrilvallez Mar 10, 2025
b241377
torch only feature extractor
Cyrilvallez Mar 24, 2025
9c752b2
make fake tokens adjustable
Cyrilvallez Mar 24, 2025
47664e1
Update feature_extraction_phi4_multimodal.py
Cyrilvallez Mar 24, 2025
d9beef2
fix
Cyrilvallez Mar 24, 2025
17985f9
Update processing_phi4_multimodal.py
Cyrilvallez Mar 24, 2025
c169f36
simplify mask
Cyrilvallez Mar 24, 2025
067edbf
last touch
Cyrilvallez Mar 24, 2025
9bee9f3
fix copies
Cyrilvallez Mar 24, 2025
653b8ec
style
Cyrilvallez Mar 24, 2025
4213e97
Update audio_utils.py
Cyrilvallez Mar 24, 2025
2439003
style
Cyrilvallez Mar 24, 2025
16f5ca8
Update feature_extraction_phi4_multimodal.py
Cyrilvallez Mar 24, 2025
5b773c8
Update __init__.py
Cyrilvallez Mar 24, 2025
a70f307
docstrings
Cyrilvallez Mar 24, 2025
ac699b1
copies
Cyrilvallez Mar 24, 2025
aa6664b
fix all checks
Cyrilvallez Mar 24, 2025
c3a1a89
back to fix-copies
Cyrilvallez Mar 24, 2025
095bb8a
trigger CIs
Cyrilvallez Mar 24, 2025
bdc8e38
Update feature_extraction_phi4_multimodal.py
Cyrilvallez Mar 24, 2025
4f52195
improve tests with multimodal inputs
Cyrilvallez Mar 24, 2025
ec726d7
trigger CIs
Cyrilvallez Mar 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -583,6 +583,8 @@
title: Phi
- local: model_doc/phi3
title: Phi-3
- local: model_doc/phi4_multimodal
title: Phi4 Multimodal
- local: model_doc/phimoe
title: PhiMoE
- local: model_doc/phobert
Expand Down
149 changes: 149 additions & 0 deletions docs/source/en/model_doc/phi4_multimodal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->

# Phi4 Multimodal

## Overview

Phi4 Multimodal is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:

- Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
- Vision: English
- Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

This model was contributed by [Cyril Vallez](https://huggingface.co/cyrilvallez). The most recent code can be
found [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/phi4_multimodal/modeling_phi4_multimodal.py).


## Usage tips

`Phi4-multimodal-instruct` can be found on the [Huggingface Hub](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)

In the following, we demonstrate how to use it for inference depending on the input modalities (text, image, audio).

```python
import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen


# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"
device = "cuda:0"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device, torch_dtype=torch.float16)

# Optional: load the adapters (note that without them, the base model will very likely not work well)
model.load_adapter(model_path, adapter_name="speech", device_map=device, adapter_kwargs={"subfolder": 'speech-lora'})
model.load_adapter(model_path, adapter_name="vision", device_map=device, adapter_kwargs={"subfolder": 'vision-lora'})

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
model.set_adapter("vision") # if loaded, activate the vision adapter
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to(device)

# Generate response
generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
do_sample=False,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

# Part 2: Audio Processing
model.set_adapter("speech") # if loaded, activate the speech adapter
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Downlowd and open audio file
audio, sample_rate = sf.read(io.BytesIO(urlopen(audio_url).read()))

# Process with the model
inputs = processor(text=prompt, audios=audio, sample_rate=sample_rate, return_tensors='pt').to(device)

generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
do_sample=False,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')
```

## Phi4MultimodalFeatureExtractor

[[autodoc]] Phi4MultimodalFeatureExtractor

## Phi4MultimodalImageProcessorFast

[[autodoc]] Phi4MultimodalImageProcessorFast

## Phi4MultimodalProcessor

[[autodoc]] Phi4MultimodalProcessor

## Phi4MultimodalAudioConfig

[[autodoc]] Phi4MultimodalAudioConfig

## Phi4MultimodalVisionConfig

[[autodoc]] Phi4MultimodalVisionConfig

## Phi4MultimodalConfig

[[autodoc]] Phi4MultimodalConfig

## Phi4MultimodalAudioModel

[[autodoc]] Phi4MultimodalAudioModel

## Phi4MultimodalVisionModel

[[autodoc]] Phi4MultimodalVisionModel

## Phi4MultimodalModel

[[autodoc]] Phi4MultimodalModel
- forward

## Phi4MultimodalForCausalLM

[[autodoc]] Phi4MultimodalForCausalLM
- forward
36 changes: 36 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -699,6 +699,13 @@
"models.persimmon": ["PersimmonConfig"],
"models.phi": ["PhiConfig"],
"models.phi3": ["Phi3Config"],
"models.phi4_multimodal": [
"Phi4MultimodalAudioConfig",
"Phi4MultimodalConfig",
"Phi4MultimodalFeatureExtractor",
"Phi4MultimodalProcessor",
"Phi4MultimodalVisionConfig",
],
"models.phimoe": ["PhimoeConfig"],
"models.phobert": ["PhobertTokenizer"],
"models.pix2struct": [
Expand Down Expand Up @@ -1348,6 +1355,7 @@
_import_structure["models.llava"].append("LlavaImageProcessorFast")
_import_structure["models.llava_next"].append("LlavaNextImageProcessorFast")
_import_structure["models.llava_onevision"].append("LlavaOnevisionImageProcessorFast")
_import_structure["models.phi4_multimodal"].append("Phi4MultimodalImageProcessorFast")
_import_structure["models.pixtral"].append("PixtralImageProcessorFast")
_import_structure["models.qwen2_vl"].append("Qwen2VLImageProcessorFast")
_import_structure["models.rt_detr"].append("RTDetrImageProcessorFast")
Expand Down Expand Up @@ -2802,6 +2810,17 @@
"LlavaNextPreTrainedModel",
]
)
_import_structure["models.phi4_multimodal"].extend(
[
"Phi4MultimodalForCausalLM",
"Phi4MultimodalPreTrainedModel",
"Phi4MultimodalAudioModel",
"Phi4MultimodalAudioPreTrainedModel",
"Phi4MultimodalModel",
"Phi4MultimodalVisionModel",
"Phi4MultimodalVisionPreTrainedModel",
]
)
_import_structure["models.llava_next_video"].extend(
[
"LlavaNextVideoForConditionalGeneration",
Expand Down Expand Up @@ -5914,6 +5933,13 @@
)
from .models.phi import PhiConfig
from .models.phi3 import Phi3Config
from .models.phi4_multimodal import (
Phi4MultimodalAudioConfig,
Phi4MultimodalConfig,
Phi4MultimodalFeatureExtractor,
Phi4MultimodalProcessor,
Phi4MultimodalVisionConfig,
)
from .models.phimoe import PhimoeConfig
from .models.phobert import PhobertTokenizer
from .models.pix2struct import (
Expand Down Expand Up @@ -6587,6 +6613,7 @@
from .models.llava import LlavaImageProcessorFast
from .models.llava_next import LlavaNextImageProcessorFast
from .models.llava_onevision import LlavaOnevisionImageProcessorFast
from .models.phi4_multimodal import Phi4MultimodalImageProcessorFast
from .models.pixtral import PixtralImageProcessorFast
from .models.qwen2_vl import Qwen2VLImageProcessorFast
from .models.rt_detr import RTDetrImageProcessorFast
Expand Down Expand Up @@ -8153,6 +8180,15 @@
Phi3Model,
Phi3PreTrainedModel,
)
from .models.phi4_multimodal import (
Phi4MultimodalAudioModel,
Phi4MultimodalAudioPreTrainedModel,
Phi4MultimodalForCausalLM,
Phi4MultimodalModel,
Phi4MultimodalPreTrainedModel,
Phi4MultimodalVisionModel,
Phi4MultimodalVisionPreTrainedModel,
)
from .models.phimoe import (
PhimoeForCausalLM,
PhimoeForSequenceClassification,
Expand Down
7 changes: 6 additions & 1 deletion src/transformers/audio_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,16 @@
"""

import warnings
from typing import Optional, Union
from typing import List, Optional, Tuple, Union

import numpy as np


AudioInput = Union[
np.ndarray, "torch.Tensor", List[np.ndarray], Tuple[np.ndarray], List["torch.Tensor"], Tuple["torch.Tensor"] # noqa: F821
]


def hertz_to_mel(freq: Union[float, np.ndarray], mel_scale: str = "htk") -> Union[float, np.ndarray]:
"""
Convert frequency from hertz to mels.
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,7 @@
persimmon,
phi,
phi3,
phi4_multimodal,
phimoe,
phobert,
pix2struct,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,7 @@
("persimmon", "PersimmonConfig"),
("phi", "PhiConfig"),
("phi3", "Phi3Config"),
("phi4_multimodal", "Phi4MultimodalConfig"),
("phimoe", "PhimoeConfig"),
("pix2struct", "Pix2StructConfig"),
("pixtral", "PixtralVisionConfig"),
Expand Down Expand Up @@ -587,6 +588,7 @@
("persimmon", "Persimmon"),
("phi", "Phi"),
("phi3", "Phi3"),
("phi4_multimodal", "Phi4Multimodal"),
("phimoe", "Phimoe"),
("phobert", "PhoBERT"),
("pix2struct", "Pix2Struct"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/feature_extraction_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@
("nat", "ViTFeatureExtractor"),
("owlvit", "OwlViTFeatureExtractor"),
("perceiver", "PerceiverFeatureExtractor"),
("phi4_multimodal", "Phi4MultimodalFeatureExtractor"),
("poolformer", "PoolFormerFeatureExtractor"),
("pop2piano", "Pop2PianoFeatureExtractor"),
("regnet", "ConvNextFeatureExtractor"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@
("owlvit", ("OwlViTImageProcessor",)),
("paligemma", ("SiglipImageProcessor", "SiglipImageProcessorFast")),
("perceiver", ("PerceiverImageProcessor",)),
("phi4_multimodal", "Phi4MultimodalImageProcessorFast"),
("pix2struct", ("Pix2StructImageProcessor",)),
("pixtral", ("PixtralImageProcessor", "PixtralImageProcessorFast")),
("poolformer", ("PoolFormerImageProcessor",)),
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,7 @@
("persimmon", "PersimmonModel"),
("phi", "PhiModel"),
("phi3", "Phi3Model"),
("phi4_multimodal", "Phi4MultimodalModel"),
("phimoe", "PhimoeModel"),
("pixtral", "PixtralVisionModel"),
("plbart", "PLBartModel"),
Expand Down Expand Up @@ -566,6 +567,7 @@
("persimmon", "PersimmonForCausalLM"),
("phi", "PhiForCausalLM"),
("phi3", "Phi3ForCausalLM"),
("phi4_multimodal", "Phi4MultimodalForCausalLM"),
("phimoe", "PhimoeForCausalLM"),
("plbart", "PLBartForCausalLM"),
("prophetnet", "ProphetNetForCausalLM"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@
("owlv2", "Owlv2Processor"),
("owlvit", "OwlViTProcessor"),
("paligemma", "PaliGemmaProcessor"),
("phi4_multimodal", "Phi4MultimodalProcessor"),
("pix2struct", "Pix2StructProcessor"),
("pixtral", "PixtralProcessor"),
("pop2piano", "Pop2PianoProcessor"),
Expand Down
32 changes: 32 additions & 0 deletions src/transformers/models/phi4_multimodal/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_phi4_multimodal import *
from .feature_extraction_phi4_multimodal import *
from .image_processing_phi4_multimodal_fast import *
from .modeling_phi4_multimodal import *
from .processing_phi4_multimodal import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading