Add support for Ovis-Image #12740

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

DoctorKey wants to merge 3 commits into huggingface:main from DoctorKey:ovis-image

+1,674 −0

DoctorKey commented Nov 28, 2025

What does this PR do?

This PR introduces Ovis-Image into the diffusers library. Ovis-Image integrates a diffusion-based visual decoder with the Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o.


          add ovis_image

f6b5af3

Author

DoctorKey commented Nov 30, 2025

Ovis-Image has been released:

GitHub: https://github.com/AIDC-AI/Ovis-Image
Hugging Face: https://huggingface.co/AIDC-AI/Ovis-Image-7B


          Merge branch 'main' into ovis-image

Collaborator

yiyixuxu commented Dec 1, 2025

@bot /style


          Merge branch 'main' into ovis-image

b74dacd

HuggingFaceDocBuilderDev commented Dec 1, 2025

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yiyixuxu reviewed

View reviewed changes

Collaborator

yiyixuxu left a comment

Thanks so much for the PR! I left a few feedbacks, and I think we can merge this very soon

Congrats on the release!! Sorry, we overlooked the PR (it was the thanksgiving holiday in US)
We will reach out to set up a collaboration channel for your future release.

src/diffusers/pipelines/ovis_image/pipeline_ovis_image.py

Comment on lines +365 to +416

+                  def enable_vae_slicing(self):
+                      r"""
+                      Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
+                      compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
+                      """
+                      depr_message = f"Calling `enable_vae_slicing()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.enable_slicing()`."
+                      deprecate(
+                          "enable_vae_slicing",
+                          "0.40.0",
+                          depr_message,
+                      )
+                      self.vae.enable_slicing()
+                  def disable_vae_slicing(self):
+                      r"""
+                      Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
+                      computing decoding in one step.
+                      """
+                      depr_message = f"Calling `disable_vae_slicing()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.disable_slicing()`."
+                      deprecate(
+                          "disable_vae_slicing",
+                          "0.40.0",
+                          depr_message,
+                      )
+                      self.vae.disable_slicing()
+                  def enable_vae_tiling(self):
+                      r"""
+                      Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
+                      compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
+                      processing larger images.
+                      """
+                      depr_message = f"Calling `enable_vae_tiling()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.enable_tiling()`."
+                      deprecate(
+                          "enable_vae_tiling",
+                          "0.40.0",
+                          depr_message,
+                      )
+                      self.vae.enable_tiling()
+                  def disable_vae_tiling(self):
+                      r"""
+                      Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
+                      computing decoding in one step.
+                      """
+                      depr_message = f"Calling `disable_vae_tiling()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.disable_tiling()`."
+                      deprecate(
+                          "disable_vae_tiling",
+                          "0.40.0",
+                          depr_message,
+                      )
+                      self.vae.disable_tiling()

Collaborator

yiyixuxu Dec 1, 2025

Suggested change

      
                def enable_vae_slicing(self):
          
                    r"""
          
                    Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
          
                    compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
          
                    """
          
                    depr_message = f"Calling `enable_vae_slicing()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.enable_slicing()`."
          
                    deprecate(
          
                        "enable_vae_slicing",
          
                        "0.40.0",
          
                        depr_message,
          
                    )
          
                    self.vae.enable_slicing()
          
                def disable_vae_slicing(self):
          
                    r"""
          
                    Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
          
                    computing decoding in one step.
          
                    """
          
                    depr_message = f"Calling `disable_vae_slicing()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.disable_slicing()`."
          
                    deprecate(
          
                        "disable_vae_slicing",
          
                        "0.40.0",
          
                        depr_message,
          
                    )
          
                    self.vae.disable_slicing()
          
                def enable_vae_tiling(self):
          
                    r"""
          
                    Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
          
                    compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
          
                    processing larger images.
          
                    """
          
                    depr_message = f"Calling `enable_vae_tiling()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.enable_tiling()`."
          
                    deprecate(
          
                        "enable_vae_tiling",
          
                        "0.40.0",
          
                        depr_message,
          
                    )
          
                    self.vae.enable_tiling()
          
                def disable_vae_tiling(self):
          
                    r"""
          
                    Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
          
                    computing decoding in one step.
          
                    """
          
                    depr_message = f"Calling `disable_vae_tiling()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.disable_tiling()`."
          
                    deprecate(
          
                        "disable_vae_tiling",
          
                        "0.40.0",
          
                        depr_message,
          
                    )
          
                    self.vae.disable_tiling()

src/diffusers/pipelines/ovis_image/pipeline_ovis_image.py

+                      self,
+                      prompt: Union[str, List[str]] = None,
+                      negative_prompt: Union[str, List[str]] = None,
+                      true_cfg_scale: float = 5.0,

Collaborator

yiyixuxu Dec 1, 2025

Suggested change

      
                    true_cfg_scale: float = 5.0,
          
                    guidance_scale: float = 5.0,

can we use guidance_scale if it is not a distilled checkpoint? since model is already out with this PR, we can add a deprecation message if you prefer

src/diffusers/pipelines/ovis_image/pipeline_ovis_image.py

Comment on lines +669 to +670

		if image_embeds is not None:
		self._joint_attention_kwargs["ip_adapter_image_embeds"] = image_embeds

Collaborator

yiyixuxu Dec 1, 2025

Suggested change

      
                            if image_embeds is not None:
          
                                self._joint_attention_kwargs["ip_adapter_image_embeds"] = image_embeds

let's remove the IP-adapter related logics if we don't support it yet

src/diffusers/pipelines/ovis_image/pipeline_ovis_image.py


		device = self._execution_device

		has_neg_prompt = negative_prompt is not None or (

Collaborator

yiyixuxu Dec 1, 2025

so flux/qwen pipelines were written this way to support both distilled guidance and regular CFG - the user experience was pretty bad, and we regret the design choise so very much.
If Ovis only supports regular CFG let's not follow their path :)

Collaborator

yiyixuxu Dec 1, 2025

for standard CFG, one pattern you can do is

prompt_embeds, text_ids = self.encode_prompt(...)
if do_classifier_free_guidance:
    negative_prompt_embeds, negative_text_ids = self.encode_prompt(..)

src/diffusers/pipelines/ovis_image/pipeline_ovis_image.py

Comment on lines +656 to +657

		image_embeds = None
		negative_image_embeds = None

Collaborator

yiyixuxu Dec 1, 2025

Suggested change

      
                    image_embeds = None
          
                    negative_image_embeds = None

yiyixuxu added the close-to-merge label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels