You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OmniGen is a new image generation model that is built by tuning an existing Phi-3 model into a transformer for diffusion task. It appears to have next-level multi-modal capability, like incorporating images as inputs and refer the images in the prompt text and compose them into the generated image in a flexible way.
It's capable model, but the architecture is surprisingly simple. Besides regular patchify-ing as in DiT and using a SDXL VAE, looks like it only adds a few layers on top of a standard Phi-3 model and changed the attention mask for image and timestamp tokens. This means an implementation of this model can largely borrow from existing impls for Phi3, DiT and VAE.
@LaurentMazare I can give it a try. Let me know if you are already working on it :D.
The text was updated successfully, but these errors were encountered:
OmniGen is a new image generation model that is built by tuning an existing Phi-3 model into a transformer for diffusion task. It appears to have next-level multi-modal capability, like incorporating images as inputs and refer the images in the prompt text and compose them into the generated image in a flexible way.
It's capable model, but the architecture is surprisingly simple. Besides regular patchify-ing as in DiT and using a SDXL VAE, looks like it only adds a few layers on top of a standard Phi-3 model and changed the attention mask for image and timestamp tokens. This means an implementation of this model can largely borrow from existing impls for Phi3, DiT and VAE.
@LaurentMazare I can give it a try. Let me know if you are already working on it :D.
The text was updated successfully, but these errors were encountered: