Model support: OmniGen #2593

Czxck001 · 2024-11-03T06:54:59Z

OmniGen is a new image generation model that is built by tuning an existing Phi-3 model into a transformer for diffusion task. It appears to have next-level multi-modal capability, like incorporating images as inputs and refer the images in the prompt text and compose them into the generated image in a flexible way.

GitHub: OmniGen/OmniGen
HuggingFace: Shitao/OmniGen-v1
Paper: 2409.11340

It's capable model, but the architecture is surprisingly simple. Besides regular patchify-ing as in DiT and using a SDXL VAE, looks like it only adds a few layers on top of a standard Phi-3 model and changed the attention mask for image and timestamp tokens. This means an implementation of this model can largely borrow from existing impls for Phi3, DiT and VAE.

@LaurentMazare I can give it a try. Let me know if you are already working on it :D.

LaurentMazare · 2024-11-03T07:04:49Z

Sounds like a nice model to support, feel free to give it a stab.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model support: OmniGen #2593

Model support: OmniGen #2593

Czxck001 commented Nov 3, 2024

LaurentMazare commented Nov 3, 2024

Model support: OmniGen #2593

Model support: OmniGen #2593

Comments

Czxck001 commented Nov 3, 2024

LaurentMazare commented Nov 3, 2024