Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model support: OmniGen #2593

Open
Czxck001 opened this issue Nov 3, 2024 · 1 comment
Open

Model support: OmniGen #2593

Czxck001 opened this issue Nov 3, 2024 · 1 comment

Comments

@Czxck001
Copy link
Contributor

Czxck001 commented Nov 3, 2024

OmniGen is a new image generation model that is built by tuning an existing Phi-3 model into a transformer for diffusion task. It appears to have next-level multi-modal capability, like incorporating images as inputs and refer the images in the prompt text and compose them into the generated image in a flexible way.

It's capable model, but the architecture is surprisingly simple. Besides regular patchify-ing as in DiT and using a SDXL VAE, looks like it only adds a few layers on top of a standard Phi-3 model and changed the attention mask for image and timestamp tokens. This means an implementation of this model can largely borrow from existing impls for Phi3, DiT and VAE.

@LaurentMazare I can give it a try. Let me know if you are already working on it :D.

@LaurentMazare
Copy link
Collaborator

Sounds like a nice model to support, feel free to give it a stab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants