Open
Description
Hi! Great work!
Have you tried leveraging MLLM to be the prompt encoder? We have open-source MLLM now, and I think this will be an easy extension but very powerful one. For example, we could give image prompts without ControlNet or other mechanisms to inject image information. We just tell MLLM what we want with text and images, then SD generates it for us.
Update: I see this in Conclusion and Limitation. If you can release training code, then probably the community can also try to approach this direction and to adapt various LLMs
Metadata
Metadata
Assignees
Labels
No labels