Skip to content

Multimodal Large Language Model Support?  #15

Open
@ifsheldon

Description

@ifsheldon

Hi! Great work!

Have you tried leveraging MLLM to be the prompt encoder? We have open-source MLLM now, and I think this will be an easy extension but very powerful one. For example, we could give image prompts without ControlNet or other mechanisms to inject image information. We just tell MLLM what we want with text and images, then SD generates it for us.

Update: I see this in Conclusion and Limitation. If you can release training code, then probably the community can also try to approach this direction and to adapt various LLMs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions