Skip to content

Fix SD2.X clip single file load projection_dim #10770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 3, 2025

Conversation

Teriks
Copy link
Contributor

@Teriks Teriks commented Feb 11, 2025

Infer projection_dim from the checkpoint before loading from pretrained, override any incorrect hub config.

Hub configuration for SD2.X specifies projection_dim=512 which is incorrect for SD2.X checkpoints loaded from civitai and similar.

Exception was previously thrown upon attempting to load_model_dict_into_meta for SD2.X single file checkpoints.

Such LDM models usually require projection_dim=1024 for the clip encoder.

What does this PR do?

Fixes # (issue)

Before submitting

Who can review?

@sayakpaul @yiyixuxu @DN6

@DN6
Copy link
Collaborator

DN6 commented Feb 14, 2025

@Teriks could you share an example I can use to reproduce the error? Along with a link to the checkpoint you're trying to use?

@Teriks
Copy link
Contributor Author

Teriks commented Feb 14, 2025

@DN6

Model page: https://civitai.com/models/2711/21-sd-modern-buildings-style-md

Checkpoint: https://civitai.com/api/download/models/3002?type=Model&format=PickleTensor&size=full&fp=fp16

Original Config: https://civitai.com/api/download/models/3002?type=Config&format=Other

Here is a reproducible error condition script, and checkpoint to test.

This exception happens with any LDM checkpoint hosted on CivitAI under SD2.0 and SD2.1 checkpoints.

There is probably additional config for some models needed to make them work, the fix I am applying just makes most of them function out of the box.

import diffusers

# https://civitai.com/models/2711/21-sd-modern-buildings-style-md

# This is the ckpt and YAML config from the same page

# https://civitai.com/api/download/models/3002?type=Model&format=PickleTensor&size=full&fp=fp16

# https://civitai.com/api/download/models/3002?type=Config&format=Other


# this will fail with an exception

pipe = diffusers.StableDiffusionPipeline.from_single_file(
    '21SDModernBuildings_midjourneyBuildings.ckpt',
    original_config='21SDModernBuildings_midjourneyBuildings.yaml')

This fails with this exception due to projection_dim for the text_encoder being wrong in the hub config (taken from SD2.1) for this model

Fetching 10 files: 100%|██████████| 10/10 [00:00<?, ?it/s]
Loading pipeline components...:  33%|███▎      | 2/6 [00:00<00:00,  8.02it/s]
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    pipe = diffusers.StableDiffusionPipeline.from_single_file(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\huggingface_hub\utils\_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\loaders\single_file.py", line 495, in from_single_file
    loaded_sub_model = load_single_file_sub_model(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\loaders\single_file.py", line 113, in load_single_file_sub_model
    loaded_sub_model = create_diffusers_clip_model_from_ldm(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\loaders\single_file_utils.py", line 1571, in create_diffusers_clip_model_from_ldm
    unexpected_keys = load_model_dict_into_meta(model, diffusers_format_checkpoint, dtype=torch_dtype)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\models\model_loading_utils.py", line 230, in load_model_dict_into_meta
    raise ValueError(
ValueError: Cannot load  because text_model.encoder.layers.0.self_attn.q_proj.weight expected shape torch.Size([1024, 1024]), but got torch.Size([512, 1024]). If you want to instead overwrite randomly initialized weights, please make sure to pass both `low_cpu_mem_usage=False` and `ignore_mismatched_sizes=True`. For more information, see also: https://github.com/huggingface/diffusers/issues/1619#issuecomment-1345604389 as an example.

@hlky
Copy link
Contributor

hlky commented Feb 18, 2025

if text_proj_key in checkpoint:
text_proj_dim = int(checkpoint[text_proj_key].shape[0])
elif hasattr(text_model.config, "projection_dim"):
text_proj_dim = text_model.config.projection_dim
else:
text_proj_dim = LDM_OPEN_CLIP_TEXT_PROJECTION_DIM

text_model_dict[diffusers_key + ".q_proj.weight"] = weight_value[:text_proj_dim, :].clone().detach()
text_model_dict[diffusers_key + ".k_proj.weight"] = (
weight_value[text_proj_dim : text_proj_dim * 2, :].clone().detach()
)
text_model_dict[diffusers_key + ".v_proj.weight"] = weight_value[text_proj_dim * 2 :, :].clone().detach()

We're getting text_proj_dim from either text_projection key, config.projection_dim or LDM_OPEN_CLIP_TEXT_PROJECTION_DIM which is hard coded at 1024. Then using it to split qkv.

The issue is with text_projection key path and config.projection_dim.

config.projection_dim is 512 because it's used as the final output shape in CLIPTextModelWithProjection.

CLIPTextModelWithProjection

self.text_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)

config.hidden_size is used in CLIPAttention for q_proj shape

CLIPAttention

self.embed_dim = config.hidden_size
self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)
>>> torch.nn.Linear(1024, 512).state_dict()["weight"].shape
torch.Size([512, 1024])

text_proj_dim = int(checkpoint[text_proj_key].shape[0])

This would be shape[1] and config.projection_dim path can use config.hidden_size

WDYT @DN6?

Teriks added a commit to Teriks/dgenerate that referenced this pull request Feb 18, 2025
@Teriks
Copy link
Contributor Author

Teriks commented Mar 1, 2025

@DN6

I am pretty sure the fix described by @hlky is correct

I can update this pull later

Teriks added 2 commits March 3, 2025 01:33
Infer projection_dim from the checkpoint before loading
from pretrained, override any incorrect hub config.

Hub configuration for SD2.X specifies projection_dim=512
which is incorrect for SD2.X checkpoints loaded from civitai
and similar.

Exception was previously thrown upon attempting to
load_model_dict_into_meta for SD2.X single file checkpoints.

Such LDM models usually require projection_dim=1024
@Teriks Teriks force-pushed the sd2x_ldm_singlefile_fix branch from 511316c to d25c76a Compare March 3, 2025 07:39
@DN6
Copy link
Collaborator

DN6 commented Mar 3, 2025

Hmm so the checkpoint @Teriks shared appears to be missing the cond_stage_model.model.text_projection key. Which is weird. Guessing it's because SD 2.X doesn't use CLIPWithTextProjection.

Your suggested change looks good @hlky. I think we would only need to swap config.projection_dim for config.hidden_dim though. The text_projection dims for axis 0, 1 in OpenCLIP are the same. @Teriks would you mind updating the PR please?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@DN6 DN6 merged commit 9e910c4 into huggingface:main Mar 3, 2025
10 of 12 checks passed
Teriks added a commit to Teriks/dgenerate that referenced this pull request Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants