Fix SD2.X clip single file load projection_dim #10770

Teriks · 2025-02-11T20:48:55Z

Infer projection_dim from the checkpoint before loading from pretrained, override any incorrect hub config.

Hub configuration for SD2.X specifies projection_dim=512 which is incorrect for SD2.X checkpoints loaded from civitai and similar.

Exception was previously thrown upon attempting to load_model_dict_into_meta for SD2.X single file checkpoints.

Such LDM models usually require projection_dim=1024 for the clip encoder.

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sayakpaul @yiyixuxu @DN6

DN6 · 2025-02-14T07:43:59Z

@Teriks could you share an example I can use to reproduce the error? Along with a link to the checkpoint you're trying to use?

Teriks · 2025-02-14T17:02:13Z

@DN6

Model page: https://civitai.com/models/2711/21-sd-modern-buildings-style-md

Checkpoint: https://civitai.com/api/download/models/3002?type=Model&format=PickleTensor&size=full&fp=fp16

Original Config: https://civitai.com/api/download/models/3002?type=Config&format=Other

Here is a reproducible error condition script, and checkpoint to test.

This exception happens with any LDM checkpoint hosted on CivitAI under SD2.0 and SD2.1 checkpoints.

There is probably additional config for some models needed to make them work, the fix I am applying just makes most of them function out of the box.

import diffusers

# https://civitai.com/models/2711/21-sd-modern-buildings-style-md

# This is the ckpt and YAML config from the same page

# https://civitai.com/api/download/models/3002?type=Model&format=PickleTensor&size=full&fp=fp16

# https://civitai.com/api/download/models/3002?type=Config&format=Other


# this will fail with an exception

pipe = diffusers.StableDiffusionPipeline.from_single_file(
    '21SDModernBuildings_midjourneyBuildings.ckpt',
    original_config='21SDModernBuildings_midjourneyBuildings.yaml')

This fails with this exception due to projection_dim for the text_encoder being wrong in the hub config (taken from SD2.1) for this model

Fetching 10 files: 100%|██████████| 10/10 [00:00<?, ?it/s]
Loading pipeline components...:  33%|███▎      | 2/6 [00:00<00:00,  8.02it/s]
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    pipe = diffusers.StableDiffusionPipeline.from_single_file(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\huggingface_hub\utils\_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\loaders\single_file.py", line 495, in from_single_file
    loaded_sub_model = load_single_file_sub_model(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\loaders\single_file.py", line 113, in load_single_file_sub_model
    loaded_sub_model = create_diffusers_clip_model_from_ldm(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\loaders\single_file_utils.py", line 1571, in create_diffusers_clip_model_from_ldm
    unexpected_keys = load_model_dict_into_meta(model, diffusers_format_checkpoint, dtype=torch_dtype)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\Lib\site-packages\diffusers\models\model_loading_utils.py", line 230, in load_model_dict_into_meta
    raise ValueError(
ValueError: Cannot load  because text_model.encoder.layers.0.self_attn.q_proj.weight expected shape torch.Size([1024, 1024]), but got torch.Size([512, 1024]). If you want to instead overwrite randomly initialized weights, please make sure to pass both `low_cpu_mem_usage=False` and `ignore_mismatched_sizes=True`. For more information, see also: https://github.com/huggingface/diffusers/issues/1619#issuecomment-1345604389 as an example.

hlky · 2025-02-18T07:37:04Z

diffusers/src/diffusers/loaders/single_file_utils.py

Lines 1449 to 1454 in b75b204

    
           if text_proj_key in checkpoint: 
        
               text_proj_dim = int(checkpoint[text_proj_key].shape[0]) 
        
           elif hasattr(text_model.config, "projection_dim"): 
        
               text_proj_dim = text_model.config.projection_dim 
        
           else: 
        
               text_proj_dim = LDM_OPEN_CLIP_TEXT_PROJECTION_DIM

diffusers/src/diffusers/loaders/single_file_utils.py

Lines 1488 to 1492 in b75b204

    
           text_model_dict[diffusers_key + ".q_proj.weight"] = weight_value[:text_proj_dim, :].clone().detach() 
        
           text_model_dict[diffusers_key + ".k_proj.weight"] = ( 
        
               weight_value[text_proj_dim : text_proj_dim * 2, :].clone().detach() 
        
           ) 
        
           text_model_dict[diffusers_key + ".v_proj.weight"] = weight_value[text_proj_dim * 2 :, :].clone().detach()

We're getting text_proj_dim from either text_projection key, config.projection_dim or LDM_OPEN_CLIP_TEXT_PROJECTION_DIM which is hard coded at 1024. Then using it to split qkv.

The issue is with text_projection key path and config.projection_dim.

config.projection_dim is 512 because it's used as the final output shape in CLIPTextModelWithProjection.

CLIPTextModelWithProjection

self.text_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)

config.hidden_size is used in CLIPAttention for q_proj shape

CLIPAttention

self.embed_dim = config.hidden_size
self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)

>>> torch.nn.Linear(1024, 512).state_dict()["weight"].shape
torch.Size([512, 1024])

diffusers/src/diffusers/loaders/single_file_utils.py

Line 1450 in b75b204

text_proj_dim = int(checkpoint[text_proj_key].shape[0])

This would be shape[1] and config.projection_dim path can use config.hidden_size

WDYT @DN6?

Teriks · 2025-03-01T23:44:53Z

@DN6

I am pretty sure the fix described by @hlky is correct

I can update this pull later

Infer projection_dim from the checkpoint before loading from pretrained, override any incorrect hub config. Hub configuration for SD2.X specifies projection_dim=512 which is incorrect for SD2.X checkpoints loaded from civitai and similar. Exception was previously thrown upon attempting to load_model_dict_into_meta for SD2.X single file checkpoints. Such LDM models usually require projection_dim=1024

DN6 · 2025-03-03T11:36:05Z

Hmm so the checkpoint @Teriks shared appears to be missing the cond_stage_model.model.text_projection key. Which is weird. Guessing it's because SD 2.X doesn't use CLIPWithTextProjection.

Your suggested change looks good @hlky. I think we would only need to swap config.projection_dim for config.hidden_dim though. The text_projection dims for axis 0, 1 in OpenCLIP are the same. @Teriks would you mind updating the PR please?

…1] -> [0] values are identical

HuggingFaceDocBuilderDev · 2025-03-03T12:26:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

huggingface/diffusers#10770

Teriks added a commit to Teriks/dgenerate that referenced this pull request Feb 18, 2025

test clip loading fix described in huggingface/diffusers#10770 (comment)

879b257

Teriks added 2 commits March 3, 2025 01:33

convert_open_clip_checkpoint use hidden_size for text_proj_dim

d25c76a

Teriks force-pushed the sd2x_ldm_singlefile_fix branch from 511316c to d25c76a Compare March 3, 2025 07:39

Teriks and others added 2 commits March 3, 2025 05:45

convert_open_clip_checkpoint, revert checkpoint[text_proj_key].shape[…

c88b6c1

…1] -> [0] values are identical

Merge branch 'main' into sd2x_ldm_singlefile_fix

519ec44

DN6 approved these changes Mar 3, 2025

View reviewed changes

DN6 merged commit 9e910c4 into huggingface:main Mar 3, 2025
10 of 12 checks passed

Teriks added a commit to Teriks/dgenerate that referenced this pull request Apr 16, 2025

SD2.x projection dim hack no longer needed

e745148

huggingface/diffusers#10770

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix SD2.X clip single file load projection_dim #10770

Fix SD2.X clip single file load projection_dim #10770

Uh oh!

Teriks commented Feb 11, 2025

Uh oh!

DN6 commented Feb 14, 2025

Uh oh!

Teriks commented Feb 14, 2025 •

edited

Loading

Uh oh!

hlky commented Feb 18, 2025

Uh oh!

Teriks commented Mar 1, 2025 •

edited

Loading

Uh oh!

DN6 commented Mar 3, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 3, 2025

Uh oh!

Uh oh!

Uh oh!

Fix SD2.X clip single file load projection_dim #10770

Fix SD2.X clip single file load projection_dim #10770

Uh oh!

Conversation

Teriks commented Feb 11, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

DN6 commented Feb 14, 2025

Uh oh!

Teriks commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hlky commented Feb 18, 2025

Uh oh!

Teriks commented Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DN6 commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 3, 2025

Uh oh!

Uh oh!

Uh oh!

Teriks commented Feb 14, 2025 •

edited

Loading

Teriks commented Mar 1, 2025 •

edited

Loading

DN6 commented Mar 3, 2025 •

edited

Loading