Skip to content

Commit 0719bbe

Browse files
authored
Merge branch 'main' into fix-types
2 parents aba2793 + 5afbcce commit 0719bbe

26 files changed

+612
-45
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -525,6 +525,8 @@
525525
title: Kandinsky 2.2
526526
- local: api/pipelines/kandinsky3
527527
title: Kandinsky 3
528+
- local: api/pipelines/kandinsky5
529+
title: Kandinsky 5
528530
- local: api/pipelines/kolors
529531
title: Kolors
530532
- local: api/pipelines/latent_consistency_models

docs/source/en/api/models/chroma_transformer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
1212

1313
# ChromaTransformer2DModel
1414

15-
A modified flux Transformer model from [Chroma](https://huggingface.co/lodestones/Chroma)
15+
A modified flux Transformer model from [Chroma](https://huggingface.co/lodestones/Chroma1-HD)
1616

1717
## ChromaTransformer2DModel
1818

docs/source/en/api/pipelines/chroma.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,20 +19,21 @@ specific language governing permissions and limitations under the License.
1919

2020
Chroma is a text to image generation model based on Flux.
2121

22-
Original model checkpoints for Chroma can be found [here](https://huggingface.co/lodestones/Chroma).
22+
Original model checkpoints for Chroma can be found here:
23+
* High-resolution finetune: [lodestones/Chroma1-HD](https://huggingface.co/lodestones/Chroma1-HD)
24+
* Base model: [lodestones/Chroma1-Base](https://huggingface.co/lodestones/Chroma1-Base)
25+
* Original repo with progress checkpoints: [lodestones/Chroma](https://huggingface.co/lodestones/Chroma) (loading this repo with `from_pretrained` will load a Diffusers-compatible version of the `unlocked-v37` checkpoint)
2326

2427
> [!TIP]
2528
> Chroma can use all the same optimizations as Flux.
2629
2730
## Inference
2831

29-
The Diffusers version of Chroma is based on the [`unlocked-v37`](https://huggingface.co/lodestones/Chroma/blob/main/chroma-unlocked-v37.safetensors) version of the original model, which is available in the [Chroma repository](https://huggingface.co/lodestones/Chroma).
30-
3132
```python
3233
import torch
3334
from diffusers import ChromaPipeline
3435

35-
pipe = ChromaPipeline.from_pretrained("lodestones/Chroma", torch_dtype=torch.bfloat16)
36+
pipe = ChromaPipeline.from_pretrained("lodestones/Chroma1-HD", torch_dtype=torch.bfloat16)
3637
pipe.enable_model_cpu_offload()
3738

3839
prompt = [
@@ -63,10 +64,10 @@ Then run the following example
6364
import torch
6465
from diffusers import ChromaTransformer2DModel, ChromaPipeline
6566

66-
model_id = "lodestones/Chroma"
67+
model_id = "lodestones/Chroma1-HD"
6768
dtype = torch.bfloat16
6869

69-
transformer = ChromaTransformer2DModel.from_single_file("https://huggingface.co/lodestones/Chroma/blob/main/chroma-unlocked-v37.safetensors", torch_dtype=dtype)
70+
transformer = ChromaTransformer2DModel.from_single_file("https://huggingface.co/lodestones/Chroma1-HD/blob/main/Chroma1-HD.safetensors", torch_dtype=dtype)
7071

7172
pipe = ChromaPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=dtype)
7273
pipe.enable_model_cpu_offload()
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
3+
the License. You may obtain a copy of the License at
4+
http://www.apache.org/licenses/LICENSE-2.0
5+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
6+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
7+
specific language governing permissions and limitations under the License.
8+
-->
9+
10+
# Kandinsky 5.0
11+
12+
Kandinsky 5.0 is created by the Kandinsky team: Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim, Anastasiia Kargapoltseva, Nikita Kiselev, Anna Dmitrienko, Anastasia Maltseva, Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov, Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina, Tatiana Nikulina, Polina Gavrilova, Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Denis Dimitrov
13+
14+
15+
Kandinsky 5.0 is a family of diffusion models for Video & Image generation. Kandinsky 5.0 T2V Lite is a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger models and offers the best understanding of Russian concepts in the open-source ecosystem.
16+
17+
The model introduces several key innovations:
18+
- **Latent diffusion pipeline** with **Flow Matching** for improved training stability
19+
- **Diffusion Transformer (DiT)** as the main generative backbone with cross-attention to text embeddings
20+
- Dual text encoding using **Qwen2.5-VL** and **CLIP** for comprehensive text understanding
21+
- **HunyuanVideo 3D VAE** for efficient video encoding and decoding
22+
- **Sparse attention mechanisms** (NABLA) for efficient long-sequence processing
23+
24+
The original codebase can be found at [ai-forever/Kandinsky-5](https://github.com/ai-forever/Kandinsky-5).
25+
26+
> [!TIP]
27+
> Check out the [AI Forever](https://huggingface.co/ai-forever) organization on the Hub for the official model checkpoints for text-to-video generation, including pretrained, SFT, no-CFG, and distilled variants.
28+
29+
## Available Models
30+
31+
Kandinsky 5.0 T2V Lite comes in several variants optimized for different use cases:
32+
33+
| model_id | Description | Use Cases |
34+
|------------|-------------|-----------|
35+
| **ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers** | 5 second Supervised Fine-Tuned model | Highest generation quality |
36+
| **ai-forever/Kandinsky-5.0-T2V-Lite-sft-10s-Diffusers** | 10 second Supervised Fine-Tuned model | Highest generation quality |
37+
| **ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-5s-Diffusers** | 5 second Classifier-Free Guidance distilled | 2× faster inference |
38+
| **ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-10s-Diffusers** | 10 second Classifier-Free Guidance distilled | 2× faster inference |
39+
| **ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers** | 5 second Diffusion distilled to 16 steps | 6× faster inference, minimal quality loss |
40+
| **ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-10s-Diffusers** | 10 second Diffusion distilled to 16 steps | 6× faster inference, minimal quality loss |
41+
| **ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-5s-Diffusers** | 5 second Base pretrained model | Research and fine-tuning |
42+
| **ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-10s-Diffusers** | 10 second Base pretrained model | Research and fine-tuning |
43+
44+
All models are available in 5-second and 10-second video generation versions.
45+
46+
## Kandinsky5T2VPipeline
47+
48+
[[autodoc]] Kandinsky5T2VPipeline
49+
- all
50+
- __call__
51+
52+
## Usage Examples
53+
54+
### Basic Text-to-Video Generation
55+
56+
```python
57+
import torch
58+
from diffusers import Kandinsky5T2VPipeline
59+
from diffusers.utils import export_to_video
60+
61+
# Load the pipeline
62+
model_id = "ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers"
63+
pipe = Kandinsky5T2VPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
64+
pipe = pipe.to("cuda")
65+
66+
# Generate video
67+
prompt = "A cat and a dog baking a cake together in a kitchen."
68+
negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards"
69+
70+
output = pipe(
71+
prompt=prompt,
72+
negative_prompt=negative_prompt,
73+
height=512,
74+
width=768,
75+
num_frames=121, # ~5 seconds at 24fps
76+
num_inference_steps=50,
77+
guidance_scale=5.0,
78+
).frames[0]
79+
80+
export_to_video(output, "output.mp4", fps=24, quality=9)
81+
```
82+
83+
### 10 second Models
84+
**⚠️ Warning!** all 10 second models should be used with Flex attention and max-autotune-no-cudagraphs compilation:
85+
86+
```python
87+
pipe = Kandinsky5T2VPipeline.from_pretrained(
88+
"ai-forever/Kandinsky-5.0-T2V-Lite-sft-10s-Diffusers",
89+
torch_dtype=torch.bfloat16
90+
)
91+
pipe = pipe.to("cuda")
92+
93+
pipe.transformer.set_attention_backend(
94+
"flex"
95+
) # <--- Sett attention bakend to Flex
96+
pipe.transformer.compile(
97+
mode="max-autotune-no-cudagraphs",
98+
dynamic=True
99+
) # <--- Compile with max-autotune-no-cudagraphs
100+
101+
prompt = "A cat and a dog baking a cake together in a kitchen."
102+
negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards"
103+
104+
output = pipe(
105+
prompt=prompt,
106+
negative_prompt=negative_prompt,
107+
height=512,
108+
width=768,
109+
num_frames=241,
110+
num_inference_steps=50,
111+
guidance_scale=5.0,
112+
).frames[0]
113+
114+
export_to_video(output, "output.mp4", fps=24, quality=9)
115+
```
116+
117+
### Diffusion Distilled model
118+
**⚠️ Warning!** all nocfg and diffusion distilled models should be infered wothout CFG (```guidance_scale=1.0```):
119+
120+
```python
121+
model_id = "ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers"
122+
pipe = Kandinsky5T2VPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
123+
pipe = pipe.to("cuda")
124+
125+
output = pipe(
126+
prompt="A beautiful sunset over mountains",
127+
num_inference_steps=16, # <--- Model is distilled in 16 steps
128+
guidance_scale=1.0, # <--- no CFG
129+
).frames[0]
130+
131+
export_to_video(output, "output.mp4", fps=24, quality=9)
132+
```
133+
134+
135+
## Citation
136+
```bibtex
137+
@misc{kandinsky2025,
138+
author = {Alexey Letunovskiy and Maria Kovaleva and Ivan Kirillov and Lev Novitskiy and Denis Koposov and
139+
Dmitrii Mikhailov and Anna Averchenkova and Andrey Shutkin and Julia Agafonova and Olga Kim and
140+
Anastasiia Kargapoltseva and Nikita Kiselev and Vladimir Arkhipkin and Vladimir Korviakov and
141+
Nikolai Gerasimenko and Denis Parkhomenko and Anna Dmitrienko and Anastasia Maltseva and
142+
Kirill Chernyshev and Ilia Vasiliev and Viacheslav Vasilev and Vladimir Polovnikov and
143+
Yury Kolabushin and Alexander Belykh and Mikhail Mamaev and Anastasia Aliaskina and
144+
Tatiana Nikulina and Polina Gavrilova and Denis Dimitrov},
145+
title = {Kandinsky 5.0: A family of diffusion models for Video & Image generation},
146+
howpublished = {\url{https://github.com/ai-forever/Kandinsky-5}},
147+
year = 2025
148+
}
149+
```

docs/source/en/optimization/attention_backends.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ Refer to the table below for an overview of the available attention families and
2121
| attention family | main feature |
2222
|---|---|
2323
| FlashAttention | minimizes memory reads/writes through tiling and recomputation |
24+
| AI Tensor Engine for ROCm | FlashAttention implementation optimized for AMD ROCm accelerators |
2425
| SageAttention | quantizes attention to int8 |
2526
| PyTorch native | built-in PyTorch implementation using [scaled_dot_product_attention](./fp16#scaled-dot-product-attention) |
2627
| xFormers | memory-efficient attention with support for various attention kernels |
@@ -139,6 +140,7 @@ Refer to the table below for a complete list of available attention backends and
139140
| `_native_xla` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | XLA-optimized attention |
140141
| `flash` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-2 |
141142
| `flash_varlen` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | Variable length FlashAttention |
143+
| `aiter` | [AI Tensor Engine for ROCm](https://github.com/ROCm/aiter) | FlashAttention for AMD ROCm |
142144
| `_flash_3` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-3 |
143145
| `_flash_varlen_3` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | Variable length FlashAttention-3 |
144146
| `_flash_3_hub` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-3 from kernels |

src/diffusers/guiders/__init__.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,6 @@
1717
from ..utils import is_torch_available, logging
1818

1919

20-
logger = logging.get_logger(__name__)
21-
logger.warning(
22-
"Guiders are currently an experimental feature under active development. The API is subject to breaking changes in future releases."
23-
)
24-
25-
2620
if is_torch_available():
2721
from .adaptive_projected_guidance import AdaptiveProjectedGuidance
2822
from .adaptive_projected_guidance_mix import AdaptiveProjectedMixGuidance

src/diffusers/guiders/guider_utils.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,10 @@ class BaseGuidance(ConfigMixin, PushToHubMixin):
4141
_identifier_key = "__guidance_identifier__"
4242

4343
def __init__(self, start: float = 0.0, stop: float = 1.0, enabled: bool = True):
44+
logger.warning(
45+
"Guiders are currently an experimental feature under active development. The API is subject to breaking changes in future releases."
46+
)
47+
4448
self._start = start
4549
self._stop = stop
4650
self._step: int = None

src/diffusers/models/attention_dispatch.py

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,8 @@
2727

2828
from ..utils import (
2929
get_logger,
30+
is_aiter_available,
31+
is_aiter_version,
3032
is_flash_attn_3_available,
3133
is_flash_attn_available,
3234
is_flash_attn_version,
@@ -47,13 +49,15 @@
4749
from ._modeling_parallel import ParallelConfig
4850

4951
_REQUIRED_FLASH_VERSION = "2.6.3"
52+
_REQUIRED_AITER_VERSION = "0.1.5"
5053
_REQUIRED_SAGE_VERSION = "2.1.1"
5154
_REQUIRED_FLEX_VERSION = "2.5.0"
5255
_REQUIRED_XLA_VERSION = "2.2"
5356
_REQUIRED_XFORMERS_VERSION = "0.0.29"
5457

5558
_CAN_USE_FLASH_ATTN = is_flash_attn_available() and is_flash_attn_version(">=", _REQUIRED_FLASH_VERSION)
5659
_CAN_USE_FLASH_ATTN_3 = is_flash_attn_3_available()
60+
_CAN_USE_AITER_ATTN = is_aiter_available() and is_aiter_version(">=", _REQUIRED_AITER_VERSION)
5761
_CAN_USE_SAGE_ATTN = is_sageattention_available() and is_sageattention_version(">=", _REQUIRED_SAGE_VERSION)
5862
_CAN_USE_FLEX_ATTN = is_torch_version(">=", _REQUIRED_FLEX_VERSION)
5963
_CAN_USE_NPU_ATTN = is_torch_npu_available()
@@ -78,6 +82,12 @@
7882
flash_attn_3_func = None
7983
flash_attn_3_varlen_func = None
8084

85+
86+
if _CAN_USE_AITER_ATTN:
87+
from aiter import flash_attn_func as aiter_flash_attn_func
88+
else:
89+
aiter_flash_attn_func = None
90+
8191
if DIFFUSERS_ENABLE_HUB_KERNELS:
8292
if not is_kernels_available():
8393
raise ImportError(
@@ -178,6 +188,9 @@ class AttentionBackendName(str, Enum):
178188
_FLASH_3_HUB = "_flash_3_hub"
179189
# _FLASH_VARLEN_3_HUB = "_flash_varlen_3_hub" # not supported yet.
180190

191+
# `aiter`
192+
AITER = "aiter"
193+
181194
# PyTorch native
182195
FLEX = "flex"
183196
NATIVE = "native"
@@ -414,6 +427,12 @@ def _check_attention_backend_requirements(backend: AttentionBackendName) -> None
414427
f"Flash Attention 3 Hub backend '{backend.value}' is not usable because the `kernels` package isn't available. Please install it with `pip install kernels`."
415428
)
416429

430+
elif backend == AttentionBackendName.AITER:
431+
if not _CAN_USE_AITER_ATTN:
432+
raise RuntimeError(
433+
f"Aiter Attention backend '{backend.value}' is not usable because of missing package or the version is too old. Please install `aiter>={_REQUIRED_AITER_VERSION}`."
434+
)
435+
417436
elif backend in [
418437
AttentionBackendName.SAGE,
419438
AttentionBackendName.SAGE_VARLEN,
@@ -1397,6 +1416,47 @@ def _flash_varlen_attention_3(
13971416
return (out, lse) if return_lse else out
13981417

13991418

1419+
@_AttentionBackendRegistry.register(
1420+
AttentionBackendName.AITER,
1421+
constraints=[_check_device_cuda, _check_qkv_dtype_bf16_or_fp16, _check_shape],
1422+
)
1423+
def _aiter_flash_attention(
1424+
query: torch.Tensor,
1425+
key: torch.Tensor,
1426+
value: torch.Tensor,
1427+
dropout_p: float = 0.0,
1428+
is_causal: bool = False,
1429+
scale: Optional[float] = None,
1430+
return_lse: bool = False,
1431+
_parallel_config: Optional["ParallelConfig"] = None,
1432+
) -> torch.Tensor:
1433+
if not return_lse and torch.is_grad_enabled():
1434+
# aiter requires return_lse=True by assertion when gradients are enabled.
1435+
out, lse, *_ = aiter_flash_attn_func(
1436+
q=query,
1437+
k=key,
1438+
v=value,
1439+
dropout_p=dropout_p,
1440+
softmax_scale=scale,
1441+
causal=is_causal,
1442+
return_lse=True,
1443+
)
1444+
else:
1445+
out = aiter_flash_attn_func(
1446+
q=query,
1447+
k=key,
1448+
v=value,
1449+
dropout_p=dropout_p,
1450+
softmax_scale=scale,
1451+
causal=is_causal,
1452+
return_lse=return_lse,
1453+
)
1454+
if return_lse:
1455+
out, lse, *_ = out
1456+
1457+
return (out, lse) if return_lse else out
1458+
1459+
14001460
@_AttentionBackendRegistry.register(
14011461
AttentionBackendName.FLEX,
14021462
constraints=[_check_attn_mask_or_causal, _check_device, _check_shape],

src/diffusers/models/transformers/transformer_chroma.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -379,7 +379,7 @@ class ChromaTransformer2DModel(
379379
"""
380380
The Transformer model introduced in Flux, modified for Chroma.
381381
382-
Reference: https://huggingface.co/lodestones/Chroma
382+
Reference: https://huggingface.co/lodestones/Chroma1-HD
383383
384384
Args:
385385
patch_size (`int`, defaults to `1`):

src/diffusers/models/transformers/transformer_kandinsky.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -324,6 +324,7 @@ def apply_rotary(x, rope):
324324
sparse_params["sta_mask"],
325325
thr=sparse_params["P"],
326326
)
327+
327328
else:
328329
attn_mask = None
329330

@@ -335,6 +336,7 @@ def apply_rotary(x, rope):
335336
backend=self._attention_backend,
336337
parallel_config=self._parallel_config,
337338
)
339+
338340
hidden_states = hidden_states.flatten(-2, -1)
339341

340342
attn_out = attn.out_layer(hidden_states)

0 commit comments

Comments
 (0)