Skip to content

UPSTREAM PR #1296: feat: add Anima support #67

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1296-anima
Open

UPSTREAM PR #1296: feat: add Anima support #67
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1296-anima

Conversation

@loci-dev
Copy link

Note

Source pull request: leejet/stable-diffusion.cpp#1296

For leejet/stable-diffusion.cpp#1245

./build/bin/sd-cli --diffusion-model models/anima-preview.safetensors --llm models/qwen_3_06b_base.safetensors --vae models/qwen_image_vae.safetensors -p "a cute cat" --fa -v -H 1024 -W 1024 --cfg-scale 4

Download model: https://huggingface.co/circlestone-labs/Anima/tree/main/split_files

Output eg:

output

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod February 26, 2026 04:17 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Feb 26, 2026

Overview

Analysis of 49,755 functions across two binaries reveals mixed performance impacts from adding ANIMA model support. Modified: 122 functions, new: 1,454, removed: 4, unchanged: 48,175.

Binaries analyzed:

  • build.bin.sd-server: +1.458% power consumption (518,798 → 526,362 nJ)
  • build.bin.sd-cli: +1.574% power consumption (483,665 → 491,278 nJ)

The single commit "add anima" introduces a new diffusion model architecture through additive changes (774-line anima.hpp, model detection logic). Performance changes stem primarily from compiler optimization artifacts rather than algorithmic modifications.

Function Analysis

STL Vector Accessors - Improvements:

  • std::vector<gguf_kv>::end() (sd-server): Response time 263.83ns → 80.54ns (-69.47%), throughput 243.07ns → 59.78ns (-75.41%)
  • std::vector<std::thread>::end() (sd-cli): Response time 265.16ns → 81.87ns (-69.12%), throughput 243.07ns → 59.78ns (-75.41%)
  • std::vector<ggml_backend_reg_entry>::cbegin() (sd-server): Response time 264.07ns → 83.25ns (-68.47%), throughput 243.31ns → 62.49ns (-74.32%)

These improvements benefit model loading and backend initialization through better compiler code generation for simple template types.

STL Vector Accessors - Regressions:

  • std::vector<std::pair<...>>::end() (sd-server): Response time 80.53ns → 263.84ns (+227.65%), throughput 59.77ns → 243.08ns (+306.70%)

Complex nested template types suffer from optimization challenges, affecting AnimaConditioner initialization.

Model Detection Logic:

  • sd_version_is_dit() (both binaries): Response time +9.07% (~31ns), throughput +10.06% (~9ns)

Functionally justified regression from adding ANIMA classification check, enabling DiT optimizations for the new model.

Memory Management:

  • std::shared_ptr<T5CLIPEmbedder>::_M_destroy() (sd-cli): Throughput 293.74ns → 105.03ns (-64.24%), response 495.60ns → 307.53ns (-37.95%)
  • std::make_shared<WanModel> (sd-server): Throughput +43.38% (+41ns), but response time +0.05% (468ns out of 938µs)

Construction shows throughput regressions confined to function prologues; destruction improves significantly.

Other analyzed functions (std::swap, thread::joinable, file I/O operations) showed minor regressions (8-59% throughput increases, 9-76ns absolute) attributable to compiler optimization variations rather than source changes.

Additional Findings

Performance changes are isolated to initialization and utility code; inference hot paths remain unaffected. The analyzed functions do not involve GPU kernel execution—impacts are limited to model loading, backend registration, and object lifecycle management. The 1.5% power consumption increase is negligible compared to actual inference costs. ANIMA integration successfully leverages existing DiT optimization infrastructure (EasyCache, CacheDIT) with minimal overhead.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants