Skip to content

The Rust library for video generation models based on Candle (HF)

License

Notifications You must be signed in to change notification settings

FerrisMind/candle-video

Repository files navigation

English Русский Português


candle-video

License Rust

Rust library for AI video generation built on the Candle ML framework. High-performance, standalone video generation inference without Python runtime dependencies.


📚 Table of Contents


✨ What is this?

candle-video is a Rust-native implementation of video generation models, targeting deployment scenarios where startup time, binary size, and memory efficiency matter. It provides inference for state-of-the-art text-to-video models without requiring a Python runtime.

Supported Models

  • LTX-Video — Text-to-video generation using DiT (Diffusion Transformer) architecture
    • 2B and 13B parameter variants
    • Standard and distilled versions (0.9.5 – 0.9.8)
    • T5-XXL text encoder with GGUF quantization support
    • 3D VAE for video encoding/decoding
    • Flow Matching scheduler

🚀 Key Features

  • High Performance — Native Rust with GPU acceleration via CUDA/cuDNN
  • Memory Efficient — BF16 inference, VAE tiling/slicing, GGUF quantized text encoders
  • Flexible — Run on CPU or GPU, with optional Flash Attention v2
  • Standalone — No Python runtime required in production
  • Fast Startup — ~2 seconds vs ~15-30 seconds for Python/PyTorch

Hardware Acceleration

Feature Description
flash-attn Flash Attention v2 for efficient attention (default)
cudnn cuDNN for faster convolutions (default)
mkl Intel MKL for optimized CPU operations (x86_64)
accelerate Apple Accelerate for Metal (macOS)
nccl Multi-GPU support via NCCL

🎬 Demonstration

Model Video Prompt
LTX-Video-0.9.5 Waves and Rocks The waves crash against the jagged rocks of the shoreline, sending spray high into the air...
LTX-Video-0.9.8-2b-distilled woman_with_blood A woman with blood on her face and a white tank top looks down and to her right...

More examples in examples.


🖥️ System Requirements

Prerequisites

  • Rust 1.82+ (Edition 2024)
  • CUDA Toolkit 12.x (for GPU acceleration)
  • cuDNN 8.x/9.x (optional, for faster convolutions)
  • hf

Approximate VRAM Requirements (512×768, 97 frames)

  • Full model: ~8-12GB
  • With VAE tiling: ~8GB
  • With GGUF T5: saves ~8GB additional

🛠️ Installation & Setup

Add to your project

[dependencies]
candle-video = { git = "https://github.com/FerrisMind/candle-video" }

Build from source

# Clone the repository
git clone https://github.com/FerrisMind/candle-video.git
cd candle-video

# Default build (CUDA + cuDNN + Flash Attention)
cargo build --release

# CPU-only build
cargo build --release --no-default-features

# With specific features
cargo build --release --features "cudnn,flash-attn"

Model Weights

Download from oxide-lab/LTX-Video-0.9.8-2B-distilled:

huggingface-cli download oxide-lab/LTX-Video-0.9.8-2B-distilled --local-dir ./models/ltx-video

Note: This is the same official version of Lightricks/LTX-Video model, , but the repository contains all the necessary files at once. You don't need to individually search for everything

Required files for diffusers model versions::

  • transformer/diffusion_pytorch_model.safetensors — DiT model
  • vae/diffusion_pytorch_model.safetensors — 3D VAE
  • text_encoder_gguf/t5-v1_1-xxl-encoder-Q5_K_M.gguf — Quantized T5
  • text_encoder_gguf/tokenizer.json — T5 tokenizer

Required files for official model versions:

  • ltxv-2b-0.9.8-distilled.safetensors — DiT + 3D VAE in single file
  • text_encoder_gguf/t5-v1_1-xxl-encoder-Q5_K_M.gguf — Quantized T5
  • text_encoder_gguf/tokenizer.json — T5 tokenizer

📖 How to Start Using

Using Local Weights Examples (Recommended)

For diffusers model versions:

cargo run --example ltx-video --release --features flash-attn,cudnn -- \
    --local-weights ./models/ltx-video \
    --ltxv-version 0.9.5 \
    --prompt "A cat playing with a ball of yarn" 

For official model versions:

cargo run --example ltx-video --release --features flash-attn,cudnn -- \
    --local-weights ./models/ltx-video-model \
    --unified-weights ./models/ltx-video-model.safetensors \
    --ltxv-version 0.9.8-2b-distilled \
    --prompt "A cat playing with a ball of yarn" 

Fast Preview (Lower Resolution)

cargo run --example ltx-video --release --features flash-attn,cudnn -- \
    --local-weights ./models/ltx-video-model \
    --unified-weights ./models/ltx-video-model.safetensors \
    --ltxv-version 0.9.8-2b-distilled \
    --prompt "A cat playing with a ball of yarn" \
    --height 256 --width 384 --num-frames 25 

Low VRAM Mode

cargo run --example ltx-video --release --features flash-attn,cudnn -- \
    --local-weights ./models/ltx-video \
    --prompt "A majestic eagle soaring over mountains" \
    --vae-tiling --vae-slicing

CLI Options

Argument Default Description
--prompt "A video of a cute cat..." Text prompt for generation
--negative-prompt "" Negative prompt
--height 512 Video height (divisible by 32)
--width 768 Video width (divisible by 32)
--num-frames 97 Number of frames (should be 8n + 1)
--steps (from version config) Diffusion steps
--guidance-scale (from version config) Classifier-free guidance scale
--ltxv-version "0.9.5" Model version
--local-weights (None) Path to local weights
--output-dir "output" Directory to save results
--seed random Random seed for reproducibility
--vae-tiling false Enable VAE tiling for memory efficiency
--vae-slicing false Enable VAE batch slicing
--frames false Save individual PNG frames
--gif false Save as GIF animation
--cpu false Run on CPU instead of GPU
--use-bf16-t5 false Use BF16 T5 instead of GGUF quantized
--unified-weights (None) Path to unified safetensors file

Supported Model Versions

Version Parameters Steps Guidance Notes
0.9.5 2B 40 3.0 Standard model
0.9.6-dev 2B 40 3.0 Development version
0.9.6-distilled 2B 8 1.0 Fast inference
0.9.8-2b-distilled 2B 7 1.0 Latest distilled
0.9.8-13b-dev 13B 30 8.0 Large model
0.9.8-13b-distilled 13B 7 1.0 Large distilled

Memory Optimization

For limited VRAM:

# VAE tiling - processes image in tiles
--vae-tiling

# VAE slicing - processes batches sequentially
--vae-slicing

# Lower resolution
--height 256 --width 384

# Fewer frames
--num-frames 25

Project Structure

candle-video/
├── src/
│   ├── lib.rs                    # Library entry point
│   └── models/
│       └── ltx_video/            # LTX-Video implementation
│           ├── ltx_transformer.rs    # DiT transformer
│           ├── vae.rs                # 3D VAE
│           ├── text_encoder.rs       # T5 text encoder
│           ├── quantized_t5_encoder.rs # GGUF T5 encoder
│           ├── scheduler.rs          # Flow matching scheduler
│           ├── t2v_pipeline.rs       # Text-to-video pipeline
│           ├── loader.rs             # Weight loading
│           └── configs.rs            # Model version configs
├── examples/
│   └── ltx-video/                # Main CLI example
├── tests/                        # Parity and unit tests
├── scripts/                      # Python reference generators
└── benches/                      # Performance benchmarks

🙏 Acknowledgments


License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Copyright 2025 FerrisMind

About

The Rust library for video generation models based on Candle (HF)

Resources

License

Stars

Watchers

Forks