GRN: Generative Refinement Networks

📋 Table of Contents

🌟 Introduction
✨ Gallery
🚀 Demo
📦 Model Zoo
🛠️ Installation
🖼️ Class-to-Image
🎨 Text-to-Image
- Inference
🎬 Text-to-Video
- Inference
📧 Contact
🤗 Acknowledgements
📝 Citation

🌟 Introduction

This is the official implementation of the paper Generative Refinement Networks for Visual Synthesis. Neither diffusion nor autoregressive — GRN is a third way. 🧠 Refines globally like an artist. ⚡ Generates adaptively by complexity. 🏆 New SOTA across image & video. The visual generation paradigm just got rewritten.

Diffusion models dominate visual generation but they allocate uniform computational effort to samples with varying levels of complexity. Autoregressive (AR) models are complexity-aware, as evidenced by their variable likelihoods, but suffer from lossy tokenization and error accumulation.

We introduce Generative Refinement Networks (GRN), a new visual synthesis paradigm that addresses these issues:

Near-lossless tokenization via Hierarchical Binary Quantization (HBQ)
Global refinement mechanism that progressively perfects outputs like a human artist
Entropy-guided sampling for complexity-aware, adaptive-step generation

GRN achieves state-of-the-art results on ImageNet reconstruction and class-conditional generation, and scales effectively to text-to-image and text-to-video tasks.

Generative Refinement Framework

Starting from a random token map, GRN randomly selects more predictions at each step and refines all input tokens. For example, compared to the second step, the third step filled six new tokens (pink), kept two tokens (blue), erased two tokens (yellow), and left six tokens blank (gray).

✨ Gallery

GRN-8B Text-to-Video Examples

01.mp4	02.mp4	03.mp4
04.mp4	05.mp4	06.mp4
07.mp4	08.mp4	09.mp4

GRN-8B Image-to-Video Examples

01.mp4	02.mp4	03.mp4
04.mp4	05.mp4	06.mp4
07.mp4	08.mp4	09.mp4

GRN-2B Class-to-Image Examples

GRN-2B Text-to-Image Examples

🚀 Demo

🖼️ Text-to-Image

Try our interactive Text-to-Image demo on 🤗 Hugging Face Space:

GRN T2I Demo

Experience the power of Generative Refinement Networks firsthand by generating images from text prompts directly in your browser!

🎬 Text-to-Video

Try our interactive Text-to-Video demo on Discord:

T2V Demo on Discord

📦 Model Zoo

Model	Checkpoints
Tokenizers	✅ ImageNet Tokenizer ✅ Joint Image/Video Tokenizer
GRN_ind_C2I	✅ B ⬜ L (TBD) ⬜ H (TBD) ⬜ G (TBD)
GRN_bit_T2I	✅ GRN_T2I
GRN_bit_T2V	✅ GRN_T2V

🛠️ Installation

Step 1: Clone the repository

git clone https://github.com/MGenAI/GRN
cd GRN

Step 2: Create conda environment

A suitable conda environment named GRN can be created and activated with:

conda env create -f environment.yaml
conda activate GRN

Troubleshooting

If you get undefined symbol: iJIT_NotifyEvent when importing torch, simply:

pip uninstall torch
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124

Check this issue for more details.

🖼️ Class-to-Image

Dataset

Download ImageNet dataset, and place it in your IMAGENET_PATH.

Training

All training scripts are located in scripts/c2i/. We suggest using 8x80GB GPUs for most models.

Model	Training Script	GPUs Required
GRN_ind_B	`bash scripts/c2i/train_GRN_ind_B.sh`	8x80GB
GRN_bit_B	`bash scripts/c2i/train_GRN_bit_B.sh`	8x80GB
GRN_ind_L	`bash scripts/c2i/train_GRN_ind_L.sh`	8x80GB
GRN_ind_H	`bash scripts/c2i/train_GRN_ind_H.sh`	16x80GB
GRN_ind_G	`bash scripts/c2i/train_GRN_ind_G.sh`	32x80GB

Evaluation

PyTorch pre-trained models are available here.

All evaluation scripts are located in scripts/c2i/. We suggest using 8x80GB vRAM GPUs.

Model	Evaluation Script
GRN_ind_B	`bash scripts/c2i/eval_GRN_ind_B.sh`
GRN_bit_B	`bash scripts/c2i/eval_GRN_bit_B.sh`
GRN_ind_L	`bash scripts/c2i/eval_GRN_ind_L.sh`
GRN_ind_H	`bash scripts/c2i/eval_GRN_ind_H.sh`
GRN_ind_G	`bash scripts/c2i/eval_GRN_ind_G.sh`

We use torch-fidelity to evaluate FID and IS against a reference image folder or statistics. We use the JiT's pre-computed reference stats under grn/utils_c2i/fid_stats.

🎨 Text-to-Image

Inference

You can simply run python3 t2i_infer.py or use the following code:

from PIL import Image
from grn_pipeline import GRNPipeline

# Load pipeline
pipeline = GRNPipeline.from_pretrained(
    hf_repo_id='bytedance-research/GRN',
    task='T2I',
    device='cpu',
).to('cuda')

# Generate one image
result = pipeline(
    prompt="A cute cat playing in the garden",
    guidance_scale=3.0,
    temperature=1.1,
    complexity_aware_Tmin=10,
    complexity_aware_Tmax=50,
    complexity_aware_k = 0,
    complexity_aware_b = 50,
    complexity_aware_wp = 5,
    snr_shift = 1.,
    h_div_w=1.,
    content_type='image',
    seed=42,
)
image = result.images[0]
image.save('./generated_image.jpg')

🎬 Text-to-Video

Inference

You can simply run python3 t2v_infer.py or use the following code:

from grn_pipeline import GRNPipeline

# Load pipeline
pipeline = GRNPipeline.from_pretrained(
    hf_repo_id='bytedance-research/GRN', 
    task='T2V', 
    pn='0.41M', 
    device='cpu'
).to('cuda')

# Generate one video
result = pipeline(
    prompt="Two women demonstrate a makeup product, applying it with a sponge while smiling and engaging with the camera in a bright, clean setting.",
    guidance_scale=4.0,
    temperature=1.0,
    complexity_aware_Tmin=10,
    complexity_aware_Tmax=50,
    complexity_aware_k = 0,
    complexity_aware_b = 50,
    complexity_aware_wp = 5,
    snr_shift = 1.,
    h_div_w=9/16,
    duration=2.,
    content_type='video',
    seed=42,
)
video_file = result.videos[0]

📧 Contact

If you are interested in scaling GRN for image generation / image editing / video generation / video editing / unified model directions, please feel free to reach out!

📧 Email: hanjian.thu123@bytedance.com

🤗 Acknowledgements

Thanks to JiT, Infinity and InfinityStar for their wonderful work and codebase!

📝 Citation

If you find our work useful, please consider citing:

@misc{han2026grn,
      title={Generative Refinement Networks for Visual Synthesis}, 
      author={Jian Han and Jinlai Liu and Jiahuan Wang and Bingyue Peng and Zehuan Yuan},
      year={2026},
      eprint={2604.13030},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.13030}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
assets		assets
demo		demo
evaluation/gen_eval		evaluation/gen_eval
grn		grn
scripts/c2i		scripts/c2i
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
c2i_train_infer.py		c2i_train_infer.py
environment.yaml		environment.yaml
grn_pipeline.py		grn_pipeline.py
t2i_infer.py		t2i_infer.py
t2iv_train.py		t2iv_train.py
t2v_infer.py		t2v_infer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GRN: Generative Refinement Networks

📋 Table of Contents

🌟 Introduction

✨ Gallery

GRN-8B Text-to-Video Examples

GRN-8B Image-to-Video Examples

GRN-2B Class-to-Image Examples

GRN-2B Text-to-Image Examples

🚀 Demo

🖼️ Text-to-Image

🎬 Text-to-Video

📦 Model Zoo

🛠️ Installation

Step 1: Clone the repository

Step 2: Create conda environment

Troubleshooting

🖼️ Class-to-Image

Dataset

Training

Evaluation

🎨 Text-to-Image

Inference

🎬 Text-to-Video

Inference

📧 Contact

🤗 Acknowledgements

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GRN: Generative Refinement Networks

📋 Table of Contents

🌟 Introduction

✨ Gallery

GRN-8B Text-to-Video Examples

GRN-8B Image-to-Video Examples

GRN-2B Class-to-Image Examples

GRN-2B Text-to-Image Examples

🚀 Demo

🖼️ Text-to-Image

🎬 Text-to-Video

📦 Model Zoo

🛠️ Installation

Step 1: Clone the repository

Step 2: Create conda environment

Troubleshooting

🖼️ Class-to-Image

Dataset

Training

Evaluation

🎨 Text-to-Image

Inference

🎬 Text-to-Video

Inference

📧 Contact

🤗 Acknowledgements

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages