GitHub - PolyU-ChenLab/UniPixel: 🔮 UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning (NeurIPS 2025)

UniPixel: Unified Object Referring and Segmentation
for Pixel-Level Visual Reasoning

Ye Liu^1,2, Zongyang Ma^2,3, Junfu Pu², Zhongang Qi⁴, Yang Wu⁵, Ying Shan², Chang Wen Chen^1*

¹The Hong Kong Polytechnic University ²ARC Lab, Tencent PCG
³Chinese Academy of Sciences ⁴vivo Mobile Communication Co. ⁵Tencent AI Lab

UniPixel is a unified MLLM for pixel-level vision-language understanding. It flexibly supports a variety of fine-grained tasks, including image/video segmentation, regional understanding, and a novel PixelQA task that jointly requires object-centric referring, segmentation, and question-answering in videos.

🔥 News

2025.10.03 🕹️ Our online demo is available on Hugging Face Spaces. Enjoy!
2025.09.27 🎮 Try our model on custom data in one click.
2025.09.21 🔮 Code, model, and dataset release.
2025.09.18 🎉 Our paper has been accepted by NeurIPS 2025.

🏆 UniPixel on Public Benchmarks

Benchmark	Evaluation Results (3B/7B)
`CT` ReVOS (val)	`J: 59.7/61.7` `F: 64.4/65.7` `J&F: 62.1/63.7`
`CT` MeViS (val)	`J: 50.4/53.2` `F: 55.7/58.3` `J&F: 53.1/55.8`
`CT` Ref-YouTube-VOS (val)	`J: 68.6/69.5` `F: 72.3/72.4` `J&F: 70.5/71.0`
`CT` Ref-DAVIS17 (val)	`J: 70.7/72.7` `F: 77.8/80.1` `J&F: 74.2/76.4`
`CT` Ref-SAV (val)	`J: 66.9/68.5` `F: 67.6/69.6` `J&F: 67.2/69.0`
`CT` GroundMoRe (test)	`J: 36.0/36.5` `F: 38.7/39.1` `J&F: 37.4/37.8`
`CT` RefCOCO (RES)	`val: 80.5/80.8` `testA: 82.6/83.0` `testB: 76.9/77.4`
`CT` RefCOCO+ (RES)	`val: 74.3/75.3` `testA: 78.9/80.1` `testB: 68.4/70.0`
`CT` RefCOCOg (RES)	`val(U): 76.3/76.4` `test(U): 77.0/77.1`
`CT` ReasonSeg (val)	`gIoU: 64.0/60.5` `cIoU: 56.2/58.7`
`CT` VideoRefer-Bench-D	`single-frame: 3.42/3.37` `multi-frame: 3.44/3.36`
`CT` VideoRefer-Bench-Q	`single-frame: 72.2/73.4` `multi-frame: 72.8/74.1`
`ZS` MVBench	`Acc: 62.5/64.3`

CT and ZS refer to multi-task co-training and zero-shot settings, respectively. Evaluation results under more settings can be found in our paper.

🕹️ Gradio Demo

🌾 Turn on your sound and enjoy the BGM from Stardew Valley!

demo.mp4

Play with our demo online or see DEMO.md for guidelines about how to deploy it locally.

🎮 Inference on Custom Data

Make sure you have setup the environment.
Run the following script for image or video segmentation.

# Set the Python Path
export PYTHONPATH="./:$PYTHONPATH"

# Run inference on custom data
python tools/inference.py <media-path> <prompt>

# Example: python tools/inference.py example.jpg 'Please segment the rabbit'

Here, <media-path> could be a path to an image, a video, or a folder containing video frames (001.jpg, 002.jpg).

Here are some example prompts

1. Please segment the tallest giraffe.
2. Where is the nearest sheep? Please provide the segmentation mask.
3. Why is the boy crying? Please provide the segmentation mask and explain why.
4. Who shooted the ball? Please answer the question and provide the segmentation mask.
5. Please segment the object according to the description: <a-long-description>

💻 Model Zoo

Model	Base MLLM	Checkpoint	Training Log
UniPixel-3B	Qwen2.5-VL-3B-Instruct	🤗 Link	🤗 Link
UniPixel-7B	Qwen2.5-VL-7B-Instruct	🤗 Link	🤗 Link

📦 UniPixel-SFT-1M Dataset

We provide raw images/videos and pre-processed annotations of 23 referring/segmentation/QA datasets, including our UniPixel-SFT-1M for training and multiple benchmarks for evaluation. The list of source datasets is shown below. See our dataset repo for more details.

🚀 Training

Our codebase supports training and evaluating on 23 datasets and benchmarks with the following features.

Flexible hardware settings: NVIDIA GPU / Ascend NPU, Single-Node / Multi-Node
Efficient training techniques: DeepSpeed ZeRO, BF16, LoRA, SDPA, FlashAttention2, Liger-Kernel
Customizing the base LLM and conversation templates
Monitoring the training process via Tensorboard / Wandb
Group sampling for mixed dataset training
Multi-process / multi-device evaluation on public benchmarks

See TRAIN.md for a quick start guide.

🔮 Evaluation

See EVAL.md for details about evaluating UniPixel on public benchmarks.

📖 Citation

Please kindly cite our paper if you find this project helpful.

@inproceedings{liu2025unipixel,
  title={UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning},
  author={Liu, Ye and Ma, Zongyang and Pu, Junfu and Qi, Zhongang and Wu, Yang and Ying, Shan and Chen, Chang Wen},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github		.github
demo		demo
docs		docs
sam2		sam2
scripts		scripts
tools		tools
unipixel		unipixel
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UniPixel: Unified Object Referring and Segmentation
for Pixel-Level Visual Reasoning

🔥 News

🏆 UniPixel on Public Benchmarks

🕹️ Gradio Demo

🎮 Inference on Custom Data

💻 Model Zoo

📦 UniPixel-SFT-1M Dataset

🚀 Training

🔮 Evaluation

📖 Citation

About

Uh oh!

Releases

Packages

Languages

License

PolyU-ChenLab/UniPixel

Folders and files

Latest commit

History

Repository files navigation

UniPixel: Unified Object Referring and Segmentationfor Pixel-Level Visual Reasoning

🔥 News

🏆 UniPixel on Public Benchmarks

🕹️ Gradio Demo

🎮 Inference on Custom Data

💻 Model Zoo

📦 UniPixel-SFT-1M Dataset

🚀 Training

🔮 Evaluation

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

UniPixel: Unified Object Referring and Segmentation
for Pixel-Level Visual Reasoning

Packages