Can Large Models Fool the Eye? A New Turing Test for Biological Animation 👀

Even young infants can easily interpret the biological motions through pointlight display without any knowledge foundation

Zijian Chen^1,2, Lirong Deng³, Zhengyu Chen¹, Kaiwei Zhang², Qi Jia², Yuan Tian², Yucheng Zhu¹, Guangtao Zhai^1,2,*

¹Shanghai Jiao Tong University
²Shanghai AI Laboratory
³Macao Polytechnic University
^*Corresponding author

We propose BioMotion Arena, the first biological motion-based visual preference evaluation framework for large models. We focus on ten typical human motions and introduce fine-grained control over gender, weight, mood, and direction. More than 45k votes for 53 mainstream LLMs and MLLMs on 90 biological motion variants are collected.

Release 🚀

[2025/08/22] 🔥 We add the Gradio version of BioMotion Arena.
[2025/08/11] 🔥 BioMotion Arena was highlighted in Medium authored by Berend Watchus !
[2025/08/08] ⚡️ Project Website for BioMotion Arena is online !

Motivations 💡

Evaluating the abilities of large models and manifesting their gaps are challenging. Current benchmarks adopt either ground-truth-based score-form evaluation on static datasets or indistinct textual chatbot-style human preferences collection, which may not provide users with immediate, intuitive, and perceptible feedback on performance differences. In this paper, we introduce BioMotion Arena, a novel framework for evaluating large language models (LLMs) and multimodal large language models (MLLMs) via visual animation.

Motion Space 🧩

We include 10 typical human actions as well as four fine-grained attributes:

Action: Walking, running, waving a hand, jumping up, jumping forward, bowing, lying down, sitting down, turning around, and forward rolling
Gender: Man, woman
Happiness: Happy, sad
Weight: Heavy, light
Direction: Left, right, facing forward

Participating LLMs and MLLMs 🤖

Our BioMotion Arena currently includes 53 large models (both LLMs and MLLMs) in total, with a mix of cutting-edge proprietary models, open-source models, and code-specific models.

Run with Gradio 🎮

We use a third-party unofficial API. Otherwise, you need to update the API structure (_def call_model_api_) part of the code.

python biomotion_gradio.py --default-key xxxxxxxxxxx --special-key xxxxxxxxxx

Running on local URL: http://127.0.0.1:7860

You can use the default recommended prompt, or you can input your own desired action prompt word.
Click the Generate Code button, and waiting for the responses from two anonymous models.
Click the Run Code A and Run Code B buttons respectively.
Then, make a preference selection based on the motion animation result of the code execution. The result will be automatically saved locally at preferences.csv.

Code 💻

We recommend directly installing the environment for the model to be evaluated.

Such as Qwen2.5-VL, Qwen2.5, llama3.3-70B, InternVL2.5, and OpenAI.
Two code examples for both proprietary (OpenAI's) and open-source (Qwen) LLMs are given. openai-MLLM.py provides a code demo for MLLMs with reference image input.

python openai.py
python qwen.py
python openai-MLLM.py

Human Preference Collection

Configure the evaluation pool and output path in anmoy-subjective-exp.py, then launch the UI code for anonymous subjective experiments.

cd subjective-exp-tool
python anmoy-subjective-exp.py

Calculate the Elo score from the collected human preference:

python elo_score.py

Main Results 📌

The average lines of code for biological motion representation (click to expand)

Win-rate and the rate of ‘Both-are-bad’ (click to expand)

Elo scores of a subset of model (click to expand)

Comparison with Other Benchmarks (click to expand)

Contact ✉️

Please contact the first author of this paper for queries.

Zijian Chen, zijian.chen@sjtu.edu.cn

Citation 📎

If you find our work interesting, please feel free to cite our paper:

@article{chen2025can,
  title={Can Large Models Fool the Eye? A New Turing Test for Biological Animation},
  author={Chen, Zijian and Deng, Lirong and Chen, Zhengyu and Zhang, Kaiwei and Jia, Qi and Tian, Yuan and Zhu, Yucheng and Zhai, Guangtao},
  journal={arXiv preprint arXiv:2508.06072},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Examples		Examples
figures		figures
subjective-exp-tool		subjective-exp-tool
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
biomotion_gradio.py		biomotion_gradio.py
elo_score.py		elo_score.py
openai-MLLM.py		openai-MLLM.py
openai.py		openai.py
qwen.py		qwen.py
ref.png		ref.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Can Large Models Fool the Eye? A New Turing Test for Biological Animation 👀

Release 🚀

Motivations 💡

Motion Space 🧩

Participating LLMs and MLLMs 🤖

Run with Gradio 🎮

Code 💻

Human Preference Collection

Main Results 📌

Contact ✉️

Citation 📎

About

Uh oh!

Releases

Packages

Languages

License

zijianchen98/BioMotion_Arena

Folders and files

Latest commit

History

Repository files navigation

Can Large Models Fool the Eye? A New Turing Test for Biological Animation 👀

Release 🚀

Motivations 💡

Motion Space 🧩

Participating LLMs and MLLMs 🤖

Run with Gradio 🎮

Code 💻

Human Preference Collection

Main Results 📌

Contact ✉️

Citation 📎

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages