This software project accompanies the research paper: CoMotion: Concurrent Multi-person 3D Motion, Alejandro Newell, Peiyun Hu, Lahav Lipson, Stephan R. Richter, and Vladlen Koltun.
We introduce CoMotion, an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains temporally coherent predictions in crowded scenes filled with difficult poses and occlusions. Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion.
The code in this directory provides helper functions and scripts for inference and visualization.
conda create -n comotion -y python=3.10
conda activate comotion
pip install -e '.[all]'
To download pretrained checkpoints, run:
bash get_pretrained_models.sh
Checkpoint data will be downloaded to src/comotion_demo/data
. You will find pretrained weights for the detection stage which includes the main vision backbone (comotion_detection_checkpoint.pt
), as well as a separate checkpoint for the update stage (comotion_refine_checkpoint.pt
). You can use the detection stage standalone for single-image multiperson pose estimation.
For MacOS, we provide a pre-compiled coreml
version of the detection stage of the model which offers significant speedups when running locally on a personal device.
In order to run CoMotion and the corresponding visualization, the neutral SMPL body model is required. Please go to the SMPL website and follow the provided instructions to download the model (version 1.1.0). After downloading, copy basicmodel_neutral_lbs_10_207_0_v1.1.0.pkl
to src/comotion_demo/data/smpl/SMPL_NEUTRAL.pkl
(we rename the file to be compatible with the visualization library aitviewer
).
We provide a demo script that takes either a video file or a directory of images as input. To run it, call:
python demo.py -i path/to/video.mp4 -o results/
Optional arguments include --start-frame
and --num-frames
to select subsets of the video to run on. The network will save a .pt
file with all of the detected SMPL pose parameters as well as a rendered .mp4
with the predictions overlaid on the input video. We also automatically produce a .txt
file in the MOT
format with bounding boxes compatible with most standard tracking evaluation code. If you wish to skip the visualization, add the command --skip-visualization
.
The demo code supports running on a single image as well, which the code will infer automatically if the input path provided has a .png
or .jpeg/.jpg
suffix:
python demo.py -i path/to/image.jpg -o results/
In this case, we save a .pt
file with the detected SMPL poses as well as 2D and 3D coordinates and confidences associated with each detection.
Tip
- If you encounter an error that
libc++.1.dylib
is not found, resolve it withconda install libcxx
. - For headless rendering on a remote server, you may encounter an error like
XOpenDisplay: cannot open display
. In this case start a virtual display usingXvfb :0 -screen 0 640x480x24 & export DISPLAY=:0.0
. You may need to installxvfb
first (apt install xvfb
).
If you find our work useful, please cite the following paper:
@inproceedings{newell2025comotion,
title = {CoMotion: Concurrent Multi-person 3D Motion},
author = {Alejandro Newell and Peiyun Hu and Lahav Lipson and Stephan R. Richter and Vladlen Koltun},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://openreview.net/forum?id=qKu6KWPgxt},
}
This sample code is released under the LICENSE terms.
The model weights are released under the MODEL LICENSE terms.
Our codebase is built using multiple open source contributions, please see Acknowledgements for more details.
Please check the paper for a complete list of references and datasets used in this work.