Timing Yang1, Sicheng He1, Hongyi Jing1, Jiawei Yang1, Zhijian Liu2,3, Chuhang Zou4†, Yue Wang1,3†
1USC Physical Superintelligence (PSI) Lab 2University of California, San Diego 3NVIDIA 4Meta Reality Labs
† Joint corresponding authors
Speed-accuracy overview of Fast SAM 3D Body. Top left: Qualitative results on in-the-wild images show our framework preserves high-fidelity reconstruction. Top right: Our method achieves up to a 10.25x end-to-end speedup over SAM 3D Body and replaces the iterative MHR-to-SMPL bottleneck with a 10,000x faster neural mapping. Bottom: Our system enables real-time humanoid robot control from a single RGB stream at ~65 ms per frame on an NVIDIA RTX 5090.
SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.
Qualitative comparison. The original SAM 3D Body (left) and our Fast variant (right) yield visually comparable mesh reconstructions across diverse poses and multi-person scenes on 3DPW and EMDB.
Please refer to SAM 3D Body for environment setup, or use our setup script:
bash setup_env.sh
conda activate fast_sam_3d_bodyDependencies installed by setup_env.sh:
- Python 3.11, PyTorch 2.5.1 + CUDA 12.4
- Detectron2, Ultralytics YOLO, MoGe, ONNX Runtime GPU
- pyrender, roma, einops, timm, huggingface_hub
checkpoints/
├── sam-3d-body-dinov3/ # Auto-downloaded from HuggingFace on first run
│ ├── model.ckpt (~2.0 GB)
│ ├── model_config.yaml
│ └── assets/
│ └── mhr_model.pt (~664 MB)
├── yolo/ # Place YOLO-Pose weights here
│ ├── yolo11m-pose.pt
│ └── yolo11m-pose.engine # Generated by convert_yolo_pose_trt.py (optional)
└── moge_trt/ # Generated by build_tensorrt.sh (optional)
└── moge_dinov2_encoder_fp16.engine
The sam-3d-body-dinov3 checkpoint is fetched automatically on first run via huggingface_hub. To pre-download manually:
python -c "from huggingface_hub import snapshot_download; snapshot_download('facebook/sam-3d-body-dinov3', local_dir='checkpoints/sam-3d-body-dinov3')"# Optimized demo – single image or webcam (torch.compile + TensorRT)
bash run_demo.sh
# Webcam real-time demo
bash run_webcam.sh
# Quick single-image test (no TensorRT required)
python demo_human.py \
--image_path assets/teaser.png \
--detector yolo \
--detector_model checkpoints/yolo/yolo11m-pose.pt# Convert all models (YOLO-Pose + MoGe encoder + DINOv3 backbone)
bash build_tensorrt.sh
# Or convert individually
python convert_yolo_pose_trt.py --model yolo11m-pose.pt --imgsz 640 --half
python convert_moge_encoder_trt.py --all
python convert_backbone_tensorrt.py --allAll generated engines are stored under ./checkpoints/.
fast_sam_3dbody_cpp/ is a self-contained C++ library and CLI that runs the full
pipeline (YOLO → backbone → decoder → MHR heads) with zero Python runtime dependency.
It also includes two Python frontends that wrap the compiled library via ctypes.
Image (BGR uint8)
│
▼ YOLO11m-pose (ONNX Runtime, CUDA EP) person bboxes + COCO keypoints
▼ backbone.onnx (DINOv3-ViT-H/14+) feature map [B, 1280, 32, 32]
▼ decoder.onnx (6-layer PromptableDecoder) pose token [B, 1024]
▼ pipeline.gguf (MHR + camera heads, CPU) pose params [B, 519] camera [B, 3]
▼ body_model.onnx (optional LBS skinning) vertices [18439×3] + joints
Run once from the repo root (Python venv must be active):
source venv/bin/activate # or: conda activate fast_sam_3d_body
python fast_sam_3dbody_cpp/prepare_models.py \
--checkpoint ./checkpoints/sam-3d-body-dinov3This writes to fast_sam_3dbody_cpp/onnx/:
| File | Size | Description |
|---|---|---|
backbone.onnx + .data |
~3.2 GB | DINOv3-ViT-H/14+ backbone |
decoder.onnx |
~174 MB | 6-layer transformer decoder |
pipeline.gguf |
~5 MB | MHR + camera projection heads |
yolo.onnx |
~81 MB | YOLO11m-pose person detector |
body_model.pt |
~664 MB | TorchScript LBS body model (optional) |
To skip steps where the output already exists:
python fast_sam_3dbody_cpp/prepare_models.py --skip onnx # skip backbone + decoder
python fast_sam_3dbody_cpp/prepare_models.py --skip gguf # skip pipeline.gguf
python fast_sam_3dbody_cpp/prepare_models.py --skip yolo # skip yolo.onnxRequirements: CMake ≥ 3.18, g++ with C++17, CUDA Toolkit (optional but recommended), OpenCV.
cd fast_sam_3dbody_cpp
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
# Optional: point to a pre-downloaded ONNX Runtime to avoid the 300 MB download
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DONNX_RUNTIME_DIR=/path/to/onnxruntime-linux-x64-gpu-1.20.1
make -j$(nproc)CMake auto-detects CUDA (defaults to sm_86; change with -DCMAKE_CUDA_ARCHITECTURES=<arch>).
If CUDA is not found, a CPU-only build is produced automatically.
Outputs in fast_sam_3dbody_cpp/build/:
libfast_sam_3dbody.so– shared library for ctypes / C++ linkingfast_sam_3dbody_run– standalone CLI executable
cd fast_sam_3dbody_cpp/build
# Single image
./fast_sam_3dbody_run \
--onnx-dir ../onnx \
--gguf ../onnx/pipeline.gguf \
--yolo ../onnx/yolo.onnx \
--from ../../assets/teaser.png
# Webcam (device 0)
./fast_sam_3dbody_run \
--onnx-dir ../onnx --gguf ../onnx/pipeline.gguf --yolo ../onnx/yolo.onnx \
--from 0
# Video file
./fast_sam_3dbody_run \
--onnx-dir ../onnx --gguf ../onnx/pipeline.gguf --yolo ../onnx/yolo.onnx \
--from /path/to/video.mp4
# Skip body model (fastest – pose params only, no 3D mesh)
./fast_sam_3dbody_run ... --skip-body
# CPU-only inference
./fast_sam_3dbody_run ... --cuda -1
# Cap persons per frame
./fast_sam_3dbody_run ... --max-persons 4Full option list: ./fast_sam_3dbody_run --help
Visualises COCO 2D skeletons + pose bar panel. Requires only opencv-python and numpy.
# From repo root
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend.py --from assets/teaser.png
# Webcam with at most 3 skeletons
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend.py --from 0 --max-skeletons 3
# Save output image
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend.py \
--from assets/teaser.png --out out.jpg
# Headless / video processing
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend.py \
--from video.mp4 --headless --out out_video.mp4
# Key options
# --cuda N CUDA device for C engine (default 0; -1 = CPU)
# --thresh 0.5 YOLO confidence threshold
# --max-skeletons N cap person count
# --fx / --fy custom focal length in pixels
# --cx / --cy custom principal pointFull 3D mesh rendering identical to demo_webcam.py ([orig | 2D skeleton | front mesh | side mesh]).
Uses the C engine for detection + backbone + decoder + MHR FFN, then the Python body model for LBS skinning.
Requires the full Python environment (sam_3d_body package, pyrender, torch).
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend-3D.py --from assets/teaser.png
# Webcam, limit to 3 persons
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend-3D.py --from 0 --max-skeletons 3
# Custom checkpoint paths
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend-3D.py \
--from assets/teaser.png \
--checkpoint ./checkpoints/sam-3d-body-dinov3/model.ckpt \
--mhr-model ./checkpoints/sam-3d-body-dinov3/assets/mhr_model.pt
# Save output
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend-3D.py \
--from assets/teaser.png --out out_3d.jpg
# Key options (same as lightweight frontend, plus):
# --checkpoint PATH model.ckpt path (default: checkpoints/sam-3d-body-dinov3/model.ckpt)
# --mhr-model PATH mhr_model.pt path
# --device cuda|cpu PyTorch device for body modelTo package all runtime models into a single zip for deployment on another machine:
# C++ models only (~3.5 GB)
bash scripts/create_redist.sh
# Include Python checkpoint too (~6 GB)
bash scripts/create_redist.sh --with-python
# Include LBS body_model.pt (~664 MB extra)
bash scripts/create_redist.sh --with-body
# Write to a specific directory
bash scripts/create_redist.sh --output /mnt/storageOutput: fast_sam_3dbody_models_YYYYMMDD.zip
For instructions on running the publisher, see docs/realworld_deployment.md.
We demonstrate a real-time, vision-only teleoperation system for the Unitree G1 humanoid robot using a single RGB camera, operating at ~65 ms end-to-end latency on an NVIDIA RTX 5090.
Humanoid teleoperation. The system tracks diverse whole-body motions including upper-body gestures (a), body rotations (b-e), walking (f), wide stance (g), single-leg standing (h), squatting (i), and kneeling (j).
Humanoid policy rollout. The robot grasps a box on the table with both hands, squats down, and steps to the right. Achieving 80% task success rate with 40 demonstrations collected via our system.
Single-View vs Multi-View. Multi-view fusion resolves depth ambiguities inherent in single-view reconstruction, producing more accurate SMPL body estimates.
@article{yang2026fastsam3dbody,
title={Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery},
author={Yang, Timing and He, Sicheng and Jing, Hongyi and Yang, Jiawei and Liu, Zhijian and Zou, Chuhang and Wang, Yue},
journal={arXiv preprint arXiv:2603.15603},
year={2026}
}This project builds upon SAM 3D Body (3DB) and Multi-HMR (MHR). We thank the original authors for releasing their models and codebases, which served as the foundation for our acceleration framework.




