Skip to content

AmmarkoV/Fast-SAM-3D-Body

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fast SAM 3D Body

Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery

Timing Yang1, Sicheng He1, Hongyi Jing1, Jiawei Yang1, Zhijian Liu2,3, Chuhang Zou4, Yue Wang1,3

1USC Physical Superintelligence (PSI) Lab   2University of California, San Diego   3NVIDIA   4Meta Reality Labs

Joint corresponding authors

Paper   Project Page

Speed-accuracy overview of Fast SAM 3D Body. Top left: Qualitative results on in-the-wild images show our framework preserves high-fidelity reconstruction. Top right: Our method achieves up to a 10.25x end-to-end speedup over SAM 3D Body and replaces the iterative MHR-to-SMPL bottleneck with a 10,000x faster neural mapping. Bottom: Our system enables real-time humanoid robot control from a single RGB stream at ~65 ms per frame on an NVIDIA RTX 5090.

Abstract

SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.

Qualitative comparison. The original SAM 3D Body (left) and our Fast variant (right) yield visually comparable mesh reconstructions across diverse poses and multi-person scenes on 3DPW and EMDB.

Getting Started

Environment

Please refer to SAM 3D Body for environment setup, or use our setup script:

bash setup_env.sh
conda activate fast_sam_3d_body

Dependencies installed by setup_env.sh:

  • Python 3.11, PyTorch 2.5.1 + CUDA 12.4
  • Detectron2, Ultralytics YOLO, MoGe, ONNX Runtime GPU
  • pyrender, roma, einops, timm, huggingface_hub

Checkpoints

checkpoints/
├── sam-3d-body-dinov3/       # Auto-downloaded from HuggingFace on first run
│   ├── model.ckpt            (~2.0 GB)
│   ├── model_config.yaml
│   └── assets/
│       └── mhr_model.pt      (~664 MB)
├── yolo/                     # Place YOLO-Pose weights here
│   ├── yolo11m-pose.pt
│   └── yolo11m-pose.engine   # Generated by convert_yolo_pose_trt.py (optional)
└── moge_trt/                 # Generated by build_tensorrt.sh (optional)
    └── moge_dinov2_encoder_fp16.engine

The sam-3d-body-dinov3 checkpoint is fetched automatically on first run via huggingface_hub. To pre-download manually:

python -c "from huggingface_hub import snapshot_download; snapshot_download('facebook/sam-3d-body-dinov3', local_dir='checkpoints/sam-3d-body-dinov3')"

Run (Python pipeline)

# Optimized demo – single image or webcam (torch.compile + TensorRT)
bash run_demo.sh

# Webcam real-time demo
bash run_webcam.sh

# Quick single-image test (no TensorRT required)
python demo_human.py \
    --image_path assets/teaser.png \
    --detector yolo \
    --detector_model checkpoints/yolo/yolo11m-pose.pt

TensorRT Acceleration (Optional)

# Convert all models (YOLO-Pose + MoGe encoder + DINOv3 backbone)
bash build_tensorrt.sh

# Or convert individually
python convert_yolo_pose_trt.py --model yolo11m-pose.pt --imgsz 640 --half
python convert_moge_encoder_trt.py --all
python convert_backbone_tensorrt.py --all

All generated engines are stored under ./checkpoints/.


C++ Inference Engine

fast_sam_3dbody_cpp/ is a self-contained C++ library and CLI that runs the full pipeline (YOLO → backbone → decoder → MHR heads) with zero Python runtime dependency. It also includes two Python frontends that wrap the compiled library via ctypes.

Pipeline overview

Image (BGR uint8)
  │
  ▼  YOLO11m-pose  (ONNX Runtime, CUDA EP)       person bboxes + COCO keypoints
  ▼  backbone.onnx  (DINOv3-ViT-H/14+)           feature map  [B, 1280, 32, 32]
  ▼  decoder.onnx   (6-layer PromptableDecoder)   pose token   [B, 1024]
  ▼  pipeline.gguf  (MHR + camera heads, CPU)     pose params  [B, 519]  camera [B, 3]
  ▼  body_model.onnx  (optional LBS skinning)     vertices [18439×3] + joints

1 Prepare ONNX / GGUF models

Run once from the repo root (Python venv must be active):

source venv/bin/activate   # or: conda activate fast_sam_3d_body

python fast_sam_3dbody_cpp/prepare_models.py \
    --checkpoint ./checkpoints/sam-3d-body-dinov3

This writes to fast_sam_3dbody_cpp/onnx/:

File Size Description
backbone.onnx + .data ~3.2 GB DINOv3-ViT-H/14+ backbone
decoder.onnx ~174 MB 6-layer transformer decoder
pipeline.gguf ~5 MB MHR + camera projection heads
yolo.onnx ~81 MB YOLO11m-pose person detector
body_model.pt ~664 MB TorchScript LBS body model (optional)

To skip steps where the output already exists:

python fast_sam_3dbody_cpp/prepare_models.py --skip onnx   # skip backbone + decoder
python fast_sam_3dbody_cpp/prepare_models.py --skip gguf   # skip pipeline.gguf
python fast_sam_3dbody_cpp/prepare_models.py --skip yolo   # skip yolo.onnx

2 Build

Requirements: CMake ≥ 3.18, g++ with C++17, CUDA Toolkit (optional but recommended), OpenCV.

cd fast_sam_3dbody_cpp
mkdir -p build && cd build

cmake .. -DCMAKE_BUILD_TYPE=Release

# Optional: point to a pre-downloaded ONNX Runtime to avoid the 300 MB download
cmake .. -DCMAKE_BUILD_TYPE=Release \
         -DONNX_RUNTIME_DIR=/path/to/onnxruntime-linux-x64-gpu-1.20.1

make -j$(nproc)

CMake auto-detects CUDA (defaults to sm_86; change with -DCMAKE_CUDA_ARCHITECTURES=<arch>). If CUDA is not found, a CPU-only build is produced automatically.

Outputs in fast_sam_3dbody_cpp/build/:

  • libfast_sam_3dbody.so – shared library for ctypes / C++ linking
  • fast_sam_3dbody_run – standalone CLI executable

3 Run — CLI executable

cd fast_sam_3dbody_cpp/build

# Single image
./fast_sam_3dbody_run \
    --onnx-dir ../onnx \
    --gguf     ../onnx/pipeline.gguf \
    --yolo     ../onnx/yolo.onnx \
    --from     ../../assets/teaser.png

# Webcam (device 0)
./fast_sam_3dbody_run \
    --onnx-dir ../onnx --gguf ../onnx/pipeline.gguf --yolo ../onnx/yolo.onnx \
    --from 0

# Video file
./fast_sam_3dbody_run \
    --onnx-dir ../onnx --gguf ../onnx/pipeline.gguf --yolo ../onnx/yolo.onnx \
    --from /path/to/video.mp4

# Skip body model (fastest – pose params only, no 3D mesh)
./fast_sam_3dbody_run ... --skip-body

# CPU-only inference
./fast_sam_3dbody_run ... --cuda -1

# Cap persons per frame
./fast_sam_3dbody_run ... --max-persons 4

Full option list: ./fast_sam_3dbody_run --help

4 Run — Python lightweight frontend

Visualises COCO 2D skeletons + pose bar panel. Requires only opencv-python and numpy.

# From repo root
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend.py --from assets/teaser.png

# Webcam with at most 3 skeletons
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend.py --from 0 --max-skeletons 3

# Save output image
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend.py \
    --from assets/teaser.png --out out.jpg

# Headless / video processing
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend.py \
    --from video.mp4 --headless --out out_video.mp4

# Key options
#   --cuda N          CUDA device for C engine (default 0; -1 = CPU)
#   --thresh 0.5      YOLO confidence threshold
#   --max-skeletons N cap person count
#   --fx / --fy       custom focal length in pixels
#   --cx / --cy       custom principal point

5 Run — Python 3D frontend

Full 3D mesh rendering identical to demo_webcam.py ([orig | 2D skeleton | front mesh | side mesh]). Uses the C engine for detection + backbone + decoder + MHR FFN, then the Python body model for LBS skinning. Requires the full Python environment (sam_3d_body package, pyrender, torch).

python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend-3D.py --from assets/teaser.png

# Webcam, limit to 3 persons
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend-3D.py --from 0 --max-skeletons 3

# Custom checkpoint paths
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend-3D.py \
    --from assets/teaser.png \
    --checkpoint ./checkpoints/sam-3d-body-dinov3/model.ckpt \
    --mhr-model  ./checkpoints/sam-3d-body-dinov3/assets/mhr_model.pt

# Save output
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend-3D.py \
    --from assets/teaser.png --out out_3d.jpg

# Key options (same as lightweight frontend, plus):
#   --checkpoint PATH   model.ckpt path (default: checkpoints/sam-3d-body-dinov3/model.ckpt)
#   --mhr-model PATH    mhr_model.pt path
#   --device cuda|cpu   PyTorch device for body model

Distributing model files

To package all runtime models into a single zip for deployment on another machine:

# C++ models only (~3.5 GB)
bash scripts/create_redist.sh

# Include Python checkpoint too (~6 GB)
bash scripts/create_redist.sh --with-python

# Include LBS body_model.pt (~664 MB extra)
bash scripts/create_redist.sh --with-body

# Write to a specific directory
bash scripts/create_redist.sh --output /mnt/storage

Output: fast_sam_3dbody_models_YYYYMMDD.zip


Real-World Deployment

For instructions on running the publisher, see docs/realworld_deployment.md.

We demonstrate a real-time, vision-only teleoperation system for the Unitree G1 humanoid robot using a single RGB camera, operating at ~65 ms end-to-end latency on an NVIDIA RTX 5090.

Humanoid teleoperation. The system tracks diverse whole-body motions including upper-body gestures (a), body rotations (b-e), walking (f), wide stance (g), single-leg standing (h), squatting (i), and kneeling (j).

Humanoid policy rollout. The robot grasps a box on the table with both hands, squats down, and steps to the right. Achieving 80% task success rate with 40 demonstrations collected via our system.

Single-View vs Multi-View. Multi-view fusion resolves depth ambiguities inherent in single-view reconstruction, producing more accurate SMPL body estimates.

Citation

@article{yang2026fastsam3dbody,
  title={Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery},
  author={Yang, Timing and He, Sicheng and Jing, Hongyi and Yang, Jiawei and Liu, Zhijian and Zou, Chuhang and Wang, Yue},
  journal={arXiv preprint arXiv:2603.15603},
  year={2026}
}

Acknowledgements

This project builds upon SAM 3D Body (3DB) and Multi-HMR (MHR). We thank the original authors for releasing their models and codebases, which served as the foundation for our acceleration framework.

About

Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 87.0%
  • C++ 8.6%
  • Shell 2.9%
  • Other 1.5%