Skip to content

jonasneves/aipi540-tabletop-perception

Repository files navigation

Tabletop Perception for Beginner Robot Kits

Live demo Duke AIPI 540 License: MIT Python 3.10+ ONNX Runtime Web

Live demo — point a webcam at a desk, watch a specialist and a generalist reason side-by-side. In-browser, no server, no API keys.

A fine-tuned MobileNetV3-small (2.5M frozen + 6K head) hits 97% top-1 at ~15 ms/frame on a 6-class tabletop task (cell_phone, cup, headphone, laptop, scissors, stapler). A 450M-parameter open-vocab VLM (LFM2.5-VL-450M) on the same input runs ~85× slower at detect (1.3 s/query vs 15 ms/frame) and collapses to 0% recall@IoU≥0.3 on stylized out-of-distribution synthetic scenes — though on natural photos it correctly refuses absent-object queries, a property the specialist cannot offer. Production shape: both running at once — specialist on the camera stream, generalist on typed queries — with the open-vocab tier extending coverage where the closed-set head cannot reach.

Choosing between them is a deployment decision, not a benchmark one.

Results

Tier Model Params Latency Top-1
Naive HSV threshold 0 ~1 ms 24%
Classical Color-hist + HOG + GBM ~2.4K trees ~40 ms 76%
DL MobileNetV3-small fine-tune 2.5M frozen + 6K head ~15 ms 97%
VLM LFM2.5-VL-450M zero-shot 450M ~1300 ms open-vocab

Per-class F1, latency breakdown, and a five-case error analysis: report/report.md.

Reproduce

git clone https://github.com/jonasneves/aipi540-tabletop-perception
cd aipi540-tabletop-perception
pip install -r requirements.txt
make dataset   # downloads + stages Caltech-101, ~2 min
make eval      # runs all three models + exports ONNX
make serve     # local demo on :8088

Structure

.
├── README.md
├── SCOPE.md
├── requirements.txt
├── Makefile               # dataset | eval | sync | serve | deploy
├── scripts/
│   ├── make_dataset.py    # Caltech-101 download + 6-class filter
│   ├── naive.py           # HSV dominant-hue baseline
│   ├── classical.py       # color-hist + HOG + GBM
│   └── train_dl.py        # MobileNetV3-small fine-tune + ONNX export
├── models/                # ONNX + pickle artifacts
├── data/
│   ├── raw/
│   └── processed/
├── results/               # scores.json, plots
├── report/                # written report + figures
├── public/                # static site: ONNX Runtime Web + WebGPU
└── docs -> public         # GH Pages serves main/docs → public

Team

Jonas Neves · Duke University · AIPI 540 · Spring 2026

About

Live in-browser demo: a 3M-param specialist (97%, ~15 ms) running alongside a 450M-param VLM (~1.3 s) on the same webcam stream. Duke AIPI 540 final.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors