Ultrafast sitting classification. 32x24 pixels is sufficient for estimating the state of the whole human body.
output_.mp4
| Variant | Size | F1 | CPU inference latency |
ONNX |
|---|---|---|---|---|
| P | 115 KB | 0.8923 | 0.13 ms | Download |
| N | 176 KB | 0.9076 | 0.24 ms | Download |
| T | 279 KB | 0.8935 | 0.31 ms | Download |
| S | 494 KB | 0.9168 | 0.39 ms | Download |
| C | 875 KB | 0.9265 | 0.47 ms | Download |
| 1 | 2 | 3 | 4 |
|---|---|---|---|
git clone https://github.com/PINTO0309/SC.git && cd SC
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
source .venv/bin/activateuv run python demo_sc.py \
-pm sc_c_32x24.onnx \
-v 0 \
-ep cuda \
-dlr -dnm -dgm -dhm -dhd
uv run python demo_sc.py \
-pm sc_c_32x24.onnx \
-v 0 \
-ep tensorrt \
-dlr -dnm -dgm -dhm -dhd- AVA Actions Download (v2.2) - CC BY 4.0 license https://research.google.com/ava/download.html
uv run python 01_download_ava_videos.pyuv run python 03_renumber_files.py
uv run python 04_data_prep.py
uv run python 04_data_prep.py \
--timestamp-stride 32 \
--image-dir ./data/images \
--annotation-file ./data/annotation.txt \
--histogram-file ./data/image_size_hist.png \
--class-ratio-file ./data/class_ratio.png \
--balance-classes \
--dry-run
uv run python 04_data_prep.py \
--timestamp-stride 1 \
--image-dir ./data/images \
--annotation-file ./data/annotation.txt \
--histogram-file ./data/image_size_hist.png \
--class-ratio-file ./data/class_ratio.png \
--balance-classes
python 04_data_prep.py \
--timestamp-stride 1 \
--image-dir ./data/images \
--annotation-file ./data/annotation.txt \
--histogram-file ./data/image_size_hist.png \
--class-ratio-file ./data/class_ratio.png \
--balance-classes \
--resume-existing \
--start-index 198727
uv run python 05_make_parquet.py \
--embed-images \
--overwriteuv run python 06_merge_parquet.py dataset1.parquet dataset2.parquet \
--output dataset.parquet \
--overwriteuv run python 06_merge_parquet.py dataset1.parquet dataset2.parquet \
--output dataset.parquet \
--overwrite-
Use the images located under
dataset/output/002_xxxx_front_yyyyyytogether with their annotations indataset/output/002_xxxx_front.csv. -
Every augmented image that originates from the same
still_imagestays in the same split to prevent leakage. -
The training loop relies on
BCEWithLogitsLossplus class-balancedpos_weightto stabilise optimisation under class imbalance; inference produces sigmoid probabilities. Use--train_resampling weightedto switch on the previousWeightedRandomSamplerbehaviour, or--train_resampling balancedto physically duplicate minority classes before shuffling. -
Training history, validation metrics, optional test predictions, checkpoints, configuration JSON, and ONNX exports are produced automatically.
-
Per-epoch checkpoints named like
sc_epoch_0001.ptare retained (latest 10), as well as the best checkpoints namedsc_best_epoch0004_f1_0.9321.pt(also latest 10). -
The backbone can be switched with
--arch_variant. Supported combinations with--head_variantare:--arch_variantDefault ( --head_variant auto)Explicitly selectable heads Remarks baselineavgavg,avgmax_mlpWhen using transformer/mlp_mixer, you need to adjust the height and width of the feature map so that they are divisible by--token_mixer_grid(if left as is, an exception will occur during ONNX conversion or inference).inverted_seavgmax_mlpavg,avgmax_mlpWhen using transformer/mlp_mixer, it is necessary to adjust--token_mixer_gridas above.convnexttransformeravg,avgmax_mlp,transformer,mlp_mixerFor both heads, the grid must be divisible by the feature map (default 3x2fits with 30x48 input). -
The classification head is selected with
--head_variant(avg,avgmax_mlp,transformer,mlp_mixer, orautowhich derives a sensible default from the backbone). -
Pass
--rgb_to_yuv_to_yto convert RGB crops to YUV, keep only the Y (luma) channel inside the network, and train a single-channel stem without modifying the dataloader. -
Alternatively, use
--rgb_to_labor--rgb_to_luvto convert inputs to CIE Lab/Luv (3-channel) before the stem; these options are mutually exclusive with each other and with--rgb_to_yuv_to_y. -
Mixed precision can be enabled with
--use_ampwhen CUDA is available. -
Resume training with
--resume path/to/sc_epoch_XXXX.pt; all optimiser/scheduler/AMP states and history are restored. -
Loss/accuracy/F1 metrics are logged to TensorBoard under
output_dir, andtqdmprogress bars expose per-epoch progress for train/val/test loops.
Baseline depthwise-separable CNN:
SIZE=32x24
uv run python -m sc train \
--data_root data/dataset.parquet \
--output_dir runs/sc_${SIZE} \
--epochs 100 \
--batch_size 256 \
--train_ratio 0.9 \
--val_ratio 0.1 \
--train_resampling balanced \
--image_size ${SIZE} \
--base_channels 32 \
--num_blocks 4 \
--arch_variant baseline \
--seed 42 \
--device auto \
--use_ampInverted residual + SE variant (recommended for higher capacity):
SIZE=32x24
uv run python -m sc train \
--data_root data/dataset.parquet \
--output_dir runs/sc_is_s_${SIZE} \
--epochs 100 \
--batch_size 256 \
--train_ratio 0.9 \
--val_ratio 0.1 \
--train_resampling balanced \
--image_size ${SIZE} \
--base_channels 32 \
--num_blocks 4 \
--arch_variant inverted_se \
--head_variant avgmax_mlp \
--seed 42 \
--device auto \
--use_ampConvNeXt-style backbone with transformer head over pooled tokens:
SIZE=32x24
uv run python -m sc train \
--data_root data/dataset.parquet \
--output_dir runs/sc_convnext_${SIZE} \
--epochs 100 \
--batch_size 256 \
--train_ratio 0.9 \
--val_ratio 0.1 \
--train_resampling balanced \
--image_size ${SIZE} \
--base_channels 32 \
--num_blocks 4 \
--arch_variant convnext \
--head_variant transformer \
--token_mixer_grid 2x2 \
--seed 42 \
--device auto \
--use_amp- Outputs include the latest 10
sc_epoch_*.pt, the latest 10sc_best_epochXXXX_f1_YYYY.pt(highest validation F1, or training F1 when no validation split),history.json,summary.json, optionaltest_predictions.csv, andtrain.log. - After every epoch a confusion matrix and ROC curve are saved under
runs/sc/diagnostics/<split>/confusion_<split>_epochXXXX.pngandroc_<split>_epochXXXX.png. --image_sizeaccepts either a single integer for square crops (e.g.--image_size 48) orHEIGHTxWIDTHto resize non-square frames (e.g.--image_size 64x48).- Add
--resume <checkpoint>to continue from an earlier epoch. Remember that--epochsindicates the desired total epoch count (e.g. resuming--epochs 40after training to epoch 30 will run 10 additional epochs). - Launch TensorBoard with:
tensorboard --logdir runs/sc
uv run python -m sc exportonnx \
--checkpoint runs/sc_is_s_32x24/sc_best_epoch0049_f1_0.9939.pt \
--output sc_s.onnx \
--opset 17- The saved graph exposes
imagesas input andprob_pointingas output (batch dimension is dynamic); probabilities can be consumed directly. - After exporting, the tool runs
onnxsimfor simplification and rewrites any remaining BatchNormalization nodes into affineMul/Addprimitives. If simplification fails, a warning is emitted and the unsimplified model is preserved.
- VSDLM: Visual-only speech detection driven by lip movements - MIT License
- OCEC: Open closed eyes classification. Ultra-fast wink and blink estimation model - MIT License
- PGC: Ultrafast pointing gesture classification - MIT License
- SC: Ultrafast sitting classification - MIT License
- PUC: Phone Usage Classifier is a three-class image classification pipeline for understanding how people interact with smartphones - MIT License
- HSC: Happy smile classifier - MIT License
If you find this project useful, please consider citing:
@software{hyodo2025sc,
author = {Katsuya Hyodo},
title = {PINTO0309/SC},
month = {11},
year = {2025},
publisher = {Zenodo},
doi = {10.5281/zenodo.17625710},
url = {https://github.com/PINTO0309/sc},
abstract = {Ultrafast sitting classification.},
}- AVA Actions Download (v2.2) - CC BY 4.0 License
- https://github.com/PINTO0309/PINTO_model_zoo/tree/main/472_DEIMv2-Wholebody34: Apache 2.0 License
@software{DEIMv2-Wholebody34, author={Katsuya Hyodo}, title={Lightweight human detection models generated on high-quality human data sets. It can detect objects with high accuracy and speed in a total of 28 classes: body, adult, child, male, female, body_with_wheelchair, body_with_crutches, head, front, right-front, right-side, right-back, back, left-back, left-side, left-front, face, eye, nose, mouth, ear, collarbone, shoulder, solar_plexus, elbow, wrist, hand, hand_left, hand_right, abdomen, hip_joint, knee, ankle, foot.}, url={https://github.com/PINTO0309/PINTO_model_zoo/tree/main/472_DEIMv2-Wholebody34}, year={2025}, month={10}, doi={10.5281/zenodo.17625710} }
- https://github.com/PINTO0309/bbalg: MIT License