Happy smile classifier. The estimation is done on the entire head, 48x48 pixels, rather than the face.
output_.mp4
| Variant | Size | F1 | CPU inference latency |
ONNX |
|---|---|---|---|---|
| P | 115 KB | 0.4841 | 0.23 ms | Download |
| N | 176 KB | 0.5849 | 0.41 ms | Download |
| T | 280 KB | 0.6701 | 0.52 ms | Download |
| S | 495 KB | 0.7394 | 0.64 ms | Download |
| C | 876 KB | 0.7344 | 0.69 ms | Download |
| M | 1.7 MB | 0.8144 | 0.85 ms | Download |
| L | 6.4 MB | 0.8293 | 1.03 ms | Download |
| 1 | 2 | 3 | 4 |
|---|---|---|---|
git clone https://github.com/PINTO0309/HSC.git && cd HSC
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
source .venv/bin/activateuv run python demo_hsc.py \
-hm hsc_l_48x48.onnx \
-v 0 \
-ep cuda \
-dlr -dnm -dgm -dhm -dhd
uv run python demo_hsc.py \
-hm hsc_l_48x48.onnx \
-v 0 \
-ep tensorrt \
-dlr -dnm -dgm -dhm -dhduv run python 01_build_smile_parquet.py-
Use the images located under
dataset/output/002_xxxx_front_yyyyyytogether with their annotations indataset/output/002_xxxx_front.csv. -
Every augmented image that originates from the same
still_imagestays in the same split to prevent leakage. -
The training loop relies on
BCEWithLogitsLossplus class-balancedpos_weightto stabilise optimisation under class imbalance; inference produces sigmoid probabilities. Use--train_resampling weightedto switch on the previousWeightedRandomSamplerbehaviour, or--train_resampling balancedto physically duplicate minority classes before shuffling. -
Training history, validation metrics, optional test predictions, checkpoints, configuration JSON, and ONNX exports are produced automatically.
-
Per-epoch checkpoints named like
hsc_epoch_0001.ptare retained (latest 10), as well as the best checkpoints namedhsc_best_epoch0004_f1_0.9321.pt(also latest 10). -
The backbone can be switched with
--arch_variant. Supported combinations with--head_variantare:--arch_variantDefault ( --head_variant auto)Explicitly selectable heads Remarks baselineavgavg,avgmax_mlpWhen using transformer/mlp_mixer, you need to adjust the height and width of the feature map so that they are divisible by--token_mixer_grid(if left as is, an exception will occur during ONNX conversion or inference).inverted_seavgmax_mlpavg,avgmax_mlpWhen using transformer/mlp_mixer, it is necessary to adjust--token_mixer_gridas above.convnexttransformeravg,avgmax_mlp,transformer,mlp_mixerFor both heads, the grid must be divisible by the feature map (default 3x2fits with 30x48 input). -
The classification head is selected with
--head_variant(avg,avgmax_mlp,transformer,mlp_mixer, orautowhich derives a sensible default from the backbone). -
Pass
--rgb_to_yuv_to_yto convert RGB crops to YUV, keep only the Y (luma) channel inside the network, and train a single-channel stem without modifying the dataloader. -
Alternatively, use
--rgb_to_labor--rgb_to_luvto convert inputs to CIE Lab/Luv (3-channel) before the stem; these options are mutually exclusive with each other and with--rgb_to_yuv_to_y. -
Mixed precision can be enabled with
--use_ampwhen CUDA is available. -
Resume training with
--resume path/to/hsc_epoch_XXXX.pt; all optimiser/scheduler/AMP states and history are restored. -
Loss/accuracy/F1 metrics are logged to TensorBoard under
output_dir, andtqdmprogress bars expose per-epoch progress for train/val/test loops.
Baseline depthwise-separable CNN:
SIZE=48x48
uv run python -m hsc train \
--data_root data/dataset.parquet \
--output_dir runs/hsc_${SIZE} \
--epochs 100 \
--batch_size 256 \
--train_resampling balanced \
--image_size ${SIZE} \
--base_channels 32 \
--num_blocks 4 \
--arch_variant baseline \
--seed 42 \
--device auto \
--use_ampInverted residual + SE variant (recommended for higher capacity):
SIZE=48x48
VAR=s
uv run python -m hsc train \
--data_root data/dataset.parquet \
--output_dir runs/hsc_is_${VAR}_${SIZE} \
--epochs 100 \
--batch_size 256 \
--train_resampling balanced \
--image_size ${SIZE} \
--base_channels 32 \
--num_blocks 4 \
--arch_variant inverted_se \
--head_variant avgmax_mlp \
--seed 42 \
--device auto \
--use_ampConvNeXt-style backbone with transformer head over pooled tokens:
SIZE=48x48
uv run python -m hsc train \
--data_root data/dataset.parquet \
--output_dir runs/hsc_convnext_${SIZE} \
--epochs 100 \
--batch_size 256 \
--train_resampling balanced \
--image_size ${SIZE} \
--base_channels 32 \
--num_blocks 4 \
--arch_variant convnext \
--head_variant transformer \
--token_mixer_grid 3x3 \
--seed 42 \
--device auto \
--use_amp- Outputs include the latest 10
hsc_epoch_*.pt, the latest 10hsc_best_epochXXXX_f1_YYYY.pt(highest validation F1, or training F1 when no validation split),history.json,summary.json, optionaltest_predictions.csv, andtrain.log. - After every epoch a confusion matrix and ROC curve are saved under
runs/hsc/diagnostics/<split>/confusion_<split>_epochXXXX.pngandroc_<split>_epochXXXX.png. --image_sizeaccepts either a single integer for square crops (e.g.--image_size 48) orHEIGHTxWIDTHto resize non-square frames (e.g.--image_size 64x48).- Add
--resume <checkpoint>to continue from an earlier epoch. Remember that--epochsindicates the desired total epoch count (e.g. resuming--epochs 40after training to epoch 30 will run 10 additional epochs). - Launch TensorBoard with:
tensorboard --logdir runs/hsc
uv run python -m hsc exportonnx \
--checkpoint runs/hsc_is_s_48x48/hsc_best_epoch0049_f1_0.9939.pt \
--output hsc_s_48x48.onnx \
--opset 17- The saved graph exposes
imagesas input andprob_smilingas output (batch dimension is dynamic); probabilities can be consumed directly. - After exporting, the tool runs
onnxsimfor simplification and rewrites any remaining BatchNormalization nodes into affineMul/Addprimitives. If simplification fails, a warning is emitted and the unsimplified model is preserved.
- VSDLM: Visual-only speech detection driven by lip movements - MIT License
- OCEC: Open closed eyes classification. Ultra-fast wink and blink estimation model - MIT License
- PGC: Ultrafast pointing gesture classification - MIT License
- SC: Ultrafast sitting classification - MIT License
- PUC: Phone Usage Classifier is a three-class image classification pipeline for understanding how people interact with smartphones - MIT License
- HSC: Happy smile classifier - MIT License
If you find this project useful, please consider citing:
@software{hyodo2025hsc,
author = {Katsuya Hyodo},
title = {PINTO0309/HSC},
month = {11},
year = {2025},
publisher = {Zenodo},
doi = {10.5281/zenodo.17670546},
url = {https://github.com/PINTO0309/hsc},
abstract = {Happy smile classifier.},
}- https://github.com/microsoft/FERPlus: MIT License
@inproceedings{BarsoumICMI2016, title={Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution}, author={Barsoum, Emad and Zhang, Cha and Canton Ferrer, Cristian and Zhang, Zhengyou}, booktitle={ACM International Conference on Multimodal Interaction (ICMI)}, year={2016} }
- https://github.com/PINTO0309/PINTO_model_zoo/tree/main/472_DEIMv2-Wholebody34: Apache 2.0 License
@software{DEIMv2-Wholebody34, author={Katsuya Hyodo}, title={Lightweight human detection models generated on high-quality human data sets. It can detect objects with high accuracy and speed in a total of 28 classes: body, adult, child, male, female, body_with_wheelchair, body_with_crutches, head, front, right-front, right-side, right-back, back, left-back, left-side, left-front, face, eye, nose, mouth, ear, collarbone, shoulder, solar_plexus, elbow, wrist, hand, hand_left, hand_right, abdomen, hip_joint, knee, ankle, foot.}, url={https://github.com/PINTO0309/PINTO_model_zoo/tree/main/472_DEIMv2-Wholebody34}, year={2025}, month={10}, doi={10.5281/zenodo.17625710} }