A computer vision and machine learning project for detecting and classifying hand gestures captured from a laptop camera. The project combines face detection, skin-color-based hand localization, dataset generation, and neural network classification to recognize hand-made letter signs.
This repository contains a university computer vision lab project originally developed in Google Colab and later organized for GitHub presentation.
The project explores a full pipeline for real-time hand gesture recognition:
- detect the face in the camera frame,
- use the detected face region to estimate a skin-color distribution,
- suppress the face region and search for the hand,
- capture and preprocess hand images,
- build datasets for selected letters,
- train MLP models to classify the gestures,
- run inference on live camera input.
The selected gesture classes in this project are the letters M, N, and W.
- Face detection using a Haar cascade on grayscale images
- Region of interest tracking for identifying relevant areas in the frame
- CamShift-based color tracking to model skin-color distribution
- Hand extraction and cropping from the video feed
- Dataset generation with different class-balance and variability settings
- MLP classification for recognizing hand gesture letters
- Live prediction on camera input
Automatic-Signal-Detector/
├── README.md
├── .gitignore
├── notebooks/
│ └── CompVision_Ilaria.ipynb
├── models/
│ ├── model1.json
│ ├── model1_weights.h5
│ ├── model2.json
│ ├── model2_weights.h5
│ ├── model3.json
│ └── model3_weights.h5
├── results/
│ ├── dataset1.txt
│ ├── dataset2.txt
│ └── dataset3.txt
notebooks/CompVision_Ilaria.ipynb— main notebook containing the full project workflowresults/dataset1.txt,results/dataset2.txt,results/dataset3.txt— dataset and experiment output logsmodels/— saved model architectures and trained weights
The first stage detects the face using a Haar cascade on a grayscale image. Grayscale reduces the amount of information to process and makes detection more efficient than working directly on full-color frames.
After detecting the face, the project uses the face region as a reference area to estimate a skin-color distribution. This information is then used to search for other regions in the frame with similar characteristics.
The face region is excluded from the probability map so that the algorithm focuses on locating the hands instead of repeatedly identifying the face.
The system captures hand images at user-defined intervals and stores them in multiple sizes, including 16×16 and 224×224, for later processing and training.
Three datasets were created to compare how class balance and variability affect model performance:
- Dataset 1: balanced classes with high variability
- Dataset 2: unbalanced classes (50 / 100 / 150 samples) with high variability
- Dataset 3: balanced classes where one class has low variability
Three MLP models were trained and evaluated on the datasets to compare their behavior under different data conditions.
| Dataset | Train/Test Split | Validation Loss | Validation Accuracy |
|---|---|---|---|
| Dataset 1 | 210 / 90 | 1.4553 | 0.6556 |
| Dataset 2 | 244 / 106 | 0.9063 | 0.8302 |
| Dataset 3 | 210 / 90 | 0.6691 | 0.8444 |
Observation: Model 1 performs best on Datasets 2 and 3.
| Dataset | Train/Test Split | Validation Loss | Validation Accuracy |
|---|---|---|---|
| Dataset 1 | 210 / 90 | 1.7095 | 0.7667 |
| Dataset 2 | 244 / 106 | 0.9224 | 0.8396 |
| Dataset 3 | 210 / 90 | 1.2521 | 0.7556 |
Observation: Model 2 performs best on Dataset 2, likely benefiting from the dominant class distribution.
| Dataset | Train/Test Split | Validation Loss | Validation Accuracy |
|---|---|---|---|
| Dataset 1 | 210 / 90 | 1.2044 | 0.7889 |
| Dataset 2 | 244 / 106 | 1.2693 | 0.7642 |
| Dataset 3 | 210 / 90 | 1.8945 | 0.7000 |
Observation: Model 3 performs best on Datasets 1 and 2.
For the live test phase, the project uses Model 1 for prediction. The system:
- detects the hand in the camera frame,
- generates a grayscale probability image,
- reshapes the processed image for model input,
- loads the trained model,
- predicts the performed letter,
- overlays the prediction on the video stream.
- Python
- OpenCV
- NumPy
- Matplotlib
- TensorFlow / Keras
- Google Colab
This project was originally developed in Google Colab and includes Colab-specific components such as:
- camera capture through browser-side JavaScript,
- Google Drive mounting,
- Colab utility imports.
Because of this, the notebook is best understood as a documented academic project and prototype rather than a packaged, fully reproducible local application.
The full image dataset is stored externally on Google Drive rather than in this repository.
- The implementation is tightly coupled to the Google Colab environment.
- Only three gesture classes are considered: M, N, and W.
- The dataset is relatively small and tailored to the project experiment.
- The repository is focused on demonstrating the pipeline and results rather than production deployment.