Existing apps require you to frame the cube in a grid and take a photo per side, with specific orientation requirements to detect the state. This project aims to detect the cube state through a video stream, with the user rotating the cube in front of a camera.
This project initially started as a CalHacks project where we built a SwiftUI app with C++ OpenCV (for Swift interoperability). The cube detection was done through pure classical CV (see project here), with CV techniques including masking & thresholding, contour maps, connected components, Canny edge detection and RDP polygonal approximation. Results can be seen here:
We use DINO + LangSAM (bbox output from DINO to LangSAM for segmentation) to produce a segmented cube (1-cube-segmentation), and pass this into a classical CV pipeline to detect the pieces (2-piece-detection), where the final state is extracted (3-state-mapping).

We obtained reasonable results on most cube inputs, with the exception of some hand placements obscuring corners, causing line detection to fail.
An immediate next goal is to train an end-to-end model, bypassing the classical CV steps.
This was our Machine Learning @ Berkeley's NMEP Project in Fall 2023.
