A full-stack hardware acceleration project for MNIST digit recognition. This project features a full implementation of both a custom Weight-Stationary Tensor Processing Unit (TPU) and a complete 5-stage pipelined MIPS CPU.
Figure 1: High-Level System Architecture
- 16x16 Systolic Array: Custom weight-stationary architecture optimized for high-throughput General Matrix Multiplication (GEMM) operations.
- Quantization Aware Training (QAT): Custom training pipeline in PyTorch to optimize model weights for 8-bit hardware precision.
- Zero-Copy Dataflow: Row-major, batch-by-batch output format allows layer-to-layer inference without CPU intervention.
- Pipelined CPU: A complete 5-stage MIPS CPU that acts as the master controller for the TPU via Memory-Mapped IO (MMIO).
- Bit-Exact Verification: Comprehensive test suite that verifies hardware output matches the golden Python simulation bit-for-bit.
- Automated Toolchain: End-to-end compiler that automatically converts PyTorch models into the exact binary/hex files required by the hardware.
- Real-time Visualization: Web-based frontend that bridges handwriting input directly to the Verilog hardware simulation.
This system accelerates matrix multiplications for neural network inference. It spans the entire stack:
- ML: PyTorch model training and Quantization Aware Training (QAT) for 8-bit precision.
- Architecture: A custom 16x16 Systolic Array TPU integrated with a full-cycle, pipelined MIPS CPU implementation.
- Simulation: Verilog hardware simulation using Icarus Verilog.
- Frontend: A Next.js web demo for real-time interaction with the hardware simulator.
By offloading matrix multiplications to the weight-stationary systolic array, we achieve up to a 7,000x speedup in General Matrix Multiplication (GEMM) calculations compared to a standard scalar CPU implementation. The TPU's ability to perform 256 multiplications per cycle (at 16x16) drastically reduces inference latency.
Performance scales significantly with batch size. Because our architecture is weight-stationary, loading weights into the systolic array is an expensive operation that only needs to happen once for a given set of inputs. Larger batch sizes allow us to amortize this setup cost across more compute cycles, keeping the processing elements (PEs) active for longer periods and maximizing utilization.
Figure 2: TPU vs CPU Performance Comparison
src/: Hardware Source Code. Contains the Verilog implementation of the TPU and CPU.- See src/README.md for detailed hardware specifications and interface requirements.
tests/: Test Suite. Comprehensive, multi-tier testbenches covering everything from unit tests to full system integration.- See TESTING.md for detailed testing documentation and the test runner guide.
ml/: Machine Learning. Python scripts for training the MNIST model and exporting weights/biases for the hardware.- See ml/README.md for training workflow details.
demo/: Web Demo. A Next.js application to interact with the hardware simulator.- See demo/README.md for setup details.
- Python 3.x
- Icarus Verilog (
iverilog) andvvp - Node.js (for the demo)
We have a robust testing framework located in tests/.
To run all tests:
python test.py --allTo run specific suites:
python test.py tpu --all # Run all TPU tests
python test.py cpu --all # Run all CPU tests
Figure 3: Rich Text Test Runner Output
For more details, check out TESTING.md.
The demo visualizes the hardware's internal state (systolic array heatmaps) as it processes your handwriting.
- Navigate to the demo folder:
cd demo - Install dependencies and run:
npm install npm run dev
- Open
http://localhost:3000.
- Michael Scutari
- Praneeth Muvva
Built for Duke University's ECE350.
For a deep dive into the architecture, design decisions, and detailed performance analysis, check out the Full Project Writeup.
