Mini TPU

A full-stack hardware acceleration project for MNIST digit recognition. This project features a full implementation of both a custom Weight-Stationary Tensor Processing Unit (TPU) and a complete 5-stage pipelined MIPS CPU.

Figure 1: High-Level System Architecture

Key Features

16x16 Systolic Array: Custom weight-stationary architecture optimized for high-throughput General Matrix Multiplication (GEMM) operations.
Quantization Aware Training (QAT): Custom training pipeline in PyTorch to optimize model weights for 8-bit hardware precision.
Zero-Copy Dataflow: Row-major, batch-by-batch output format allows layer-to-layer inference without CPU intervention.
Pipelined CPU: A complete 5-stage MIPS CPU that acts as the master controller for the TPU via Memory-Mapped IO (MMIO).
Bit-Exact Verification: Comprehensive test suite that verifies hardware output matches the golden Python simulation bit-for-bit.
Automated Toolchain: End-to-end compiler that automatically converts PyTorch models into the exact binary/hex files required by the hardware.
Real-time Visualization: Web-based frontend that bridges handwriting input directly to the Verilog hardware simulation.

Project Overview

This system accelerates matrix multiplications for neural network inference. It spans the entire stack:

ML: PyTorch model training and Quantization Aware Training (QAT) for 8-bit precision.
Architecture: A custom 16x16 Systolic Array TPU integrated with a full-cycle, pipelined MIPS CPU implementation.
Simulation: Verilog hardware simulation using Icarus Verilog.
Frontend: A Next.js web demo for real-time interaction with the hardware simulator.

Performance

By offloading matrix multiplications to the weight-stationary systolic array, we achieve up to a 7,000x speedup in General Matrix Multiplication (GEMM) calculations compared to a standard scalar CPU implementation. The TPU's ability to perform 256 multiplications per cycle (at 16x16) drastically reduces inference latency.

Performance scales significantly with batch size. Because our architecture is weight-stationary, loading weights into the systolic array is an expensive operation that only needs to happen once for a given set of inputs. Larger batch sizes allow us to amortize this setup cost across more compute cycles, keeping the processing elements (PEs) active for longer periods and maximizing utilization.

Figure 2: TPU vs CPU Performance Comparison

Repository Structure

src/: Hardware Source Code. Contains the Verilog implementation of the TPU and CPU.
- See src/README.md for detailed hardware specifications and interface requirements.
tests/: Test Suite. Comprehensive, multi-tier testbenches covering everything from unit tests to full system integration.
- See TESTING.md for detailed testing documentation and the test runner guide.
ml/: Machine Learning. Python scripts for training the MNIST model and exporting weights/biases for the hardware.
- See ml/README.md for training workflow details.
demo/: Web Demo. A Next.js application to interact with the hardware simulator.
- See demo/README.md for setup details.

Getting Started

Prerequisites

Python 3.x
Icarus Verilog (iverilog) and vvp
Node.js (for the demo)

Running Tests

We have a robust testing framework located in tests/.

To run all tests:

python test.py --all

To run specific suites:

python test.py tpu --all  # Run all TPU tests
python test.py cpu --all  # Run all CPU tests

Figure 3: Rich Text Test Runner Output

For more details, check out TESTING.md.

Running the Demo

The demo visualizes the hardware's internal state (systolic array heatmaps) as it processes your handwriting.

Navigate to the demo folder:
```
cd demo
```
Install dependencies and run:
```
npm install
npm run dev
```
Open http://localhost:3000.

Figure 4: Web Demo Interface

Authors

Michael Scutari
Praneeth Muvva

Context

Built for Duke University's ECE350.

For a deep dive into the architecture, design decisions, and detailed performance analysis, check out the Full Project Writeup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mini TPU

Key Features

Project Overview

Performance

Repository Structure

Getting Started

Prerequisites

Running Tests

Running the Demo

Authors

Context

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
demo		demo
docs		docs
ml		ml
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
TESTING.md		TESTING.md
test.py		test.py

michaelscutari/mini-tpu

Folders and files

Latest commit

History

Repository files navigation

Mini TPU

Key Features

Project Overview

Performance

Repository Structure

Getting Started

Prerequisites

Running Tests

Running the Demo

Authors

Context

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages