Skip to content

A custom AI hardware accelerator built from scratch. Runs neural networks 7,000x faster than a CPU, visualized in the browser.

Notifications You must be signed in to change notification settings

michaelscutari/mini-tpu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mini TPU

Verilog Python Next.js PyTorch

A full-stack hardware acceleration project for MNIST digit recognition. This project features a full implementation of both a custom Weight-Stationary Tensor Processing Unit (TPU) and a complete 5-stage pipelined MIPS CPU.

System Architecture Figure 1: High-Level System Architecture

Key Features

  • 16x16 Systolic Array: Custom weight-stationary architecture optimized for high-throughput General Matrix Multiplication (GEMM) operations.
  • Quantization Aware Training (QAT): Custom training pipeline in PyTorch to optimize model weights for 8-bit hardware precision.
  • Zero-Copy Dataflow: Row-major, batch-by-batch output format allows layer-to-layer inference without CPU intervention.
  • Pipelined CPU: A complete 5-stage MIPS CPU that acts as the master controller for the TPU via Memory-Mapped IO (MMIO).
  • Bit-Exact Verification: Comprehensive test suite that verifies hardware output matches the golden Python simulation bit-for-bit.
  • Automated Toolchain: End-to-end compiler that automatically converts PyTorch models into the exact binary/hex files required by the hardware.
  • Real-time Visualization: Web-based frontend that bridges handwriting input directly to the Verilog hardware simulation.

Project Overview

This system accelerates matrix multiplications for neural network inference. It spans the entire stack:

  1. ML: PyTorch model training and Quantization Aware Training (QAT) for 8-bit precision.
  2. Architecture: A custom 16x16 Systolic Array TPU integrated with a full-cycle, pipelined MIPS CPU implementation.
  3. Simulation: Verilog hardware simulation using Icarus Verilog.
  4. Frontend: A Next.js web demo for real-time interaction with the hardware simulator.

Performance

By offloading matrix multiplications to the weight-stationary systolic array, we achieve up to a 7,000x speedup in General Matrix Multiplication (GEMM) calculations compared to a standard scalar CPU implementation. The TPU's ability to perform 256 multiplications per cycle (at 16x16) drastically reduces inference latency.

Performance scales significantly with batch size. Because our architecture is weight-stationary, loading weights into the systolic array is an expensive operation that only needs to happen once for a given set of inputs. Larger batch sizes allow us to amortize this setup cost across more compute cycles, keeping the processing elements (PEs) active for longer periods and maximizing utilization.

Speedup Graph Figure 2: TPU vs CPU Performance Comparison

Repository Structure

  • src/: Hardware Source Code. Contains the Verilog implementation of the TPU and CPU.
    • See src/README.md for detailed hardware specifications and interface requirements.
  • tests/: Test Suite. Comprehensive, multi-tier testbenches covering everything from unit tests to full system integration.
    • See TESTING.md for detailed testing documentation and the test runner guide.
  • ml/: Machine Learning. Python scripts for training the MNIST model and exporting weights/biases for the hardware.
  • demo/: Web Demo. A Next.js application to interact with the hardware simulator.

Getting Started

Prerequisites

  • Python 3.x
  • Icarus Verilog (iverilog) and vvp
  • Node.js (for the demo)

Running Tests

We have a robust testing framework located in tests/.

To run all tests:

python test.py --all

To run specific suites:

python test.py tpu --all  # Run all TPU tests
python test.py cpu --all  # Run all CPU tests

Test Results Figure 3: Rich Text Test Runner Output

For more details, check out TESTING.md.

Running the Demo

The demo visualizes the hardware's internal state (systolic array heatmaps) as it processes your handwriting.

  1. Navigate to the demo folder:
    cd demo
  2. Install dependencies and run:
    npm install
    npm run dev
  3. Open http://localhost:3000.

Demo Screenshot Figure 4: Web Demo Interface

Authors

  • Michael Scutari
  • Praneeth Muvva

Context

Built for Duke University's ECE350.

For a deep dive into the architecture, design decisions, and detailed performance analysis, check out the Full Project Writeup.

About

A custom AI hardware accelerator built from scratch. Runs neural networks 7,000x faster than a CPU, visualized in the browser.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •