A minimal tensor processing unit (TPU), reinvented from Google's TPU V2 and V1.
tinytpu.mp4
- Motivation
- Architecture
- Instruction Set
- Example Instruction Sequence
- Future Steps
- Setup
- Adding a new module to the tiny-tpu
- Running commands from Makefile
- Fixed point viewing in gtkwave
- What is a gtkw file?
- Function: Performs a multiply-accumulate operation every clock cycle
- Data Flow:
- Incoming data is multiplied by a stored weight and added to an incoming partial sum to produce an output sum
- Incoming data also passes through to the next element for propagation across the array
- Architecture: A 2D grid of processing elements
- Data Movement:
- Input values flow horizontally across the array
- Partial sums flow vertically down the array
- Weights remain fixed within each processing element during computation
- Input Preprocessing:
- Input matrices are rotated 90 degrees (implemented in hardware)
- Inputs are staggered for correct computation in the systolic array
- Weight matrices are transposed and staggered to align with mathematical formulas
- Performs element-wise operations after the systolic array
- Control: Module selection depends on the computation stage
- Modules (pipelined):
- Bias addition
- Leaky ReLU activation function
- MSE loss
- Leaky ReLU derivative
- Dual-port memory for storing intermediate values
- Stored Data:
- Input matrices
- Weight matrices
- Bias vectors
- Post-activation values for backpropagation
- Activation leak factors
- Inverse batch size constant for MSE backpropagation
- Interface:
- Two read and two write ports per data type
- Data is accessed by specifying a start address and count
- Reads can occur continuously in the background until the requested count is reached
- Instruction width: 94 bits
- See Instruction Set section below for more information.
Our ISA is 94 bits wide. The full image is available in the images/
folder.
Our ISA defines all necessary signals for transferring data and interacting with our TPU. The implementation of the control unit (which reads instructions) can be found at src/control_unit.sv
.
The instruction
bus is 94 bits wide ([93:0]
) and is divided into fields that directly control subsystems.
Bit | Signal | Meaning | Example |
---|---|---|---|
0 | sys_switch_in |
System mode switch (general-purpose "on/off" CU) | 1 = system active , 0 = idle |
1 | ub_rd_start_in |
Start UB (Unified Buffer) read transaction | 1 = trigger read , 0 = no read |
2 | ub_rd_transpose |
UB read transpose mode | 1 = transpose , 0 = normal |
3 | ub_wr_host_valid_in_1 |
Host write channel 1 valid flag | 1 = write valid , 0 = not valid |
4 | ub_wr_host_valid_in_2 |
Host write channel 2 valid flag | 1 = write valid , 0 = not valid |
Field | Signal | Meaning | Example |
---|---|---|---|
[6:5] | ub_rd_col_size |
Number of columns to read | 00=0 , 01=1 , 10=2 , 11=3 |
Field | Signal | Meaning | Example |
---|---|---|---|
[14:7] | ub_rd_row_size |
Number of rows to read (0–255) | 0x08 = read 8 rows |
Field | Signal | Meaning | Example |
---|---|---|---|
[22:15] | ub_rd_addr_in |
UB read address (0–255) | 0x10 = read bank 16 |
Field | Signal | Meaning | Example |
---|---|---|---|
[25:23] | ub_ptr_sel |
Selects UB pointer | 3’b001 = route read ptr to bias module in VPU |
Field | Signal | Meaning | Example |
---|---|---|---|
[41:26] | ub_wr_host_data_in_1 |
First host write word | 0xABCD |
Field | Signal | Meaning | Example |
---|---|---|---|
[57:42] | ub_wr_host_data_in_2 |
Second host write word | 0x1234 |
Field | Signal | Meaning | Example |
---|---|---|---|
[61:58] | vpu_data_pathway |
Routing of data in VPU | 0001=bias + relu routing |
Field | Signal | Meaning | Example |
---|---|---|---|
[77:62] | inv_batch_size_times_two_in |
Precomputed scaling factor (2/batch) | 0x0010 = (2/32) |
Field | Signal | Meaning | Example |
---|---|---|---|
[93:78] | vpu_leak_factor_in |
Leak factor for activation (e.g., Leaky ReLU) | 0x00A0 = 0.625 |
Instructions are directly loaded into an instruction buffer on the chip from a testbench file.
- See
tests/test_tpu.py
for our forward and backward pass instruction sequence - See the Setup section on how to run this testbench
- Compiler for this instruction set
- Scaling TPU to larger dimensions (256×256 or 512×512)
We are open source and appreciate any contributions! Here is our workflow and steps to set up our development environment:
- Create a virtual environment and run:
pip install cocotb
- Install iverilog using Homebrew:
brew install iverilog
- Build gtkwave FROM SOURCE (important: other installation methods currently do not work)
- Create a virtual environment and run:
pip install cocotb
- Install gtkwave:
sudo apt install gtkwave
- Install iverilog:
sudo apt install iverilog
Follow these steps to add a new module to the project:
Add your new module file <MODULE_NAME>.sv
in the src/
directory.
Create dump_<MODULE_NAME>.sv
in the test/
directory with the following code:
module dump();
initial begin
$dumpfile("waveforms/<MODULE_NAME>.vcd");
$dumpvars(0, <MODULE_NAME>);
end
endmodule
Create test_<MODULE_NAME>.py
in the test/
directory.
Add your module to the SOURCES
variable and create a test target:
test_<MODULE_NAME>: $(SIM_BUILD_DIR)
$(IVERILOG) -o $(SIM_VVP) -s <MODULE_NAME> -s dump -g2012 $(SOURCES) test/dump_<MODULE_NAME>.sv
PYTHONOPTIMIZE=$(NOASSERT) MODULE=test_<MODULE_NAME> $(VVP) -M $(COCOTB_LIBS) -m libcocotbvpi_icarus $(SIM_VVP)
! grep failure results.xml
mv <MODULE_NAME>.vcd waveforms/ 2>/dev/null || true
Run the following command to view the generated waveforms:
gtkwave waveforms/<MODULE_NAME>.vcd
Run tests:
make test_<MODULE_NAME>
View waveforms:
gtkwave waveforms/<MODULE_NAME>.vcd
Or use the shorthand:
make show_<MODULE_NAME>
- Right-click all signals
- Navigate to: Data Format → Fixed Point Shift → Specify
- Enter
8
and click OK - Set: Data Format → Signed Decimal
- Enable: Data Format → Fixed Point Shift → ON
A .gtkw
file stores the signal configuration for make show_<MODULE_NAME>
. You only need to save it once after running:
gtkwave waveforms/<MODULE_NAME>.vcd
The details of TPU architecture are closed source, as is most of chip design. We want this resource to be the ultimate guide to breaking into building chip accelerators for all levels of technical expertise — even if you just learned high school math and only know y = mx + b.
Before this project, none of us had professional experience in hardware architecture/design. We started this ambitious project as a dedicated group wanting to break into hardware design. We've collectively gained significant design experience from this project.
We hope that the inventive nature of the article at tinytpu.com, this README, and the code in this repository will help you walk through our steps and learn how to approach problems with an inventive mindset.