You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+12-12Lines changed: 12 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -65,13 +65,13 @@ The GPU itself consists of the following units:
65
65
4. Memory controllers for data memory & program memory
66
66
5. Cache
67
67
68
-
**Device Control Register:**
68
+
### Device Control Register
69
69
70
70
The device control register usually stores metadata specifying how kernels should be executed on the GPU.
71
71
72
72
In this case, the device control register just stores the `thread_count` - the total number of threads to launch for the active kernel.
73
73
74
-
**Dispatcher:**
74
+
### Dispatcher
75
75
76
76
Once a kernel is launched, the dispatcher is the unit that actually manages the distribution of threads to different compute cores.
77
77
@@ -83,7 +83,7 @@ Once all blocks have been processed, the dispatcher reports back that the kernel
83
83
84
84
The GPU is built to interface with an external global memory. Here, data memory and program memory are separated out for simplicity.
85
85
86
-
**Global Memory:**
86
+
### Global Memory
87
87
88
88
tiny-gpu data memory has the following specifications:
89
89
@@ -95,15 +95,15 @@ tiny-gpu program memory has the following specifications:
95
95
- 8 bit addressability (256 rows of program memory)
96
96
- 16 bit data (each instruction is 16 bits as specified by the ISA)
97
97
98
-
**Memory Controllers:**
98
+
### Memory Controllers
99
99
100
100
Global memory has fixed read/write bandwidth, but there may be far more incoming requests across all cores to access data from memory than the external memory is actually able to handle.
101
101
102
102
The memory controllers keep track of all the outgoing requests to memory from the compute cores, throttle requests based on actual external memory bandwidth, and relay responses from external memory back to the proper resources.
103
103
104
104
Each memory controller has a fixed number of channels based on the bandwidth of global memory.
105
105
106
-
**Cache:**
106
+
### Cache
107
107
108
108
The same data is often requested from global memory by multiple cores. Constantly access global memory repeatedly is expensive, and since the data has already been fetched once, it would be more efficient to store it on device in SRAM to be retrieved much quicker on later requests.
109
109
@@ -115,7 +115,7 @@ Each core has a number of compute resources, often built around a certain number
115
115
116
116
In this simplified GPU, each core processed one **block** at a time, and for each thread in a block, the core has a dedicated ALU, LSU, PC, and register file. Managing the execution of thread instructions on these resources is one of the most challening problems in GPUs.
117
117
118
-
**Scheduler:**
118
+
### Scheduler
119
119
120
120
Each core has a single scheduler that manages the execution of threads.
121
121
@@ -125,33 +125,33 @@ In more advanced schedulers, techniques like **pipelining** are used to stream t
125
125
126
126
The main constraint the scheduler has to work around is the latency associated with loading & storing data from global memory. While most instructions can be executed synchronously, these load-store operations are asynchronous, meaning the rest of the instruction execution has to be built around these long wait times.
127
127
128
-
**Fetcher:**
128
+
### Fetcher
129
129
130
130
Asynchronously fetches the instruction at the current program counter from program memory (most should actually be fetching from cache after a single block is executed).
131
131
132
-
**Decoder:**
132
+
### Decoder
133
133
134
134
Decodes the fetched instruction into control signals for thread execution.
135
135
136
-
**Register Files:**
136
+
### Register Files
137
137
138
138
Each thread has it's own dedicated set of register files. The register files hold the data that each thread is performing computations on, which enables the same-instruction multiple-data (SIMD) pattern.
139
139
140
140
Importantly, each register file contains a few read-only registers holding data about the current block & thread being executed locally, enabling kernels to be executed with different data based on the local thread id.
141
141
142
-
**ALUs:**
142
+
### ALUs
143
143
144
144
Dedicated arithmetic-logic unit for each thread to perform computations. Handles the `ADD`, `SUB`, `MUL`, `DIV` arithmetic instructions.
145
145
146
146
Also handles the `CMP` comparison instruction which actually outputs whether the result of the difference between two registers is negative, zero or positive - and stores the result in the `NZP` register in the PC unit.
147
147
148
-
**LSUs:**
148
+
### LSUs
149
149
150
150
Dedicated load-store unit for each thread to access global data memory.
151
151
152
152
Handles the `LDR` & `STR` instructions - and handles async wait times for memory requests to be processed and relayed by the memory controller.
153
153
154
-
**PCs:**
154
+
### PCs
155
155
156
156
Dedicated program-counter for each unit to determine the next instructions to execute on each thread.
0 commit comments