You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -323,22 +325,61 @@ Below is a sample of the execution traces, showing on each cycle the execution o
323
325
324
326
**For anyone trying to run the simulation or play with this repo, please feel free to DM me on [twitter](https://twitter.com/majmudaradam) if you run into any issues - I want you to get this running!**
325
327
326
-
## Notes
328
+
#Advanced Functionality
327
329
328
-
Notes on design decisions made for simplicity that could be optimized away;
330
+
For the sake of simplicity, there were many additional features implemented in modern GPUs that heavily improve performance & functionality that tiny-gpu omits. We'll discuss some of those most critical features in this section.
329
331
330
-
- Many things that could be wires are registers to make things explicitly synchronous and for code simplicity and clarity.
331
-
- State management does some things in many cycles that could be done in 1 cycle to make control flow explicit.
332
+
### Multi-layered Cache & Shared Memory
332
333
333
-
## Next Steps
334
+
In modern GPUs, multiple different levels of caches are used to minimize the amount of data that needs to get accessed from global memory. tiny-gpu implements only one cache layer between individual compute units requesting memory and the memory controllers which stores recent cached data.
335
+
336
+
Implementing multi-layered caches allows frequently accessed data to be cached more locally to where it's being used (with some caches within individual compute cores), minimizing load times for this data.
337
+
338
+
Different caching algorithms are used to maximize cache-hits - this is a critical dimension that can be improved on to optimize memory access.
339
+
340
+
Additionally, GPUs often use **shared memory** for threads within the same block to access a single memory space that can be used to share results with other threads.
341
+
342
+
### Memory Coalescing
343
+
344
+
Another critical memory optimization used by GPUs is **memory coalescing.** Multiple threads running in parallel often need to access sequential addresses in memory (for example, a group of threads accessing neighboring elements in a matrix) - but each of these memory requests is put in separately.
345
+
346
+
Memory coalescing is used to analyzing queued memory requests and combine neighboring requests into a single transaction, minimizing time spent on addressing, and making all the requests together.
347
+
348
+
### Pipelining
349
+
350
+
In the control flow for tiny-gpu, cores wait for one instruction to be executed on a group of threads before starting execution of the next instruction.
351
+
352
+
Modern GPUs use **pipelining** to stream execution of multiple sequential instructions at once while ensuring that instructions with dependencies on each other still get executed sequentially.
353
+
354
+
This helps to maximize resource utilization within cores as resources are not sitting idle while waiting (ex: during async memory requests).
355
+
356
+
### Warp Scheduling
357
+
358
+
Another strategy used to maximize resource utilization on course is **warp scheduling.** This approach involves breaking up blocks into individual batches of theads that can be executed together.
359
+
360
+
Multiple warps can be executed on a single core simultaneously by executing instructions from one warp while another warp is waiting. This is similar to pipelining, but dealing with instructions from different threads.
361
+
362
+
### Branch Divergence
363
+
364
+
tiny-gpu assumes that all threads in a single batch end up on the same PC after each instruction, meaning that threads can be executed in parallel for their entire lifetime.
365
+
366
+
In reality, individual threads could diverge from each other and branch to different lines based on their data. With different PCs, these threads would need to split into separate lines of execution, which requires managing diverging threads & paying attention to when threads converge again.
367
+
368
+
### Synchronization & Barriers
369
+
370
+
Another core functionality of modern GPUs is the ability to set **barriers** so that groups of threads in a block can synchronize and wait until all other threads in the same block have gotten to a certain point before continuing execution.
371
+
372
+
This is useful for cases where threads need to exchange shared data with each other so they can ensure that the data has been fully processed.
373
+
374
+
# Next Steps
334
375
335
376
Updates I want to make in the future to improve the design, anyone else is welcome to contribute as well:
336
377
337
378
-[ ] Build an adapter to use GPU with Tiny Tapeout 7
338
-
-[ ] Add support for branch divergence
339
-
-[ ] Optimize control flow and use of registers to improve cycle time
379
+
-[ ] Add basic branch divergence
340
380
-[ ] Add basic memory coalescing
341
381
-[ ] Add basic pipelining
382
+
-[ ] Optimize control flow and use of registers to improve cycle time
342
383
-[ ] Write a basic graphics kernel or add simple graphics hardware to demonstrate graphics functionality
343
384
344
385
**For anyone curious to play around or make a contribution, feel free to put up a PR with any improvements you'd like to add 😄**
0 commit comments