Add advanced functionality section

adam-maj · adam-maj · commit 708be3d28995 · 2024-04-24T02:27:49.000-07:00
diff --git a/README.md b/README.md
@@ -19,6 +19,8 @@ Built with <15 files of fully documented Verilog, complete documentation on arch
   - [Matrix Addition](#matrix-addition)
   - [Matrix Multiplication](/tree/master?tab=readme-ov-file#matrix-multiplication)
 - [Simulation](#simulation)
+- [Advanced Functionality](#advanced-functionality)
+- [Next Steps](#next-steps)
 
 # Overview
 
@@ -323,22 +325,61 @@ Below is a sample of the execution traces, showing on each cycle the execution o
 
 **For anyone trying to run the simulation or play with this repo, please feel free to DM me on [twitter](https://twitter.com/majmudaradam) if you run into any issues - I want you to get this running!**
 
-## Notes
+# Advanced Functionality
 
-Notes on design decisions made for simplicity that could be optimized away;
+For the sake of simplicity, there were many additional features implemented in modern GPUs that heavily improve performance & functionality that tiny-gpu omits. We'll discuss some of those most critical features in this section.
 
-- Many things that could be wires are registers to make things explicitly synchronous and for code simplicity and clarity.
-- State management does some things in many cycles that could be done in 1 cycle to make control flow explicit.
+### Multi-layered Cache & Shared Memory
 
-## Next Steps
+In modern GPUs, multiple different levels of caches are used to minimize the amount of data that needs to get accessed from global memory. tiny-gpu implements only one cache layer between individual compute units requesting memory and the memory controllers which stores recent cached data.
+
+Implementing multi-layered caches allows frequently accessed data to be cached more locally to where it's being used (with some caches within individual compute cores), minimizing load times for this data.
+
+Different caching algorithms are used to maximize cache-hits - this is a critical dimension that can be improved on to optimize memory access.
+
+Additionally, GPUs often use **shared memory** for threads within the same block to access a single memory space that can be used to share results with other threads.
+
+### Memory Coalescing
+
+Another critical memory optimization used by GPUs is **memory coalescing.** Multiple threads running in parallel often need to access sequential addresses in memory (for example, a group of threads accessing neighboring elements in a matrix) - but each of these memory requests is put in separately.
+
+Memory coalescing is used to analyzing queued memory requests and combine neighboring requests into a single transaction, minimizing time spent on addressing, and making all the requests together.
+
+### Pipelining
+
+In the control flow for tiny-gpu, cores wait for one instruction to be executed on a group of threads before starting execution of the next instruction.
+
+Modern GPUs use **pipelining** to stream execution of multiple sequential instructions at once while ensuring that instructions with dependencies on each other still get executed sequentially.
+
+This helps to maximize resource utilization within cores as resources are not sitting idle while waiting (ex: during async memory requests).
+
+### Warp Scheduling
+
+Another strategy used to maximize resource utilization on course is **warp scheduling.** This approach involves breaking up blocks into individual batches of theads that can be executed together.
+
+Multiple warps can be executed on a single core simultaneously by executing instructions from one warp while another warp is waiting. This is similar to pipelining, but dealing with instructions from different threads.
+
+### Branch Divergence
+
+tiny-gpu assumes that all threads in a single batch end up on the same PC after each instruction, meaning that threads can be executed in parallel for their entire lifetime.
+
+In reality, individual threads could diverge from each other and branch to different lines based on their data. With different PCs, these threads would need to split into separate lines of execution, which requires managing diverging threads & paying attention to when threads converge again.
+
+### Synchronization & Barriers
+
+Another core functionality of modern GPUs is the ability to set **barriers** so that groups of threads in a block can synchronize and wait until all other threads in the same block have gotten to a certain point before continuing execution.
+
+This is useful for cases where threads need to exchange shared data with each other so they can ensure that the data has been fully processed.
+
+# Next Steps
 
 Updates I want to make in the future to improve the design, anyone else is welcome to contribute as well:
 
 - [ ] Build an adapter to use GPU with Tiny Tapeout 7
-- [ ] Add support for branch divergence
-- [ ] Optimize control flow and use of registers to improve cycle time
+- [ ] Add basic branch divergence
 - [ ] Add basic memory coalescing
 - [ ] Add basic pipelining
+- [ ] Optimize control flow and use of registers to improve cycle time
 - [ ] Write a basic graphics kernel or add simple graphics hardware to demonstrate graphics functionality
 
 **For anyone curious to play around or make a contribution, feel free to put up a PR with any improvements you'd like to add 😄**