Skip to content

Commit efbd393

Browse files
committed
Update blog
1 parent aa3468e commit efbd393

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

content/blog/2025-10-27-1761560082.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,13 @@ This analogy can help understand the scale and performance penalty for data tran
1818

1919
For e.g. reading constantly from the Global Memory is like driving between the factory and the warehouse outside the city each time (with the traffic of city roads). This is much slower than going to the shed inside the factory (i.e. Shared Memory), and much much slower than just sticking your hand into the tray next to your stamping machine (i.e. Registers). And reading from the Host Memory (CPU) is like taking an overnight trip to another city.
2020

21-
Therefore the job of running a computation graph (like ONNX) efficiently on GPU(s) is like planning the logistics of a manufacturing company. You've got raw materials in one city that you need to ship to potentially different cities, and store and process them across different factories and machines. And you need to make sure that the production process follows the chart laid out in the computation graph. You need to make sure that every machine in each factory is being utilized optimally, and account for the time it takes to move things between cities/factories/machines.
21+
Therefore the job of running a computation graph (like ONNX) efficiently on GPU(s) is like planning the logistics of a manufacturing company. You've got raw materials in the main warehouse that you need to transfer between cities, and store/process/transfer artifacts across different factories and machines. You need to make sure that the production process follows the chart laid out in the computation graph. And most importantly you need to make sure that every machine in each factory is being utilized optimally, and account for the time it takes to move things between cities/factories/machines.
2222

2323
If you're supporting multiple models, then you're dealing with multiple computation graphs. And if you're supporting multiple GPU vendors (NVIDIA, AMD etc), and multiple architectures in each vendor (e.g. 3060, 4080, 5080 etc), then you're dealing with multiple factory configurations.
2424

25-
You can analyze the computation graph ahead-of-time (AOT) and perform some obvious optimizations like fusing operations etc. And you can take the factory configuration (GPU specs) into account and plan the task division and schedule.
25+
So you can analyze the computation graph ahead-of-time (AOT) and perform some obvious optimizations like fusing operations etc. And you can take the factory configuration (GPU specs) into account and plan the task division and schedule.
2626

27-
But it might also make sense to have a "realtime" supervisor. This supervisor would get realtime information about how things are actually going, and would adjust the task division and layout in real time. Maybe even change the compiled graph in realtime.
27+
But it might also make sense to have a "realtime" supervisor. This supervisor would get realtime information about how things are actually going, and adjust the task division and layout in real time. Maybe even change the compiled graph in realtime.
2828

2929
Notes:
3030
1. Apple Silicon and Mobile devices use a concept of "unified memory", so they don't have an overnight trip between cities. You can think of Apple Silicon as neighboring cities that almost overlap, like twin cities in some countries.

0 commit comments

Comments
 (0)