Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eurosys '21 | Accelerating Graph Sampling for Graph Machine Learning using GPUs #313

Open
jasperzhong opened this issue Jul 6, 2022 · 2 comments
Assignees
Labels

Comments

@jasperzhong
Copy link
Owner

https://dl.acm.org/doi/pdf/10.1145/3447786.3456244

@jasperzhong jasperzhong self-assigned this Jul 6, 2022
@jasperzhong
Copy link
Owner Author

没想到这篇文章的background部分提高了我对GPU的认知的....

一个是关于warp的. SM执行一个thread block的时候,SM每次调度一个subset of threads,叫做warp. 一般是连续的32个threads. GPU是用SIMT的execution model: all threads in a warp run the same instruction in lock-step. 注意,the same instruction. 这意味着如果遇到了一个branch,那么这个warp中不执行这个branch的threads,需要等待执行这个branch的threads做完后,才能继续执行. 这个现象叫做warp divergence,会导致很差的性能.

第二个是关于SM是没法做context switching的. 比如有两个thread blocks想在某个SM上执行,在执行thread block A的时候出现了等待(比如由于memory latency),这个时候是没法context switch到thread block B执行的. 这点和CPU很不一样,CPU可以很轻易地做context switching (thread A保存寄存器到memory).

最后是一个知道但是不是很清楚的. 就是同一个warp中的对global memory的同时连续访存是可以合并的. 因为global memory延迟比较大嘛,能合并操作的话可以提高throughput. 这个优化我知道,但没用过.

@jasperzhong
Copy link
Owner Author

我看不懂.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant