-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CudnnConvAlgoCache #3649
Add CudnnConvAlgoCache #3649
Conversation
// The best algorithm for larger workspace can also be used for smaller | ||
// workspace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// There might be a case that only memory size pair.second.memory
was required for the best algorithm even though a workspace pair.first
supplied
这里是指在一个 stream 里面试跑同一个算法时,其他 stream 里面的任务不同也会影响到这个试跑的结果? |
是的,同一个stream里面的kernel是串行执行,不会相互影响,多个stream里面的kernel可能并行执行,如果试跑的时候其他stream有kernel正在执行,那么时间一定会边长 |
* Add CudnnConvAlgoCache * refine Former-commit-id: a2af59e
目前cuDNN conv 算法缓存存在一下两个问题
额外加上一层全局缓存,可以一定程度上缓解 1 和 2 的影响(多机情况下2仍不能避免)。
因为编译期推导算法时使用的workspace size是cudnn buffer size,运行时用的是编译期推导出来的 workspace size,所以缓存的key抹掉了workspace size 信息,并且依据 " 更大的workspace size 推导出来的最优算法适用于更小的 workspace " 检索缓存。
TODO:
多机情况下问题 2 的解决,这里一个备选方案是在runtime启动时遍历plan并“预热” CudnnConvAlgoCache,或者大家讨论有没有更好的解决办法。