-
Notifications
You must be signed in to change notification settings - Fork 5.8k
[PIR] speed up pir interpreter by event #69513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PIR] speed up pir interpreter by event #69513
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
@@ -224,6 +230,8 @@ class InstructionBase { | |||
std::unordered_map<::pir::Value, std::vector<int>> output_index_; | |||
|
|||
std::unordered_set<::pir::Value> no_need_buffer_values_; | |||
|
|||
int need_record_stream_for_gc_{0}; // 0:not init, 1:need record, 2:not need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里不必要设计过复杂?默认就是1,代表需要,设置成0就代表不需要?
for (const EventInter& event_iter : events_to_wait_) { | ||
// If InterpreterCore in on CPUPlace, do nothing. | ||
if (phi::is_cpu_place(place)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这样的改动,会有什么性能优化或者提升么?
for (const EventInter& event_iter : events_to_wait_) { | ||
// If InterpreterCore in on CPUPlace, do nothing. | ||
if (phi::is_cpu_place(place)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StreamAnalyzer中做多流分析已经考虑了部分event剪枝的场景,这个新的剪枝逻辑是否可以直接通过增强StreamAnalyzer功能来实现,原因基于以下几个方面:
- 在分布式跨step多流overlap、推理动态切换执行流等场景中,可能出现第一轮和后续实际需要触发record操作的event不一致,第一轮的记录不完全准确。
- 可以避免对allocation模块做侵入式修改。
- PIR之前在执行架构上期望通过调度信息的完全静态化来做到IR分析与调度执行的解耦隔离,这种借助一轮动态运行记录来静态化信息的方式可能给后续的相关升级带来阻力,不如直接静态预分析实现。
… speed_up_pir_interpreter_by_event
… speed_up_pir_interpreter_by_event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR Category
Execute Infrastructure
PR Types
Performance
Description
本PR尝试优化多Stream引入的event相关操作,对CPU调度性能带来的开销。
从图中可以看到:
本PR在炤伍的一个测试case中整个CheckGC有25%的性能受益。
由于PirInterpreter::RecordStreamForGC的cache策略可能影响分布式跨step多流overlap、推理动态切换执行流等场景等场景,添加FLAGS_pir_interpreter_record_stream_for_gc_cache,默认不打开。
之所以暂时没有使用StreamAnalyzer,是因为StreamAnalyzer需要的开发工作量较大。而未来又有“计划显存”策略的构想。暂时没有用StreamAnalyzer优化GC。
Pcard-67164