Skip to content

[PIR] speed up pir interpreter by event #69513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

wanghuancoder
Copy link
Contributor

@wanghuancoder wanghuancoder commented Nov 19, 2024

PR Category

Execute Infrastructure

PR Types

Performance

Description

本PR尝试优化多Stream引入的event相关操作,对CPU调度性能带来的开销。

  • 多Stream关键函数介绍
  1. PirInterpreter::RecordMemcpyD2H,主Stream RecordEvent,用于D2H Stream拷贝数据前Wait使用。
  2. InstructionBase::WaitEvent,kernel执行前调用,用于等待其他Stream前置依赖执行完毕。对于D2H Stream而言就是等待主Stream 相关Kernel执行完毕。所有Instruction都会执行WaitEvent,主Stream即便没有迁移依赖也会空执行。
  3. PirInterpreter::RecordStreamForGC,D2H Stream RecordEvent,用于记录拷贝数据何时完成。因为D2H Stream所使用的Allocation是主Stream的,在D2H Stream使用完之前,主Stream不能按照FastGC的原则释放Alloction。只有Event发生后才能回收。
  4. StreamSafeCUDAAllocator::FreeImpl,1)释放显存给缓冲池;2)可能在小概率下会触发cudaFree;3)主Stream在每次Free时都需要通过EventQuery的方式检查,D2H Stream的拷贝是否结束,如果结束则将延迟释放的Alloction回收。如果当前没有延迟释放的Alloction回收,则不执行EventQuery。
  5. InstructionBase::RecordEvent,每次Kernel执行完调用,多数场景下空执行。何时需要Record没有追溯。
  • 常见的timeline

1732067968793

从图中可以看到:

  1. PirInterpreter::RecordStreamForGC,绝大多数都空执行,耗时很多。
  2. StreamSafeCUDAAllocator::Free和AutoGrowthBestFitAllocator::Free,在需要GC时执行,耗时较多。AutoGrowthBestFitAllocator::Free是不可避免的,StreamSafeCUDAAllocator::Free为多Stream管理引入的耗时。StreamSafeCUDAAllocator::Free绝大多数没有调用EventQuery。
  3. InstructionBase::RecordEvent,绝大多数都空执行,耗时不多。
  • 本PR优化点:
  1. 调整InstructionBase::RecordEvent,InstructionBase::WaitEvent,减少不必要的开销。
  2. 调整Instruction::RecordEvent,Instruction::WaitEvent, 减少不必要的开销。
  3. 为PirInterpreter::RecordStreamForGC建立cache剪枝机制,所有Instruction都按第一次run的结果记录是否需要RecordStream。减少不必要开销。
  • 后续优化
  1. 可以考虑将Free多线程完成,但早期好像这么处理过,可能因为有些问题放弃了这种方案。
  2. 如果多线程Free,可能造成延迟Free,进而提高显存峰值。

本PR在炤伍的一个测试case中整个CheckGC有25%的性能受益。

由于PirInterpreter::RecordStreamForGC的cache策略可能影响分布式跨step多流overlap、推理动态切换执行流等场景等场景,添加FLAGS_pir_interpreter_record_stream_for_gc_cache,默认不打开。
之所以暂时没有使用StreamAnalyzer,是因为StreamAnalyzer需要的开发工作量较大。而未来又有“计划显存”策略的构想。暂时没有用StreamAnalyzer优化GC。

Pcard-67164

Copy link

paddle-bot bot commented Nov 19, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -224,6 +230,8 @@ class InstructionBase {
std::unordered_map<::pir::Value, std::vector<int>> output_index_;

std::unordered_set<::pir::Value> no_need_buffer_values_;

int need_record_stream_for_gc_{0}; // 0:not init, 1:need record, 2:not need
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不必要设计过复杂?默认就是1,代表需要,设置成0就代表不需要?

for (const EventInter& event_iter : events_to_wait_) {
// If InterpreterCore in on CPUPlace, do nothing.
if (phi::is_cpu_place(place)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样的改动,会有什么性能优化或者提升么?

for (const EventInter& event_iter : events_to_wait_) {
// If InterpreterCore in on CPUPlace, do nothing.
if (phi::is_cpu_place(place)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Contributor

@From00 From00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StreamAnalyzer中做多流分析已经考虑了部分event剪枝的场景,这个新的剪枝逻辑是否可以直接通过增强StreamAnalyzer功能来实现,原因基于以下几个方面:

  1. 在分布式跨step多流overlap、推理动态切换执行流等场景中,可能出现第一轮和后续实际需要触发record操作的event不一致,第一轮的记录不完全准确。
  2. 可以避免对allocation模块做侵入式修改。
  3. PIR之前在执行架构上期望通过调度信息的完全静态化来做到IR分析与调度执行的解耦隔离,这种借助一轮动态运行记录来静态化信息的方式可能给后续的相关升级带来阻力,不如直接静态预分析实现。

zhangbo9674
zhangbo9674 previously approved these changes Nov 22, 2024
Copy link
Contributor

@zhangbo9674 zhangbo9674 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wanghuancoder wanghuancoder changed the title Speed up pir interpreter by event [PIR] speed up pir interpreter by event Nov 22, 2024
Copy link
Contributor

@zhangbo9674 zhangbo9674 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wanghuancoder wanghuancoder merged commit ea50ffe into PaddlePaddle:develop Nov 22, 2024
27 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants