[PIR] speed up pir interpreter by event #69513

wanghuancoder · 2024-11-19T11:28:16Z

PR Category

Execute Infrastructure

PR Types

Performance

Description

本PR尝试优化多Stream引入的event相关操作，对CPU调度性能带来的开销。

多Stream关键函数介绍

PirInterpreter::RecordMemcpyD2H，主Stream RecordEvent，用于D2H Stream拷贝数据前Wait使用。
InstructionBase::WaitEvent，kernel执行前调用，用于等待其他Stream前置依赖执行完毕。对于D2H Stream而言就是等待主Stream 相关Kernel执行完毕。所有Instruction都会执行WaitEvent，主Stream即便没有迁移依赖也会空执行。
PirInterpreter::RecordStreamForGC，D2H Stream RecordEvent，用于记录拷贝数据何时完成。因为D2H Stream所使用的Allocation是主Stream的，在D2H Stream使用完之前，主Stream不能按照FastGC的原则释放Alloction。只有Event发生后才能回收。
StreamSafeCUDAAllocator::FreeImpl，1）释放显存给缓冲池；2）可能在小概率下会触发cudaFree；3）主Stream在每次Free时都需要通过EventQuery的方式检查，D2H Stream的拷贝是否结束，如果结束则将延迟释放的Alloction回收。如果当前没有延迟释放的Alloction回收，则不执行EventQuery。
InstructionBase::RecordEvent,每次Kernel执行完调用，多数场景下空执行。何时需要Record没有追溯。

常见的timeline

从图中可以看到：

PirInterpreter::RecordStreamForGC，绝大多数都空执行，耗时很多。
StreamSafeCUDAAllocator::Free和AutoGrowthBestFitAllocator::Free，在需要GC时执行，耗时较多。AutoGrowthBestFitAllocator::Free是不可避免的，StreamSafeCUDAAllocator::Free为多Stream管理引入的耗时。StreamSafeCUDAAllocator::Free绝大多数没有调用EventQuery。
InstructionBase::RecordEvent，绝大多数都空执行，耗时不多。

本PR优化点：

调整InstructionBase::RecordEvent，InstructionBase::WaitEvent，减少不必要的开销。
调整Instruction::RecordEvent，Instruction::WaitEvent，减少不必要的开销。
为PirInterpreter::RecordStreamForGC建立cache剪枝机制，所有Instruction都按第一次run的结果记录是否需要RecordStream。减少不必要开销。

后续优化

可以考虑将Free多线程完成，但早期好像这么处理过，可能因为有些问题放弃了这种方案。
如果多线程Free，可能造成延迟Free，进而提高显存峰值。

本PR在炤伍的一个测试case中整个CheckGC有25%的性能受益。

由于PirInterpreter::RecordStreamForGC的cache策略可能影响分布式跨step多流overlap、推理动态切换执行流等场景等场景，添加FLAGS_pir_interpreter_record_stream_for_gc_cache，默认不打开。
之所以暂时没有使用StreamAnalyzer，是因为StreamAnalyzer需要的开发工作量较大。而未来又有“计划显存”策略的构想。暂时没有用StreamAnalyzer优化GC。

Pcard-67164

paddle-bot · 2024-11-19T11:28:23Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

zhangbo9674 · 2024-11-20T02:57:14Z

paddle/fluid/framework/new_executor/instruction/instruction_base.h

@@ -224,6 +230,8 @@ class InstructionBase {
  std::unordered_map<::pir::Value, std::vector<int>> output_index_;

  std::unordered_set<::pir::Value> no_need_buffer_values_;
+
+  int need_record_stream_for_gc_{0};  // 0:not init, 1:need record, 2:not need


这里不必要设计过复杂？默认就是1，代表需要，设置成0就代表不需要？

zhangbo9674 · 2024-11-20T02:57:41Z

paddle/fluid/framework/new_executor/instruction/instruction_base.cc

  for (const EventInter& event_iter : events_to_wait_) {
+    // If InterpreterCore in on CPUPlace, do nothing.
+    if (phi::is_cpu_place(place)) {


这样的改动，会有什么性能优化或者提升么？

zhangbo9674 · 2024-11-20T03:02:50Z

paddle/fluid/framework/new_executor/new_executor_defs.cc

  for (const EventInter& event_iter : events_to_wait_) {
+    // If InterpreterCore in on CPUPlace, do nothing.
+    if (phi::is_cpu_place(place)) {


From00

StreamAnalyzer中做多流分析已经考虑了部分event剪枝的场景，这个新的剪枝逻辑是否可以直接通过增强StreamAnalyzer功能来实现，原因基于以下几个方面：

在分布式跨step多流overlap、推理动态切换执行流等场景中，可能出现第一轮和后续实际需要触发record操作的event不一致，第一轮的记录不完全准确。
可以避免对allocation模块做侵入式修改。
PIR之前在执行架构上期望通过调度信息的完全静态化来做到IR分析与调度执行的解耦隔离，这种借助一轮动态运行记录来静态化信息的方式可能给后续的相关升级带来阻力，不如直接静态预分析实现。

… speed_up_pir_interpreter_by_event

zhangbo9674

LGTM

zhangbo9674

LGTM

wanghuancoder added 3 commits November 19, 2024 11:24

speed up pir_interpreter on multi stream

aa10d12

speed up pir_interpreter on multi stream

be8c692

speed up pir_interpreter on multi stream

f859797

wanghuancoder added 2 commits November 20, 2024 01:43

refine

e7fa609

refine

80878d2

zhangbo9674 reviewed Nov 20, 2024

View reviewed changes

From00 reviewed Nov 20, 2024

View reviewed changes

wanghuancoder added 6 commits November 21, 2024 01:49

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

6e22ec2

… speed_up_pir_interpreter_by_event

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

6016084

… speed_up_pir_interpreter_by_event

refine

c48956c

refine

96ee6c2

refine

a3eca64

refine

b246ea3

zhangbo9674 previously approved these changes Nov 22, 2024

View reviewed changes

refine

b15d3e8

wanghuancoder dismissed zhangbo9674’s stale review via b15d3e8 November 22, 2024 02:53

wanghuancoder changed the title ~~Speed up pir interpreter by event~~ [PIR] speed up pir interpreter by event Nov 22, 2024

zhangbo9674 approved these changes Nov 22, 2024

View reviewed changes

luotao1 approved these changes Nov 22, 2024

View reviewed changes

wanghuancoder merged commit ea50ffe into PaddlePaddle:develop Nov 22, 2024
27 of 28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PIR] speed up pir interpreter by event #69513

[PIR] speed up pir interpreter by event #69513

Uh oh!

wanghuancoder commented Nov 19, 2024 •

edited

Loading

Uh oh!

paddle-bot bot commented Nov 19, 2024

Uh oh!

zhangbo9674 Nov 20, 2024

Uh oh!

zhangbo9674 Nov 20, 2024

Uh oh!

zhangbo9674 Nov 20, 2024

Uh oh!

From00 left a comment

Uh oh!

zhangbo9674 left a comment

Uh oh!

zhangbo9674 left a comment

Uh oh!

Uh oh!

Uh oh!

[PIR] speed up pir interpreter by event #69513

[PIR] speed up pir interpreter by event #69513

Uh oh!

Conversation

wanghuancoder commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Nov 19, 2024

Uh oh!

zhangbo9674 Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

zhangbo9674 Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

zhangbo9674 Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

From00 left a comment

Choose a reason for hiding this comment

Uh oh!

zhangbo9674 left a comment

Choose a reason for hiding this comment

Uh oh!

zhangbo9674 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wanghuancoder commented Nov 19, 2024 •

edited

Loading