Fix eager memory leak and re-enable new checkpoint #4008
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
IsStreamInParallelDesc
里没有特殊处理 ControlStreamType,一部分 DeleteObject 指令被错误忽略了,导致 object 的引用计数不正确,引起内存泄漏修复之后,在原先重复调用 save 会 OOM 的代码里测试,不再出现内存泄漏的情况。现在每张卡上由 checkpoint 引起的显存占用只有 32M,是流式 save/load/init 时的一个 slice 的大小