Skip to content

Conversation

dzhwinter
Copy link
Contributor

A future more work on #8594
In our current implementation, a lod describe the Tensor in several blocks, then every range in the lod will call Cuda kernel once.
However, this implement doesn't optimize enough, because Cuda kernel launch time is far more than the Cuda kernel execution time. So I merge these operations into one Cuda kernel to accelerate the sequence_softmax kernel.

@dzhwinter dzhwinter changed the title "add detail of merge softmax kernel" [Speed]"merge softmax kernel" Mar 27, 2018
@paddle-bot-old paddle-bot-old bot closed this May 22, 2020
@paddle-bot-old
Copy link

Since you haven't replied for a long time, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您长期未回复,我们将关闭这个issue/pr。
若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。

blacksheep-Aristotle pushed a commit to blacksheep-Aristotle/Paddle that referenced this pull request Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant