Modify the current PyTorch model to C++

Expected gain: For 13B models, we should see a 20%-30% latency gain on a single GPU and 2-3x on 4 GPUs. For smaller models, the gain should be even higher.

Having a single iteration's computation being completely C++ should be enough for high performance. In this way, we can keep most complicated scheduling logics in Python, including weight loading.

Potential sources of overheads:
1. Python v.s. C++.
2. PyTorch (even in C++) v.s. FasterTransformer.

How to implement a C++ version:
1. (Fake C++) Torch compiler (torch.jit).
2. Libtorch, C++ version of PyTorch (easier to implement and extend, but can only solve overhead 1).
3. Prune out the useful single model code from FasterTransformer to CacheFlow. This solves both overheads but is harder to implement.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Modify the current PyTorch model to C++ #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Modify the current PyTorch model to C++ #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions