Closed as not planned
Description
Expected gain: For 13B models, we should see a 20%-30% latency gain on a single GPU and 2-3x on 4 GPUs. For smaller models, the gain should be even higher.
Having a single iteration's computation being completely C++ should be enough for high performance. In this way, we can keep most complicated scheduling logics in Python, including weight loading.
Potential sources of overheads:
- Python v.s. C++.
- PyTorch (even in C++) v.s. FasterTransformer.
How to implement a C++ version:
- (Fake C++) Torch compiler (torch.jit).
- Libtorch, C++ version of PyTorch (easier to implement and extend, but can only solve overhead 1).
- Prune out the useful single model code from FasterTransformer to CacheFlow. This solves both overheads but is harder to implement.