Skip to content

Modify the current PyTorch model to C++ #42

Closed as not planned
Closed as not planned
@zhuohan123

Description

@zhuohan123

Expected gain: For 13B models, we should see a 20%-30% latency gain on a single GPU and 2-3x on 4 GPUs. For smaller models, the gain should be even higher.

Having a single iteration's computation being completely C++ should be enough for high performance. In this way, we can keep most complicated scheduling logics in Python, including weight loading.

Potential sources of overheads:

  1. Python v.s. C++.
  2. PyTorch (even in C++) v.s. FasterTransformer.

How to implement a C++ version:

  1. (Fake C++) Torch compiler (torch.jit).
  2. Libtorch, C++ version of PyTorch (easier to implement and extend, but can only solve overhead 1).
  3. Prune out the useful single model code from FasterTransformer to CacheFlow. This solves both overheads but is harder to implement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions