-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TPU] Add example for profiling TPU inference #12531
[TPU] Add example for profiling TPU inference #12531
Conversation
Signed-off-by: mgoin <mgoin@redhat.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: mgoin <mgoin@redhat.com>
Signed-off-by: mgoin <mgoin@redhat.com> Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: mgoin <mgoin@redhat.com>
Signed-off-by: mgoin <mgoin@redhat.com>
Signed-off-by: mgoin <mgoin@redhat.com>
Signed-off-by: mgoin <mgoin@redhat.com>
Signed-off-by: mgoin <mgoin@redhat.com>
Signed-off-by: mgoin <mgoin@redhat.com> Signed-off-by: Linkun Chen <github@lkchen.net>
Signed-off-by: mgoin <mgoin@redhat.com> Signed-off-by: saeediy <saidakbarp@gmail.com>
Provides an example for simple prefill or decode profiling on TPUs. This is a starting point equivalent to text-only inference using
benchmark_latency.py
, where all the user can specify is batch size, input length, and output length.Future work should expand this example to cover realistic data and multimodal inference.
Example screenshot of a profile in tensorboard (

tensorboard --logdir profiles/ --port 6006
):