-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: 2-3x inference speedup, faster than real-time #71
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few clarifications requested
fam/llm/gptfast_inference.py
Outdated
num_samples=1, | ||
seed=1337, | ||
device="cuda", | ||
dtype="bfloat16", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto handle dtype. check fam/llm/utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dirty fix to speed up inference by porting gpt-fast... supports single utterance only, no batching...
First stage only:
RTX 4090: 230T/s (~1.5 seconds of speech generated in 1 second of wall-clock time)
H100: 382T/s (~2.5 seconds of speech generated in 1 second of wall-clock time)
These times are consistent across context lengths due to the static kv-cache used.
Notes: