Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fallback tokenizer #1046

Merged
merged 4 commits into from
Jul 17, 2024
Merged

add fallback tokenizer #1046

merged 4 commits into from
Jul 17, 2024

Conversation

JerryKwan
Copy link
Contributor

add fallback tonenizer if tiktoken can not get encoding from model name
support llm services which provide openai compatibility api such as ollama

add fallback tonenizer if tiktoken can not get encoding from model name
support llm services which provide openai compatibility api
@lapp0
Copy link
Contributor

lapp0 commented Jul 17, 2024

Doesn't cl100k_base only apply to gpt4 and gpt3? Or am I missing something?

@JerryKwan
Copy link
Contributor Author

@lapp0
As far as I know, gpt-4, gpt-3.5-turbo, text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large use cl100k_base, and GPT-3 models like davinci use r50k_base

add warning when using fallback tokenizer
@rlouf
Copy link
Member

rlouf commented Jul 17, 2024

Please run pre-commit locally and push the changes

fix code style problmes
Copy link
Contributor

@lapp0 lapp0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

change import order
@rlouf rlouf merged commit d64bfc2 into dottxt-ai:main Jul 17, 2024
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants