-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and test cases for unstable tokenization schemes like phi3 #830
Conversation
lm += "\n" + gen(name="five", max_tokens=1) | ||
lm += "\n" + gen(name="six", max_tokens=1) | ||
|
||
assert True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that without this fix, something above will throw an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the newlines cause special issues with phi-3's tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment to that effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is great. It is clearly a patch/hack, but it is really fixing a bug with phi-3 so I don't think there is a nice way around it looking a bit hacky.
tests/models/test_transformers.py
Outdated
with assistant(): | ||
lm += gen(name="five", max_tokens=10) | ||
|
||
assert "5" in lm["five"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is it about this test which is specific to Llama3? Shouldn't all models be able to pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing in particular, I was just using it to debug llama3. removed the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically all the phi-3 tests should run on any model too, phi is just what they were designed for (as likely to result in errors due to tokenizer choices). the llama3 test was identical to phi tests so we should just generalize those in future PRs
lm += "\n" + gen(name="five", max_tokens=1) | ||
lm += "\n" + gen(name="six", max_tokens=1) | ||
|
||
assert True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment to that effect?
tests/models/test_transformers.py
Outdated
# Bad user tokens, but we should still generate /something/ | ||
lm += f"""<|use\n\nYou are a counting bot. Just keep counting numbers.<|end|><|assistant|>1,2,3,4,""" | ||
lm += gen("five", max_tokens=10) | ||
assert len(str(lm)) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely you mean len(lm["five"]))>0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
tests/models/test_transformers.py
Outdated
def test_phi3_chat_unrolled(phi3_model: models.Model): | ||
lm = phi3_model | ||
# Manually convert the chat format into completions style | ||
lm += f"""<|user|>\nYou are a counting bot. Just keep counting numbers.<|end|><|assistant|>1,2,3,4,""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need \n<|assistant|>\n
here. Here's the template for reference:
{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '
' + message['content'] + '<|end|>' + '
' + '<|assistant|>' + '
'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '
'}}{% endif %}{% endfor %}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phi-3 just has a really weird tokenizer. While the chat template inserts newlines, the tokenizer immediately strips them due to an rstrip they enabled on the special tokens. So functionally they are identical, and the rest of the tests with newlines/whitespaces everywhere are to check for this (and the instability it leads to). That said, this test should pass either way, but as-written is the closest thing to what gets passed to the model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to be safe!
test failures appear unrelated. |
Some models like phi-3 have unstable tokenizers that aren't reversible in select circumstances. This can cause stability issues with guidance's attempt to stay on distribution and retokenize before prompt inputs. I don't believe we have a choice but to skip this step under malformed tokenizers.
This isn't really in a merge-ready state, but is a bandaid fix and some reproducible tests to start a discussion. If we expect this to be a trend, we should formalize the exception process, but if it's a small number of models, the hack in TransformersEngine and Engine classes may be the minimal code change we need.
(Following a discussion with @slundberg and @paulbkoch -- tagging for your awareness)