Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and test cases for unstable tokenization schemes like phi3 #830

Merged
merged 8 commits into from
May 18, 2024

Conversation

Harsha-Nori
Copy link
Collaborator

@Harsha-Nori Harsha-Nori commented May 16, 2024

Some models like phi-3 have unstable tokenizers that aren't reversible in select circumstances. This can cause stability issues with guidance's attempt to stay on distribution and retokenize before prompt inputs. I don't believe we have a choice but to skip this step under malformed tokenizers.

This isn't really in a merge-ready state, but is a bandaid fix and some reproducible tests to start a discussion. If we expect this to be a trend, we should formalize the exception process, but if it's a small number of models, the hack in TransformersEngine and Engine classes may be the minimal code change we need.

(Following a discussion with @slundberg and @paulbkoch -- tagging for your awareness)

lm += "\n" + gen(name="five", max_tokens=1)
lm += "\n" + gen(name="six", max_tokens=1)

assert True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that without this fix, something above will throw an exception?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the newlines cause special issues with phi-3's tokenizer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to that effect?

Copy link
Collaborator

@slundberg slundberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great. It is clearly a patch/hack, but it is really fixing a bug with phi-3 so I don't think there is a nice way around it looking a bit hacky.

@Harsha-Nori Harsha-Nori changed the title [Not for merging] Fix and test cases for unstable tokenization schemes like phi3 Fix and test cases for unstable tokenization schemes like phi3 May 16, 2024
guidance/models/_model.py Show resolved Hide resolved
tests/models/test_transformers.py Outdated Show resolved Hide resolved
with assistant():
lm += gen(name="five", max_tokens=10)

assert "5" in lm["five"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it about this test which is specific to Llama3? Shouldn't all models be able to pass?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing in particular, I was just using it to debug llama3. removed the test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically all the phi-3 tests should run on any model too, phi is just what they were designed for (as likely to result in errors due to tokenizer choices). the llama3 test was identical to phi tests so we should just generalize those in future PRs

lm += "\n" + gen(name="five", max_tokens=1)
lm += "\n" + gen(name="six", max_tokens=1)

assert True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to that effect?

# Bad user tokens, but we should still generate /something/
lm += f"""<|use\n\nYou are a counting bot. Just keep counting numbers.<|end|><|assistant|>1,2,3,4,"""
lm += gen("five", max_tokens=10)
assert len(str(lm)) > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely you mean len(lm["five"]))>0 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

def test_phi3_chat_unrolled(phi3_model: models.Model):
lm = phi3_model
# Manually convert the chat format into completions style
lm += f"""<|user|>\nYou are a counting bot. Just keep counting numbers.<|end|><|assistant|>1,2,3,4,"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need \n<|assistant|>\n here. Here's the template for reference:

{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '
' + message['content'] + '<|end|>' + '
' + '<|assistant|>' + '
'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '
'}}{% endif %}{% endfor %}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phi-3 just has a really weird tokenizer. While the chat template inserts newlines, the tokenizer immediately strips them due to an rstrip they enabled on the special tokens. So functionally they are identical, and the rest of the tests with newlines/whitespaces everywhere are to check for this (and the instability it leads to). That said, this test should pass either way, but as-written is the closest thing to what gets passed to the model

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to be safe!

@Harsha-Nori
Copy link
Collaborator Author

test failures appear unrelated.

@Harsha-Nori Harsha-Nori merged commit b488431 into main May 18, 2024
61 of 62 checks passed
@Harsha-Nori Harsha-Nori deleted the unstabletokenizers branch May 18, 2024 04:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants