Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added space in prompt for better infills and proper stop token #34

Merged
merged 1 commit into from
Feb 8, 2024

Conversation

Kevsnz
Copy link
Contributor

@Kevsnz Kevsnz commented Feb 3, 2024

Hi!

I've been experimenting with CodeLlama FIM last couple of days. What I discovered is that CodeLlama gives more robust results when sentinel tokens in prompt is surrounded by spaces. It's especially noticeable at the beginning of the file when 'system' part of prefix dominates the prompt.

Typical failure is shown below:
before

After the fix model fills the code properly:
after

I also changed stop tokens. I added <EOT> and <EOD> according to the paper and removed <PRE>, <SUF> and <MID> tokens since they stopped showing up in model responses after I fixed the prompt.

@Kevsnz
Copy link
Contributor Author

Kevsnz commented Feb 4, 2024

So, after some more digging I think I've found the culprit.

I noticed that this happens only in Windows. Windows uses \r\n as newlines. For some reason CodeLlama (or it's tokenizer) fails to recognize <SUF> token if it's followed by \r.

That's why adding space after <SUF> stops those infilling fails. Removing \r after the token also fixes the problem.

Experiments

Failure: <SUF>\r\n

fail

OK: <SUF> \r\n
ok with space

OK: <SUF>\n (funny thing is that the model has put missing \r in generated completion 🤣):
ok without cr

Other possible way to fix it could be to replace all occurrences of \r\n' with \nin the prompt and put\r`s back into model output, and do this only in Windows. I'm not sure how to check which OS is running and/or if it's possible in VSCode at all.

@ex3ndr
Copy link
Owner

ex3ndr commented Feb 4, 2024

Wow, that's crazy! did they changed FIM tokens? Which paper you are referring to?

@Kevsnz
Copy link
Contributor Author

Kevsnz commented Feb 5, 2024

It doesn't look like those tokens changed.

In the Code Llama paper (part 2.3) they refer to Bavarian et al. (2022) (part 3) where prompt format for fill-in-the-middle is described. <EOT> token finishes the whole sequence there.

However I can't find where I saw <EOD> token, probably in the source files somewhere on HugginFace. I haven't seen it in model's output, so maybe it can be removed.

Also, today I encountered same FIM failure on my Mac (using original extension version) where there is no \rs in prompts. Adding space after <SUF> token seem to have fixed the problem there.

@ex3ndr ex3ndr merged commit 1e2431a into ex3ndr:main Feb 8, 2024
@ex3ndr
Copy link
Owner

ex3ndr commented Feb 8, 2024

Perfect, thanks!

@Kevsnz Kevsnz deleted the prompt-fix branch February 9, 2024 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants