Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UTF-8 with BOM #233

Closed
aegoroff opened this issue Oct 16, 2018 · 5 comments
Closed

Support UTF-8 with BOM #233

aegoroff opened this issue Oct 16, 2018 · 5 comments

Comments

@aegoroff
Copy link

Now the lib doesn't support toml files in UTF-8 encoding with BOM

@aaaasmile
Copy link

Yes, I got this error:

Near line 0 (last key parsed ''): bare keys cannot contain '\ufeff'

It happens to me when i tried to modify an UTF8 toml file with Notepad on windows server. Notepad was saving as default the file with BOM and the result ist that the parser was't working anymore.

@arp242
Copy link
Collaborator

arp242 commented Jun 8, 2021

UTF-8 shouldn't have a BOM; it looks like you're trying to read a UTF-16 file and the TOML specification supports only UTF-8. Since #276 the error on that should be clearer.

@arp242 arp242 closed this as completed Jun 8, 2021
@BurntSushi
Copy link
Owner

@arp242 Unfortunately, it's somewhat common for UTF-8 encoded files on Windows to have a BOM. Byte order is of course an irrelevant concept for UTF-8. As far as I can tell, it's mostly only useful as a signal that the file is UTF-8 encoded, even though its use is nowhere near universal.

(The way I've handled the UTF-8 BOM in other projects is mostly to just look for it, allow it, but otherwise ignore it.)

@arp242
Copy link
Collaborator

arp242 commented Jun 8, 2021

Oh right, what a curious thing to do.

I'll change it to ignore it then; the other UTF-16 check should still work to produce reasonable errors.

arp242 added a commit that referenced this issue Jun 9, 2021
Appearantly some UTF-8 files can start with a BOM, so read over that
instead of assuming it's UTF-16. Also move the check for NULL out of the
lexer, so it can remain "UTF-8 clean"; just examine the first few bytes
instead.

Ref: #233 (comment)
@arp242 arp242 mentioned this issue Jun 9, 2021
@arp242
Copy link
Collaborator

arp242 commented Jun 9, 2021

Fixed it now in #277

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants