Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect utf-8 handling on require('file.yaml') #54

Closed
maxp opened this issue Nov 1, 2012 · 4 comments
Closed

Incorrect utf-8 handling on require('file.yaml') #54

maxp opened this issue Nov 1, 2012 · 4 comments
Labels

Comments

@maxp
Copy link

maxp commented Nov 1, 2012

When js-yaml loads long files with multibyte utf-8 symbols sometimes it brokes characters (replaces valid unicode char by two error marks)

yaml = require('js-yaml')
data_broken = require('mydata.yaml')
data_correct = yaml.load( fs.readFileSync('mydata.yaml') )

Probably it happens when utf-8 symbol lays exactly on read buffer boundary and byte reader splits it on two incorrect parts.

@ixti
Copy link
Contributor

ixti commented Nov 2, 2012

Can you please provide a broken YAML example, so I could play with it.
You can either send it to my by e-mail, or fork js-yaml repo and put "failing" YAML file as ./test/issues/data/issue-54.yml

@maxp
Copy link
Author

maxp commented Nov 2, 2012

js and yaml:
https://gist.github.com/3997997
https://gist.github.com/3998001

That file filled with cyrillic letter 'у' (two bytes).

The bug appears near 8k buffer boundary on odd aligned utf8 bytes.
Reader splits bytes of the character and produces two broken symbols.

ixti added a commit that referenced this issue Nov 4, 2012
@ixti
Copy link
Contributor

ixti commented Nov 4, 2012

Yes, you're absolutely right. We read stream by 4K chunks, so it gets broken when it reads half of a unicode char.

@ixti
Copy link
Contributor

ixti commented Nov 4, 2012

As we are not using asynchronous work with input streams, there was no point of using fs.openFile under the hood. So I completely dropped support of raw Buffers and file descriptors in favor of plain Strings.

@ixti ixti closed this as completed Nov 4, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants