-
Notifications
You must be signed in to change notification settings - Fork 683
Change internal encoding of strings to CESU-8 #616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
049d399
to
34369e3
Compare
fyi, measurements with RaspPi2. Please, be carefull with address space randomization and measurements on x86. It's possible to get great perf gain on x86, and at the same time degradation on ARM.
PS Nice results ^_^ |
@@ -94,6 +94,16 @@ typedef ecma_char_t *ecma_char_ptr_t; | |||
#define LIT_UTF8_MAX_BYTES_IN_CODE_POINT (4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you, please, clarify, do we need the define after switching to CESU-8?
Is it correct that CESU-8 uses UTF-8 encoding scheme, but only for 1-byte and 2-byte code points?
Shouldn't we rename all UTF8
/ utf8
definitions to CESU8
/ cesu8
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay.
You are correct, CESU uses UTF-8 encoding, the only difference is with non-BMP code points. In that case we first separate the code point into a surrogate pair, and encode the surrogates separately in UTF-8, which means that these characters will be encoded in 6 bytes instead of 4. This may seem like a waste, but in this case we can directly decode the string into code units, without the need for special handling of non-BMP code points, which are very rare anyway.
We'll have to keep some of the UTF-8 support though, so I don't think that we should mass rename every definition.
73ede07
to
ce3459c
Compare
Hi, I've finished refactoring the built-ins to use CESU, so I think there won't be any major updates to this patch anymore, only some tweaks. @ruben-ayrapetyan, @egavrin, @zherczeg, please review. Latest measurements with RasPi2:
Binary size:
|
79f496f
to
4994ca4
Compare
{ | ||
bool is_negative = false; | ||
|
||
if (lit_utf8_iterator_peek_next (&iter) == '-') | ||
if (lit_cesu8_peek_next (&date_str_curr_p) == '-') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we simply use *date_str_curr_p ?
I do like the direction. I am happy that so much code can be removed. Please check all & usage, I think most of them are unnecessary. And please avoid for loops when simpler loops are available. |
7ce06fd
to
0764e75
Compare
@zherczeg, thanks for your comments, I've updated the patch. |
Is it needed to add lit_utf8_size_t read_size = lit_read_code_unit_from_cesu8 (current_p, ¤t_char); Regardless style, this PR is OK for me. |
@@ -545,6 +479,29 @@ lit_utf8_string_length (const lit_utf8_byte_t *utf8_buf_p, /**< utf-8 string */ | |||
} /* lit_utf8_string_length */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't, thanks for noting. I'll remove it.
lit_utf8_size_t string1_size, /**< string size */ | ||
const lit_utf8_byte_t *string2_p, /**< utf-8 string */ | ||
lit_utf8_size_t string2_size) /**< string size */ | ||
{ | ||
lit_utf8_iterator_t iter1 = lit_utf8_iterator_create (string1_p, string1_size); | ||
lit_utf8_iterator_t iter2 = lit_utf8_iterator_create (string2_p, string2_size); | ||
lit_utf8_byte_t *string1_pos = (lit_utf8_byte_t *) string1_p; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be useful to introduce CESU iterators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases it could be useful, but it would also over complicate others. In lots of cases we know the string is only ASCII, or we expect it to be ASCII. If we don't use iterators, then we can easily take advantage of these situations.
LGTM |
@egavrin, I don't think mixing the notation would cause any confusion, since, as you said, CESU-8 is a variant of UTF-8. On the other hand, keeping everything as |
0764e75
to
d84ac63
Compare
@dbatyai we can simply write in documentation that we're implementing "UTF-8 in CESU-8", but we can leave as is. |
JerryScript-DCO-1.0-Signed-off-by: Zsolt Borbély zsborbely.u-szeged@partner.samsung.com JerryScript-DCO-1.0-Signed-off-by: Dániel Bátyai dbatyai.u-szeged@partner.samsung.com
d84ac63
to
23831cf
Compare
JerryScript-DCO-1.0-Signed-off-by: Zsolt Borbély zsborbely.u-szeged@partner.samsung.com JerryScript-DCO-1.0-Signed-off-by: Dániel Bátyai dbatyai.u-szeged@partner.samsung.com
23831cf
to
579b1ed
Compare
@egavrin, I updated the patch to keep |
Good to me. |
Use CESU-8 encoding internally for strings instead of UTF-8. This simplifies handling of strings, since in this case we don't need special handling for surrogate pairs.
This is still work in progress.
Current status (measured on x86 linux):
(+ is better)
(+ is better)