Skip to content

Change internal encoding of strings to CESU-8 #616

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 20, 2015

Conversation

dbatyai
Copy link
Member

@dbatyai dbatyai commented Sep 3, 2015

Use CESU-8 encoding internally for strings instead of UTF-8. This simplifies handling of strings, since in this case we don't need special handling for surrogate pairs.
This is still work in progress.

Current status (measured on x86 linux):

Benchmark RSS
(+ is better)
Perf
(+ is better)
3d-cube.js 156->152 (2.564) 0.429->0.417 (2.797)
access-binary-trees.js 96->92 (4.167) 0.382->0.369 (3.403)
access-fannkuch.js 44->48 (-9.091) 1.111->1.053 (5.221)
access-nbody.js 68->68 (0.000) 0.617->0.565 (8.428)
bitops-3bit-bits-in-byte.js 44->44 (0.000) 0.407->0.39 (4.177)
bitops-bits-in-byte.js 40->36 (10.000) 0.519->0.509 (1.927)
bitops-bitwise-and.js 32->36 (-12.500) 0.438->0.433 (1.142)
controlflow-recursive.js 292->288 (1.370) 0.384->0.384 (0.000)
date-format-xparb.js 108->104 (3.704) 0.422->0.26 (38.389)
math-cordic.js 52->48 (7.692) 0.491->0.463 (5.703)
math-partial-sums.js 44->40 (9.091) 0.294->0.285 (3.061)
math-spectral-norm.js 52->56 (-7.692) 0.392->0.379 (3.316)
string-fasta.js 64->60 (6.250) 3.79->2.172 (42.691)
Geometric mean: RSS reduction: 1.4238% Speed up: 10.505%

@dbatyai dbatyai added enhancement An improvement performance Affects performance labels Sep 3, 2015
@egavrin
Copy link
Contributor

egavrin commented Sep 4, 2015

fyi, measurements with RaspPi2. Please, be carefull with address space randomization and measurements on x86. It's possible to get great perf gain on x86, and at the same time degradation on ARM.

Benchmark RSS
(+ is better)
Perf
(+ is better)
3d-cube.js 132-> 132 (0.000) 3.7904-> 3.756 (0.908)
access-binary-trees.js 88-> 88 (0.000) 2.8832->2.8456 (1.304)
access-fannkuch.js 40-> 40 (0.000) 10.1056->9.8144 (2.882)
access-nbody.js 64-> 64 (0.000) 4.7656->4.6792 (1.813)
bitops-3bit-bits-in-byte.js 36-> 36 (0.000) 3.3376->3.2632 (2.229)
bitops-bits-in-byte.js 36-> 36 (0.000) 4.468->4.4256 (0.949)
bitops-bitwise-and.js 32-> 32 (0.000) 4.3424->4.2184 (2.856)
controlflow-recursive.js 220-> 220 (0.000) 3.3696->3.2576 (3.324)
date-format-xparb.js 100-> 100 (0.000) 3.0168->1.8536 (38.557)
math-cordic.js 48-> 48 (0.000) 4.9792->4.8968 (1.655)
math-partial-sums.js 40-> 40 (0.000) 2.7184-> 2.644 (2.737)
math-spectral-norm.js 52-> 52 (0.000) 3.3016-> 3.256 (1.381)
string-fasta.js 52-> 52 (0.000) 27.4368->19.016 (30.692)
Geometric mean: RSS reduction: 0% Speed up: 7.9482%

PS Nice results ^_^

@@ -94,6 +94,16 @@ typedef ecma_char_t *ecma_char_ptr_t;
#define LIT_UTF8_MAX_BYTES_IN_CODE_POINT (4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you, please, clarify, do we need the define after switching to CESU-8?

Is it correct that CESU-8 uses UTF-8 encoding scheme, but only for 1-byte and 2-byte code points?
Shouldn't we rename all UTF8 / utf8 definitions to CESU8 / cesu8?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay.
You are correct, CESU uses UTF-8 encoding, the only difference is with non-BMP code points. In that case we first separate the code point into a surrogate pair, and encode the surrogates separately in UTF-8, which means that these characters will be encoded in 6 bytes instead of 4. This may seem like a waste, but in this case we can directly decode the string into code units, without the need for special handling of non-BMP code points, which are very rare anyway.

We'll have to keep some of the UTF-8 support though, so I don't think that we should mass rename every definition.

@dbatyai dbatyai force-pushed the utf8_to_cesu8 branch 3 times, most recently from 73ede07 to ce3459c Compare September 16, 2015 13:47
@dbatyai
Copy link
Member Author

dbatyai commented Sep 17, 2015

Hi, I've finished refactoring the built-ins to use CESU, so I think there won't be any major updates to this patch anymore, only some tweaks. @ruben-ayrapetyan, @egavrin, @zherczeg, please review.

Latest measurements with RasPi2:

Benchmark RSS
(+ is better)
Perf
(+ is better)
3d-cube.js 132 -> 128 (3.0303) 3.7 -> 3.696 (0.1081)
access-binary-trees.js 84 -> 88 (-4.762) 2.846 -> 2.828 (0.6325)
access-fannkuch.js 48 -> 44 (8.3333) 9.874 -> 9.874 (0)
access-nbody.js 64 -> 64 (0) 4.674 -> 4.632 (0.8986)
bitops-3bit-bits-in-byte.js 36 -> 36 (0) 3.088 -> 3.152 (-2.073)
bitops-bits-in-byte.js 36 -> 36 (0) 4.224 -> 4.308 (-1.989)
bitops-bitwise-and.js 28 -> 28 (0) 4.06 -> 4.084 (-0.591)
controlflow-recursive.js 216 -> 212 (1.8519) 3.138 -> 3.12 (0.5736)
date-format-xparb.js 100 -> 100 (0) 3.042 -> 1.798 (40.8941)
math-cordic.js 44 -> 44 (0) 4.166 -> 4.172 (-0.144)
math-partial-sums.js 36 -> 36 (0) 2.508 -> 2.55 (-1.675)
math-spectral-norm.js 48 -> 52 (-8.333) 3.178 -> 3.178 (0)
string-fasta.js 56 -> 52 (7.1429) 26.888 -> 9.534 (64.5418)
Geometric mean: RSS reduction: 0.6443% Speed up: 11.0395%

Binary size:

  • Before: 198336
  • After: 195816

@dbatyai dbatyai force-pushed the utf8_to_cesu8 branch 2 times, most recently from 79f496f to 4994ca4 Compare September 17, 2015 13:08
{
bool is_negative = false;

if (lit_utf8_iterator_peek_next (&iter) == '-')
if (lit_cesu8_peek_next (&date_str_curr_p) == '-')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we simply use *date_str_curr_p ?

@zherczeg
Copy link
Member

I do like the direction. I am happy that so much code can be removed. Please check all & usage, I think most of them are unnecessary. And please avoid for loops when simpler loops are available.

@dbatyai dbatyai force-pushed the utf8_to_cesu8 branch 3 times, most recently from 7ce06fd to 0764e75 Compare September 23, 2015 08:18
@dbatyai
Copy link
Member Author

dbatyai commented Sep 23, 2015

@zherczeg, thanks for your comments, I've updated the patch.

@egavrin
Copy link
Contributor

egavrin commented Sep 23, 2015

Is it needed to add cesu8 instead of utf8? As far as I understand CESU-8 is a variant of UTF-8, so I think we can leave utf8 and mention CESU-8 only in comments.
In some places there are mix of utf8 and cesu8:

lit_utf8_size_t read_size = lit_read_code_unit_from_cesu8 (current_p, &current_char);

Regardless style, this PR is OK for me.

@egavrin egavrin assigned dbatyai and unassigned ruben-ayrapetyan Sep 23, 2015
@@ -545,6 +479,29 @@ lit_utf8_string_length (const lit_utf8_byte_t *utf8_buf_p, /**< utf-8 string */
} /* lit_utf8_string_length */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't, thanks for noting. I'll remove it.

lit_utf8_size_t string1_size, /**< string size */
const lit_utf8_byte_t *string2_p, /**< utf-8 string */
lit_utf8_size_t string2_size) /**< string size */
{
lit_utf8_iterator_t iter1 = lit_utf8_iterator_create (string1_p, string1_size);
lit_utf8_iterator_t iter2 = lit_utf8_iterator_create (string2_p, string2_size);
lit_utf8_byte_t *string1_pos = (lit_utf8_byte_t *) string1_p;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be useful to introduce CESU iterators?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases it could be useful, but it would also over complicate others. In lots of cases we know the string is only ASCII, or we expect it to be ASCII. If we don't use iterators, then we can easily take advantage of these situations.

@zherczeg
Copy link
Member

LGTM

@dbatyai
Copy link
Member Author

dbatyai commented Oct 12, 2015

@egavrin, I don't think mixing the notation would cause any confusion, since, as you said, CESU-8 is a variant of UTF-8. On the other hand, keeping everything as utf8 would be misleading in my opinion, since in that case some functions would not behave as implied by their names, while others would. Of course, lots of comments could help with this, but still, I think naming functions by their behavior is much clearer.

@dbatyai dbatyai assigned egavrin and unassigned dbatyai Oct 12, 2015
@egavrin
Copy link
Contributor

egavrin commented Oct 14, 2015

@dbatyai we can simply write in documentation that we're implementing "UTF-8 in CESU-8", but we can leave as is.

JerryScript-DCO-1.0-Signed-off-by: Zsolt Borbély zsborbely.u-szeged@partner.samsung.com
JerryScript-DCO-1.0-Signed-off-by: Dániel Bátyai dbatyai.u-szeged@partner.samsung.com
JerryScript-DCO-1.0-Signed-off-by: Zsolt Borbély zsborbely.u-szeged@partner.samsung.com
JerryScript-DCO-1.0-Signed-off-by: Dániel Bátyai dbatyai.u-szeged@partner.samsung.com
@dbatyai
Copy link
Member Author

dbatyai commented Oct 16, 2015

@egavrin, I updated the patch to keep utf8.

@egavrin
Copy link
Contributor

egavrin commented Oct 20, 2015

Good to me. make push

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement performance Affects performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants