Change internal encoding of strings to CESU-8 #616

dbatyai · 2015-09-03T12:36:23Z

Use CESU-8 encoding internally for strings instead of UTF-8. This simplifies handling of strings, since in this case we don't need special handling for surrogate pairs.
This is still work in progress.

Current status (measured on x86 linux):

Benchmark	RSS (+ is better)	Perf (+ is better)
3d-cube.js	156->152 (2.564)	0.429->0.417 (2.797)
access-binary-trees.js	96->92 (4.167)	0.382->0.369 (3.403)
access-fannkuch.js	44->48 (-9.091)	1.111->1.053 (5.221)
access-nbody.js	68->68 (0.000)	0.617->0.565 (8.428)
bitops-3bit-bits-in-byte.js	44->44 (0.000)	0.407->0.39 (4.177)
bitops-bits-in-byte.js	40->36 (10.000)	0.519->0.509 (1.927)
bitops-bitwise-and.js	32->36 (-12.500)	0.438->0.433 (1.142)
controlflow-recursive.js	292->288 (1.370)	0.384->0.384 (0.000)
date-format-xparb.js	108->104 (3.704)	0.422->0.26 (38.389)
math-cordic.js	52->48 (7.692)	0.491->0.463 (5.703)
math-partial-sums.js	44->40 (9.091)	0.294->0.285 (3.061)
math-spectral-norm.js	52->56 (-7.692)	0.392->0.379 (3.316)
string-fasta.js	64->60 (6.250)	3.79->2.172 (42.691)
Geometric mean:	RSS reduction: 1.4238%	Speed up: 10.505%

egavrin · 2015-09-04T10:57:55Z

fyi, measurements with RaspPi2. Please, be carefull with address space randomization and measurements on x86. It's possible to get great perf gain on x86, and at the same time degradation on ARM.

Benchmark	RSS (+ is better)	Perf (+ is better)
3d-cube.js	132-> 132 (0.000)	3.7904-> 3.756 (0.908)
access-binary-trees.js	88-> 88 (0.000)	2.8832->2.8456 (1.304)
access-fannkuch.js	40-> 40 (0.000)	10.1056->9.8144 (2.882)
access-nbody.js	64-> 64 (0.000)	4.7656->4.6792 (1.813)
bitops-3bit-bits-in-byte.js	36-> 36 (0.000)	3.3376->3.2632 (2.229)
bitops-bits-in-byte.js	36-> 36 (0.000)	4.468->4.4256 (0.949)
bitops-bitwise-and.js	32-> 32 (0.000)	4.3424->4.2184 (2.856)
controlflow-recursive.js	220-> 220 (0.000)	3.3696->3.2576 (3.324)
date-format-xparb.js	100-> 100 (0.000)	3.0168->1.8536 (38.557)
math-cordic.js	48-> 48 (0.000)	4.9792->4.8968 (1.655)
math-partial-sums.js	40-> 40 (0.000)	2.7184-> 2.644 (2.737)
math-spectral-norm.js	52-> 52 (0.000)	3.3016-> 3.256 (1.381)
string-fasta.js	52-> 52 (0.000)	27.4368->19.016 (30.692)
Geometric mean:	RSS reduction: 0%	Speed up: 7.9482%

PS Nice results ^_^

ruben-ayrapetyan · 2015-09-04T11:12:52Z

jerry-core/lit/lit-globals.h

@@ -94,6 +94,16 @@ typedef ecma_char_t *ecma_char_ptr_t;
 #define LIT_UTF8_MAX_BYTES_IN_CODE_POINT (4)


Could you, please, clarify, do we need the define after switching to CESU-8?

Is it correct that CESU-8 uses UTF-8 encoding scheme, but only for 1-byte and 2-byte code points?
Shouldn't we rename all UTF8 / utf8 definitions to CESU8 / cesu8?

Sorry for the delay.
You are correct, CESU uses UTF-8 encoding, the only difference is with non-BMP code points. In that case we first separate the code point into a surrogate pair, and encode the surrogates separately in UTF-8, which means that these characters will be encoded in 6 bytes instead of 4. This may seem like a waste, but in this case we can directly decode the string into code units, without the need for special handling of non-BMP code points, which are very rare anyway.

We'll have to keep some of the UTF-8 support though, so I don't think that we should mass rename every definition.

dbatyai · 2015-09-17T12:46:55Z

Hi, I've finished refactoring the built-ins to use CESU, so I think there won't be any major updates to this patch anymore, only some tweaks. @ruben-ayrapetyan, @egavrin, @zherczeg, please review.

Latest measurements with RasPi2:

Benchmark	RSS (+ is better)	Perf (+ is better)
3d-cube.js	132 -> 128 (3.0303)	3.7 -> 3.696 (0.1081)
access-binary-trees.js	84 -> 88 (-4.762)	2.846 -> 2.828 (0.6325)
access-fannkuch.js	48 -> 44 (8.3333)	9.874 -> 9.874 (0)
access-nbody.js	64 -> 64 (0)	4.674 -> 4.632 (0.8986)
bitops-3bit-bits-in-byte.js	36 -> 36 (0)	3.088 -> 3.152 (-2.073)
bitops-bits-in-byte.js	36 -> 36 (0)	4.224 -> 4.308 (-1.989)
bitops-bitwise-and.js	28 -> 28 (0)	4.06 -> 4.084 (-0.591)
controlflow-recursive.js	216 -> 212 (1.8519)	3.138 -> 3.12 (0.5736)
date-format-xparb.js	100 -> 100 (0)	3.042 -> 1.798 (40.8941)
math-cordic.js	44 -> 44 (0)	4.166 -> 4.172 (-0.144)
math-partial-sums.js	36 -> 36 (0)	2.508 -> 2.55 (-1.675)
math-spectral-norm.js	48 -> 52 (-8.333)	3.178 -> 3.178 (0)
string-fasta.js	56 -> 52 (7.1429)	26.888 -> 9.534 (64.5418)
Geometric mean:	RSS reduction: 0.6443%	Speed up: 11.0395%

Binary size:

Before: 198336
After: 195816

zherczeg · 2015-09-18T07:00:09Z

jerry-core/ecma/builtin-objects/ecma-builtin-date.cpp

        {
          bool is_negative = false;

-          if (lit_utf8_iterator_peek_next (&iter) == '-')
+          if (lit_cesu8_peek_next (&date_str_curr_p) == '-')


Can't we simply use *date_str_curr_p ?

zherczeg · 2015-09-18T07:35:43Z

I do like the direction. I am happy that so much code can be removed. Please check all & usage, I think most of them are unnecessary. And please avoid for loops when simpler loops are available.

dbatyai · 2015-09-23T08:32:19Z

@zherczeg, thanks for your comments, I've updated the patch.

egavrin · 2015-09-23T09:51:45Z

Is it needed to add cesu8 instead of utf8? As far as I understand CESU-8 is a variant of UTF-8, so I think we can leave utf8 and mention CESU-8 only in comments.
In some places there are mix of utf8 and cesu8:

lit_utf8_size_t read_size = lit_read_code_unit_from_cesu8 (current_p, &current_char);

Regardless style, this PR is OK for me.

ruben-ayrapetyan · 2015-09-28T12:11:37Z

jerry-core/lit/lit-strings.cpp

@@ -545,6 +479,29 @@ lit_utf8_string_length (const lit_utf8_byte_t *utf8_buf_p, /**< utf-8 string */
 } /* lit_utf8_string_length */


Do we need this function?

We don't, thanks for noting. I'll remove it.

ruben-ayrapetyan · 2015-09-28T12:13:16Z

jerry-core/lit/lit-strings.cpp

                                          lit_utf8_size_t string1_size, /**< string size */
                                          const lit_utf8_byte_t *string2_p, /**< utf-8 string */
                                          lit_utf8_size_t string2_size) /**< string size */
 {
-  lit_utf8_iterator_t iter1 = lit_utf8_iterator_create (string1_p, string1_size);
-  lit_utf8_iterator_t iter2 = lit_utf8_iterator_create (string2_p, string2_size);
+  lit_utf8_byte_t *string1_pos = (lit_utf8_byte_t *) string1_p;


Wouldn't it be useful to introduce CESU iterators?

In some cases it could be useful, but it would also over complicate others. In lots of cases we know the string is only ASCII, or we expect it to be ASCII. If we don't use iterators, then we can easily take advantage of these situations.

zherczeg · 2015-09-29T08:29:57Z

LGTM

dbatyai · 2015-10-12T10:37:04Z

@egavrin, I don't think mixing the notation would cause any confusion, since, as you said, CESU-8 is a variant of UTF-8. On the other hand, keeping everything as utf8 would be misleading in my opinion, since in that case some functions would not behave as implied by their names, while others would. Of course, lots of comments could help with this, but still, I think naming functions by their behavior is much clearer.

egavrin · 2015-10-14T11:18:49Z

@dbatyai we can simply write in documentation that we're implementing "UTF-8 in CESU-8", but we can leave as is.

JerryScript-DCO-1.0-Signed-off-by: Zsolt Borbély zsborbely.u-szeged@partner.samsung.com JerryScript-DCO-1.0-Signed-off-by: Dániel Bátyai dbatyai.u-szeged@partner.samsung.com

dbatyai · 2015-10-16T08:30:44Z

@egavrin, I updated the patch to keep utf8.

egavrin · 2015-10-20T06:49:43Z

Good to me. make push

dbatyai added enhancement An improvement performance Affects performance labels Sep 3, 2015

dbatyai added this to the Engine optimization & enhancement milestone Sep 3, 2015

dbatyai force-pushed the utf8_to_cesu8 branch from 049d399 to 34369e3 Compare September 4, 2015 09:00

ruben-ayrapetyan reviewed Sep 4, 2015
View reviewed changes

dbatyai force-pushed the utf8_to_cesu8 branch 3 times, most recently from 73ede07 to ce3459c Compare September 16, 2015 13:47

dbatyai force-pushed the utf8_to_cesu8 branch 2 times, most recently from 79f496f to 4994ca4 Compare September 17, 2015 13:08

zherczeg reviewed Sep 18, 2015
View reviewed changes

dbatyai force-pushed the utf8_to_cesu8 branch 3 times, most recently from 7ce06fd to 0764e75 Compare September 23, 2015 08:18

dbatyai assigned ruben-ayrapetyan Sep 23, 2015

egavrin assigned dbatyai and unassigned ruben-ayrapetyan Sep 23, 2015

ruben-ayrapetyan reviewed Sep 28, 2015
View reviewed changes

dbatyai force-pushed the utf8_to_cesu8 branch from 0764e75 to d84ac63 Compare October 12, 2015 11:09

dbatyai assigned egavrin and unassigned dbatyai Oct 12, 2015

Change internal encoding of strings to CESU-8

dcd610b

JerryScript-DCO-1.0-Signed-off-by: Zsolt Borbély zsborbely.u-szeged@partner.samsung.com JerryScript-DCO-1.0-Signed-off-by: Dániel Bátyai dbatyai.u-szeged@partner.samsung.com

dbatyai force-pushed the utf8_to_cesu8 branch from d84ac63 to 23831cf Compare October 15, 2015 13:45

Refactor builtins to handle CESU-8 encoded strings.

579b1ed

JerryScript-DCO-1.0-Signed-off-by: Zsolt Borbély zsborbely.u-szeged@partner.samsung.com JerryScript-DCO-1.0-Signed-off-by: Dániel Bátyai dbatyai.u-szeged@partner.samsung.com

dbatyai force-pushed the utf8_to_cesu8 branch from 23831cf to 579b1ed Compare October 15, 2015 14:01

egavrin assigned dbatyai and unassigned egavrin Oct 20, 2015

dbatyai merged commit 579b1ed into jerryscript-project:master Oct 20, 2015

dbatyai mentioned this pull request Nov 19, 2015

Q. current internal encoding for lit_utf8_byte_t * ? #722

Closed

lvidacs mentioned this pull request Dec 1, 2015

CESU-8 indexing problem in String index_of helper #757

Closed

dbatyai deleted the utf8_to_cesu8 branch March 1, 2016 11:50

zherczeg mentioned this pull request Jul 20, 2016

Nominating Dániel Bátyai (dbatyai) for JerryScript Maintainer status #1220

Closed

martijnthe mentioned this pull request Aug 15, 2016

Rename symbols from ...utf8... to ...cesu8... #1268

Closed

robertsipka mentioned this pull request Nov 14, 2016

Add API functions to create string from a valid UTF-8 string. #1430

Merged

		@@ -94,6 +94,16 @@ typedef ecma_char_t *ecma_char_ptr_t;
		#define LIT_UTF8_MAX_BYTES_IN_CODE_POINT (4)

		@@ -545,6 +479,29 @@ lit_utf8_string_length (const lit_utf8_byte_t utf8_buf_p, /< utf-8 string /
		} /* lit_utf8_string_length */

Change internal encoding of strings to CESU-8 #616

Change internal encoding of strings to CESU-8 #616

Uh oh!

Conversation

dbatyai commented Sep 3, 2015

Uh oh!

egavrin commented Sep 4, 2015

Uh oh!

ruben-ayrapetyan Sep 4, 2015

Choose a reason for hiding this comment

Uh oh!

dbatyai Sep 16, 2015

Choose a reason for hiding this comment

Uh oh!

dbatyai commented Sep 17, 2015

Uh oh!

zherczeg Sep 18, 2015

Choose a reason for hiding this comment

Uh oh!

zherczeg commented Sep 18, 2015

Uh oh!

dbatyai commented Sep 23, 2015

Uh oh!

egavrin commented Sep 23, 2015

Uh oh!

ruben-ayrapetyan Sep 28, 2015

Choose a reason for hiding this comment

Uh oh!

dbatyai Oct 12, 2015

Choose a reason for hiding this comment

Uh oh!

ruben-ayrapetyan Sep 28, 2015

Choose a reason for hiding this comment

Uh oh!

dbatyai Oct 12, 2015

Choose a reason for hiding this comment

Uh oh!

zherczeg commented Sep 29, 2015

Uh oh!

dbatyai commented Oct 12, 2015

Uh oh!

egavrin commented Oct 14, 2015

Uh oh!

dbatyai commented Oct 16, 2015

Uh oh!

egavrin commented Oct 20, 2015

Uh oh!

Uh oh!