-
Notifications
You must be signed in to change notification settings - Fork 683
Add API functions to create string from a valid UTF-8 string. #1430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1c51485
to
8210285
Compare
|
||
dest_buf_size = lit_calculate_cesu8_size_from_utf8 ((lit_utf8_byte_t *) str_p, (lit_utf8_size_t) str_size); | ||
|
||
if ((lit_utf8_size_t) str_size != dest_buf_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this only handles ASCII strings. We should also allow strings which does not have 4 byte characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will handle all utf8 encoded string.
while (size < buf_size) | ||
{ | ||
if ((buf_p[size] & LIT_UTF8_4_BYTE_MASK) == LIT_UTF8_4_BYTE_MARKER) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow we should return with this information.
|
||
if ((lit_utf8_size_t) str_size != dest_buf_size) | ||
{ | ||
lit_utf8_byte_t dest_buf[dest_buf_size]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not allocate a big buffer on stack. We should pre-allocate a string for this, copy the string there, and set the string flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, but I think we should use JMEM_DEFINE_LOCAL_ARRAY also and it wouldn't require too much modification, what is your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zherczeg : I modified this part, and I defined a local array variable and allocated the memory for the array on the heap, please check it.
8210285
to
c02585b
Compare
|
||
if ((lit_utf8_size_t) str_size != dest_buf_size) | ||
{ | ||
JMEM_DEFINE_LOCAL_ARRAY (dest_buf, dest_buf_size, lit_utf8_byte_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't think we should allocate memory here. We could allocate memory for the final string directly. It is a bit more complicated, but I think memory reduction is critcal.
c02585b
to
a7a9e4e
Compare
@zherczeg : Thanks for the advice, I've modified this patch based on your comments. |
@@ -135,4 +135,6 @@ ecma_char_t lit_utf8_peek_prev (const lit_utf8_byte_t *); | |||
void lit_utf8_incr (const lit_utf8_byte_t **); | |||
void lit_utf8_decr (const lit_utf8_byte_t **); | |||
|
|||
bool lit_utf8_contains_four_bytes_unicode (const lit_utf8_byte_t *, lit_utf8_size_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked the original approach that the size is also computed by this function, and it can be passed to ecma_new_ecma_string_from_utf8_converted_to_cesu8 to avoid double scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was a double scan, so I had an idea based on your suggestion and I made a small change on this checking parts. I hope you like this way too.
@@ -216,6 +216,111 @@ ecma_new_ecma_string_from_utf8 (const lit_utf8_byte_t *string_p, /**< utf-8 stri | |||
} /* ecma_new_ecma_string_from_utf8 */ | |||
|
|||
/** | |||
* Allocate new ecma-string and fill it with characters from the utf8 string | |||
* It converts all 4-bytes long unicode sequence to two 3-bytes long sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocate a new ecma-string and initialize it from the utf8 string argument.
All 4-bytes long unicode sequences are converted into two 3-bytes long sequences.
The patch is definitely going in the right direction. Some finishing touch is still needed. |
717f31a
to
6a3dec5
Compare
str_length++; | ||
} | ||
|
||
JERRY_ASSERT (size == string_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is true, there is no need to computing it except for debugging purposes. This would simplify the for loop above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed the misleading variable names.
6a3dec5
to
c851e1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I think the patch is ok. @LaszloLango do you think we should update the documentation and some api tests in this or a follow-up patch? |
c851e1a
to
cc1c4c8
Compare
Thanks, I've rebased it with master. @LaszloLango : documentation and some api tests in a follow-up patch is ok for you too? |
JerryScript-DCO-1.0-Signed-off-by: Robert Sipka rsipka.uszeged@partner.samsung.com
cc1c4c8
to
6f9353a
Compare
Yes, follow-up patch is fine. Just don't forget to do it. :) |
LGTM |
I think this patch needs to be followed by another, which contains symbol (functions names, types, ...) renames. Because CESU-8 is a variant of UTF-8, the names was left utf8 and mentioned the CESU-8 only in comments after #616, but I think this is very confusing right now. For example the 'ecma_new_ecma_string_from_utf8' function check that the string is a cesu-8 encoded string at the first step, so it doesn't accept an utf-8 encoded string which contains 4 bytes long characters, and throw an assert.