Skip to content

Add API functions to create string from a valid UTF-8 string. #1430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 16, 2016

Conversation

robertsipka
Copy link
Contributor

I think this patch needs to be followed by another, which contains symbol (functions names, types, ...) renames. Because CESU-8 is a variant of UTF-8, the names was left utf8 and mentioned the CESU-8 only in comments after #616, but I think this is very confusing right now. For example the 'ecma_new_ecma_string_from_utf8' function check that the string is a cesu-8 encoded string at the first step, so it doesn't accept an utf-8 encoded string which contains 4 bytes long characters, and throw an assert.

@robertsipka robertsipka force-pushed the api_utf8 branch 2 times, most recently from 1c51485 to 8210285 Compare November 14, 2016 10:39

dest_buf_size = lit_calculate_cesu8_size_from_utf8 ((lit_utf8_byte_t *) str_p, (lit_utf8_size_t) str_size);

if ((lit_utf8_size_t) str_size != dest_buf_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this only handles ASCII strings. We should also allow strings which does not have 4 byte characters.

Copy link
Contributor Author

@robertsipka robertsipka Nov 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will handle all utf8 encoded string.

while (size < buf_size)
{
if ((buf_p[size] & LIT_UTF8_4_BYTE_MASK) == LIT_UTF8_4_BYTE_MARKER)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow we should return with this information.


if ((lit_utf8_size_t) str_size != dest_buf_size)
{
lit_utf8_byte_t dest_buf[dest_buf_size];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not allocate a big buffer on stack. We should pre-allocate a string for this, copy the string there, and set the string flags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, but I think we should use JMEM_DEFINE_LOCAL_ARRAY also and it wouldn't require too much modification, what is your opinion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zherczeg : I modified this part, and I defined a local array variable and allocated the memory for the array on the heap, please check it.


if ((lit_utf8_size_t) str_size != dest_buf_size)
{
JMEM_DEFINE_LOCAL_ARRAY (dest_buf, dest_buf_size, lit_utf8_byte_t);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think we should allocate memory here. We could allocate memory for the final string directly. It is a bit more complicated, but I think memory reduction is critcal.

@robertsipka
Copy link
Contributor Author

@zherczeg : Thanks for the advice, I've modified this patch based on your comments.

@@ -135,4 +135,6 @@ ecma_char_t lit_utf8_peek_prev (const lit_utf8_byte_t *);
void lit_utf8_incr (const lit_utf8_byte_t **);
void lit_utf8_decr (const lit_utf8_byte_t **);

bool lit_utf8_contains_four_bytes_unicode (const lit_utf8_byte_t *, lit_utf8_size_t);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked the original approach that the size is also computed by this function, and it can be passed to ecma_new_ecma_string_from_utf8_converted_to_cesu8 to avoid double scan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was a double scan, so I had an idea based on your suggestion and I made a small change on this checking parts. I hope you like this way too.

@@ -216,6 +216,111 @@ ecma_new_ecma_string_from_utf8 (const lit_utf8_byte_t *string_p, /**< utf-8 stri
} /* ecma_new_ecma_string_from_utf8 */

/**
* Allocate new ecma-string and fill it with characters from the utf8 string
* It converts all 4-bytes long unicode sequence to two 3-bytes long sequence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocate a new ecma-string and initialize it from the utf8 string argument.
All 4-bytes long unicode sequences are converted into two 3-bytes long sequences.

@zherczeg
Copy link
Member

The patch is definitely going in the right direction. Some finishing touch is still needed.

@robertsipka robertsipka force-pushed the api_utf8 branch 2 times, most recently from 717f31a to 6a3dec5 Compare November 15, 2016 14:07
@LaszloLango LaszloLango added api Related to the public API feature request Requested feature labels Nov 15, 2016
str_length++;
}

JERRY_ASSERT (size == string_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is true, there is no need to computing it except for debugging purposes. This would simplify the for loop above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed the misleading variable names.

Copy link
Contributor

@LaszloLango LaszloLango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zherczeg
Copy link
Member

I think the patch is ok. @LaszloLango do you think we should update the documentation and some api tests in this or a follow-up patch?

@robertsipka
Copy link
Contributor Author

Thanks, I've rebased it with master. @LaszloLango : documentation and some api tests in a follow-up patch is ok for you too?

JerryScript-DCO-1.0-Signed-off-by: Robert Sipka rsipka.uszeged@partner.samsung.com
@LaszloLango
Copy link
Contributor

Yes, follow-up patch is fine. Just don't forget to do it. :)

@zherczeg
Copy link
Member

zherczeg commented Nov 16, 2016

LGTM
Please also reference this patch for the appropriate issue.

@robertsipka robertsipka merged commit ffaca58 into jerryscript-project:master Nov 16, 2016
@robertsipka robertsipka deleted the api_utf8 branch November 24, 2016 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Related to the public API feature request Requested feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants