Skip to content

Improve utf8_to_utf16 speed for common path #892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 26, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 40 additions & 19 deletions Release/src/utilities/asyncrt_utils.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -347,19 +347,33 @@ const std::error_category & __cdecl linux_category()
#define H_SURROGATE_END 0xDBFF
#define SURROGATE_PAIR_START 0x10000

// Create a dedicated type for characters to avoid the issue
// of different platforms defaulting char to be either signed
// or unsigned.
using UtilCharInternal_t = signed char;


inline size_t count_utf8_to_utf16(const std::string& s)
{
const size_t sSize = s.size();
const char* const sData = s.data();
auto sData = reinterpret_cast<const UtilCharInternal_t* const>(s.data());
size_t result{ sSize };

for (size_t index = 0; index < sSize;)
{
const char c{ sData[index++] };
if ((c & BIT8) == 0)
if( sData[index] > 0 )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transformation doesn't work on platforms where char is unsigned -- e.g., most non-Windows platforms. Clearly this is missing some test coverage if that was missed...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I missed that. I fixed the code, still thinking about testability. This would probably require a separate test binary with different char defaults.

Copy link
Member

@BillyONeal BillyONeal Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests might be avoiding this with the __GLIBCXX__ bits? (GCC defaults to char being unsigned)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__GLIBCXX__ seems only to be used as a means to determine whether the test code should compare byte-wise (using unsigned char) or via the results of conversion derived from codecvt_utf8_utf16. Where I'm not sure about the reason for this, especially since codecvt_utf8_utf16 is deprecated in C++17 (->https://en.cppreference.com/w/cpp/locale/codecvt_utf8_utf16).

With regard to the test, I was considering whether there is an elegant way to make use of the effects of -funsigned-char and -fsigned-char with GCC (-> http://gcc.gnu.org/onlinedocs/gcc-4.8.0/gcc/C-Dialect-Options.html) or /J Option in VS (-> https://msdn.microsoft.com/en-us/library/0d294k5z.aspx). Where elegant would mean that we at least avoid building the tests with their dependecies twice (once signed once unsigned).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm I stand corrected on non-Windows platforms. Maybe it was just an Android thing. https://twitter.com/ubsanitizer/status/1052343628201259008

A static_assert(std::is_signed<char>::value, "Untested for unsigned char") is probably sufficient then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but this would prevent the change to work on unsigned char platforms without changes. Or is this just a marker to ensure unsigned char checks are done before next release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought: We could reinterpret_cast the pointer retrieved from s.data() to signed char (or unsigned whatever fits better to the project) and avoid the platform dependency. From what I see, this should also be fine with regard to strict aliasing rules since we're only working on the buffer via our pointers and are not using any std::string operations.
What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be OK (because signed char/unsigned char are character types)

{
continue;
// use fast inner loop to skip single byte code points (which are
// expected to be the most frequent)
while ((++index < sSize) && (sData[index] > 0))
;

if (index >= sSize) break;
}

// start special handling for multi-byte code points
const UtilCharInternal_t c{ sData[index++] };

if ((c & BIT7) == 0)
{
throw std::range_error("UTF-8 string character can never start with 10xxxxxx");
Expand All @@ -371,7 +385,7 @@ inline size_t count_utf8_to_utf16(const std::string& s)
throw std::range_error("UTF-8 string is missing bytes in character");
}

const char c2{ sData[index++] };
const UtilCharInternal_t c2{ sData[index++] };
if ((c2 & 0xC0) != BIT8)
{
throw std::range_error("UTF-8 continuation byte is missing leading bit mask");
Expand All @@ -387,8 +401,8 @@ inline size_t count_utf8_to_utf16(const std::string& s)
throw std::range_error("UTF-8 string is missing bytes in character");
}

const char c2{ sData[index++] };
const char c3{ sData[index++] };
const UtilCharInternal_t c2{ sData[index++] };
const UtilCharInternal_t c3{ sData[index++] };
if (((c2 | c3) & 0xC0) != BIT8)
{
throw std::range_error("UTF-8 continuation byte is missing leading bit mask");
Expand All @@ -403,9 +417,9 @@ inline size_t count_utf8_to_utf16(const std::string& s)
throw std::range_error("UTF-8 string is missing bytes in character");
}

const char c2{ sData[index++] };
const char c3{ sData[index++] };
const char c4{ sData[index++] };
const UtilCharInternal_t c2{ sData[index++] };
const UtilCharInternal_t c3{ sData[index++] };
const UtilCharInternal_t c4{ sData[index++] };
if (((c2 | c3 | c4) & 0xC0) != BIT8)
{
throw std::range_error("UTF-8 continuation byte is missing leading bit mask");
Expand All @@ -427,21 +441,21 @@ utf16string __cdecl conversions::utf8_to_utf16(const std::string &s)
{
// Save repeated heap allocations, use the length of resulting sequence.
const size_t srcSize = s.size();
const std::string::value_type* const srcData = &s[0];
auto srcData = reinterpret_cast<const UtilCharInternal_t* const>(s.data());
utf16string dest(count_utf8_to_utf16(s), L'\0');
utf16string::value_type* const destData = &dest[0];
size_t destIndex = 0;

for (size_t index = 0; index < srcSize; ++index)
{
std::string::value_type src = srcData[index];
UtilCharInternal_t src = srcData[index];
switch (src & 0xF0)
{
case 0xF0: // 4 byte character, 0x10000 to 0x10FFFF
{
const char c2{ srcData[++index] };
const char c3{ srcData[++index] };
const char c4{ srcData[++index] };
const UtilCharInternal_t c2{ srcData[++index] };
const UtilCharInternal_t c3{ srcData[++index] };
const UtilCharInternal_t c4{ srcData[++index] };
uint32_t codePoint = ((src & LOW_3BITS) << 18) | ((c2 & LOW_6BITS) << 12) | ((c3 & LOW_6BITS) << 6) | (c4 & LOW_6BITS);
if (codePoint >= SURROGATE_PAIR_START)
{
Expand All @@ -464,20 +478,27 @@ utf16string __cdecl conversions::utf8_to_utf16(const std::string &s)
break;
case 0xE0: // 3 byte character, 0x800 to 0xFFFF
{
const char c2{ srcData[++index] };
const char c3{ srcData[++index] };
const UtilCharInternal_t c2{ srcData[++index] };
const UtilCharInternal_t c3{ srcData[++index] };
destData[destIndex++] = static_cast<utf16string::value_type>(((src & LOW_4BITS) << 12) | ((c2 & LOW_6BITS) << 6) | (c3 & LOW_6BITS));
}
break;
case 0xD0: // 2 byte character, 0x80 to 0x7FF
case 0xC0:
{
const char c2{ srcData[++index] };
const UtilCharInternal_t c2{ srcData[++index] };
destData[destIndex++] = static_cast<utf16string::value_type>(((src & LOW_5BITS) << 6) | (c2 & LOW_6BITS));
}
break;
default: // single byte character, 0x0 to 0x7F
destData[destIndex++] = static_cast<utf16string::value_type>(src);
// try to use a fast inner loop for following single byte characters,
// since they are quite probable
do
{
destData[destIndex++] = static_cast<utf16string::value_type>(srcData[index++]);
} while (index < srcSize && srcData[index] > 0);
// adjust index since it will be incremented by the for loop
--index;
}
}
return dest;
Expand Down
25 changes: 25 additions & 0 deletions Release/tests/functional/utils/strings.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,31 @@ TEST(utf8_to_utf16)
#else
VERIFY_ARE_EQUAL(conversion.from_bytes(input), result);
#endif


// 1 byte character followed by 4 byte character
input.clear();
input.push_back( 51u); // 00110011
// U+10000
input.push_back(244u); // 11110100
input.push_back(128u); // 10000000
input.push_back(128u); // 10000000
input.push_back(128u); // 10000000
// U+10FFFF
input.push_back(244u); // 11110100
input.push_back(143u); // 10001111
input.push_back(191u); // 10111111
input.push_back(191u); // 10111111
result = utility::conversions::utf8_to_utf16(input);
#if defined(__GLIBCXX__)
VERIFY_ARE_EQUAL(51, result[0]);
VERIFY_ARE_EQUAL(56256, result[1]);
VERIFY_ARE_EQUAL(56320, result[2]);
VERIFY_ARE_EQUAL(56319, result[3]);
VERIFY_ARE_EQUAL(57343, result[4]);
#else
VERIFY_ARE_EQUAL(conversion.from_bytes(input), result);
#endif
}

TEST(utf16_to_utf8_errors)
Expand Down