Closed
Description
How about adding these 3 lines (or a better rewriting of them) to function code_point_length()?
if (len >= 2 && (static_cast<unsigned char>(*(begin + 1)) >> 6) != 0x2) len = 1;
if (len >= 3 && (static_cast<unsigned char>(*(begin + 2)) >> 6) != 0x2) len = 1;
if (len == 4 && (static_cast<unsigned char>(*(begin + 3)) >> 6) != 0x2) len = 1;
Based on 9.0 source code, this becomes:
template <typename Char>
FMT_CONSTEXPR auto code_point_length(const Char* begin) -> int {
if (const_check(sizeof(Char) != 1)) return 1;
auto lengths =
"\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\0\0\0\0\0\0\0\0\2\2\2\2\3\3\4";
int len = lengths[static_cast<unsigned char>(*begin) >> 3];
if (len >= 2 && (static_cast<unsigned char>(*(begin + 1)) >> 6) != 0x2) len = 1;
if (len >= 3 && (static_cast<unsigned char>(*(begin + 2)) >> 6) != 0x2) len = 1;
if (len == 4 && (static_cast<unsigned char>(*(begin + 3)) >> 6) != 0x2) len = 1;
// Compute the pointer to the next character early so that the next
// iteration can start working on the next character. Neither Clang
// nor GCC figure out this reordering on their own.
return len + !len;
}
This simply consider that a byte value, which should introduce a 2, 3, or 4 bytes UTF-8 sequence, is only counted as a 2, 3, 4 bytes sequence IF the right count of next bytes are indeed trailing bytes of an UTF-8 sequence.
If the library is used with char strings encoding like, let's say ISO8859, it won't start miscounting lengths when padding.
And it still works properly for correct UTF-8 strings :
string iso{ -23, 99, 111, 108, 101 }; // "école" (ISO889)
string utf{ -61, -87, 99, 111, 108, 101 }; // "école" (UTF-8)
string asc{ 101, 99, 111, 108, 101 }; // "ecole" (ASCII)
string out_iso = fmt::format("{:<10}", iso); // size() == 10 (correct)
string out_utf = fmt::format("{:<10}", utf); // size() == 11 (correct)
string out_asc = fmt::format("{:<10}", asc); // size() == 10 (correct)
Of course, there is a possibility of inventing single-byte character sets sequences which would "look like" valid UTF-8 encoding, but generally, those will be unusual combinations for real text sequences.