Skip to content

Handling strings which might not be proper UTF-8 in a better way... #3059

Closed
@omascia

Description

@omascia

How about adding these 3 lines (or a better rewriting of them) to function code_point_length()?

  if (len >= 2 && (static_cast<unsigned char>(*(begin + 1)) >> 6) != 0x2) len = 1;
  if (len >= 3 && (static_cast<unsigned char>(*(begin + 2)) >> 6) != 0x2) len = 1;
  if (len == 4 && (static_cast<unsigned char>(*(begin + 3)) >> 6) != 0x2) len = 1;

Based on 9.0 source code, this becomes:

template <typename Char>
FMT_CONSTEXPR auto code_point_length(const Char* begin) -> int {
  if (const_check(sizeof(Char) != 1)) return 1;
  auto lengths =
      "\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\0\0\0\0\0\0\0\0\2\2\2\2\3\3\4";
  int len = lengths[static_cast<unsigned char>(*begin) >> 3];
  if (len >= 2 && (static_cast<unsigned char>(*(begin + 1)) >> 6) != 0x2) len = 1;
  if (len >= 3 && (static_cast<unsigned char>(*(begin + 2)) >> 6) != 0x2) len = 1;
  if (len == 4 && (static_cast<unsigned char>(*(begin + 3)) >> 6) != 0x2) len = 1;

  // Compute the pointer to the next character early so that the next
  // iteration can start working on the next character. Neither Clang
  // nor GCC figure out this reordering on their own.
  return len + !len;
}

This simply consider that a byte value, which should introduce a 2, 3, or 4 bytes UTF-8 sequence, is only counted as a 2, 3, 4 bytes sequence IF the right count of next bytes are indeed trailing bytes of an UTF-8 sequence.
If the library is used with char strings encoding like, let's say ISO8859, it won't start miscounting lengths when padding.
And it still works properly for correct UTF-8 strings :

		string iso{ -23, 99, 111, 108, 101 };  // "école" (ISO889)
		string utf{ -61, -87, 99, 111, 108, 101 }; // "école" (UTF-8)
		string asc{ 101, 99, 111, 108, 101 };  // "ecole" (ASCII)

		string out_iso = fmt::format("{:<10}", iso);  // size() == 10 (correct)
		string out_utf = fmt::format("{:<10}", utf);  // size() == 11 (correct)
		string out_asc = fmt::format("{:<10}", asc);  // size() == 10 (correct)

Of course, there is a possibility of inventing single-byte character sets sequences which would "look like" valid UTF-8 encoding, but generally, those will be unusual combinations for real text sequences.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions