Add ASSUMEs for UTF-8 byte lengths #23680

khwilliamson · 2025-09-03T20:53:56Z

The maximum number of bytes in a Perl extended UTF-8 character is 13 on ASCII platforms; 14 on EBCDIC. Yet the variable that returns that number is a Size_t. By adding these clues to these inline functions, the compiler may be able to do some optimizations.

This isn't done here on another inline function, utf8_to_uv_msgs(). That is because it currently returns the call of a non-inline function, so the ASSUME would be unreachable code. I don't know if that actually matters. Or that function's boolean result could be stored in a temporary the ASSUME done, and then utf8_to_uv_msgs() would return the temporary's value.

Opinions welcome

This set of changes does not require a perldelta entry.

tonycoz

I do wonder if assuming expectlen >= 1 (etc) is useful.

khwilliamson · 2025-09-04T02:15:13Z

Revised to do check that the value is 1<= x <= MAX

khwilliamson · 2025-09-04T11:51:15Z

@tonycoz Any ideas on the second paragraph of #23680 (comment)

tonycoz · 2025-09-05T03:37:28Z

This isn't done here on another inline function, utf8_to_uv_msgs(). That is because it currently returns the call of a non-inline function, so the ASSUME would be unreachable code. I don't know if that actually matters. Or that function's boolean result could be stored in a temporary the ASSUME done, and then utf8_to_uv_msgs() would return the temporary's value.

I think there's some value in:

bool result = somfunc(...advancep);
ASSUME(advancep == NULL || inRANGE(...));
return result;

also, while -fanalyze -flto is impractical, -flto is practical, so ASSUME()s in non-inline functions can have some value.

The maximum number of bytes in a Perl extended UTF-8 character is 13 on ASCII platforms; 14 on EBCDIC. Yet the variable that returns that number is a Size_t in the cases changed by this commit. By adding these ASSUMES to these functions, the compiler may be able to do some optimizations. I looked through the code base, and found no other instances where such a small value could be stored in a fully wide variable. With link time optimization, an ASSUME may be helpful even in non-inline functions.

khwilliamson · 2025-09-05T21:12:33Z

I added an ASSUME in a place where it would only help link time optimization, and changed as suggested in #23680 (comment).

I also audited the code base for other potential spots to change, and found none.

tonycoz approved these changes Sep 4, 2025

View reviewed changes

khwilliamson force-pushed the ASSUME branch from d9d7c5d to 770df21 Compare September 4, 2025 02:07

tonycoz approved these changes Sep 4, 2025

View reviewed changes

khwilliamson force-pushed the ASSUME branch from 770df21 to f1d3134 Compare September 5, 2025 21:09

khwilliamson merged commit b60f610 into Perl:blead Sep 6, 2025
33 checks passed

khwilliamson deleted the ASSUME branch September 6, 2025 02:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ASSUMEs for UTF-8 byte lengths #23680

Add ASSUMEs for UTF-8 byte lengths #23680

Uh oh!

khwilliamson commented Sep 3, 2025

Uh oh!

tonycoz left a comment

Uh oh!

khwilliamson commented Sep 4, 2025

Uh oh!

khwilliamson commented Sep 4, 2025

Uh oh!

tonycoz commented Sep 5, 2025

Uh oh!

khwilliamson commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

Add ASSUMEs for UTF-8 byte lengths #23680

Add ASSUMEs for UTF-8 byte lengths #23680

Uh oh!

Conversation

khwilliamson commented Sep 3, 2025

Uh oh!

tonycoz left a comment

Choose a reason for hiding this comment

Uh oh!

khwilliamson commented Sep 4, 2025

Uh oh!

khwilliamson commented Sep 4, 2025

Uh oh!

tonycoz commented Sep 5, 2025

Uh oh!

khwilliamson commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!