Skip to content

Commit 5824601

Browse files
committed
utf8n_to_uvchr_msgs: Use different DFA tables
I'm uncertain about this commit. There are three separate DFA tables already in core. One accepts Perl extended UTF-8; one accepts only strict Unicode UTF-8; and the third accepts modified Unicode UTF-8 spelled out by them in Corrigendum Perl#9. Both the Unicode varieties reject surrogate code points and anything above U+10FFFF. C9 accepts, but the other rejects non-character code points. Without this commit, the way it works is it uses the most restrictive table for the DFA. Anything it accepts is always valid. Anything it rejects is potentially problematic, and it calls a non-inlined function to examine the input more slowly to determine if it is acceptable and/or if a warning needs to be raised. This commit examines the input flags to determine which DFA to use in this situation. The benefit is that the slower routine could be avoided for many more code points. But the vast vast majority of calls to this function aren't for any problematic code points, so the extra cost of this will very rarely be recouped. The translation from UTF-8 is critically important. We want it to be as fast as possible. I would not even consider this commit if the extra cost weren't very small. A complicating factor is that 2048 (approximately 20% of the total) Korean Hangul syllable code points are not handled by the strict table, so must be by the slower function; though they're handled at the very beginning of it. These code points are never problematic, so it is unfortunate that they have to be handled via the slower function. But still, rarely will this function be called with them. Only the strict table has this problem The way this commit works is to have a table containing pointers to the three DFA tables. The function looks at the input flags; if none are present, it uses the loosest dfa; if any restrictions are present, it adds 1 to the index to use, and it the C9 resetrictions are present, it adds an extra 1. The flags are cast to bools to get each addition. If the bool casts didn't generate conditionals, the only cost to this would be two additions and an indirection; and I would say that that cost is so tiny that this would be worth it. But I looked at godbolt, and casting to bool requires a comparison on both modern clang and gcc. That makes me unsure of the tradeoff. Another option would be to just juse two DFAs, loose and most strict. Then there would be a single conditional, and the Hanguls still would be handled by the DFA when there were no flags restricting things
1 parent 301e785 commit 5824601

File tree

3 files changed

+29
-4
lines changed

3 files changed

+29
-4
lines changed

globvar.sym

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ PL_bincompat_options
99
PL_bitcount
1010
PL_block_type
1111
PL_c9_utf8_dfa_tab
12+
PL_which_utf8_dfa_tab
1213
PL_charclass
1314
PL_check
1415
PL_core_reg_engine

inline.h

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3023,6 +3023,15 @@ Perl_utf8n_to_uvchr_msgs(const U8 * const s0,
30233023
const U8 * s = s0;
30243024
const U8 * send = s + curlen;
30253025

3026+
/* Find which dfa table to use. If no restrictions, use [0]. If any,
3027+
* use at least [1]; and use [2] for the most restrictive */
3028+
const U8 * table =
3029+
PL_which_utf8_dfa_tab[
3030+
(bool) (flags & ( UTF8_WARN_ILLEGAL_INTERCHANGE
3031+
|UTF8_DISALLOW_ILLEGAL_INTERCHANGE))
3032+
+ (bool) (flags & (UTF8_WARN_NONCHAR | UTF8_DISALLOW_NONCHAR))
3033+
];
3034+
30263035
/* This dfa is fast. If it accepts the input, it was for a
30273036
* well-formed, non-problematic code point, which can be returned
30283037
* immediately. Otherwise we call a helper function to figure out the
@@ -3038,13 +3047,13 @@ Perl_utf8n_to_uvchr_msgs(const U8 * const s0,
30383047
* The terminology of the dfa refers to a 'class'. The variable 'type'
30393048
* would have been named 'class' except that is a reserved word in C++
30403049
* */
3041-
PERL_UINT_FAST8_T type = PL_strict_utf8_dfa_tab[*s];
3042-
PERL_UINT_FAST8_T state = PL_strict_utf8_dfa_tab[256 + type];
3050+
PERL_UINT_FAST8_T type = table[*s];
3051+
PERL_UINT_FAST8_T state = table[256 + type];
30433052
UV uv = (0xff >> type) & NATIVE_UTF8_TO_I8(*s);
30443053

30453054
while (state > 1 && ++s < send) {
3046-
type = PL_strict_utf8_dfa_tab[*s];
3047-
state = PL_strict_utf8_dfa_tab[256 + state + type];
3055+
type = table[*s];
3056+
state = table[256 + state + type];
30483057

30493058
uv = UTF8_ACCUMULATE(uv, *s);
30503059
}

perl.h

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6927,12 +6927,27 @@ EXTCONST U8 PL_c9_utf8_dfa_tab[] = {
69276927
/*N7*/ 1, 1, 1, 1, 1, 1, 1, 1, 1, N2, 1, 1,
69286928
};
69296929

6930+
/* There are 3 different tables driving the dfa, depending on what sort of
6931+
* restrictions are wanted.
6932+
* [0] is where all of Perl's extended UTF-8 is accepted.
6933+
* [2] is where what is acceptable is the subset of [0] that conforms to
6934+
* Unicode's requirements for free exchange between processes.
6935+
* [1] is where what is acceptable is the superset of [2] that conforms to
6936+
* Unicode Corrigendum #9, for exchange between processes that each have
6937+
* agreed not to send certain portions of [2]. */
6938+
EXTCONST U8 * PL_which_utf8_dfa_tab[] = {
6939+
PL_extended_utf8_dfa_tab,
6940+
PL_c9_utf8_dfa_tab,
6941+
PL_strict_utf8_dfa_tab
6942+
};
6943+
69306944
# endif /* defined(PERL_CORE) */
69316945
# else /* End of is DOINIT */
69326946

69336947
EXTCONST U8 PL_extended_utf8_dfa_tab[];
69346948
EXTCONST U8 PL_strict_utf8_dfa_tab[];
69356949
EXTCONST U8 PL_c9_utf8_dfa_tab[];
6950+
EXTCONST U8 * PL_which_utf8_dfa_tab[];
69366951

69376952
# endif
69386953
#endif /* end of isn't EBCDIC */

0 commit comments

Comments
 (0)