Skip to content

Commit f34acfe

Browse files
committed
Reimplement tr/// without swashes
This large commit removes the last use of swashes from core. It replaces swashes by inversion maps. This data structure is already in use for some Unicode properties, such as case changing. The inversion map data structure leads to straight forward implementation code, so I collapsed the two doop.c routines do_trans_complex_utf8() and do_trans_simple_utf8() into one. A few conditionals could be avoided in the loop if this function were split so that one version didn't have to test for, e.g., squashing, but I suspect these are in the noise in the loop, which has to deal with UTF-8 conversions. This should be faster than the previous implementation anyway. I measured the differences some releases back, and inversion maps were faster than the equivalent swash for up to 512 or 1024 different ranges. These numbers are unlikely to be exceeded in tr/// except possibly in machine-generated ones. Inversion maps are capable of handling both UTF-8 and non-UTF-8 cases, but I left in the existing non-UTF-8 implementation, which uses tables, because I suspect it is faster. This means that there is extra code, purely for runtime performance. An inversion map is always created from the input, and then if the table implementation is to be used, the table is easily derived from the map. Prior to this commit, the table implementation was used in certain edge cases involving code points above 255. Those cases are now handled by the inversion map implementation, because it would have taken extra code to detect them, and I didn't think it was worth it. That could be changed if I am wrong. Creating an inversion map for all inputs essentially normalizes them, and then the same logic is usable for all. This fixes some false negatives in the previous implementation. It also allows for detecting if the actual transliteration can be done in place. Previously, the code mostly punted on that detection for the UTF-8 case. This also allows for accurate counting of the lengths of the two sides, fixing some longstanding TODO warning tests. A new flag is created, OPpTRANS_CAN_FORCE_UTF8, when the tr/// has a below 256 character resolving to one that requires UTF-8. If this isn't set, the code knows that a non-UTF-8 input won't become UTF-8 in the process, and so can take short cuts. The bit representing this flag is the same as OPpTRANS_FROM_UTF, which is no longer used. That name is left in so that the dozen-ish modules in cpan that refer to it can still compile. AFAICT none of them actually use the flag, as well they shouldn't since it is private to the core. Inversion maps are ideally suited for tr/// implementations. An issue with them in general is that for some pathological data, they can become fragmented requiring more space than you would expect, to represent the underlying data. However, the typical tr/// would not have this issue, requiring only very short inversion maps to represent; in some cases shorter than the table implementation. Inversion maps are also easier to deparse than swashes. A deparse TODO was also fixed by this commit, and the code to deparse UTF-8 inputs is simplified. One could implement specialized data structures for specific types of inputs. For example, a common tr/// form is a single range, like tr/A-Z/a-z/. That could be implemented without a table and be quite fast. An intermediate step would be to use the inversion map implementation always when the transliteration is a single range, and then special case length=1 maps at execution time. Thanks to Nicholas Rochemagne for his help on B
1 parent 8c90d3a commit f34acfe

File tree

10 files changed

+1341
-834
lines changed

10 files changed

+1341
-834
lines changed

doop.c

Lines changed: 219 additions & 287 deletions
Large diffs are not rendered by default.

dump.c

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1305,13 +1305,13 @@ S_do_op_dump_bar(pTHX_ I32 level, UV bar, PerlIO *file, const OP *o)
13051305

13061306
case OP_TRANS:
13071307
case OP_TRANSR:
1308-
if (o->op_private & (OPpTRANS_FROM_UTF | OPpTRANS_TO_UTF)) {
1309-
/* utf8: table stored as a swash */
1308+
if (o->op_private & OPpTRANS_USE_SVOP) {
1309+
/* utf8: table stored as an inversion map */
13101310
#ifndef USE_ITHREADS
1311-
/* with ITHREADS, swash is stored in the pad, and the right pad
1311+
/* with ITHREADS, it is stored in the pad, and the right pad
13121312
* may not be active here, so skip */
13131313
S_opdump_indent(aTHX_ o, level, bar, file,
1314-
"SWASH = 0x%" UVxf "\n",
1314+
"INVMAP = 0x%" UVxf "\n",
13151315
PTR2UV(MUTABLE_SV(cSVOPo->op_sv)));
13161316
#endif
13171317
}
@@ -2986,11 +2986,10 @@ Perl_op_class(pTHX_ const OP *o)
29862986
* pointer to a table of shorts used to look up translations.
29872987
* Under utf8, however, a simple table isn't practical; instead,
29882988
* the OP is an SVOP (or, under threads, a PADOP),
2989-
* and the SV is a reference to a swash
2990-
* (i.e., an RV pointing to an HV).
2989+
* and the SV is an AV.
29912990
*/
29922991
return (!custom &&
2993-
(o->op_private & (OPpTRANS_TO_UTF|OPpTRANS_FROM_UTF))
2992+
(o->op_private & OPpTRANS_USE_SVOP)
29942993
)
29952994
#if defined(USE_ITHREADS)
29962995
? OPclass_PADOP : OPclass_PVOP;

embed.fnc

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1870,7 +1870,7 @@ Apd |void |sv_vsetpvfn |NN SV *const sv|NN const char *const pat|const STRLEN pa
18701870
ApR |NV |str_to_version |NN SV *sv
18711871
EXpR |SV* |swash_init |NN const char* pkg|NN const char* name|NN SV* listsv|I32 minbits|I32 none
18721872
EXp |UV |swash_fetch |NN SV *swash|NN const U8 *ptr|bool do_utf8
1873-
#if defined(PERL_IN_REGCOMP_C)
1873+
#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_OP_C) || defined(PERL_IN_DOOP_C)
18741874
EiR |SV* |add_cp_to_invlist |NULLOK SV* invlist|const UV cp
18751875
Ei |void |invlist_extend |NN SV* const invlist|const UV len
18761876
Ei |void |invlist_set_len|NN SV* const invlist|const UV len|const bool offset
@@ -1922,7 +1922,8 @@ EpX |SV* |invlist_clone |NN SV* const invlist|NULLOK SV* newlist
19221922
#endif
19231923
#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) \
19241924
|| defined(PERL_IN_TOKE_C) || defined(PERL_IN_UTF8_C) \
1925-
|| defined(PERL_IN_PP_C) || defined(PERL_IN_OP_C)
1925+
|| defined(PERL_IN_PP_C) || defined(PERL_IN_OP_C) \
1926+
|| defined(PERL_IN_DOOP_C)
19261927
EiRT |UV* |invlist_array |NN SV* const invlist
19271928
EiRT |bool |is_invlist |NULLOK SV* const invlist
19281929
EiRT |bool* |get_invlist_offset_addr|NN SV* invlist
@@ -2308,9 +2309,8 @@ p |void |init_constants
23082309
SR |Size_t |do_trans_simple |NN SV * const sv|NN const OPtrans_map * const tbl
23092310
SR |Size_t |do_trans_count |NN SV * const sv|NN const OPtrans_map * const tbl
23102311
SR |Size_t |do_trans_complex |NN SV * const sv|NN const OPtrans_map * const tbl
2311-
SR |Size_t |do_trans_simple_utf8 |NN SV * const sv
2312-
SR |Size_t |do_trans_count_utf8 |NN SV * const sv
2313-
SR |Size_t |do_trans_complex_utf8 |NN SV * const sv
2312+
SR |Size_t |do_trans_invmap |NN SV * const sv|NN AV * const map
2313+
SR |Size_t |do_trans_count_invmap |NN SV * const sv|NN AV * const map
23142314
#endif
23152315

23162316
#if defined(PERL_IN_GV_C)

embed.h

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1014,7 +1014,6 @@
10141014
# endif
10151015
# if defined(PERL_IN_REGCOMP_C)
10161016
#define add_above_Latin1_folds(a,b,c) S_add_above_Latin1_folds(aTHX_ a,b,c)
1017-
#define add_cp_to_invlist(a,b) S_add_cp_to_invlist(aTHX_ a,b)
10181017
#define add_data S_add_data
10191018
#define add_multi_match(a,b,c) S_add_multi_match(aTHX_ a,b,c)
10201019
#define change_engine_size(a,b) S_change_engine_size(aTHX_ a,b)
@@ -1024,20 +1023,13 @@
10241023
#define edit_distance S_edit_distance
10251024
#define get_ANYOFM_contents(a) S_get_ANYOFM_contents(aTHX_ a)
10261025
#define get_ANYOF_cp_list_for_ssc(a,b) S_get_ANYOF_cp_list_for_ssc(aTHX_ a,b)
1027-
#define get_invlist_iter_addr S_get_invlist_iter_addr
10281026
#define grok_bslash_N(a,b,c,d,e,f,g) S_grok_bslash_N(aTHX_ a,b,c,d,e,f,g)
10291027
#define handle_named_backref(a,b,c,d) S_handle_named_backref(aTHX_ a,b,c,d)
10301028
#define handle_possible_posix(a,b,c,d,e) S_handle_possible_posix(aTHX_ a,b,c,d,e)
10311029
#define handle_regex_sets(a,b,c,d,e) S_handle_regex_sets(aTHX_ a,b,c,d,e)
10321030
#define handle_user_defined_property(a,b,c,d,e,f,g,h,i,j) Perl_handle_user_defined_property(aTHX_ a,b,c,d,e,f,g,h,i,j)
10331031
#define invlist_contents(a,b) S_invlist_contents(aTHX_ a,b)
1034-
#define invlist_extend(a,b) S_invlist_extend(aTHX_ a,b)
1035-
#define invlist_highest S_invlist_highest
10361032
#define invlist_is_iterating S_invlist_is_iterating
1037-
#define invlist_iterfinish S_invlist_iterfinish
1038-
#define invlist_iterinit S_invlist_iterinit
1039-
#define invlist_iternext S_invlist_iternext
1040-
#define invlist_set_len(a,b,c) S_invlist_set_len(aTHX_ a,b,c)
10411033
#define is_ssc_worth_it S_is_ssc_worth_it
10421034
#define join_exact(a,b,c,d,e,f,g) S_join_exact(aTHX_ a,b,c,d,e,f,g)
10431035
#define make_exactf_invlist(a,b) S_make_exactf_invlist(aTHX_ a,b)
@@ -1083,6 +1075,16 @@
10831075
# if defined(PERL_IN_REGCOMP_C) || defined (PERL_IN_DUMP_C) || defined(PERL_IN_OP_C)
10841076
#define _invlist_dump(a,b,c,d) Perl__invlist_dump(aTHX_ a,b,c,d)
10851077
# endif
1078+
# if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_OP_C) || defined(PERL_IN_DOOP_C)
1079+
#define add_cp_to_invlist(a,b) S_add_cp_to_invlist(aTHX_ a,b)
1080+
#define get_invlist_iter_addr S_get_invlist_iter_addr
1081+
#define invlist_extend(a,b) S_invlist_extend(aTHX_ a,b)
1082+
#define invlist_highest S_invlist_highest
1083+
#define invlist_iterfinish S_invlist_iterfinish
1084+
#define invlist_iterinit S_invlist_iterinit
1085+
#define invlist_iternext S_invlist_iternext
1086+
#define invlist_set_len(a,b,c) S_invlist_set_len(aTHX_ a,b,c)
1087+
# endif
10861088
# if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_PERL_C) || defined(PERL_IN_UTF8_C)
10871089
#define _invlistEQ(a,b,c) Perl__invlistEQ(aTHX_ a,b,c)
10881090
#define _new_invlist_C_array(a) Perl__new_invlist_C_array(aTHX_ a)
@@ -1094,7 +1096,7 @@
10941096
#endif
10951097
#define regprop(a,b,c,d,e) Perl_regprop(aTHX_ a,b,c,d,e)
10961098
# endif
1097-
# if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_PP_C) || defined(PERL_IN_OP_C)
1099+
# if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_TOKE_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_PP_C) || defined(PERL_IN_OP_C) || defined(PERL_IN_DOOP_C)
10981100
#define _invlist_contains_cp S__invlist_contains_cp
10991101
#define _invlist_len S__invlist_len
11001102
#define _invlist_search Perl__invlist_search
@@ -1603,11 +1605,10 @@
16031605
# endif
16041606
# if defined(PERL_IN_DOOP_C)
16051607
#define do_trans_complex(a,b) S_do_trans_complex(aTHX_ a,b)
1606-
#define do_trans_complex_utf8(a) S_do_trans_complex_utf8(aTHX_ a)
16071608
#define do_trans_count(a,b) S_do_trans_count(aTHX_ a,b)
1608-
#define do_trans_count_utf8(a) S_do_trans_count_utf8(aTHX_ a)
1609+
#define do_trans_count_invmap(a,b) S_do_trans_count_invmap(aTHX_ a,b)
1610+
#define do_trans_invmap(a,b) S_do_trans_invmap(aTHX_ a,b)
16091611
#define do_trans_simple(a,b) S_do_trans_simple(aTHX_ a,b)
1610-
#define do_trans_simple_utf8(a) S_do_trans_simple_utf8(aTHX_ a)
16111612
# endif
16121613
# if defined(PERL_IN_DUMP_C)
16131614
#define deb_curcv(a) S_deb_curcv(aTHX_ a)

invlist_inline.h

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,8 @@
1414
|| defined(PERL_IN_REGEXEC_C) \
1515
|| defined(PERL_IN_TOKE_C) \
1616
|| defined(PERL_IN_PP_C) \
17-
|| defined(PERL_IN_OP_C)
17+
|| defined(PERL_IN_OP_C) \
18+
|| defined(PERL_IN_DOOP_C)
1819

1920
/* An element is in an inversion list iff its index is even numbered: 0, 2, 4,
2021
* etc */
@@ -92,7 +93,7 @@ S_invlist_array(SV* const invlist)
9293
}
9394

9495
#endif
95-
#if defined(PERL_IN_REGCOMP_C)
96+
#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_OP_C) || defined(PERL_IN_DOOP_C)
9697

9798
PERL_STATIC_INLINE void
9899
S_invlist_extend(pTHX_ SV* const invlist, const UV new_max)

lib/B/Deparse.pm

Lines changed: 73 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -279,6 +279,8 @@ BEGIN { for (qw[ const stringify rv2sv list glob pushmark null aelem
279279
# _pessimise_walk(): recursively walk the optree of a sub,
280280
# possibly undoing optimisations along the way.
281281

282+
sub DEBUG { 0 }
283+
282284
sub _pessimise_walk {
283285
my ($self, $startop) = @_;
284286

@@ -5714,100 +5716,81 @@ sub tr_chr {
57145716
}
57155717
}
57165718

5717-
# XXX This doesn't yet handle all cases correctly either
5719+
sub tr_invmap {
5720+
my ($invlist_ref, $map_ref) = @_;
57185721

5719-
sub tr_decode_utf8 {
5720-
my($swash_hv, $flags) = @_;
5721-
my %swash = $swash_hv->ARRAY;
5722-
my $final = undef;
5723-
$final = $swash{'FINAL'}->IV if exists $swash{'FINAL'};
5724-
my $none = $swash{"NONE"}->IV;
5725-
my $extra = $none + 1;
5726-
my(@from, @delfrom, @to);
5727-
my $line;
5728-
foreach $line (split /\n/, $swash{'LIST'}->PV) {
5729-
my($min, $max, $result) = split(/\t/, $line);
5730-
$min = hex $min;
5731-
if (length $max) {
5732-
$max = hex $max;
5733-
} else {
5734-
$max = $min;
5735-
}
5736-
$result = hex $result;
5737-
if ($result == $extra) {
5738-
push @delfrom, [$min, $max];
5739-
} else {
5740-
push @from, [$min, $max];
5741-
push @to, [$result, $result + $max - $min];
5742-
}
5743-
}
5744-
for my $i (0 .. $#from) {
5745-
if ($from[$i][0] == ord '-') {
5746-
unshift @from, splice(@from, $i, 1);
5747-
unshift @to, splice(@to, $i, 1);
5748-
last;
5749-
} elsif ($from[$i][1] == ord '-') {
5750-
$from[$i][1]--;
5751-
$to[$i][1]--;
5752-
unshift @from, ord '-';
5753-
unshift @to, ord '-';
5754-
last;
5755-
}
5756-
}
5757-
for my $i (0 .. $#delfrom) {
5758-
if ($delfrom[$i][0] == ord '-') {
5759-
push @delfrom, splice(@delfrom, $i, 1);
5760-
last;
5761-
} elsif ($delfrom[$i][1] == ord '-') {
5762-
$delfrom[$i][1]--;
5763-
push @delfrom, ord '-';
5764-
last;
5765-
}
5722+
my $infinity = ~0 >> 1; # IV_MAX
5723+
my $from = "";
5724+
my $to = "";
5725+
5726+
for my $i (0.. @$invlist_ref - 1) {
5727+
my $this_from = $invlist_ref->[$i];
5728+
my $map = $map_ref->[$i];
5729+
my $upper = ($i < @$invlist_ref - 1)
5730+
? $invlist_ref->[$i+1]
5731+
: $infinity;
5732+
my $range = $upper - $this_from - 1;
5733+
if (DEBUG) {
5734+
print STDERR "i=$i, from=$this_from, upper=$upper, range=$range\n";
5735+
}
5736+
next if $map == ~0;
5737+
next if $map == ~0 - 1;
5738+
$from .= tr_chr($this_from);
5739+
$to .= tr_chr($map);
5740+
next if $range == 0; # Single code point
5741+
if ($range == 1) { # Adjacent code points
5742+
$from .= tr_chr($this_from + 1);
5743+
$to .= tr_chr($map + 1);
5744+
}
5745+
elsif ($upper != $infinity) {
5746+
$from .= "-" . tr_chr($this_from + $range);
5747+
$to .= "-" . tr_chr($map + $range);
5748+
}
5749+
else {
5750+
$from .= "-INFTY";
5751+
$to .= "-INFTY";
5752+
}
57665753
}
5767-
if (defined $final and $to[$#to][1] != $final) {
5768-
push @to, [$final, $final];
5754+
5755+
return ($from, $to);
5756+
}
5757+
5758+
sub tr_decode_utf8 {
5759+
my($tr_av, $flags) = @_;
5760+
printf STDERR "flags=0x%x\n", $flags if DEBUG;
5761+
my $invlist = $tr_av->ARRAYelt(0);
5762+
my @invlist = unpack("J*", $invlist->PV);
5763+
my @map = unpack("J*", $tr_av->ARRAYelt(1)->PV);
5764+
5765+
if (DEBUG) {
5766+
for my $i (0 .. @invlist - 1) {
5767+
printf STDERR "[%d]\t%x\t", $i, $invlist[$i];
5768+
my $map = $map[$i];
5769+
if ($map == ~0) {
5770+
print STDERR "TR_UNMAPPED\n";
5771+
}
5772+
elsif ($map == ~0 - 1) {
5773+
print STDERR "TR_SPECIAL\n";
5774+
}
5775+
else {
5776+
printf STDERR "%x\n", $map;
5777+
}
5778+
}
57695779
}
5770-
push @from, @delfrom;
5780+
5781+
my ($from, $to) = tr_invmap(\@invlist, \@map);
5782+
57715783
if ($flags & OPpTRANS_COMPLEMENT) {
5772-
my @newfrom;
5773-
my $next = 0;
5774-
for my $i (0 .. $#from) {
5775-
push @newfrom, [$next, $from[$i][0] - 1];
5776-
$next = $from[$i][1] + 1;
5777-
}
5778-
@from = ();
5779-
for my $range (@newfrom) {
5780-
if ($range->[0] <= $range->[1]) {
5781-
push @from, $range;
5782-
}
5783-
}
5784-
}
5785-
my($from, $to, $diff);
5786-
for my $chunk (@from) {
5787-
$diff = $chunk->[1] - $chunk->[0];
5788-
if ($diff > 1) {
5789-
$from .= tr_chr($chunk->[0]) . "-" . tr_chr($chunk->[1]);
5790-
} elsif ($diff == 1) {
5791-
$from .= tr_chr($chunk->[0]) . tr_chr($chunk->[1]);
5792-
} else {
5793-
$from .= tr_chr($chunk->[0]);
5794-
}
5784+
shift @map;
5785+
pop @invlist;
5786+
my $throw_away;
5787+
($from, $throw_away) = tr_invmap(\@invlist, \@map);
57955788
}
5796-
for my $chunk (@to) {
5797-
$diff = $chunk->[1] - $chunk->[0];
5798-
if ($diff > 1) {
5799-
$to .= tr_chr($chunk->[0]) . "-" . tr_chr($chunk->[1]);
5800-
} elsif ($diff == 1) {
5801-
$to .= tr_chr($chunk->[0]) . tr_chr($chunk->[1]);
5802-
} else {
5803-
$to .= tr_chr($chunk->[0]);
5804-
}
5789+
5790+
if (DEBUG) {
5791+
print STDERR "Returning ", escape_str($from), "/",
5792+
escape_str($to), "\n";
58055793
}
5806-
#$final = sprintf("%04x", $final) if defined $final;
5807-
#$none = sprintf("%04x", $none) if defined $none;
5808-
#$extra = sprintf("%04x", $extra) if defined $extra;
5809-
#print STDERR "final: $final\n none: $none\nextra: $extra\n";
5810-
#print STDERR $swash{'LIST'}->PV;
58115794
return (escape_str($from), escape_str($to));
58125795
}
58135796

@@ -5821,9 +5804,9 @@ sub pp_trans {
58215804
($from, $to) = tr_decode_byte($op->pv, $priv_flags);
58225805
} elsif ($class eq "PADOP") {
58235806
($from, $to)
5824-
= tr_decode_utf8($self->padval($op->padix)->RV, $priv_flags);
5807+
= tr_decode_utf8($self->padval($op->padix), $priv_flags);
58255808
} else { # class($op) eq "SVOP"
5826-
($from, $to) = tr_decode_utf8($op->sv->RV, $priv_flags);
5809+
($from, $to) = tr_decode_utf8($op->sv, $priv_flags);
58275810
}
58285811
my $flags = "";
58295812
$flags .= "c" if $priv_flags & OPpTRANS_COMPLEMENT;

0 commit comments

Comments
 (0)