Replies: 1 comment 1 reply
-
I worry that it's hard to exactly replicate the matching semantics of forward matches when going backwards, unless it's something simple like |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The match function in the
_sre
module works by iterating over the pattern, one op at a time, advancing the input pointer if necessary. While this works just fine, we're not taking advantage of vector instructions, that would (potentially) improve the performance.One of my ideas here is, if the pattern has a
SRE_OP_LITERAL
opcode, we use a fast search mode (e.g.memmem()
,memchr()
, or similar things). If the literal isn't found in that alternation, we can skip to the next one. If the literal is found, we have three possible options:SRE_OP_LITERAL
opcode is the first thing in that alternation. In this case, we proceed forwards as usual.SRE_OP_LITERAL
in that alternation, but the pattern ends in that literal. In this case, we match the pattern backwards until we reach the beginning of the pattern.SRE_OP_LITERAL
. In this case, we both go backwards, and then forwards.In the cases we go backwards, we can stop iterating if we find another
SRE_OP_LITERAL
in the same alternation. This is not necessary to do when going forwards because we're already handling that.If we have a run of literals, it should be possible to combine them into a single
SRE_OP_LITERAL_RUN
or something like that, and use a fast substring search rather than searching a single character/unicode rune.(The same ideas are valid for the locale-ignoring versions, and even more so, in fact.)
I believe this can yield a sizable performance improvement to how we match regular expressions, and would be a good place to start. There are many other things that can be explored, such as:
a{1,8}
), we can transform this into two instructions: in the previous example,aa{0,7}
, which would generate aSRE_OP_LITERAL(a)
and aSRE_OP_REPEAT_ONE(a, 0, 7)
. This would let us take advantage of the already optimizedSRE_OP_LITERAL
before we get into the current implementation forSRE_OP_REPEAT_ONE
.[0-7a-f]
, generate the range and use something likestrspn()
(or something similar that would work with unicode characters) to find any ocurrence of, in the previous example, any character from the string"01234567abcdef"
.Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions