Skip to content

aho-corasick should be applied for cases like \b(literal1|literal2|...|literalN)\b #891

Closed
@BurntSushi

Description

@BurntSushi

Discussed in #890

Originally posted by Guillermogsjc July 19, 2022
Hi, to match efficiently large amounts of alternations, I guess it is interesting to trigger aho_corasick variant here

/// An Aho-Corasick automaton with leftmost-first match semantics.
regarding doc:

/// This is only set when the entire regex is a simple unanchored
/// alternation of literals. We could probably use it more circumstances,
/// but this is already hacky enough in this architecture.

The question is: is there any way to use word boundaries in such a way this expression is highly optimized for a thing like this?

r"\b(a|... #massive ammount of literal alternations here# ...|z)\b"

or with (?-u:\b) instead of \b .

And... regarding PERFORMANCE documentation here

there is no problem with using non-greedy matching or having lots of alternations in your regex

this previously stated regex would be in the set of "no problem" ?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions