-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support match alignments #7321
Support match alignments #7321
Conversation
Thanks, this is a very interesting solution, and the implementation looks good at first glance. The matcher code has been a source of many bugs over the years, and it's often not easy to debug. So I want to be careful before merging this. Another concern is whether any changes can change the time-complexity or runtime unfavourably. It looks like your change is good in this respect. There's a copy of the alignment vector on each extend action I believe, it might be better if that could be avoided, especially if the alignments aren't being used? On the other hand it might make no difference at all to the practical runtime. |
I think we can get this merged for v3.1. |
I agree that this looks like a useful addition. It looks really lightweight, but speed looks like it might be more of a concern than I thought at first glance. Here's a very rough comparison:
As to naming: |
Thanks for the benchmark! Yes that's surprising. Can we only copy the vector if we're using the alignments? |
@adrianeboyd That naming looks better. |
…al flow if with_alignments is given, validate with_alignments, add related test case
@honnibal I made the vector copying happen depending on the
|
Again, this is just a rough timing test, but the changes look good!
I don't think we need the additional check/error for And it would be nice to revert all the whitespace edits. If it would be simpler, I can handle the remaining edits if you'd like. |
@adrianeboyd Thank you for the good feedback.
|
Thanks, everything looks good! I'll make a few minor edits and add it to the API docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
@adrianeboyd @honnibal Is there any way to access alignments in callbacks? |
proposed change at #9001 |
Description
Support for match alignments.
Many users wanted rule-base matcher to support subgroup labeling(#3275) or Group capture(#4642), Look-around operator like regex(#6420).
However, it wasn't an easy task as seen in the issue(#3275).
To address this issue, I propose the concept of match alignments.
It represents the part of a token pattern that contributed to the match.
For example, suppose we have pattern
[{"ORTH": "a", "OP": "+"}, {"ORTH": "b"}]
and the text is given as
a a a b
The matched span will have four tokens(in the longest greedy setup).
We can easily verify that the first three matched tokens(
a a a
) was matched by the first token pattern ({"ORTH": "a", "OP": "+"}
),and the last token(
b
) was matched by the second token pattern ({"ORTH": "b"}
)We can rewrite this in
List[int]
,[0, 0, 0, 1]
.Using this information, it can be applied to have the same effect as group capture or look around operator, subgroup labeling.
Any better alternative is welcome. The API needs to be strongly managed by maintainers, so it is not necessary to use the expression
match_alignments.
Implementation details
API
Types of change
New feature
Checklist