Support match alignments #7321

broaddeep · 2021-03-06T12:55:46Z

Description

Support for match alignments.

Many users wanted rule-base matcher to support subgroup labeling(#3275) or Group capture(#4642), Look-around operator like regex(#6420).
However, it wasn't an easy task as seen in the issue(#3275).

To address this issue, I propose the concept of match alignments.
It represents the part of a token pattern that contributed to the match.

For example, suppose we have pattern
[{"ORTH": "a", "OP": "+"}, {"ORTH": "b"}]
and the text is given as
a a a b

The matched span will have four tokens(in the longest greedy setup).
We can easily verify that the first three matched tokens(a a a) was matched by the first token pattern ({"ORTH": "a", "OP": "+"}),
and the last token(b) was matched by the second token pattern ({"ORTH": "b"})

We can rewrite this in List[int], [0, 0, 0, 1].

Using this information, it can be applied to have the same effect as group capture or look around operator, subgroup labeling.
Any better alternative is welcome. The API needs to be strongly managed by maintainers, so it is not necessary to use the expression match_alignments.

Implementation details

Each time the state changes, it keeps track of the index of the token pattern at that time and the length of the span.

API

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')

pattern = [
    {'ENT_TYPE': 'PERSON', 'OP': '+'}, 
    {'LEMMA': 'love'}, 
    {'ENT_TYPE': 'PERSON', 'OP': '+'}
]

matcher = Matcher(nlp.vocab)
matcher.add("test", [pattern], greedy='LONGEST')

doc = nlp("John Doe loves Jane Doe. John loves Jane.")

matches = matcher(doc, match_alignments=True)

for m in matches:
    print(m)
# (1618900948208871284, 0, 5, [0, 0, 1, 2, 2])
# (1618900948208871284, 6, 9, [0, 1, 2])

It does not require breaking changes. (All test passed)
Test case added.

Types of change

New feature

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

honnibal · 2021-03-09T03:15:27Z

Thanks, this is a very interesting solution, and the implementation looks good at first glance.

The matcher code has been a source of many bugs over the years, and it's often not easy to debug. So I want to be careful before merging this.

Another concern is whether any changes can change the time-complexity or runtime unfavourably. It looks like your change is good in this respect. There's a copy of the alignment vector on each extend action I believe, it might be better if that could be avoided, especially if the alignments aren't being used? On the other hand it might make no difference at all to the practical runtime.

honnibal · 2021-03-09T06:23:27Z

I think we can get this merged for v3.1.

adrianeboyd · 2021-03-09T09:12:46Z

I agree that this looks like a useful addition. It looks really lightweight, but speed looks like it might be more of a concern than I thought at first glance. Here's a very rough comparison:

	# Keywords	# Docs	# Matches	Time
`master`	10	1000	0	0.72
	100	1000	41	4.03
	1000	1000	305	36.88
	10	10000	0	8.78
	100	10000	587	48.31
	1000	10000	8006	438.94
this PR	10	1000	0	0.89
	100	1000	41	5.49
	1000	1000	305	50.78
	10	10000	0	10.55
	100	10000	587	62.91
	1000	10000	8006	595.40

As to naming: with_alignments?

honnibal · 2021-03-09T12:06:14Z

Thanks for the benchmark! Yes that's surprising.

Can we only copy the vector if we're using the alignments?

broaddeep · 2021-03-09T12:41:55Z

@adrianeboyd That naming looks better. with_alignments=True
@honnibal Yes, it is possible to execute conditionally depending on with_alignments=True or False, but the key is how much code duplication and complexity can be reduced. I will try several things to see if there is a better alternative.

…al flow if with_alignments is given, validate with_alignments, add related test case

broaddeep · 2021-03-16T02:21:39Z

@honnibal I made the vector copying happen depending on the with_alignments option.

The code is a bit more complicated, but the rules are clear.
- The states - align_states vector, matches - align_matches vector always have a 1:1 match relationship.
- Enabling this option will copy the data into the align_* vectors before any changes to the states/matches vector occur, otherwise the logic is the same as the existing flow.
- Therefore, I believe that disabling this option will not make a significant difference in execution speed.
Added test case, validation error message.

adrianeboyd · 2021-03-16T15:31:23Z

Again, this is just a rough timing test, but the changes look good!

	# Keywords	# Docs	# Matches	Time
`master`	10	1000	0	0.70
	100	1000	41	4.03
	1000	1000	305	38.00
this PR without alignments	10	1000	0	0.81
	100	1000	41	4.08
	1000	1000	305	37.11
this PR with alignments	10	1000	0	0.76
	100	1000	41	4.55
	1000	1000	305	42.93

I don't think we need the additional check/error for bool for the kwarg. The reason for the doclike type error is that cython doesn't support a Union[Doc,Span] kind of type, but the boolean args can just be typed bint (cython is quirky) and then you don't need the check or the conversion. I think it's okay for whatever the user puts in here to be accepted as bool(val) since we don't do this kind of type checking elsewhere for similar args, either.

And it would be nice to revert all the whitespace edits. If it would be simpler, I can handle the remaining edits if you'd like.

broaddeep · 2021-03-17T04:10:37Z

@adrianeboyd Thank you for the good feedback.

dropped type checking for bool type
added bint type for function args
revert all the whitespace edits.

adrianeboyd · 2021-03-17T08:21:55Z

Thanks, everything looks good! I'll make a few minor edits and add it to the API docs.

honnibal

Looks great, thanks!

website/docs/api/matcher.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

kwhumphreys · 2021-08-19T01:45:52Z

@adrianeboyd @honnibal Is there any way to access alignments in callbacks?
It looks like callbacks are called before alignments are built: https://github.com/explosion/spaCy/blob/master/spacy/matcher/matcher.pyx#L287
but it would be useful for callbacks to see the same matches that the Matcher returns.

kwhumphreys · 2021-08-19T18:58:35Z

proposed change at #9001

Support match alignments

7669a49

svlandeg added enhancement Feature requests and improvements feat / matcher Feature: Token, phrase and dependency matcher labels Mar 6, 2021

adrianeboyd added v3.0 Related to v3.0 v3.1 Related to v3.1 and removed v3.0 Related to v3.0 labels Mar 15, 2021

change naming from match_alignments to with_alignments, add condition…

7940305

…al flow if with_alignments is given, validate with_alignments, add related test case

broaddeep added 2 commits March 17, 2021 02:22

remove added errors, utilize bint type, cleanup whitespace

c2e5507

fix no new line in end of file

a55a5d6

adrianeboyd added 3 commits March 17, 2021 09:27

Minor formatting

4b4f946

Skip alignments processing if as_spans is set

86132fb

Add with_alignments to Matcher API docs

1c4c606

adrianeboyd approved these changes Mar 17, 2021

View reviewed changes

honnibal approved these changes Mar 29, 2021

View reviewed changes

Merge branch 'master' into feature/matcher-alignment

c6f069f

svlandeg reviewed Apr 1, 2021

View reviewed changes

website/docs/api/matcher.md Outdated Show resolved Hide resolved

Update website/docs/api/matcher.md

e81b9f8

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

honnibal merged commit ee159b8 into explosion:master Apr 8, 2021

polm mentioned this pull request Jun 2, 2021

spaCy Token Matcher does not support group capture #4642

Closed

polm mentioned this pull request Jul 31, 2021

Lookaround operators on Matcher patterns #6420

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support match alignments #7321

Support match alignments #7321

broaddeep commented Mar 6, 2021

honnibal commented Mar 9, 2021

honnibal commented Mar 9, 2021

adrianeboyd commented Mar 9, 2021 •

edited

Loading

honnibal commented Mar 9, 2021

broaddeep commented Mar 9, 2021

broaddeep commented Mar 16, 2021 •

edited

Loading

adrianeboyd commented Mar 16, 2021

broaddeep commented Mar 17, 2021

adrianeboyd commented Mar 17, 2021

honnibal left a comment

kwhumphreys commented Aug 19, 2021

kwhumphreys commented Aug 19, 2021

Support match alignments #7321

Support match alignments #7321

Conversation

broaddeep commented Mar 6, 2021

Description

Implementation details

API

Types of change

Checklist

honnibal commented Mar 9, 2021

honnibal commented Mar 9, 2021

adrianeboyd commented Mar 9, 2021 • edited Loading

honnibal commented Mar 9, 2021

broaddeep commented Mar 9, 2021

broaddeep commented Mar 16, 2021 • edited Loading

adrianeboyd commented Mar 16, 2021

broaddeep commented Mar 17, 2021

adrianeboyd commented Mar 17, 2021

honnibal left a comment

Choose a reason for hiding this comment

kwhumphreys commented Aug 19, 2021

kwhumphreys commented Aug 19, 2021

adrianeboyd commented Mar 9, 2021 •

edited

Loading

broaddeep commented Mar 16, 2021 •

edited

Loading