Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support match alignments #7321

Merged
merged 9 commits into from
Apr 8, 2021

Conversation

broaddeep
Copy link
Contributor

Description

Support for match alignments.

Many users wanted rule-base matcher to support subgroup labeling(#3275) or Group capture(#4642), Look-around operator like regex(#6420).
However, it wasn't an easy task as seen in the issue(#3275).

To address this issue, I propose the concept of match alignments.
It represents the part of a token pattern that contributed to the match.

For example, suppose we have pattern
[{"ORTH": "a", "OP": "+"}, {"ORTH": "b"}]
and the text is given as
a a a b

The matched span will have four tokens(in the longest greedy setup).
We can easily verify that the first three matched tokens(a a a) was matched by the first token pattern ({"ORTH": "a", "OP": "+"}),
and the last token(b) was matched by the second token pattern ({"ORTH": "b"})

We can rewrite this in List[int], [0, 0, 0, 1].

Using this information, it can be applied to have the same effect as group capture or look around operator, subgroup labeling.
Any better alternative is welcome. The API needs to be strongly managed by maintainers, so it is not necessary to use the expression match_alignments.

Implementation details

  • Each time the state changes, it keeps track of the index of the token pattern at that time and the length of the span.

API

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')

pattern = [
    {'ENT_TYPE': 'PERSON', 'OP': '+'}, 
    {'LEMMA': 'love'}, 
    {'ENT_TYPE': 'PERSON', 'OP': '+'}
]

matcher = Matcher(nlp.vocab)
matcher.add("test", [pattern], greedy='LONGEST')

doc = nlp("John Doe loves Jane Doe. John loves Jane.")

matches = matcher(doc, match_alignments=True)

for m in matches:
    print(m)
# (1618900948208871284, 0, 5, [0, 0, 1, 2, 2])
# (1618900948208871284, 6, 9, [0, 1, 2])
  • It does not require breaking changes. (All test passed)
  • Test case added.

Types of change

New feature

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg added enhancement Feature requests and improvements feat / matcher Feature: Token, phrase and dependency matcher labels Mar 6, 2021
@honnibal
Copy link
Member

honnibal commented Mar 9, 2021

Thanks, this is a very interesting solution, and the implementation looks good at first glance.

The matcher code has been a source of many bugs over the years, and it's often not easy to debug. So I want to be careful before merging this.

Another concern is whether any changes can change the time-complexity or runtime unfavourably. It looks like your change is good in this respect. There's a copy of the alignment vector on each extend action I believe, it might be better if that could be avoided, especially if the alignments aren't being used? On the other hand it might make no difference at all to the practical runtime.

@honnibal
Copy link
Member

honnibal commented Mar 9, 2021

I think we can get this merged for v3.1.

@adrianeboyd
Copy link
Contributor

adrianeboyd commented Mar 9, 2021

I agree that this looks like a useful addition. It looks really lightweight, but speed looks like it might be more of a concern than I thought at first glance. Here's a very rough comparison:

# Keywords # Docs # Matches Time
master 10 1000 0 0.72
100 1000 41 4.03
1000 1000 305 36.88
10 10000 0 8.78
100 10000 587 48.31
1000 10000 8006 438.94
this PR 10 1000 0 0.89
100 1000 41 5.49
1000 1000 305 50.78
10 10000 0 10.55
100 10000 587 62.91
1000 10000 8006 595.40

As to naming: with_alignments?

@honnibal
Copy link
Member

honnibal commented Mar 9, 2021

Thanks for the benchmark! Yes that's surprising.

Can we only copy the vector if we're using the alignments?

@broaddeep
Copy link
Contributor Author

@adrianeboyd That naming looks better. with_alignments=True
@honnibal Yes, it is possible to execute conditionally depending on with_alignments=True or False, but the key is how much code duplication and complexity can be reduced. I will try several things to see if there is a better alternative.

@adrianeboyd adrianeboyd added v3.0 Related to v3.0 v3.1 Related to v3.1 and removed v3.0 Related to v3.0 labels Mar 15, 2021
…al flow if with_alignments is given, validate with_alignments, add related test case
@broaddeep
Copy link
Contributor Author

broaddeep commented Mar 16, 2021

@honnibal I made the vector copying happen depending on the with_alignments option.

  • The code is a bit more complicated, but the rules are clear.
    • The states - align_states vector, matches - align_matches vector always have a 1:1 match relationship.
    • Enabling this option will copy the data into the align_* vectors before any changes to the states/matches vector occur, otherwise the logic is the same as the existing flow.
    • Therefore, I believe that disabling this option will not make a significant difference in execution speed.
  • Added test case, validation error message.

@adrianeboyd
Copy link
Contributor

Again, this is just a rough timing test, but the changes look good!

# Keywords # Docs # Matches Time
master 10 1000 0 0.70
100 1000 41 4.03
1000 1000 305 38.00
this PR without alignments 10 1000 0 0.81
100 1000 41 4.08
1000 1000 305 37.11
this PR with alignments 10 1000 0 0.76
100 1000 41 4.55
1000 1000 305 42.93

I don't think we need the additional check/error for bool for the kwarg. The reason for the doclike type error is that cython doesn't support a Union[Doc,Span] kind of type, but the boolean args can just be typed bint (cython is quirky) and then you don't need the check or the conversion. I think it's okay for whatever the user puts in here to be accepted as bool(val) since we don't do this kind of type checking elsewhere for similar args, either.

And it would be nice to revert all the whitespace edits. If it would be simpler, I can handle the remaining edits if you'd like.

@broaddeep
Copy link
Contributor Author

@adrianeboyd Thank you for the good feedback.

  • dropped type checking for bool type
  • added bint type for function args
  • revert all the whitespace edits.

@adrianeboyd
Copy link
Contributor

Thanks, everything looks good! I'll make a few minor edits and add it to the API docs.

Copy link
Member

@honnibal honnibal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
@kwhumphreys
Copy link
Contributor

@adrianeboyd @honnibal Is there any way to access alignments in callbacks?
It looks like callbacks are called before alignments are built: https://github.com/explosion/spaCy/blob/master/spacy/matcher/matcher.pyx#L287
but it would be useful for callbacks to see the same matches that the Matcher returns.

@kwhumphreys
Copy link
Contributor

proposed change at #9001

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / matcher Feature: Token, phrase and dependency matcher v3.1 Related to v3.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants