Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lookaround operators on Matcher patterns #6420

Open
kinghuang opened this issue Nov 20, 2020 · 3 comments
Open

Lookaround operators on Matcher patterns #6420

kinghuang opened this issue Nov 20, 2020 · 3 comments
Labels
enhancement Feature requests and improvements feat / matcher Feature: Token, phrase and dependency matcher help wanted Contributions welcome!

Comments

@kinghuang
Copy link

kinghuang commented Nov 20, 2020

Feature description

The Matcher supports !, ?, +, and * operators and quantifiers. I have text where it would be useful to have something like the regex lookaround patterns, where a pattern should or should not be matched, but is not included as part of the matched range.

For example, consider the following text.

Haul from AB CD site to XY site.

I want to create patterns for AB CD site and XY site and label them as source and destination spans. The from and to tokens are needed to distinguish between AB CD site and XY site, but should not be part of the match.

from spacy.lang.en import English
from spacy.matcher import Matcher

nlp = English()

m = Matcher(nlp.vocab)
m.add("from_loc", None, [{"ORTH": "from"}, {"ORTH": {"NOT_IN": ["to"]}, "OP": "+"}, {"ORTH": "site"}])
m.add("to_loc",   None, [{"ORTH": "to"}, {"ORTH": {"NOT_IN": ["from"]}, "OP": "+"}, {"ORTH": "site"}])

doc = nlp.make_doc("Haul from AB CD site to XY site.")
matches = m(doc)

for match_id, start, end in matches:
  print(doc[start:end])
from AB CD site
to XY site

The first match span the tokens for from AB CD site. I want just AB CD site back as the match. Same for the second match.

Proposal

The Matcher should support the following new ops, roughly based on the regex counterparts.

Op Name Description
?= Positive lookaround The token pattern matches, but is not part of the match result.
?! Negative lookaround The token pattern does not match, and is not part of the match result.

Zero or more lookaround can be used as the start and end of the pattern. A lookaround operator cannot be surrounded on both sides by non-lookaround operators in a pattern.

While there is a distinction between lookahead and lookbehind in regex, these operators are just positive/negative matchers that are not included in the result.

m = Matcher(nlp.vocab)
m.add("from_loc", None, [{"ORTH": "from", "OP": "?="}, {"ORTH": {"NOT_IN": ["to"]}, "OP": "+"}, {"ORTH": "site"}])
m.add("to_loc",   None, [{"ORTH": "to", "OP": "?="}, {"ORTH": {"NOT_IN": ["from"]}, "OP": "+"}, {"ORTH": "site"}])

doc = nlp.make_doc("Haul from AB CD site to XY site.")
matches = m(doc)

for match_id, start, end in matches:
  print(doc[start:end])
AB CD site
XY site

The from and to tokens are matched by not part of the match range.

Could the feature be a custom component or spaCy plugin?

No.

@svlandeg svlandeg added enhancement Feature requests and improvements feat / matcher Feature: Token, phrase and dependency matcher labels Nov 21, 2020
@svlandeg
Copy link
Member

This is related to another feature request discussed here: #2262 and there have been other questions around matching/naming/group capture: #3275 & #4642. I don't know off the top of my head whether it would make sense to address these issues together (or at least keep in mind that there may be other extensions in the future), but I just thought I'd link them.

In general I think we'd be happy to accept contributions to enhance the functionality in the Matcher but we may not have time to look into this ourselves in the near future.

@MajorTal
Copy link

Ugly as hell, but I managed to work around with this on_match function:

def on_match_kill_last_token(_matcher, _doc, current_id, matches):  
    match_id, match_start, match_end = matches[current_id]  
    matches[current_id] = (match_id, match_start, match_end - 1)

This was referenced Mar 6, 2021
@polm
Copy link
Contributor

polm commented Jul 31, 2021

Note that it's a little more work that simple lookaround, but you should be able to do this with the match alignments in #7321.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / matcher Feature: Token, phrase and dependency matcher help wanted Contributions welcome!
Projects
None yet
Development

No branches or pull requests

4 participants