Lookaround operators on Matcher patterns #6420

kinghuang · 2020-11-20T21:58:03Z

Feature description

The Matcher supports !, ?, +, and * operators and quantifiers. I have text where it would be useful to have something like the regex lookaround patterns, where a pattern should or should not be matched, but is not included as part of the matched range.

For example, consider the following text.

Haul from AB CD site to XY site.

I want to create patterns for AB CD site and XY site and label them as source and destination spans. The from and to tokens are needed to distinguish between AB CD site and XY site, but should not be part of the match.

from spacy.lang.en import English
from spacy.matcher import Matcher

nlp = English()

m = Matcher(nlp.vocab)
m.add("from_loc", None, [{"ORTH": "from"}, {"ORTH": {"NOT_IN": ["to"]}, "OP": "+"}, {"ORTH": "site"}])
m.add("to_loc",   None, [{"ORTH": "to"}, {"ORTH": {"NOT_IN": ["from"]}, "OP": "+"}, {"ORTH": "site"}])

doc = nlp.make_doc("Haul from AB CD site to XY site.")
matches = m(doc)

for match_id, start, end in matches:
  print(doc[start:end])

from AB CD site
to XY site

The first match span the tokens for from AB CD site. I want just AB CD site back as the match. Same for the second match.

Proposal

The Matcher should support the following new ops, roughly based on the regex counterparts.

Op	Name	Description
`?=`	Positive lookaround	The token pattern matches, but is not part of the match result.
`?!`	Negative lookaround	The token pattern does not match, and is not part of the match result.

Zero or more lookaround can be used as the start and end of the pattern. A lookaround operator cannot be surrounded on both sides by non-lookaround operators in a pattern.

While there is a distinction between lookahead and lookbehind in regex, these operators are just positive/negative matchers that are not included in the result.

m = Matcher(nlp.vocab)
m.add("from_loc", None, [{"ORTH": "from", "OP": "?="}, {"ORTH": {"NOT_IN": ["to"]}, "OP": "+"}, {"ORTH": "site"}])
m.add("to_loc",   None, [{"ORTH": "to", "OP": "?="}, {"ORTH": {"NOT_IN": ["from"]}, "OP": "+"}, {"ORTH": "site"}])

doc = nlp.make_doc("Haul from AB CD site to XY site.")
matches = m(doc)

for match_id, start, end in matches:
  print(doc[start:end])

AB CD site
XY site

The from and to tokens are matched by not part of the match range.

Could the feature be a custom component or spaCy plugin?

No.

The text was updated successfully, but these errors were encountered:

svlandeg · 2020-11-24T11:00:37Z

This is related to another feature request discussed here: #2262 and there have been other questions around matching/naming/group capture: #3275 & #4642. I don't know off the top of my head whether it would make sense to address these issues together (or at least keep in mind that there may be other extensions in the future), but I just thought I'd link them.

In general I think we'd be happy to accept contributions to enhance the functionality in the Matcher but we may not have time to look into this ourselves in the near future.

MajorTal · 2020-12-23T12:10:13Z

Ugly as hell, but I managed to work around with this on_match function:

def on_match_kill_last_token(_matcher, _doc, current_id, matches):  
    match_id, match_start, match_end = matches[current_id]  
    matches[current_id] = (match_id, match_start, match_end - 1)

polm · 2021-07-31T12:03:41Z

Note that it's a little more work that simple lookaround, but you should be able to do this with the match alignments in #7321.

svlandeg added enhancement Feature requests and improvements feat / matcher Feature: Token, phrase and dependency matcher labels Nov 21, 2020

svlandeg added the help wanted Contributions welcome! label Nov 24, 2020

adrianeboyd mentioned this issue Dec 4, 2020

Token-based matching: Exclude token in pattern from matches #6492

Closed

This was referenced Mar 6, 2021

Feature/matcher alignment #7319

Closed

Support match alignments #7321

Merged

polm mentioned this issue Jul 31, 2021

feature request: zero-width lookahead/-behind expression in the Matcher #2262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lookaround operators on Matcher patterns #6420

Lookaround operators on Matcher patterns #6420

kinghuang commented Nov 20, 2020 •

edited

Loading

svlandeg commented Nov 24, 2020

MajorTal commented Dec 23, 2020

polm commented Jul 31, 2021

Lookaround operators on Matcher patterns #6420

Lookaround operators on Matcher patterns #6420

Comments

kinghuang commented Nov 20, 2020 • edited Loading

Feature description

Proposal

Could the feature be a custom component or spaCy plugin?

svlandeg commented Nov 24, 2020

MajorTal commented Dec 23, 2020

polm commented Jul 31, 2021

kinghuang commented Nov 20, 2020 •

edited

Loading