Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phoneNumbers, json offset request #1144

Open
priley86 opened this issue Aug 30, 2024 · 2 comments
Open

phoneNumbers, json offset request #1144

priley86 opened this issue Aug 30, 2024 · 2 comments

Comments

@priley86
Copy link

hey there again @spencermountain !

This is a small feature request on the phoneNumbers() method exposed.

I'm noting some offset weirdness.

Example that offsets correctly:

const json = nlp('My co-worker's number is +1 212-456-7890.').phoneNumbers(0).json({ offset: true })[0];

console.log(json.offset); 
// offset: { index: 4, start: 25, length: 16 }
 
console.log('My co-worker\'s number is +1 212-456-7890.'.substr(25,16));
// '+1 212-456-7890.'

Example that offsets incorrectly:

const json = nlp('My phone number is (201) 111-2222.').phoneNumbers(0).json({ offset: true })[0];

console.log(json.offset); 
// offset: { index: 4, start: 20, length: 15 }
 
console.log('My phone number is (201) 111-2222.'.substr(20,15));
// '201) 111-2222.'

I would expect the opening ( to have been included in the top level offset to be consistent with the first response. The closing . (period) could also be omitted in both cases, but that seems more of a global issue and unrelated (and i think you've mentioned using other helpers to remove those post-facto).

I would have contributed this change directly, but the fix appears non obvious, so i've added a simple unit test instead that may help indicate the suggested fix.
priley86@5182385

Could this an extension of the toJson method on the phoneNumbers View defined here?
https://github.com/spencermountain/compromise/blob/master/src/3-three/misc/selections/index.js#L39

Sorry, totally unfamiliar here but trying to help 😄

@priley86
Copy link
Author

I'm planning to handle this entity w/ some post-processing or with a custom regex extractor for now, so no urgency on this!

thanks again for all the hard work & support on this library b.t.w.! 🎉

@spencermountain
Copy link
Owner

spencermountain commented Sep 4, 2024

hey Patrick, thank you for the good issue. This is a doozy.

The issue is from an awkward decision in the offset-calculator. When a word is split into it's whitespace (both pre & post), and it's text, the offsets don't include the whitespace characters. You can see the same issue here:

let obj = nlp('(201) 111-2222').phoneNumbers().json({ offset: true })[0]
console.log(obj.terms[0].offset) //{start:1}

obj = nlp('(patrick) after').people().json({ offset: true })[0]
console.log(obj.terms[0].offset) //{start:1}

the reasoning for this is that offsets are used in text-highlighting applications, so it looked awkward highlighting spaces and things.

but yeah, this is awkward because the ( character is central to the phonenumber, and should not be thought of as whitespace.

Will keep this open as a bug. Maybe phone-number classification should happen up at the top, when words get split up. Maybe the offsets results needs more verbose values. Maybe both. I'm not sure what is best.
cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants