-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
people() seems to get confused by commas #1111
Comments
hey Sandro - good catch! |
got it fixed in 14.14.0! |
I'd like to tell and celebrate the value of the work you do to build and maintain this excellent library. So I'll share a concrete example of the positive impact your diligence makes. This is the test output I was seeing which prompted my original question:
Then you released 14.14.0, and I updated to that version. I did nothing else. This is now the test output:
Happy start to the day. Thank you. |
hey @spencermountain - in relationship to this issue, wanted to quickly ask if you can think of any easy methodology or option to remove periods from the I'm noting that periods always seem to be included w/ these extractors when they fall at the end of a sentence. ex:
(also depicted in your documentation: https://observablehq.com/@spencermountain/topics-named-entity-recognition) Whereas, i'd love the option to make the output be:
So that my tokenization utilities applied afterward can correctly identify the same token & correctly locate offsets w/o periods (and better align entities in ML models). Full context here, i'm experimenting w/ this NER along with a few others and seeing some inconsistency w/ some including and others not including the period. Currently using the nlp.js builtin-compromise here for extracting entities:
Any thoughts or suggestions? |
It would seem we need some additional grammatical handling here for |
hey Patrick, you can print the matches off with any text options you'd like. There are some janky default choices about when to include sentence-end punctuation in the text output, which you can always override with config. I would do something like this: const prompt = 'Hello my name is John Doe. My email is john@gmail.com. I live in New York. Jane Smith also works at my company as the chief operating officer and lives in New Jersey. Our company is Smith & Doe LLC.';
let opts = {trim:true, keepPunct:false}
const processed = nlp(prompt).people();
processed.forEach(person => {
console.log(person.text(opts))
}) cheers |
hey @spencermountain ! Really appreciate your work on this library and this response. I understand that maybe this could be used in the internals of nlp.js 's built-in here: For now though, i did some post-processing here using your
This seems to work pretty well when entities are at the end of the sentence (as long as they are not abbreviated). One other challenge I'm definitely noting here though with this approach is the Some examples they don't seem to properly note the end of the clause/sentence when a single period is used (to complete the acronym and the sentence): a.m. — “Her surgery is scheduled for Wednesday at 10:30 a.m.” |
Great library!
Loving the various entity extraction utilities. They work great. One I use is
people()
. However, it seems to be unable to separate a list of comma-separated names into individual names, at least in this case.This is what I'm seeing [ NodeJs 22, OSX, "compromise": "^14.13.0" ]:
As a side note, you can also see it isn't catching
W.E.B Du Bois
but that seems a complex pattern, and prob best here would be to add to the custom lexicon I'm guessing.Thanks again for
compromise
!The text was updated successfully, but these errors were encountered: