-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What Metric exactly is reported/computed for NER? #3
Comments
Thank you for your interest. Yes, you are correct. Due to our mistakes on NER metrics, the code right now are token-level metrics. We have tested the entity-level metrics internally (only allowing exact matches), and will update our code, soon. The reported NER numbers on the arxiv paper will be corrected accordingly as soon as possible. To summarize the results of entity-level metrics, we could not reach average SOTA on 9 NER datasets as before with the same hyper parameters, but we still obtained new SOTA performance on 4 out of 9 NER datasets without any tuning. We are sorry for this confusion that might have occurred due to our mistakes on the NER evaluation. Any further questions are always welcome. |
Commit 6ddd053 is for the entity-level evaluation. |
Thanks for checking, clarifying and getting back to me so quickly! Would you be able to provide the most up to date NER entity-level results? (I assume ArXiv re-update might take several days). Note that for some of the datasets the standard reported metric is not always entity-level exact match (e.g: BC2GM, some details of what metrics are standard are here: https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-017-1776-8 ). This might mean that you're doing better than you think for these. It might be worth to check in-depth as sadly some datasets depart from the standard entity-level exact match metric. |
Thank you for your information on the metric details. Right now, we are carefully re-validating our NER results, so it will be better for you to wait for the updated arxiv paper (which will take several days). We will inform you on this issue. Thank you. |
Hi @Iwontbecreative, we've updated our arxiv paper with entity-level results. In summary, there were some improvement compared to SOTA on 6 out of 9 NER datasets. All the baseline SOTA results are based on exact match entity-level metric, too. There are two metrics for BC2GM as you commented, and we reported exact match scores (not the alternative match scores for relaxed boundaries) for both current SOTA, and our models. Thanks. |
Thanks for fixing this so quickly! Helpful to have your updated numbers to compare against other approaches :) |
Hey @jhyuklee. Can you confirm my observation? Please let me know if I am misstepping anything here. |
Hi,
I am reading through the code and implementation and was wondering what exactly R/P/F1 relate to.
In particular:
Are the metrics at the token level or the entity level? Are you accounting only for exact matches or do you give some credit for partial matches as well?
Indeed, the paper is not very clear about what exactly those metrics refer to. I'm not very familiar with tensorflow but it seems to me that the code is computing token level metrics? Or is the text "detokenized" and grouped by entity before evaluation is run? Are predictions only run for the first token in a word like in the CONLL 2003 setup of BERT?
In particular, I noticed that your implementation is based on: https://github.com/kyzhouhzau/BERT-NER which mentions that the scores differ from normal evaluation (e.g: using conlleval.pl). Is this also the case for the numbers reported in the paper?
Thanks in advance for your answer!
The text was updated successfully, but these errors were encountered: