Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What Metric exactly is reported/computed for NER? #3

Closed
Iwontbecreative opened this issue Jan 30, 2019 · 7 comments
Closed

What Metric exactly is reported/computed for NER? #3

Iwontbecreative opened this issue Jan 30, 2019 · 7 comments

Comments

@Iwontbecreative
Copy link

Iwontbecreative commented Jan 30, 2019

Hi,

I am reading through the code and implementation and was wondering what exactly R/P/F1 relate to.
In particular:
Are the metrics at the token level or the entity level? Are you accounting only for exact matches or do you give some credit for partial matches as well?
Indeed, the paper is not very clear about what exactly those metrics refer to. I'm not very familiar with tensorflow but it seems to me that the code is computing token level metrics? Or is the text "detokenized" and grouped by entity before evaluation is run? Are predictions only run for the first token in a word like in the CONLL 2003 setup of BERT?

In particular, I noticed that your implementation is based on: https://github.com/kyzhouhzau/BERT-NER which mentions that the scores differ from normal evaluation (e.g: using conlleval.pl). Is this also the case for the numbers reported in the paper?

Thanks in advance for your answer!

@Iwontbecreative Iwontbecreative changed the title What Metric exactly is reported/computed for NER. What Metric exactly is reported/computed for NER? Jan 30, 2019
@jhyuklee
Copy link
Member

jhyuklee commented Jan 31, 2019

Thank you for your interest.

Yes, you are correct. Due to our mistakes on NER metrics, the code right now are token-level metrics. We have tested the entity-level metrics internally (only allowing exact matches), and will update our code, soon. The reported NER numbers on the arxiv paper will be corrected accordingly as soon as possible.

To summarize the results of entity-level metrics, we could not reach average SOTA on 9 NER datasets as before with the same hyper parameters, but we still obtained new SOTA performance on 4 out of 9 NER datasets without any tuning. We are sorry for this confusion that might have occurred due to our mistakes on the NER evaluation. Any further questions are always welcome.

@wonjininfo
Copy link
Member

wonjininfo commented Jan 31, 2019

Commit 6ddd053 is for the entity-level evaluation.
We have updated detailed usage of the codes on README. Please let us know if you have further questions.
We'll close this issue if you have no further questions on this issue.
Thank you.

@Iwontbecreative
Copy link
Author

Thanks for checking, clarifying and getting back to me so quickly! Would you be able to provide the most up to date NER entity-level results? (I assume ArXiv re-update might take several days).

Note that for some of the datasets the standard reported metric is not always entity-level exact match (e.g: BC2GM, some details of what metrics are standard are here: https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-017-1776-8 ). This might mean that you're doing better than you think for these. It might be worth to check in-depth as sadly some datasets depart from the standard entity-level exact match metric.

@jhyuklee
Copy link
Member

jhyuklee commented Feb 1, 2019

Thank you for your information on the metric details. Right now, we are carefully re-validating our NER results, so it will be better for you to wait for the updated arxiv paper (which will take several days). We will inform you on this issue. Thank you.

@jhyuklee
Copy link
Member

jhyuklee commented Feb 5, 2019

Hi @Iwontbecreative, we've updated our arxiv paper with entity-level results. In summary, there were some improvement compared to SOTA on 6 out of 9 NER datasets. All the baseline SOTA results are based on exact match entity-level metric, too. There are two metrics for BC2GM as you commented, and we reported exact match scores (not the alternative match scores for relaxed boundaries) for both current SOTA, and our models. Thanks.

@jhyuklee jhyuklee closed this as completed Feb 8, 2019
@Iwontbecreative
Copy link
Author

Thanks for fixing this so quickly! Helpful to have your updated numbers to compare against other approaches :)

@ardakdemir
Copy link

ardakdemir commented Jul 12, 2020

Hey @jhyuklee.
I have a follow-up question about the NER results. I obtained the datasets on the repository and each dataset is only annotated with the boundary information (B I or O, without the entity type). However, the original versions of some these datasets are annotated for multiple entity types (JNLPA is annotated for DNA, RNA, Protein for example). When I cross-check your dataset with the one provided in MTL-Bioinformatics, I observe that B-Protein, B-DNA, B-RNA are all mapped to B in the datasets you provide. In this case the task you are evaluating BioBERT is "entity boundary detection" as we are not doing any kind of "entity-type detection" with this approach.

Can you confirm my observation? Please let me know if I am misstepping anything here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants