Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headnote missing and/or having wrong labels #1208

Open
ronny3 opened this issue Dec 5, 2024 · 4 comments
Open

Headnote missing and/or having wrong labels #1208

ronny3 opened this issue Dec 5, 2024 · 4 comments
Labels
error cases Some error/test case for future improvements models:segmentation

Comments

@ronny3
Copy link

ronny3 commented Dec 5, 2024

Operating System and architecture (arm64, amd64, x86, etc.)

No response

What is your Java version

No response

Log and information

No response

Further information

I'm running the latest 0.8.1. version with this OA article: https://www.sciencedirect.com/science/article/pii/S1386505620310650

After the first page there is typical headnotes for author on the left side and journal on the right side.
According to docs they should be tagged with <note place="headnote">
For this PDF (and others) they are either

  1. missing completely. For the given example the journal is missing, and the author is missing in some pages.
  2. don't have note category. So they are only <note> and then <p>
  3. Are often inside tables. See the example table 3
    <figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"> <head>Table 3</head> <label>3</label> <figDesc>(continued ) </figDesc> <table> <row> <cell>O. Fennelly et al.</cell>
  4. Mess with references. Not this example, but another that was not OA, similar issue to the table3 above, where the author headnote was inferred as a reference in the bibl.

If I have understood Grobid correctly, the models segmentation and fulltext are responsible for this.
I am especially not sure about 4), what model to retrain for this purpose.
Also as I am new to Gorbid, I wonder how many examples would I need for this to improve? Should I go over your training files to see if you are tagging "headnote" correctly?
These are not important info, but according to docs they should be present, and they are causing issues, so I need some guidance!

@ronny3
Copy link
Author

ronny3 commented Dec 5, 2024

It seems I was hasty and only evaluated the output from the service API. They are in fact present in the segmentation.tei from createTraining. I guess the service API Process Fulltext Document then just hides those?

So point 1) is invalid, while points 2-4 still stand and require some training samples.
Especially the author name (left upper side) seems to confuse with body, tables, figures and references. Especially when there are more than one author.
I will try to retrain the segmentation model and see how it affects. From another issue I read you have suggested that only a couple of examples should make a difference.

@lfoppiano
Copy link
Collaborator

Hi @ronny3,
the headnotes are not supposed to be output, as they are just visual information and don't add any useful information the article. They are identified in the segmentation model, as you noticed.
Indeed they should not appear anywhere, and the article you shared could be used as training data material (it's CC-BY).
If you have more CC-BY examples feel free to share them here.

For correcting the training data there are the guidelines that you seems already checked, in addition I summarised the process in regard of cascading models here.

@lfoppiano lfoppiano added error cases Some error/test case for future improvements models:segmentation labels Dec 5, 2024
@ronny3
Copy link
Author

ronny3 commented Dec 5, 2024

@lfoppiano great, thanks for the answer. I will try training soon and see how it affects, then add a few samples to my training folder. I estimated the wapiti training should take 8 hours if I do the default 2000 iterations.

Should I make a PR after I make more golden-standard data if the articles are CC-BY? Or would you rather do them on your own?

At the moment for this case I will only use a text editor, but I was also wondering about a better annotation tool, something that would allow changing the tags by clicking.

@lfoppiano
Copy link
Collaborator

Yes, make a PR with the training data corrected by you and I will revise them. It might take a few iteration back and forth.

As for the editor, in general, I use pycharm/intellij for this work, I got used to it, but if you find a better tool feel free to share about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error cases Some error/test case for future improvements models:segmentation
Projects
None yet
Development

No branches or pull requests

2 participants