-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GROBID splits sentences, puts second half in a figure description #1160
Comments
Hi @mariadelmarq, and thanks again for reporting the issue. This is a recurring issue in the fulltext, and likely going to be solved by #963. |
Thanks heaps, @lfoppiano! Do you have a rough timeline for the next release? No pressure at all, of course, it's just a great package and I would like to know whether a new iteration will be out before the end of the project I'm working on, later this year. Thanks again for all your work on this! |
Hi @mariadelmarq we are currently working on releasing version 0.8.1 (#1123), we've been facing an issue with the JVM that requires to process large amount of PDF documents and this is taking more time. For the change I mentioned, is going to be next year. |
Hi, @mariadelmarq, @lfoppiano I've been facing similar issues this past week and was about to enquire myself. Experiencing very simple and plain PDFs (Clean front page, pages are typically just a subheader + paragraph, clear bibliography with standard format) being parsed incorrectly. Mostly text disappears into figure descriptions, where sentences are split in the middle. I mostly experience this with non-English PDFs, typically German.
Is there any way to contribute to speed up the work on this? I've found that GROBID is the best solution for full-text extraction from scholarly PDFs/documents. Or do you recommend any other way of extracting fulltexts that is less involved than the GROBID biblio and header extraction? Not looking for bibliography data or headers, just the clean paragraph-level text from the documents, removing any metadata, footers, author info, etc. etc. Thanks |
hi @vegarab, I'm assuming you are dealing with scientific articles. Unfortunately, creating new training data can appear complicated at first. The steps are divided into two: a) generate per-annotated training data, and b) correct them following the guidelines. Ref to the documentation. Since the Grobid model is working in cascade, you will have to start from the segmentation and go throught it. I explained in another issue here. Unfortunately, I don't' have time to work on the training data at the moment, but I can help you with the process if needed. |
Adding additional cases here. |
Another example, with body text missclassified as figure:
And then:
And the output is in a figure with no attributes beside the caption: DNA libraries will be constructed using the Illumina DNA Prep with Enrichment, Tagmentation kit and IDT xGen Exome Research Panel v2 with xGen Universal Blockers-NXT Mix and dual unique barcodes. Paired-end sequencing (2 × 150 bps) on the Illumina NovaSeq 6000 System at 100× depth for exome sequencing. Library preparation and sequencing will be performed for all family members at the same time to minimize potential artifactual differences due to sample preparation. DNA samples will be stored at -80° centigrade to allow for future verification studies.Here it seems that everything get classified as PDF: pub.1160333290.pdf |
@mariadelmarq I'm working on a different issue. (#1206), but the fix there on the tables, fixes the issue you've reported on the elsevier paper. It's in the branch #1207 which is still WIP but it seems to correct the most clear cases of table misclassification. The issue is in fact text classified as table (not figure, which are more difficult to validate). |
Potential error case, not sure if open access (i.e., can be used for training). For the PDF file from: https://link.springer.com/article/10.1007/s12144-016-9469-4.
The PDF looks like this:
Whereas GROBID appears to split the text inside this section:
and arbitrarily puts the second half into a figure description:
The text was updated successfully, but these errors were encountered: