-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[WIP] Use CERMINE as PDF parser #2474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
import java.util.List; | ||
import java.util.Optional; | ||
|
||
public class Date { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure you want to call that Date? Will create a lot of confusion with the import hell..
Really like the idea! |
Grobid is Apache-licensed: https://github.com/kermitt2/grobid/blob/master/LICENSE |
This is still Maybe, someone should re-try grobid. |
We are currently trying to focus on other things. 🔥 |
This PR replaces our own PDF parser with CERMINE.
In my tests, this library was able to extract (relatively) correct information from a wide variety of articles. It had some problems with books and thesises (what is the plural of a thesis?) through.
As far as I understand it, it uses neural networks that try to analyze the PDF on a structural level (e.g. the title is often placed rather prominently). More information can be found in a paper.
In summary:
Pros:
Cons:
A comparison of different metadata extract tools can be found in the following blog post, which ends with the following summary
I tried to include Grobid but it appears to be hard to use and has some problems on Windows (until recently?).
@koppor I added you as reviewer since you wrote the PdfImporter as far as I understood it.
gradle localizationUpdate
?