-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Update pdfbox to 2.0.0 and migrate from jempbox to xmpbox #1096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -147,7 +136,7 @@ public static void writeXMP(String filename, BibEntry entry, | |||
|
|||
if (meta.isPresent()) { | |||
|
|||
List<XMPSchema> schemas = meta.get().getSchemasByNamespaceURI(XMPSchemaBibtex.NAMESPACE); | |||
List<XMPSchema> schemas = meta.get().getAllSchemas(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also tried around with pdfbox and came to that solution:
XMPSchemaBibtex bib = (XMPSchemaBibtex) meta.get().getSchema(XMPSchemaBibtex.class);
Edit// From my understanding the code was simply looking for the BibTexSchema and the pdfbox internal method already does that traversing.
In General it could be helpful to have a look at the DublinCoreSchema Implementation. |
Thanks for your comments! I am integrating them and am starting to get the tests working. One problem I am facing is that xmpbox seems to leave out all Edit: The rdf information seems to be inserted only upon serialization. |
Regarding |
@koppor: Thanks, this provides some context. In this PR, I'll only do the migration to the new pdf library though and not to a new format. |
@JabRef/developers I think I have a run into a show-stopper when it comes to replacing jempbox with xmpbox. The problem is that the parser that ships with xmpbox, @Test
public void testParsing() throws XmpParsingException {
String testData = "<?xpacket begin=\"\" id=\"W5M0MpCehiHzreSzNTczkc9d\"?><x:xmpmeta xmlns:x=\"adobe:ns:meta/\">\n" +
" <rdf:RDF xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\n" +
" <rdf:Description xmlns:dc=\"http://purl.org/dc/elements/1.1/\" rdf:about=\"\">\n" +
" <dc:description>\n" +
" <rdf:Alt>\n" +
" <rdf:li xml:lang=\"x-default\">The success of the Linux operating system has demonstrated the viability of an alternative form of software development � open source software � that challenges traditional assumptions about software markets. Understanding what drives open source developers to participate in open source projects is crucial for assessing the impact of open source software. This article identifies two broad types of motivations that account for their participation in open source projects. The first category includes internal factors such as intrinsic motivation and altruism, and the second category focuses on external rewards such as expected future returns and personal needs. This article also reports the results of a survey administered to open source programmers.</rdf:li>\n" +
" </rdf:Alt>\n" +
" </dc:description>\n" +
" <dc:creator>\n" +
" <rdf:Seq>\n" +
" <rdf:li>Kelly Clarkson</rdf:li>\n" +
" <rdf:li>Ozzy Osbourne</rdf:li>\n" +
" </rdf:Seq>\n" +
" </dc:creator>\n" +
" <dc:relation>\n" +
" <rdf:Bag>\n" +
" <rdf:li>bibtex/bibtexkey/Clarkson06</rdf:li>\n" +
" <rdf:li>bibtex/booktitle/Catch-22</rdf:li>\n" +
" <rdf:li>bibtex/journal/International Journal of High Fidelity</rdf:li>\n" +
" <rdf:li>bibtex/pdf/YeKis03 - Towards.pdf</rdf:li>\n" +
" </rdf:Bag>\n" +
" </dc:relation>\n" +
" <dc:contributor>\n" +
" <rdf:Bag>\n" +
" <rdf:li>Huey Duck</rdf:li>\n" +
" <rdf:li>Dewey Duck</rdf:li>\n" +
" <rdf:li>Louie Duck</rdf:li>\n" +
" </rdf:Bag>\n" +
" </dc:contributor>\n" +
" <dc:subject>\n" +
" <rdf:Bag>\n" +
" <rdf:li>peanut</rdf:li>\n" +
" <rdf:li>butter</rdf:li>\n" +
" <rdf:li>jelly</rdf:li>\n" +
" </rdf:Bag>\n" +
" </dc:subject>\n" +
" <dc:title>\n" +
" <rdf:Alt>\n" +
" <rdf:li xml:lang=\"x-default\">Hypersonic ultra-sound</rdf:li>\n" +
" </rdf:Alt>\n" +
" </dc:title>\n" +
" <dc:date>\n" +
" <rdf:Seq>\n" +
" <rdf:li>1982-07</rdf:li>\n" +
" </rdf:Seq>\n" +
" </dc:date>\n" +
" <dc:format>application/pdf</dc:format>\n" +
" <dc:type>\n" +
" <rdf:Bag>\n" +
" <rdf:li>InProceedings</rdf:li>\n" +
" </rdf:Bag>\n" +
" </dc:type>\n" +
" </rdf:Description>\n" +
" <rdf:Description xmlns:bibtex=\"http://jabref.sourceforge.net/bibteXMP/\" rdf:about=\"\">\n" +
" <bibtex:abstract>The success of the Linux operating system has demonstrated the viability of an alternative form of software development � open source software � that challenges traditional assumptions about software markets. Understanding what drives open source developers to participate in open source projects is crucial for assessing the impact of open source software. This article identifies two broad types of motivations that account for their participation in open source projects. The first category includes internal factors such as intrinsic motivation and altruism, and the second category focuses on external rewards such as expected future returns and personal needs. This article also reports the results of a survey administered to open source programmers.</bibtex:abstract>\n" +
" <bibtex:author>\n" +
" <rdf:Seq>\n" +
" <rdf:li>Kelly Clarkson</rdf:li>\n" +
" <rdf:li>Ozzy Osbourne</rdf:li>\n" +
" </rdf:Seq>\n" +
" </bibtex:author>\n" +
" <bibtex:bibtexkey>Clarkson06</bibtex:bibtexkey>\n" +
" <bibtex:booktitle>Catch-22</bibtex:booktitle>\n" +
" <bibtex:editor>\n" +
" <rdf:Seq>\n" +
" <rdf:li>Huey Duck</rdf:li>\n" +
" <rdf:li>Dewey Duck</rdf:li>\n" +
" <rdf:li>Louie Duck</rdf:li>\n" +
" </rdf:Seq>\n" +
" </bibtex:editor>\n" +
" <bibtex:journal>International Journal of High Fidelity</bibtex:journal>\n" +
" <bibtex:keywords>peanut, butter, jelly</bibtex:keywords>\n" +
" <bibtex:month>#jul#</bibtex:month>\n" +
" <bibtex:pdf>YeKis03 - Towards.pdf</bibtex:pdf>\n" +
" <bibtex:title>Hypersonic ultra-sound</bibtex:title>\n" +
" <bibtex:year>1982</bibtex:year>\n" +
" <bibtex:entrytype>inproceedings</bibtex:entrytype>\n" +
" </rdf:Description>\n" +
" </rdf:RDF>\n" +
"</x:xmpmeta><?xpacket end=\"w\"?>";
InputStream is = new ByteArrayInputStream(testData.getBytes(StandardCharsets.UTF_8));
DomXmpParser parser = new DomXmpParser();
XMPMetadata meta = parser.parse(is);
} The result is: org.apache.xmpbox.xml.XmpParsingException: Cannot find a definition for the namespace http://jabref.sourceforge.net/bibteXMP/
at org.apache.xmpbox.xml.DomXmpParser.checkPropertyDefinition(DomXmpParser.java:853)
at org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:290)
at org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:234)
at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:198)
at net.sf.jabref.logic.xmp.XMPUtilTest.testParsing(XMPUtilTest.java:1444) So unless there is something I did not see, the question is how to proceed. I do not think we should write our own customn xmp parser, as long as jempbox still exists. We might be able to update to pdfbox-2.0.0 and keep jempbox, but that needs to be evaluated separately. |
From what I see we are not the only ones have problems with the XMPBox DomParser. |
+1 for asking at the mailing list. Or report an issue at https://issues.apache.org/jira/browse/PDFBOX/. Others seemed to have had issues too: https://issues.apache.org/jira/browse/PDFBOX-2416. Are we sure that old JabRef versions wrote the correct XMP data? 😇 Do we really need that XMP thing. Shouldn't we replace it in the long term by something else? See #938 (comment) I cannot really judge now, because I have too little knowledge about this metadata thing in PDFs. |
Ok, I will ask at the mailing list, but I get the feeling that the developers of pdfbox switched to xmpbox because they want strict parsing (i.e., rejecting non-standard extensions to xmp meta data. Regarding the relevance of the XMP feature, I really have no clue. I am not using it and do not know someone who does. If we do not need it, I would be very happy to throw it away. Is there any chance to find someone who knows and uses the feature and can shed some light on this? We could disable it for v3.3 and wait until someone complains ;-) |
And here is the reply from the pdfbox mailing list:
So that pretty says it. For now, we cannot switch to xmpbox. I'd suggest to leave this PR open until there is a new release of xmpbox. |
I just close the issue. We will find it again when querying for on-hold issues. |
What is the status here? I couldn't find any related bug on https://issues.apache.org/jira/browse/PDFBOX/fixforversion/12328837/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel. |
There is no change, really. We cannot use the most recent version of pdfbox, so our options are:
Currently, we are going for option 3. However that might be a long wait. "Long" as in "years". |
👍 for dublin core. Seems to be the best option. |
Update pdfbox and fontbox from 1.8.13 to 2.0.8 and migritate from jempbox to xmpbox. See pull JabRef#1096. Next step: Writing test cases for XMPUtil (DublinCore).
This fixes #938 - Reading and writing multiple dublinCore entries works: XMPUtilWriter supports mutliple metadata entries in dublinCore and a single entry in the PDDocumentInformation. If you want to test the reading of multiple entries, the PDF file JabRef_multipleMetaEntries.pdf contains three metadata entries in DublinCore for testing locally. - Removed to much code when refactoring the XMPUtil. Non XMP metadata are also relevent, when retrieving org.apache.pdfbox.pdmodel.PDDocumentInformation - Update pdfbox and fontbox from 1.8.13 to 2.0.8 and migritate from jempbox to xmpbox. See pull #1096. - Refactor extraction from DublinCoreSchema - The tests cover the most important use cases, which include reading and writing metadata from pdf files. Both formats, DublinCore and PDMetadata (which are no XMP metadata) are tested. - Separated XMPUtils in a reader and a writer utitlity class. - add meaningful names in DublinCoreExtractor and use StringUtils.isNullOrEmpty - Log exception in XMPUtilShared
This was the basis for #3710, so this is integrated and not a freeze anymore. |
I am glad to hear that my work was of some use in the end :) |
This PR addresses #1004
There are significant changes in the APIs from jempbox and xmpbox. The current state of this PR is just a plain translation from jempbox to xmpbox to get the code to compile. The tests are not working yet, so there are probably some errors in the translation that need to be fixed. Also, travis seems to have problems with xmpbox.
Comments from anyone who is familiar with XMP handling are very welcome.