-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PMC JATS dates are incorrectly represented by pandoc metadata #8865
Comments
This is a similar problem than #8866
Which, when converted to markdown, works alright and give us:
But when converted to DocBook, yields:
In contrast, the original one-liner representation of date, gives the following in DocBook:
So the problem is not that JATS dates are incorrectly represented, is that, across formats, there is no agreed multi-level structure that ensures all information that is not stored as a one-liner won't be lost. I could go ahead and fix this and allow a multi-level, more complex representation in the native format, but if I do that, crucial information will be lost at some point, for some formats. As @jgm said, this is a much bigger work than it seems. My answer to this is the same as for #8866: Given this not only involves the JATS reader, but also a number of writers, I personally prefer to not approach it until I have understood how to propose a more coordinated approach. Of course, if someone else has an alternative solution, I would be curious to see it. |
Seems to me there are fundamental long-term problems with dates in PMC JATS. I suspect they will always be incompatible with dating in other document formats. FWIW, my current thinking for "Baseprints JATS" is that only one single date is internal and stored inside the baseprint. I'm currently using the intentionally ambiguous name "Author Date" for this single internal date. Other dates are external and not stored inside the baseprint, like when the baseprint is publicly archived. These other external dates are evidence from sources separate from the baseprint itself. For a single date internal to a document I think a string in ISO 8601 format is a great standard and there is little need for multi-level dates in the pandoc object model or XML. A string in ISO 8601 sounds like a standard that works well with the current pandoc implementation in all other formats. |
Agreed. And it looks like that's what we were trying to achieve in the reader (but not completely successfully). |
I've fixed the problem with the month (it will now be |
I have a great solution: take the average of all the dates; it's a robust estimate using all the data! 😆 For PMC JATS I don't think there is a good solution. At least for my use case of a subset of JATS (with only an "author date") the current pandoc behaviour works fine. |
Should we close this? |
I submit this issue because @kamoe was interested in seeing cases like this. My opinion is that fixing an issue like this is out of scope for pandoc.
My advice for extracing PMC JATS specific metadata is to not use pandoc for that and instead use an XML parser. #8359 has more discussion and a list of JATS dialects.
Pandoc is a great tool for converting between many different formats. I think it is a wrong choice for extracting PMC JATS specific metadata out of the millions of JATS XML files for published journal articles archived by PMC for the long-term.
Here is a summary of the attached jats.xml.txt:
which is a simplication of the PMC JATS XML file of article
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9176297/
This is not a contrived example. This is the very first JATS XML file I picked out of the millions of JATS XML files archived in the PMC Open Access Subset.
At least four of these dates show up in either the HTML page or the PDF file for this article. Arguably the most important one is the "Published online" date which is 2021 Dec 13.
Here is what pandoc returns as metadata:
In addition to pandoc returning text that isn't even a date (and I suspect not even a valid ISO month), it isn't even the month of the date one would choose as the single date for the document. That date would be the one that PMC shows prominently on the HTML page and PDF file: 2021 Dec 13 which is not in March 2022.
In addition to getting an actual date, and one that makes sense, one would want it get the
date-type
attribute value to know what kind of date one is looking at. Lastly, I think one would want an PMC JATS parser to return a list of dates, not just one.The text was updated successfully, but these errors were encountered: