WTPDF / PDF/UA-2 Examples by the LaTeX Project #72
Replies: 10 comments 32 replies
-
Awesome work! But I did note 2 files with errors and a few other issues:
Just for discussion: several files have private PTEX entries for XObjects, etc. such as PTEX.FileName and PTEX.InfoDict which can include author, filename, etc as per the pdfTEX documentation (https://texdoc.org/serve/pdftex-a.pdf/0). Since PDF/A files are intended for long-term preservation, this has the potential to cause issues for FOIA and similar requests since the presence of private data might slip past various redaction workflows. A modern equivalent is to use an XMP Metadata stream instead of 2nd class custom PDF keys which makes this more discoverable. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Speaking as an accessibility professional ( am no expert in LaTeX), the dependence on VeraPDF is not wise. While it seems to be able to verify that tags exist in a nominal structure, the quality and usefulness of the actual tags and structure being generated sub-standard. The lack of ActualText in math equations means this fails to meet the PDF/UA-2 or WTPDF standards. Tables are very baseline and primitive, without any header cells or scoping. Image captions are not contained correctly and alt text seems to have some issue where it is not being picked up by screen readers. It is premature to claim any level of real compliance. All of these issues are ones that there is no automatic checker for and can only be picked up through human testing. |
Beta Was this translation helpful? Give feedback.
-
Am 14.05.24 um 01:54 schrieb ErroneousBosch:
More importantly, we have to test for
demonstrable accessibility which both the example files above and files
we generated with TeXLive 2024 do not meet.
The big question here is why are they not meeting it? Because there are
errors in them with respect to implementing UA-2 or because consuming
software up to now is not capable of properly handling PDF 2 structures yet?
So far industry hasn't bothered much with the improved structures
provided by PDF 2 (and necessary for higher quality accessibility)
because there were (nearly) no documents that used them --- so so why
bother if there is no use case?
Screen reader performance was especially poor, checking with Apple VO,
NVDA, and Adobe's own reader. That is honestly where the rubber meets
the road. Compliance to a standard that isn't implemented anywhere isn't
a useful compliance, especially if it means not meeting real-world
accessibility needs.
all true, but if your road is currently a gravel surface with huge holes
in it, the question is: do you want to continue running over it only
with noisy tanks because everything else that would be a comfortable car
will break down, or do you strife for improving the road?
right now accessibility of PDFs is so poor because consuming software is
based on 1.7 and UA-1 + a lot of heuristics (which differ from
implementation to implementation and therefore also do not give a good
user experience over all).
Now, by producing UA-2 docs you to not get magically better
accessibility, in fact you are likely to get even worse, because
consumer software handles the improved structures badly or not at all
and their heuristics fail with such documents.
But as it was pointed out, the goal of producing documents that comply
to the new (and better) standards, was to make showcases of where in the
current consumer software fails with PDF/UA-2 and this way drive good
implementations of the new standard in the consumer software. With the
ability of providing a corpus of complex documents that meet PDF/UA-2,
we are fairly confident that this could happen and in fact we already
see movements in this respect
Like I said, I am gathering more useful details to submit in one or more
issues.
please do, but also please keep in mind the purpose of the generated
documents, e.g.,
- things in which we go wrong should be improved on our end to make
the documents better
- but things that go wrong in consumer apps because they do not
understand the standard, should really (with some pressure)
communicated by the community to the vendors.
|
Beta Was this translation helpful? Give feedback.
-
I'm at the workshop right now! Can I request feedback for this PDF I made? The LaTeX source code uses amsthm, which has many votes to be made compatible: |
Beta Was this translation helpful? Give feedback.
-
I fail to get tagging to work. I have the following very minimal example:
and run with This generates a file foo.pdf but neither Preview nor Adobe Acrobat manage to read it. I then tried to use the tagpdf example files from https://github.com/latex3/tagging-project/tree/main/project-examples/tagpdf. I run which fails with
I have TeX Live 2024 for macOS. This is the version information when compiling:
Any suggestions? |
Beta Was this translation helpful? Give feedback.
-
@davidcarlisle doing more in-depth tests, we encountered the issue on 2401.09965v1-tagged: the inline math element
and
The file names also seem to indicate the mismatch: We suspect that these are other cases in this and other test PDFs like this, but we stopped at the first one. |
Beta Was this translation helpful? Give feedback.
-
Oh thank you. Fortunately (or perhaps unfortunately) this is not a systematic error but manual editing failure on my part. The arxiv examples were trialing obtaining mathml for each tex fragment by extracting it from the arxiv supplied xhtml version of the document. the tricky part (and why we never published the scripts used) is matching up the tex fragments as obtained by different systems The file fragment 18 is I'll fix the hash ahd re-generate.... |
Beta Was this translation helpful? Give feedback.
-
see 5dc2b94 |
Beta Was this translation helpful? Give feedback.
-
LaTeX is an incredible document format. My understanding is that PDF/UA and PDF/UA-2 are tagged extensively. Does that mean PDF/UA docs can be reverse engineered into LaTeX? Is this possible? |
Beta Was this translation helpful? Give feedback.
-
WTPDF / PDF/UA-2 Examples by the LaTeX Project
The following files demonstrate various aspects of Well Tagged PDF documents conforming to PDF/UA-2.
They were all generated with LuaLaTeX (
lualatex-dev
in TeX Live 2024).The files are a mixture of small examples demonstrating specific features, older out of copyright documents that have been re-typeset as tagged pdf, and contemporary documents including recently published arXiv papers, course notes, and conference papers.
The files here are all PDF 2.0. PDF 1.7 versions of the same documents are available from PDF/UA-1 Examples by the LaTeX Project.
Access to the Files
The full collection of PDF files is available at Google Drive, where you may select one or more individual files to download, or, at the
top of the page is a Download all link which will generate a zip file and download the full collection.
Google drive directory of all example PDF files
The LaTeX sources are available from this repository also, where appropriate, we link to the original files used as source material.
Verification of PDF/UA-2 compliance
There are not yet many validators that correctly handle UA-2 (given that the standard was released in March 2024 not that
surprisingly). One online validator you can try on the smaller examples is
VeraPDF — PDF/A and PDF/UA Validation
Please note that some PDF viewers modify the PDF when opening it (to allow for annotations, for example). In some cases this is known to break the PDF/UA-2 standard. If that happens re-download and use a different viewer.
The Samples
Simple Examples with MathML Associated files
All three conform to: PDF/UA-2 PDF/A-4F WTPDF/Accessibility WTPDF/Reuse Arlington
Three small examples demonstrating the use of Associated Files to Tag mathematics. Each formula is associated with two associated files. A LaTeX fragment representing the original source, and a MathML document.
mathml-AF-ex1
mathml-AF-ex2
Sample-AF-Math-LaTeX
amsmath
LaTeX package documentationConforms to: PDF/UA-2 PDF/A-4F WTPDF/Accessibility WTPDF/Reuse Arlington
The
amsmath
package defines the main markup structures for mathematics in LaTeX.This manual has examples of many kinds of aligned equations and similar structures. This version has been enhanced to produce Well Tagged PDF.
amsldoc-tagged
tagpdf
LaTeX package documentationConforms to: PDF/UA-2 PDF/A-4 WTPDF/Accessibility WTPDF/Reuse Arlington
The
tagpdf
LaTeX package is a core part of the LaTeX support for tagged PDF.Its documentation already conforms to WTPDF and PDF/UA-2 and a snapshot is included here.
tagpdf
ArXiv publications
Tagged using MathML extracted from the arXiv-supplied html versions of the documents.
They were each submitted to arXiv under a CC Licence permitting re-use such as this experiment, The tagged documents are available under the same licence.
Conforms to: PDF/UA-2 PDF/A-4F WTPDF/Accessibility WTPDF/Reuse Arlington
2401.09965v1-tagged — Original Source
Conforms to: PDF/UA-2 PDF/A-4F WTPDF/Accessibility WTPDF/Reuse Arlington
2401.09436v1-tagged — Original Source
Conforms to: PDF/UA-2 PDF/A-4F WTPDF/Accessibility WTPDF/Reuse Arlington
2401.05361v1-tagged — Original Source
Niels Bohr: The Theory of Spectra and Atomic Constitution; Three Essays
Conforms to: PDF/UA-2 PDF/A-4F WTPDF/Accessibility WTPDF/Reuse Arlington
These essays by Niels Bohr are available as LaTeX source from The Project Gutenberg.
Additional TeX markup has been added to produce Tagged PDF. Also all math expressions were converted to MathML using LaTeXML.
47464-t-tagged — Original Source
William Shakespeare: MACBETH
Conforms to: PDF/UA-2 PDF/A-4 WTPDF/Accessibility WTPDF/Reuse Arlington
macbeth-tagged — Original Source
This document uses a provided LaTeX source of the play text. The LaTeX markup has been enhanced to produce Well Tagged PDF.
American Standard Version of the Bible (1901 text)
Conforms to: PDF/UA-2 PDF/A-4 WTPDF/Accessibility WTPDF/Reuse Arlington
The plain text source of the ASV Bible, 1901 as provided by Wikisource. This has been marked up as LaTeX to generate well tagged PDF. This example demonstrates a custom role map with structured tagging corresponding to the Testament/Book/Chapter/Verse structure shown in this work.
ASV Bible — Original Source
DEIMS 2024 Conference paper
Conforms to: PDF/UA-2 PDF/A-4 WTPDF/Accessibility WTPDF/Reuse Arlington
The paper Enhancing LATEX to Automatically Produce Tagged and Accessible PDF submitted to DEIMS 2024, Tokyo.
As well as describing the approach to PDF tagging used for these examples, the paper does itself form an example of tagging a
contemporary conference paper. This is the version as prepared for the TeX Users Group publication, TUGBoat.
tb139mitt-deims24
The presentation at the DEIMS conference including a demonstration is available as a video.
PDF Association sample poster
Conforms to: PDF/UA-2 PDF/A-4 WTPDF/Accessibility WTPDF/Reuse Arlington
An article describing the PDF Association work on accessibility produced for the PDF Association launch of Well Tagged PDF.
pdfa-art
Sample Chemistry/Math notes
This is a small contemporary document used as notes on mathematical aspects of Chemistry.
In this example, the math is associated with just LaTeX source Associated files, not MathML.
Conforms to: PDF/UA-2 PDF/A-4F WTPDF/Accessibility WTPDF/Reuse Arlington
525Da-23-group-theory
A small template exam paper.
Conforms to: PDF/UA-2 PDF/A-4F WTPDF/Accessibility WTPDF/Reuse Arlington
PHY-exam
Wilhelm Busch: Max and Moritz
Conforms to: PDF/UA-2 PDF/A-4 WTPDF/Accessibility WTPDF/Reuse Arlington
A LaTeX document that does not have math and the main language is not English. Showing tagging of images, verse structures and the use of more than one (marked up) language in a document.
pg17161-tagged — Original Source
Beta Was this translation helpful? Give feedback.
All reactions