Dog Ear automatically bookmarks PDF documents by matching user-supplied regular expressions to patterns in text files generated from the PDF. Highlights include:
- Access to text files for testing regular expressions
- Regular expression templates
- In-app text editor to modify the Table of Contents (TOC) file before applying bookmarks
- Optional post-processing script (.sh or .py) to modify the TOC file
The user supplies a PDF document with embedded text, as well as a set of regular expressions. The regular expressions must appear in plain text files. The name of each text file becomes a category of bookmarks. For every PDF page in the user-supplied PDF document, Dog Ear generates a numbered text file. It then runs each regular expression against each text file and compiles matches into a table of contents file (TOC). The TOC file lists each bookmark under the appropriate category along with its corresponding page number. The user may edit the TOC file or modify it with a custom script. When happy with the TOC file, the user creates the bookmarked PDF. This can be an iterative process, with the bookmarked PDF changing according to changes made to the TOC file.
If the user supplies multiple PDF documents, they will be joined alphanumerically. If the user supplies a PDF that already contains bookmarks, the final bookmarked PDF will not keep them because Dog Ear works from a cleaned copy of the original PDF. Dog Ear comes preloaded with sample PDFs, regular-expression text files, and a sample post-processing python script. These are for demonstration purposes and can be easily removed.
Dog Ear uses multiline, case-sensitive, Python regular expressions. If more than one regular expression is used in a text file, each regular expression must be placed on a separate line. It is okay to separate the regular expressions with an empty line.
In an effort to simplify the creation of regular expressions for bookmarking, Dog Ear provides the following basic template (with some variations):
(?s)(PATTERN)(?=.*?HELPER)
In this template, (PATTERN) is a capture group, meaning that text matching PATTERN will appear in the TOC file. (?=.*?HELPER) is a positive lookahead non-capturing group. It will not appear in the TOC file, but must be present in the text file. (?s) is the DOTALL expression. Normally, the dot character matches any character except a newline. However, the DOTALL expression makes the dot also match a newline. * means that whatever character comes before it can be repeated an indefinite number of times. Therefore, by including .* before the HELPER pattern, the PATTERN will only appear in the TOC if HELPER also appears somewhere in the text file after PATTERN.
For example, the following regular expression will match any date (with an abbreviated month) only if the literal word "ABSTRACT" appears somewhere in the same text file after the matching date.
(?s)((Jan\.|Feb\.|Mar\.|Apr\.|May|Jun\.|Jul\.|Aug\.|Sept\.|Oct\.|Nov\.|Dec\.)\s\d{1,2}\,\s?\d{4}\s*$)(?=.*?ABSTRACT)
This approach works well for matching information in structured forms with consistent elements. A more precise regular expression that more directly ties the HELPER pattern to the capture group would be faster and more efficient. However, this tends to make the regular expression more complicated and less readable. But depending on your use case, speed and precision may take priority. Consult your favorite LLM for assistance.