Skip to content

Create, cleanup, add and manage Table Of Contents (TOC) of pdf and djvu documents with Emacs

License

Notifications You must be signed in to change notification settings

atanasj/doc-tools-toc

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Doc Tools TOC

https://img.shields.io/badge/license-GPLv3-blue.svg

Create, cleanup, add and manage Table Of Contents (TOC) of pdf and djvu documents with Emacs

Introduction

Doc Tools TOC is a package for creating, cleaning, adding and managing the Table Of Contents (TOC) of pdf and djvu documents.

This package is also provided by the toc-layer for Spacemacs

Features:

  • Extract Table of Contents from documents via text layer or via Tesseract OCR
  • Auto detect indentation levels from leading spaces or by selecting level separater
  • Quickly adjust pagenumbers while viewing the document
  • Add Table of Contents to document

Installation

For Spacemacs use the toc-layer for Spacemacs. Installation instruction are found on that page.

For regular Emacs users, well… you probably know how to install packages.

Requirements

To use the pdf.tocgen functionality that software has to be installed (see https://krasjet.com/voice/pdf.tocgen/). For the other remaining functionality the package requires pdftotext (part of poppler-utils), pdfoutline (part of fntsample or from Github (not from Pypi as the package seems broken)) and djvused (part of http://djvu.sourceforge.net/) command line utilities to be available. Extraction with OCR requires the tesseract command line utility to be available.

Usage

pdf-tocgen (software generated PDF’s)

https://krasjet.com/voice/pdf.tocgen/

For ‘software-generated’ (i.e. PDF’s not created from scans) PDF-files it is recommend to use doc-toc-extract-with-pdf-tocgen. To use this function you first have to provide the font properties for the different headline levels. For that select the word in a headline of a certain level and then type M-x doc-toc-gen-set-level. This function will ask which level you are setting, the highest level should be level 1. After you have set the various levels (1,2, etc.) then it is time to run M-x doc-toc-extract-with-pdf-tocgen. If a TOC is extracted succesfully, then in the pdftocgen-mode buffer simply press C-c C-c to add the contents to the PDF. The contents will be added to a copy of the original PDF with the filename output.pdf and this copy will be opened in a new buffer. If the pdf-tocgen option does not work well then continue with the steps below.

toc-mode

In each step below, check out available shortcuts using C-h m. Additionally you can find available functions by typing the M-x mode-name (e.g. M-x doc-toc-cleanup), or with two dashes in the mode name (e.g. M-x doc-toc--cleanup). Of course if you use packages like Ivy or Helm you just use the fuzzy search functionality.

Extraction and adding contents to a document is done in 4 steps:

  1. extraction
  2. cleanup
  3. adjust/correct pagenumbers
  4. add TOC to document

In each step below, check out available shortcuts using C-h m. Additionally you can find available functions by typing the M-x mode-name (e.g. M-x doc-toc-cleanup), or with two dashes in the mode name (e.g. M-x doc-toc--cleanup). Of course if you use packages like Ivy or Helm you just use the fuzzy search functionality.

1. Extraction

For PDFs without TOC pages, with a very complicated TOC (i.e. that require much cleanup work) or with headlines well fitted for automatic extraction (you will have to decide for yourself by trying it), consider to use the pdf.tocgen functionality described below.

Otherwise, start with opening some pdf or djvu file in Emacs (pdf-tools and djvu package recommended). Find the pagenumbers for the TOC. Then type M-x doc-toc-extract-pages, or M-x doc-toc-extract-pages-ocr if doc has no text layer or the text layer is bad, and answer the subsequent prompts by entering the pagenumbers for the first and the last page each followed by RET. For PDF extraction with OCR, currently it is required to view all contents pages once before extraction (doc-toc uses the cached file data). For TOC’s that are formatted as two columns per page, prepend the doc-toc-extract-pages-ocr command with two universal arguments. Then after you are asked for the start and finish pagenumbers, a third question asks you to set the tesseract psm code. For the double column layout it is best (as far as I know) to use psm code 1. Also the languages used for tesseract OCR can be customized via the doc-toc-ocr-languages variable.

doc-toc-extract.gif

A buffer with the, somewhat cleaned up, extracted text will open in TOC-cleanup mode. Prefix command with the universal argument (C-u) to omit cleanup and get the raw text. If the extracted text is of too low quality you either can hack/extend the doc-toc-extract-pages-ocr definition, or alternatively you can try to extract the text with the python document-contents-extractor script, which is more configurable (you are also welcome to hack on and improve that script). For this the tesseract documentation might be useful.

If you merely want to extract text without further processing then you can use the command doc-toc-extract-only.

2. TOC-Cleanup

In this mode you can further cleanup the contents to create a list where each line has the structure:

TITLE (SOME) PAGENUMBER

There can be any number of spaces between TITLE and PAGE. The correct pagenumbers can be edited in the next step. A document outline supports different levels and levels are automatically assigned in order of increasing number of preceding spaces, i.e. the lines with the least amount of preceding spaces are assigned level 0 etc., and lines with equal number of spaces get assigned the same levels.

Contents   1
Chapter 1      2
 Section 1 3
  Section 1.1     4
Chapter 2      5

There are some handy functions to assist in the cleanup. C-c C-j jumps automatically to the next line not ending with a number and joins it with the next line. If the indentation structure of the different lines does not correspond with the levels, then the levels can be set automatically from the number of seperators in the indices with M-x doc-toc-cleanup-set-level-by-index. The default seperator is a . but a different seperator can be entered by preceding the function invocation with the universal argument (C-u). Some documents contain a structure like

1 Chapter 1    1
Section 1      2

Here the indentation can be set with M-x replace-regexp ^[^0-9] -> \& (where there is a space character before the \&).

Type C-c C-c when finished

3. TOC-tabular (adjust pagenumbers)

This mode provides the functionality for easy adjustment of pagenmumbers. The buffer can be navigated with the arrow up/down keys. The left and right arrow keys will shift down/up all the page numbers from the current line and below (combine with SHIFT for setting individual pagenumbers).

The TAB key jumps to the pagenumber of the current line, while C-right/C-left will shift all remaining page numbers up/down while jumping/scrolling to the line its page in the document window. Because the numbering of scanned books often breaks at sections of a certain level, C-j will let jo jump quickly to the next entry of a certain level (e.g. you can quickly check if the page numbers of all level 0 sections correspond to the page numbers in the document). The S-up/S-down in the tablist window will just scroll page up/down in the document window and, C-up/C-down will scroll smoothly in that window.

If you discover some small error in some field, then you put the cursor on that field and press r to correct the text in that field.

Type C-c C-c when done.

4. TOC-mode (add outline to document)

The text of this buffer should have the right structure for adding the contents to (for pdf’s a copy of) the original document. Final adjustments can be done but should not be necessary. Type C-c C-c for adding the contents to the document.

By default, the TOC is simply added to the original file. (ONLY FOR PDF’s, if the (customizable) variable doc-toc-replace-original-file is nil, then the TOC is added to a copy of the original pdf file with the path as defined by the variable doc-toc-destination-file-name. Either a relative path to the original file directory or an absolute path can be given.)

Sometimes the pdfoutline/djvused application is not able to add the TOC to the document. In that case you can either debug the problem by copying the used terminal command from the *messages* buffer and run it manually in the document’s folder iside the terminal, or you can delete the outline source buffer and run doc-toc--tablist-to-handyoutliner from the tablist buffer to get an outline source file that can be used with HandyOutliner (unfortunately the handyoutliner command does not take arguments, but if you customize the doc-toc-handyoutliner-path and doc-toc-file-browser-command variables, then Emacs will try to open HandyOutliner and the file browser so that you can drag the file contents.txt directly into HandyOutliner).

Key bindings

all-modes (i.e. all steps)

Key BindingDescription
C-c C-cdispatch (next step)

doc-toc-cleanup-mode

C-c C-jdoc-toc-join-next-unnumbered-lines
C-c C-sdoc-toc–roman-to-arabic

doc-toc-mode (tablist)

TABpreview/jump-to-page
right/leftdoc-toc-in/decrease-remaining
C-right/C-leftdoc-toc-in/decrease-remaining and view page
S-right/S-leftin/decrease pagenumber current entry
C-down/C-upscroll document other window (only when other buffer shows document)
S-down/S-upfull page scroll document other window ( idem )
C-jdoc-toc–jump-to-next-entry-by-level
rdoc-toc–replace-input

Alternatives

Donate

Buy me a coffee (PayPal donate)

About

Create, cleanup, add and manage Table Of Contents (TOC) of pdf and djvu documents with Emacs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Emacs Lisp 100.0%