|
1 | | -# pdfAssembler |
| 1 | +# PDF Assembler |
| 2 | + |
| 3 | +[](https://www.npmjs.com/package/pdfassembler) [](https://www.npmjs.com/package/pdfassembler) [](https://github.com/dschnelldavis/pdfassembler) |
| 4 | +[](https://david-dm.org/dschnelldavis/pdfassembler) [](https://david-dm.org/dschnelldavis/pdfassembler?type=dev) |
| 5 | + |
| 6 | +The missing piece to edit PDF files directly in the browser. |
| 7 | + |
| 8 | +PDF Assembler Disassembles PDF files into editable JavaScript objects, then assembles them back into PDF files, ready to save, download, or open. |
| 9 | + |
| 10 | +## Overview |
| 11 | + |
| 12 | +Actually PDF Assembler itself only does one thing — it assembles PDF files (hence the name). However, it uses Mozilla's terrific [pdf.js](https://mozilla.github.io/pdf.js/) library to disassemble PDFs into JavaScript objects. Those objects can then be modified, after which PDF Assembler can re-assemble them back into PDFs, to display, save, or download. |
| 13 | + |
| 14 | +### Scope and future development |
| 15 | + |
| 16 | +PDF is a complex format (the [ISO standard describing it](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf) is 756 pages long). So PDF Assembler makes working with PDFs (somewhat) simpler by separating the physical structure of a PDF from its logical structure. In the future, PDF Assembler will likely offer better defaults for generating PDFs, such as cross-reference streams and compressing objects, as well as more options, such as linearizing or encrypting the output PDF. However, editing features—like adding or editing pages, or even centering or wrapping text—are outside the scope of this library. |
| 17 | + |
| 18 | +### Alternatives |
| 19 | + |
| 20 | +If you want a library to simplify creating PDFs, in a browser or on a server, you can use [jsPDF](https://github.com/MrRio/jsPDF) or [PDFKit](https://github.com/devongovett/pdfkit). |
| 21 | + |
| 22 | +If you want to simplify editing existing PDFs on a server, you can use command line tools [QPDF](http://qpdf.sourceforge.net/) or [PDFTk](https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/), the Java tools [PDFBox](https://pdfbox.apache.org/) or [iText](https://github.com/ymasory/iText-4.2.0), or the Node module [Hummus](https://github.com/galkahana/HummusJS/wiki). |
| 23 | + |
| 24 | +If you want to simplify editing existing PDFs in a browser, I haven't found that library yet. This library helps, but still requires a good understanding of how the logical structure of a PDF works. |
| 25 | + |
| 26 | +If you want to learn more about logical structure of PDFs, I recommend O'Reilly's [PDF Explained](http://shop.oreilly.com/product/0636920021483.do). If you use this library, pdf.js and PDF Assembler will care of reading and writing the raw bytes of the PDF, so you can skip to Chapter 4, "Document Structure": |
| 27 | + |
| 28 | + |
| 29 | +Figure 4-1 shows the logical structure of a typical document. (PDF Explained, Chapter 4, page 39) |
| 30 | + |
| 31 | + |
| 32 | +## How it works - the PDF structure object |
| 33 | + |
| 34 | +PDF Assembler accepts or creates a PDF structure object, which is a specially formatted JavaScript object that represents the logical structure of a PDF document as simply as possible, by mapping each type of PDF data to its closest JavaScript counterpart: |
| 35 | + |
| 36 | +| PDF data type | JavaScript data type | |
| 37 | +|---------------|--------------------------------------| |
| 38 | +| dictionary | object | |
| 39 | +| array | array | |
| 40 | +| number | number | |
| 41 | +| name | string, starting with "/" | |
| 42 | +| string | string, surrounded with "()" or "<>" | |
| 43 | +| boolean | boolean | |
| 44 | +| null | null | |
| 45 | + |
| 46 | +### "Hello world" example |
| 47 | + |
| 48 | +Here's the structure object for a simple "Hello world" PDF: |
| 49 | + |
| 50 | +```JavaScript |
| 51 | +const helloWorldPdf = { |
| 52 | + '/Root': { |
| 53 | + '/Type': '/Catalog', |
| 54 | + '/Pages': { |
| 55 | + '/Type': '/Pages', |
| 56 | + '/Count': 1, |
| 57 | + '/Kids': [ { |
| 58 | + '/Type': '/Page', |
| 59 | + '/MediaBox': [ 0, 0, 612, 792 ], |
| 60 | + '/Contents': [ { |
| 61 | + 'stream': '1 0 0 1 72 708 cm BT /Helv 12 Tf (Hello world!) Tj ET' |
| 62 | + } ], |
| 63 | + '/Resources': { |
| 64 | + '/Font': { |
| 65 | + '/Helv': { |
| 66 | + '/Type': '/Font', |
| 67 | + '/BaseFont': '/Helvetica', |
| 68 | + '/Subtype': '/Type1' |
| 69 | + } |
| 70 | + } |
| 71 | + }, |
| 72 | + } ], |
| 73 | + } |
| 74 | + } |
| 75 | +} |
| 76 | +``` |
| 77 | + |
| 78 | +In this object, the main document catalog dictionary is '/Root' (and if there were a document information dictionary, it would be '/Info', because '/Root' and '/Info' are the names used to refer to these objects in the PDF trailer dictionary). |
| 79 | + |
| 80 | +There are a few small differences from a true PDF structure. For example, streams are _inside_ their dictionary objects in order to keep them together, even though in the final PDF they will be rendered immediately after their dictionaries instead. |
| 81 | + |
| 82 | +Also, structure objects do not need to include stream '/Length' or page '/Parent' entries, because those entries will be automatically calculated and added when the PDF is assembled. (Adding them won't hurt anything, but there is no reason to, as they will just be overwritten.) |
| 83 | + |
| 84 | +### Re-using shared dictionary items |
| 85 | + |
| 86 | +If you want to use the same dictionary object in multiple places in a PDF, simply set the second location equal to the first, to create a reference from one part of the PDF structure object to another. (PDF Assembler will automatically recognize this, and sort out the details of creating an indirect object and adding PDF object references in the appropriate places.) |
| 87 | + |
| 88 | +For example, here is how to add a second page to the above PDF, and re-use the resources from the first page: |
| 89 | +```javascript |
| 90 | +// add new page |
| 91 | +helloWorldPdf['/Root']['/Pages']['/Kids'].push({ |
| 92 | + '/Type': '/Page', |
| 93 | + '/MediaBox': [ 0, 0, 612, 792 ], |
| 94 | + '/Contents': [ { |
| 95 | + 'stream': '1 0 0 1 72 708 cm BT /Helv 12 Tf (This is page two!) Tj ET' |
| 96 | + } ] |
| 97 | +}); |
| 98 | + |
| 99 | +// assign page 2 (/Kids array item 1) to re-use |
| 100 | +// the resources from page 1 (/Kids array item 0) |
| 101 | +helloWorldPdf['/Root']['/Pages']['/Kids'][1]['/Resources'] = |
| 102 | + helloWorldPdf['/Root']['/Pages']['/Kids'][0]['/Resources']; |
| 103 | +``` |
| 104 | + |
| 105 | +### Grouping page trees |
| 106 | + |
| 107 | +By default, PDF Assembler takes care of grouping pages for you. When you import a document, it will automatically flatten the page tree into one long array, and then re-group them when assembling the final PDF. Optionally, you can change the group size (the default is 16), or disable grouping. But in general, you can forget about grouping and just let PDF Assembler take care of it. |
| 108 | + |
| 109 | +## Installing and using PDF Assembler |
| 110 | + |
| 111 | +### Installing from NPM |
| 112 | + |
| 113 | +So, if you're not scared off yet, and still want to use PDF Assembler in your project, it's pretty simple. |
| 114 | + |
| 115 | +```shell |
| 116 | +npm install pdfassembler |
| 117 | +``` |
| 118 | + |
| 119 | +Next, import pdfassembler in your project, like so: |
| 120 | + |
| 121 | +```javascript |
| 122 | +PDFAssembler = require('pdfassembler').PDFAssembler; |
| 123 | +``` |
| 124 | + |
| 125 | +or, in ES6: |
| 126 | + |
| 127 | +```javascript |
| 128 | +include { PDFAssembler } from ('pdfassembler'); |
| 129 | +``` |
| 130 | + |
| 131 | +### Loading a PDF |
| 132 | + |
| 133 | +To us PDF Assembler, you must create a new PDFAssembler instance and initialize it, either with your own PDF structure object: |
| 134 | +```javascript |
| 135 | +// helloWorldPdf = the pdf object defined above |
| 136 | +const newPdf = new PDFAssembler(helloWorldPdf); |
| 137 | +``` |
| 138 | + |
| 139 | +Or, by importing a binary PDF file: |
| 140 | +```javascript |
| 141 | +// binaryPDF = a Blob, File, ArrayBuffer, or TypedArray containing a PDF file |
| 142 | +const newPdf = new PDFAssembler(binaryPDF); |
| 143 | +``` |
| 144 | + |
| 145 | +### Editing the PDF object |
| 146 | + |
| 147 | +After you've created a new new PDFAssembler instance, you can request a promise with the PDF structure object, and then edit it. |
| 148 | +(Some of PDF Assembler's actions are asynchronous, so it's necessary to use a promise to make sure the PDF is fully loaded before you edit it.) |
| 149 | + |
| 150 | +For example, here is how to edit a PDF to remove all but the first page: |
| 151 | +```javascript |
| 152 | +newPdf |
| 153 | + .pdfObject() |
| 154 | + .then(function(pdf) { |
| 155 | + pdf['/Root']['/Pages']['/Kids'] = pdf['/Root']['/Pages']['/Kids'].slice(0, 1); |
| 156 | + }); |
| 157 | +``` |
| 158 | + |
| 159 | +### Problems with outlines and internal references |
| 160 | + |
| 161 | +PDF Assembler does a good job managing page contents, and will automatically discard unused contents from deleted pages, while still retaining any contents used on other pages. However, if a PDF contains an outline or internal references that refer to a deleted page, those will cause errors in the assembled PDF file. (The PDF may still open and display, but the PDF reader will probably show an error message.) As a somewhat crude (and hopefully temporary) solution for this, PDF Assembler provides a function for removing all non-printable data from the root catalog, like so: |
| 162 | + |
| 163 | +```javascript |
| 164 | +newPdf.removeRootEntries(); |
| 165 | +``` |
| 166 | + |
| 167 | +The trade-off is that after running removeRootEntries(), your assembled PDF is less likely to have errors, and may also be smaller in size, but will also not have any outline or other non-printing information available in the original PDF. |
| 168 | + |
| 169 | +### Assembling a new PDF file from the the PDF structure object |
| 170 | + |
| 171 | +After editing, call assemblePdf() with a name for your new PDF, and PDF Assembler will assemble your PDF structure object and return a promise for a [File](https://developer.mozilla.org/en-US/docs/Web/API/File) containing your PDF, ready to download or save or whatever you want. |
| 172 | + |
| 173 | +For example, here's how to assemble a PDF and use [file-saver](https://www.npmjs.com/package/file-saver) to save it: |
| 174 | +```javascript |
| 175 | +fileSaver = require('file-saver'); |
| 176 | +// ... |
| 177 | +newPdf |
| 178 | + .assemblePdf('assembled-output-file.pdf') |
| 179 | + .then(function(pdfFile) { |
| 180 | + fileSaver.saveAs(pdfFile, 'assembled-output-file.pdf'); |
| 181 | + }); |
| 182 | +``` |
| 183 | + |
| 184 | +### PDF Assembler options |
| 185 | + |
| 186 | +PDF Assembler has a few options that will change its behavior. All options can be set any time after you have created a new PDFAssembler instance and before you have assembled your final pdf, like so: |
| 187 | + |
| 188 | +```javascript |
| 189 | +newPdf.compress = false; |
| 190 | +newPdf.indent = true; |
| 191 | +``` |
| 192 | + |
| 193 | +| option | default | description | |
| 194 | +|---------------|---------|---------------| |
| 195 | +| indent | false | Indents output to make it easier to read if you open the PDF in a text editor to look at the structure. Accepts a String or Number, similar to the space parameter in [JSON.stringify](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/stringify). |
| 196 | +| compress | true | Compresses streams. |
| 197 | +| groupPages | true | Groups pages. |
| 198 | +| pageGroupSize | 16 | Size of largest page group. |
0 commit comments