diff --git a/docs/dev/cmaps.md b/docs/dev/cmaps.md new file mode 100644 index 000000000..bef86dcff --- /dev/null +++ b/docs/dev/cmaps.md @@ -0,0 +1,52 @@ +# CMaps + +Looking at the cmap of "crazyones": + +```bash +pdftk crazyones.pdf output crazyones-uncomp.pdf uncompress +``` + +You can see this: + +```text +begincmap +/CMapName /T1Encoding-UTF16 def +/CMapType 2 def +/CIDSystemInfo << + /Registry (Adobe) + /Ordering (UCS) + /Supplement 0 +>> def +1 begincodespacerange +<00> +endcodespacerange +1 beginbfchar +<1B> +endbfchar +endcmap +CMapName currentdict /CMap defineresource pop +``` + +## codespacerange + +A codespacerange maps a complete sequence of bytes to a range of unicode glyphs. +It defines a starting point: + +```text +1 beginbfchar +<1B> +``` + +That means that `1B` (Hex for 27) maps to the unicode character [`FB00`](https://unicode-table.com/en/FB00/) - the ligature ff (two lowercase f's). + +The two numbers in `begincodespacerange` mean that it starts with an offset of +0 (hence from `1B ➜ FB00`) upt to an offset of FF (dec: 255), hence 1B+FF = 282 +➜ [FBFF](https://www.compart.com/de/unicode/U+FBFF). + +Within the text stream, there is + +```text +(The)-342(mis\034ts.) +``` + +`\034 ` is octal for 28 decimal. diff --git a/docs/dev/pdf-format.md b/docs/dev/pdf-format.md index a1b588b28..30d8a0ef6 100644 --- a/docs/dev/pdf-format.md +++ b/docs/dev/pdf-format.md @@ -84,7 +84,7 @@ startxref 1234 Let's go through it: -* `trailer <<` indicates that the *trailer dictionary` starts. It ends with `>>`. +* `trailer <<` indicates that the *trailer dictionary* starts. It ends with `>>`. * `startxref` is a keyword followed by the byte-location of the `xref` keyword. As the trailer is always at the bottom of the file, this allows readers to quickly find the xref table. @@ -99,3 +99,15 @@ Table 3.13 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required * `R` is the keyword that indicates that the object is a reference to the catalog dictionary. * `/Size` (integer) contains the total number of entries in the files xref table. + + +## Reading PDF files + +Most PDF files are compressed. If you want to read them, first uncompress them: + +```bash +pdftk crazyones.pdf output crazyones-uncomp.pdf uncompress +``` + +Then rename `crazyones-uncomp.pdf` to `crazyones-uncomp.txt` and open it in +our favorite IDE / text editor. diff --git a/docs/index.rst b/docs/index.rst index 4065e7b16..90db760e4 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -54,6 +54,7 @@ You can contribute to `PyPDF2 on Github `_. dev/intro dev/pdf-format + dev/cmaps .. toctree:: :caption: About PyPDF2