-
-
Notifications
You must be signed in to change notification settings - Fork 3
how to extract contents from pdf
how to extract contents from pdf
pdf just like excel ,a widely used file format, is everywhere .
A classic example of an important government report published as PDF only
-
The Paris Climate Agreement text was published as PDF. Some of the tools described here – plus the usual blood, sweat and tears – were used turn them back into usable HTML for our Paris COP21 Climate Treaty Texts site
-
卫计委发布了很多卫生信息标准都是pdf文件
we can simply cut them into native pdf and scanned pdf. To be clear,native pdf means there is no characters in images ,just text,but be careful,you need to treat the font embedding pdf as scanned pdf because Embedded fonts cannot be perfectly extracted. To verify the font is embedded, open the PDF with Acrobat Reader, copy some text and paste it into another application such as Word or Notepad. If the text is not recognized, the font is embedded. Or you can use poppler-utils to figure out.
for scanned pdf all you got is to use OCR –all existing OCR Engine is sensitive to poor quality scans. In order to improve the OCR recognition quality you can either:
- Rescan your document in higher resolution.
- Try to remove any hand written text, watermarks, etc.
- Try to improve your pre-processing result such as noise removal ,rotation,deskew etc.
according to the data you want ,the task fall into the following categories:
- Extracting text from native PDF
- Extracting tables from native PDF
- Extracting text from scanned PDFs where the content is not text but is images (for example, scans)
- Extracting tables from scanned PDFs where the content is not text but is images (for example, scans)
The first two is a generic topic we deal with native pdf without embedding fonts. Last two case is really a situation for OCR (optical character recognition) so we’re going to ignore it here. We may do a follow up post on this.
-
PDFMiner - PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
- Pure python
- In our trials PDFMiner has performed excellently and we rate as one of the best tools out there.
-
pdftohtml - pdftohtml is a utility which converts PDF files into HTML and XML formats. Based on xpdf. One of the better for tables but have found PDFMiner somewhat better for a while. Command-line Linux
-
pdftoxml - command line utility to convert PDF to XML built on poppler.
-
docsplit - part of DocumentCloud. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages…)
-
pypdf2xml - convert PDF to XML. Built on pdfminer. Started as an alternative to poppler’s pdftoxml, which didn’t properly decode CID Type2 fonts in PDFs.
-
pdf2htmlEX - Convert PDF to HTML without losing text or format. C++. Fast. Primarily focused on producing HTML that exactly resembles the original PDF. Limited use for straightforward text extraction as it generates css-heavy HTML that replicates the exact look of a PDF document.
-
pdf.js - you probably want a fork like pdf2json or node-pdfreader that integrates this better with node. Not tried this on tables though …
* Max Ogden has this list of Node libraries and tools for working with PDFs: [https://gist.github.com/maxogden/5842859](https://gist.github.com/maxogden/5842859) * Here’s a gist showing how to use pdf2json: [https://gist.github.com/rgrp/5944247](https://gist.github.com/rgrp/5944247)
-
Apache Tika - Java library for extracting metadata and content from all types of document types including PDF.
-
Apache PDFBox - Java library specifically for creating, manipulating and getting content from PDFs.
- Tabula - open-source, designed specifically for tabular data. Now easy to install. Ruby-based.
- https://github.com/okfn/pdftables - open-source. Created by Scraperwiki but now closed-source and powering PDFTables so here is a fork.
- pdftohtml - one of the better for tables but have not used for a while
- https://github.com/liberit/scraptils/blob/master/scraptils/tools/pdf2csv.py AGPLv3+, python, scraptils has other useful tools as well, pdf2csv needs pdfminer==20110515
- Using scraperwiki + pdftoxml - see this recent tutorial Get Started With Scraping – Extracting Simple Tables from PDF Documents
- A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents. 说明文档 https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/
-
http://givemetext.okfnlabs.org/ - Give me Text is a free, easy to use open source web service that extracts text from PDFs and other documents using Apache Tika (and built by Labs member Matt Fullerton)
-
http://pdfx.cs.man.ac.uk/ - has a nice command line interface
* Is this open? Says at [bottom of usage](http://pdfx.cs.man.ac.uk/usage) that it is powered by http://www.utopiadocs.com/
- Note that as of 2016 this seems more focused on conversion to structured XML for scientific articles but may still be useful
-
Scraperwiki - https://views.scraperwiki.com/run/pdf-to-html-preview-1/ and this tutorial- no longer working as of 2016
There are many online – just do a search – so we do not propose a comprehensive list. Two we have tried and seem promising are:
- http://www.newocr.com/ - free, with an API, very bare bones site but quite good results based on our limiting testing
- https://pdftables.com/ - pay-per-page service focused on tabular data extraction from the folks at ScraperWiki
We also note that Google app engine used to do this but unfortunately it seems discontinued.
pdfsandwich generates "sandwich" OCR pdf files, i.e. pdf files which contain only images (no text) will be processed by optical character recognition (OCR) and the text will be added to each page invisibly "behind" the images.
pdfsandwich is a command line tool which is supposed to be useful to OCR scanned books or journals. It is able to recognize the page layout even for multicolumn text.
Essentially, pdfsandwich is a wrapper script which calls the following binaries: unpaper (since version 0.0.9), convert, gs, hocr2pdf (for tesseract prior to version 3.03), and tesseract. It is known to run on Unix systems and has been tested on Linux and MacOS X. It supports parallel processing on multiprocessor systems.
While pdfsandwich works with any version of tesseract from version 3.0 on, tesseract 3.03 or later is recommended for best performance. By default, pdfsandwich runs unpaper to enhance the readability of scanned pages and to improve OCR. For instance, slightly rotated pages are automatically straightened and dark edges removed. For optimally scanned pdf files, this can be switched off by option -nopreproc to speed up processing.
My work-around is to save the PDF as a lossless or near lossless image such as .tiff format, then create a new PDF from the image and run OCR. Thus I lose no clarity/sharpness in the PDF image and get accurate OCR content that can be copied and pasted. And, yes, lots of folks do something similar with screenshots from protected PDFs to grab all the text (without the need to retype it). Simple non-expert scripts (such as Tornado's "Do It Again" freeware) and PDF generating software make it easy to process hundreds of pages quickly and accurately (at least as accurately as OCR from images can be from relatively high-res images - not screenshots of documents you are not zooming in on or otherwise capturing with tremendously low spatial resolution relative to the original document).
You can use the pdffonts command line utility to get a quick-shot analysis of the fonts used by a PDF.
Here is an example output, which demonstrates where a problem for text extraction will very likely occur. It uses one of these hand-coded PDF files from a GitHub-Repository which was created to provide PDF sample files which are well commented and may easily be opened in a text editor:
$ pdffonts textextract-bad2.pdf
name type encoding emb sub uni object ID
BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0
How to interpret this table?
The above PDF file uses two subsetted fonts (as indicated by the BAAAAA+ and CAAAAA+ prefixes to their names, as well as by the yes entries in the sub column), Helvetica and Helvtica-Bold. Both fonts are of type TrueType. Both fonts use a WinAnsi encoding (a font encoding maps char identifiers used in the PDF source code to glyphs that should be drawn). However, only for font /Helvetica there is a /ToUnicode table available inside the PDF (for /Helvetica-Bold there is none), as indicated by the yes/no in the uni-column).
The /ToUnicode table is required to provide a reverse mapping from character identifiers/codes to characters.
A missing /ToUnicode table for a specific font is almost always a sure indicator that text strings using this font cannot be extracted or copied'n'pasted from the PDF. (Even if a /ToUnicode table is there, text extraction may still pose a problem, because this table may be damaged, incorrect or incomplete -- as seen in many real-world PDF files, and as also demonstrated by a few companion files in the above linked GitHub repository.)
References: