SILVERCODERS has worked hard to make a wonderful utility. Unfortunately it was made quite a long time ago and just needs some modern love.
As of 07/14/2020 there were a number of issues with the distributed DocToText on SILVERCODERS website and Sourceforge:
- There is no version control available
- The last release was 6 years ago
- The source doesn't compile out of the box. Why?
- The 3rd party Makefile does not download shasums
- There are pointer to nonpointer comparisons
- cxx11 abi compatibility issues
- Duplicate exception symbols
- Missing return segfault
- The binaries are non-relocatable.
- This is fine for using the distributed doctotext executable where we can easily set DYLIB_LIBRARY_PATH, LD_LIBRARY_PATH, etc, however when creating another program and linking against doctotext see the next point -
- On Linux and OSX the distributed shared libraries will not be properly loaded unless placed in system locations. This prevents anyone who is creating a library that links the doctotext shared lib from distributing it as a standalone package. The shared libraries that doctotext.{dll,dylib,so} load will have to be placed in system locations. For building, for example, a python extension, it is already a challenge to link against and redistribute a chain of shared libs. It is much more work and not a maintainable solution for future releases to have to fix the the rpath entries on OSX and Linux first.
- There is some manual usage of install_name_tool on mac to make the dylibs redistributable (at least with respect to the executable path), however the distributed binaries do not have the @executable_path/ rpath embedded as would be expected if these scripts had been run
- There are memory leaks in the distributed OLE reader
Ensure the following are installed and in the path:
- doxygen
- mingw-64 with sljl exception handling
- this must be in the path before any other mingw installations or things can break
- gnu make (can use from any mingw)
The most recent compilation was done in a git-bash shell using mingw64
Building on Mac will take some extra work. My initial attempt was not a "just compile it and ship it over morning coffee" task. It appears that the 3rdparty libraries like wv2 are compiled against libstdc++. The standard on OSX is now libc++. I no longer have the requisite headers available on my machine to compile against libstdc++ - as of Xcode 10 this support was dropped. This leads to:
- Udating the 3rdparty libraries. Most of them should be straightforward but wv2 will be a bit more of a lift. Normally wv2 requires gsf which in turn requires glib, however SILVERCODERS has patched their ancient version (0.2.3) to no longer rely on gsf+glib. Additionally it takes different interface (the custom ThreadSafeOLEStorage class instead opposed to buffers and strings). In order to preserve the distributable nature of the package gsf and glib will need to be repackaged along with a modern version of wv2. The new classes of wv2 (OLEStorage, OLEStreamReader, which replace AbstractOLEStorage, AbstractOLEStreamReader) will need to be used. And finally, the new classes cause other glib/gsf dependent changes, such as stream changes for compatibility with GsfInput. I hope I've missed a simpler alternative.
- For now I'll probably just make a custom installation of gnu gcc on my mac.
TODO
From the DocToText website http://silvercoders.com/en/products/doctotext/:
SILVERCODERS DocToText is a powerful utility that can convert documents in many formats to plain text. The package, available to users for free on open source GPL license, includes console application and C/C++ library, that allows embedding text extraction mechanism into other application.
The utility supports MS Office binary formats: MS Word (DOC), MS Excel (XLS, XLSB), MS PowerPoint (PPT), Rich Text Format (RTF), OpenDocument (also known as ODF and ISO/IEC 26300, full name: OASIS Open Document Format for Office Applications): text documents (ODT), spreadsheets (ODS), presentations (ODP), graphics (ODG), Office Open XML (ISO/IEC 29500, also called OOXML, OpenXML or MSOOXML) documents: MS Word (DOCX), MS Excel (XLSX), MS PowerPoint (PPTX), iWork formats (PAGES, NUMBERS, KEYNOTE), OpenDocument Flat XML formats (FODP, FODS, FODT), Portable Document Format (PDF), Email files (EML) and HyperText Markup Language (HTML).
Extracting plain text from doc, xls, ppt, rtf, odt, ods, odp, odg, docx, xlsx, pptx, pages, numbers, keynote, fodp, fods, fodt, pdf, eml and html files can be used for a lot of things like searching, indexing or archiving. DocToText can be also used as a fast console viewer.
DocToText can extract text not only from document body but also from annotations (comments) embedded in odt, doc, docx or rtf files and read metadata like author, last modification date or number of pages.
Complex documents? Other utilities gave up? MS Excel spreadsheet embedded in MS Word document? Charset detection required? OpenDocument formats OLE? No problem.
DocToText is able to convert corrupted OpenDocument and Office Open XML documents. It can be used to recover text even if other recovery methods failed. If you need help with this kind of issues see our document recovery services.
We also offer the possibility to use the library in commercial applications, with full technical support. The utility is constantly used and tested on thousands of documents by customers all around the world. If interested, please contact us for details.
/****************************************************************************
**
** DocToText - Converts DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP),
** OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE),
** ODFXML (FODP, FODS, FODT), PDF, EML and HTML documents to plain text.
** Extracts metadata and annotations.
**
** Copyright (c) 2006-2013, SILVERCODERS(R)
** http://silvercoders.com
**
** Project homepage: http://silvercoders.com/en/products/doctotext
**
** This program may be distributed and/or modified under the terms of the
** GNU General Public License version 2 as published by the Free Software
** Foundation and appearing in the file COPYING.GPL included in the
** packaging of this file.
**
** Please remember that any attempt to workaround the GNU General Public
** License using wrappers, pipes, client/server protocols, and so on
** is considered as license violation. If your program, published on license
** other than GNU General Public License version 2, calls some part of this
** code directly or indirectly, you have to buy commercial license.
** If you do not like our point of view, simply do not use the product.
**
** Licensees holding valid commercial license for this product
** may use this file in accordance with the license published by
** SILVERCODERS and appearing in the file COPYING.COM
**
** This program is distributed in the hope that it will be useful,
** but WITHOUT ANY WARRANTY; without even the implied warranty of
** MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
**
*****************************************************************************/