Extracting text from a document

The GetStructuredTextPage method of the MuPDFDocument class makes it possible to obtain a "structured text" representation of each page of the document. This consists of a MuPDFStructuredTextPage object, which is a collection of 0 or more MuPDFStructuredTextBlocks. These can be used to extract text from a document.

Each MuPDFStructuredTextBlock either represents an image or a block of text, typically a paragraph (though there is no guarantee that this is the case). MuPDFStructuredTextBlocks are themselves collections of MuPDFStructuredTextLines, and each line is a collection of MuPDFStructuredTextCharacters (in the case of a block representing an image, it will contain a single line with a single character).

MuPDFStructuredTextBlocks and MuPDFStructuredTextLines have a BoundingBox property that defines a rectangle (in page units) that bounds the contents of the block/line in the page. Similarly, MuPDFStructuredTextCharacters have a BoundingQuad (rather than being a Rectangle, this is a Quad, i.e. a quadrilater defined by its four vertices, which may or may not be a rectangle). These can be used e.g. to highlight regions of text in the page.

The MuPDFStructuredTextPage also has methods to determine which character contains or is closest to a specified point (useful, for example, to determine on which character the user clicked), to obtain a list of shapes that encompass a specified range of text, and to perform text searches using regular expressions.

The order of the blocks in the page (which affects the definition of a "range" of text and search operations) is the same as returned by the underlying MuPDF library, which is taken from the order the text is drawn in the source file, so may not be accurate. They can be reordered using the Array.Sort method on the StructuredTextBlocks array contained in the MuPDFStructuredTextPage (lines within blocks and characters within lines can be likewise reordered).

Optical Character Recognition (OCR) using Tesseract

MuPDF 1.18+ (embedded in MuPDFCore 1.3.0+) adds support for OCR using the Tesseract library. To access this feature in MuPDFCore, you can use one of the overloads of GetStructuredTextPage that takes a TesseractLanguage argument specifying the language to use for the OCR. This will run the OCR and return a MuPDFStructuredTextPage containing the character information obtained by Tesseract, which can be used normally. Depending on the model being used, the OCR step can take a relatively long time; therefore, the MuPDFDocument class also implements a GetStructuredTextPageAsync method, which does the same thing in an asynchronous way. The GetStructuredTextPageAsync method also has optional parameters to report the OCR progress and to make it possible to cancel its execution.

Objects of the TesseractLanguage class contain information used to locate the trained language model file that is used by Tesseract. Normally, when using Tesseract, you would have to ensure that the trained language model files are available on the user's computer; however, this class implements some "clever" logic to download the necessary files on demand.

In general, MuPDF provides Tesseract with a "language name" (e.g. "eng"). Tesseract then looks for a file called eng.traineddata either in the folder specified by the TESSDATA_PREFIX environment variable, or, if the variable is not defined, in a subfolder of the current working directory called tessdata. MuPDFCore manipulates the value of TESSDATA_PREFIX (at the process level) and the language name in order to specify the language file.

The TesseractLanguage class has multiple constructors:

TesseractLanguage(string prefix, string language): this constructor is used to directly specify the value of TESSDATA_PREFIX and the language name. The library does not process these in any way. If prefix is null, the value of TESSDATA_PREFIX is not changed, and Tesseract uses the system value.
TesseractLanguage(string fileName): with this constructor, you can directly specify the path to a trained language model file. You can obtain such a file from the tessdata_fast repository or from the tessdata_best repository. If the file does not have a .traineddata extension, it will be copied in a temporary location.
TesseractLanguage(Fast language, bool useAnyCached = false)
TesseractLanguage(FastScript language, bool useAnyCached = false)
TesseractLanguage(Best language, bool useAnyCached = false)
TesseractLanguage(BestScript language, bool useAnyCached = false)

With these constructors, you can specify a language from the list of available languages defined in the TesseractLanguage.Fast, TesseractLanguage.FastScript, TesseractLanguage.Best, and TesseractLanguage.BestScript enums.

MuPDFCore will then look for the trained model file corresponding to the selected language, relative to the path of the executable, in a folder called tessdata/fast and then in a folder called fast (or best, depending on the overload; for the overloads taking a script name, it looks in tessdata/fast/script or fast/script instead).

If the language file is not found in either of these folders, it then looks for it in a subfolder called tessdata/fast in Environment.SpecialFolder.LocalApplicationData. If the optional argument useAnyCached is true, it also looks for the language file in the same folder as the executable, and then in the best (or fast) subfolders. In this case, for example, if the language file for TesseractLanguage.Fast.Eng is not available, but the file for TesseractLanguage.Best.Eng is available, the latter will be used.

Finally, if the language file could not be found in any of the possible paths, MuPDFCore will download it from the appropriate repository and place it in the appropriate subfolder of the tessdata folder in Environment.SpecialFolder.LocalApplicationData. The file will then be reused as necessary.

The TESSDATA_PREFIX and language name will then be set accordingly to where the file was located.

This means that if you use one of these constructors you do not have to worry about the language files being installed in the right place; as long as the user has an Internet connection, the library will download the language files as necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting text from a document

Optical Character Recognition (OCR) using Tesseract

Clone this wiki locally