-
Notifications
You must be signed in to change notification settings - Fork 20
Extracting text from a document
The GetStructuredTextPage
method of the MuPDFDocument
class makes it possible to obtain a "structured text" representation of each page of the document. This consists of a MuPDFStructuredTextPage
object, which is a collection of 0 or more MuPDFStructuredTextBlock
s. These can be used to extract text from a document.
Each MuPDFStructuredTextBlock
either represents an image or a block of text, typically a paragraph (though there is no guarantee that this is the case). MuPDFStructuredTextBlock
s are themselves collections of MuPDFStructuredTextLine
s, and each line is a collection of MuPDFStructuredTextCharacter
s (in the case of a block representing an image, it will contain a single line with a single character).
MuPDFStructuredTextBlock
s and MuPDFStructuredTextLine
s have a BoundingBox
property that defines a rectangle (in page units) that bounds the contents of the block/line in the page. Similarly, MuPDFStructuredTextCharacter
s have a BoundingQuad
(rather than being a Rectangle
, this is a Quad
, i.e. a quadrilater defined by its four vertices, which may or may not be a rectangle). These can be used e.g. to highlight regions of text in the page.
The MuPDFStructuredTextPage
also has methods to determine which character contains or is closest to a specified point (useful, for example, to determine on which character the user clicked), to obtain a list of shapes that encompass a specified range of text, and to perform text searches using regular expressions.
The order of the blocks in the page (which affects the definition of a "range" of text and search operations) is the same as returned by the underlying MuPDF library, which is taken from the order the text is drawn in the source file, so may not be accurate. They can be reordered using the Array.Sort
method on the StructuredTextBlocks
array contained in the MuPDFStructuredTextPage
(lines within blocks and characters within lines can be likewise reordered).
MuPDF 1.18+ (embedded in MuPDFCore 1.3.0+) adds support for OCR using the Tesseract library. To access this feature in MuPDFCore, you can use one of the overloads of GetStructuredTextPage
that takes a TesseractLanguage
argument specifying the language to use for the OCR. This will run the OCR and return a MuPDFStructuredTextPage
containing the character information obtained by Tesseract, which can be used normally. Depending on the model being used, the OCR step can take a relatively long time; therefore, the MuPDFDocument
class also implements a GetStructuredTextPageAsync
method, which does the same thing in an asynchronous way. The GetStructuredTextPageAsync
method also has optional parameters to report the OCR progress and to make it possible to cancel its execution.
Objects of the TesseractLanguage
class contain information used to locate the trained language model file that is used by Tesseract. Normally, when using Tesseract, you would have to ensure that the trained language model files are available on the user's computer; however, this class implements some "clever" logic to download the necessary files on demand.
In general, MuPDF provides Tesseract with a "language name" (e.g. "eng"
). Tesseract then looks for a file called eng.traineddata
either in the folder specified by the TESSDATA_PREFIX
environment variable, or, if the variable is not defined, in a subfolder of the current working directory called tessdata
. MuPDFCore manipulates the value of TESSDATA_PREFIX
(at the process level) and the language name in order to specify the language file.
The TesseractLanguage
class has multiple constructors:
-
TesseractLanguage(string prefix, string language)
: this constructor is used to directly specify the value ofTESSDATA_PREFIX
and the language name. The library does not process these in any way. Ifprefix
isnull
, the value ofTESSDATA_PREFIX
is not changed, and Tesseract uses the system value. -
TesseractLanguage(string fileName)
: with this constructor, you can directly specify the path to a trained language model file. You can obtain such a file from the tessdata_fast repository or from the tessdata_best repository. If the file does not have a.traineddata
extension, it will be copied in a temporary location. -
TesseractLanguage(Fast language, bool useAnyCached = false)
TesseractLanguage(FastScript language, bool useAnyCached = false)
TesseractLanguage(Best language, bool useAnyCached = false)
TesseractLanguage(BestScript language, bool useAnyCached = false)
With these constructors, you can specify a language from the list of available languages defined in the
TesseractLanguage.Fast
,TesseractLanguage.FastScript
,TesseractLanguage.Best
, andTesseractLanguage.BestScript
enums.MuPDFCore will then look for the trained model file corresponding to the selected language, relative to the path of the executable, in a folder called
tessdata/fast
and then in a folder calledfast
(orbest
, depending on the overload; for the overloads taking a script name, it looks intessdata/fast/script
orfast/script
instead).If the language file is not found in either of these folders, it then looks for it in a subfolder called
tessdata/fast
inEnvironment.SpecialFolder.LocalApplicationData
. If the optional argumentuseAnyCached
istrue
, it also looks for the language file in the same folder as the executable, and then in thebest
(orfast
) subfolders. In this case, for example, if the language file forTesseractLanguage.Fast.Eng
is not available, but the file forTesseractLanguage.Best.Eng
is available, the latter will be used.Finally, if the language file could not be found in any of the possible paths, MuPDFCore will download it from the appropriate repository and place it in the appropriate subfolder of the
tessdata
folder inEnvironment.SpecialFolder.LocalApplicationData
. The file will then be reused as necessary.The
TESSDATA_PREFIX
and language name will then be set accordingly to where the file was located.This means that if you use one of these constructors you do not have to worry about the language files being installed in the right place; as long as the user has an Internet connection, the library will download the language files as necessary.