Skip to content

Releases: kermitt2/pdfalto

Version 0.4

11 Apr 16:06
2ef1c4a
Compare
Choose a tag to compare

New in version 0.4 (apart various bug fixes):

  • support for xpdf language support package for language-specific fonts like Arabic, Chinese-simplified, Japanese, etc. they are pre-installed locally and portable

  • refined line number detection and fixing a bug which could result in random missing numbers in the ALTO output

  • update to xpdf-4.03

  • fix issue with character spacing due to invalid rotation condition

  • update dependencies and dependency install script

Version 0.3

22 Aug 21:56
3216284
Compare
Choose a tag to compare

New in version 0.3:

  • line number detection: line numbers (typically added for review in manuscripts/preprints) are specifically identified and not anymore mixed with the rest of text content, they will be grouped in a separate block or, optionally, not outputted in the ALTO file (noLineNumbers option)

  • removal of -blocks option, the block information are always returned for ensuring ALTO validation (<TextBlock> element)

  • bug fixing on reading order

  • fix possible incorrect XMax and YMax values at 0 on block coordinates having only one line

Version 0.2

17 Oct 09:29
dea10fd
Compare
Choose a tag to compare

New in version 0.2:

  • support Unicode composition of characters
  • generalize reading order to all blocks (it was limited to the blocks of the first page)
  • use subscript/superscript text font style attribute
  • use SVG as a format for vectorial images
  • propagate unsolved character Unicode value (free Unicode range for embedded fonts) as encoded special character in ALTO (so-called "placeholder" approach)
  • generate metadata information in a separate XML file (as ALTO schema does not support that)
  • use the latest version of xpdf, version 4.00
  • add cmake
  • ALTO output is replacing custom Xerox XML format

Note: this released version was used for Grobid release 0.5.6