appjsonify
defines five classes (Token
, Line
, Paragraph
, Page
, and Document
) to structure a given document(s).
This class holds a token (word) and its position and surface information such as its bounding box coordinates, font size and font name.
- token (
str
): A token string. - pos (
tuple[int]
): A tuple of bounding box information ($x_0$ ,$y_0$ ,$x_1$ ,$y_1$ ). The bounding box values range from 0 to 1000. - font_size (
float
): Font size obtained usingpdfplumber
. - font_name (
str
): Font name obtained usingpdfplumber
. - meta (
dict
): A meta dictionary used to save supplementary information.
This class holds a line element, its position and surface information such as its bounding box coordinates, font size and font name, and a list of tokens.
Line
consists of a list of Token
, i.e. a line instance must be formed with the corresponding Token
instances.
- line (
str
): A line string. - pos (
tuple[int]
): A tuple of bounding box information ($x_0$ ,$y_0$ ,$x_1$ ,$y_1$ ). The bounding box values range from 0 to 1000. - font_size (
float
): Font size obtained usingpdfplumber
. - font_name (
str
): Font name obtained usingpdfplumber
. - tokens (
list[Token]
): A list ofToken
instances that form a line. - meta (
dict
): A meta dictionary used to save supplementary information.
This class holds a paragraph element, its position and surface information such as its bounding box coordinates, font size and font name, and a list of lines.
Paragraph
consists of a list of Line
, i.e. a paragraph instance must be formed with the corresponding Line
instances.
- paragraph (
str
): A paragraph string. - pos (
tuple[int]
): A tuple of bounding box information ($x_0$ ,$y_0$ ,$x_1$ ,$y_1$ ). The bounding box values range from 0 to 1000. - font_size (
float
): Font size obtained usingpdfplumber
. - font_name (
str
): Font name obtained usingpdfplumber
. - lines (
list[Line]
): A list ofLine
instances that form a paragraph. - meta (
dict
): A meta dictionary used to save supplementary information.
This class holds lists of Paragraph
, Line
, and Token
instances per PDF page, as well as its meta dictionary.
- paragraphs (
list[Paragraph]
): A list ofParagraph
instances that form aPage
. - lines (
list[Line]
): A list ofLine
instances that form aPage
. - tokens (
list[Token]
): A list ofToken
instances that form aPage
. - meta (
dict
): A meta dictionary used to save supplementary information.
This class holds the path to an original PDF file, its Page
instances, formatted Paragraph
instances, and meta dictionary.
This is the core class used in BaseRunner
, and every processing module needs to take the list of Document
: list[Document]
as its first input parameter and return the processed list[Document]
in the execute
method.
input_path
(Path
): A path to a PDF file.pages
(list[Page]
): A list ofPage
instances.formatted_paragraphs
(list[Paragraph]
): A list ofParagraph
instances.meta
(dict
): A meta dictionary used to save supplementary information such as footers and captions.