Document Handling in `appjsonify`

Go back to top

appjsonify defines five classes (Token, Line, Paragraph, Page, and Document) to structure a given document(s).

`Token`

This class holds a token (word) and its position and surface information such as its bounding box coordinates, font size and font name.

`init`

Inputs

token (str): A token string.
pos (tuple[int]): A tuple of bounding box information ($x_0$, $y_0$, $x_1$, $y_1$). The bounding box values range from 0 to 1000.
font_size (float): Font size obtained using pdfplumber.
font_name (str): Font name obtained using pdfplumber.
meta (dict): A meta dictionary used to save supplementary information.

This class holds a line element, its position and surface information such as its bounding box coordinates, font size and font name, and a list of tokens. Line consists of a list of Token, i.e. a line instance must be formed with the corresponding Token instances.

`init`

Inputs

line (str): A line string.
pos (tuple[int]): A tuple of bounding box information ($x_0$, $y_0$, $x_1$, $y_1$). The bounding box values range from 0 to 1000.
font_size (float): Font size obtained using pdfplumber.
font_name (str): Font name obtained using pdfplumber.
tokens (list[Token]): A list of Token instances that form a line.
meta (dict): A meta dictionary used to save supplementary information.

`Paragraph`

This class holds a paragraph element, its position and surface information such as its bounding box coordinates, font size and font name, and a list of lines. Paragraph consists of a list of Line, i.e. a paragraph instance must be formed with the corresponding Line instances.

`init`

Inputs

paragraph (str): A paragraph string.
pos (tuple[int]): A tuple of bounding box information ($x_0$, $y_0$, $x_1$, $y_1$). The bounding box values range from 0 to 1000.
font_size (float): Font size obtained using pdfplumber.
font_name (str): Font name obtained using pdfplumber.
lines (list[Line]): A list of Line instances that form a paragraph.
meta (dict): A meta dictionary used to save supplementary information.

`Page`

This class holds lists of Paragraph, Line, and Token instances per PDF page, as well as its meta dictionary.

`init`

Inputs

paragraphs (list[Paragraph]): A list of Paragraph instances that form a Page.
lines (list[Line]): A list of Line instances that form a Page.
tokens (list[Token]): A list of Token instances that form a Page.
meta (dict): A meta dictionary used to save supplementary information.

`Document`

This class holds the path to an original PDF file, its Page instances, formatted Paragraph instances, and meta dictionary. This is the core class used in BaseRunner, and every processing module needs to take the list of Document: list[Document] as its first input parameter and return the processed list[Document] in the execute method.

`init`

Inputs

input_path (Path): A path to a PDF file.
pages (list[Page]): A list of Page instances.
formatted_paragraphs (list[Paragraph]): A list of Paragraph instances.
meta (dict): A meta dictionary used to save supplementary information such as footers and captions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

definition.md

definition.md

Document Handling in `appjsonify`

`Token`

`init`

Inputs

`Line`

`init`

Inputs

`Paragraph`

`init`

Inputs

`Page`

`init`

Inputs

`Document`

`init`

Inputs

Files

definition.md

Latest commit

History

definition.md

File metadata and controls

Document Handling in appjsonify

Inputs

Inputs

Inputs

Inputs

Inputs

Document Handling in `appjsonify`