|
1 |
| -# :card_box: Document Model Python |
| 1 | + |
| 2 | +[](https://docs.pydantic.dev/latest/contributing/#badges) |
| 3 | +[](LICENSE) |
| 4 | + |
| 5 | +# Parse Document Model (Python) |
| 6 | + |
| 7 | +**Parse Document Model** (Python) provides Pydantic models for representing text documents using a hierarchical model. |
| 8 | +This library allows you to define documents as a hierarchy of (specialised) nodes where each node can represent a document, page, text, heading, body, and more. |
| 9 | + |
| 10 | +These models aim to preserve the underlying structure of text documents for further processing, such as creating a table of contents or transforming between formats, e.g. converting a parsed PDF to Markdown. |
| 11 | + |
| 12 | +- **Hierarchical structure**: The document is modelled as a hierarchy of nodes. Each node can represent a part of the |
| 13 | +document itself, pages, text. |
| 14 | +- **Rich text support**: Nodes can represent not only the content but also the formatting (e.g. bold, italic) applied to the text. |
| 15 | +- **Attributes**: Each node can have attributes that provide additional information such as page number, |
| 16 | +bounding box, etc. |
| 17 | +- **Built-in validation and types**: Built with [`Pydantic`](https://docs.pydantic.dev/latest/), ensuring type safety, validation and effortless creation of complex document structures. |
| 18 | + |
| 19 | + |
| 20 | +**Requirements** |
| 21 | + |
| 22 | +- Python 3.12 or above (Python 3.9, 3.10 and 3.11 are supported on best-effort). |
| 23 | + |
| 24 | + |
| 25 | +**Next steps** |
| 26 | + |
| 27 | +- [Explore the document model](#document-model-overview) |
| 28 | +- [Install the library and use the models](#getting-started) |
| 29 | + |
| 30 | + |
| 31 | +## Document Model Overview |
| 32 | + |
| 33 | +We want to represent the document structure using a hierarchy so that the inherited structure is preserved when chapters, sections and headings are used. Consider a generic document with two pages, one heading per page and one paragraph of text. The resulting representation might be the following. |
| 34 | + |
| 35 | +``` |
| 36 | +Document |
| 37 | + ├─Page |
| 38 | + │ ├─Text (category: heading) |
| 39 | + │ └─Text (category: body) |
| 40 | + └─Page |
| 41 | + ├─Text (category: heading) |
| 42 | + └─Text (category: body) |
| 43 | +``` |
| 44 | + |
| 45 | +At a glance you can see the structure, the document is composed of two pages and there are two headings. To do so we defined a hierarchy around the concept of a Node, like a node in a graph. |
| 46 | + |
| 47 | +### Node types |
| 48 | + |
| 49 | +```mermaid |
| 50 | +classDiagram |
| 51 | + class Node |
| 52 | + Node <|-- StructuredNode |
| 53 | + Node <|-- Text |
| 54 | + StructuredNode <|-- Document |
| 55 | + StructuredNode <|-- Page |
| 56 | +``` |
| 57 | + |
| 58 | + |
| 59 | +#### 1. **Node** (Base Class) |
| 60 | + |
| 61 | +This is the abstract class from which all other nodes inherit. |
| 62 | + |
| 63 | +Each node has: |
| 64 | + |
| 65 | +- `category`: The type of the node (e.g., `doc`, `page`, `heading`). |
| 66 | +- `attributes`: Optional field to attach extra data to a node. See [Attributes](#attributes). |
| 67 | + |
| 68 | +#### 2. **StructuredNode** |
| 69 | + |
| 70 | +This extends the [`Node`](#1-node-base-class). It is used to represent the hierarchy as a node whose content is a list of other nodes, such as like [`Document`](#3-document) and [`Page`](#4-page). |
| 71 | + |
| 72 | +- `content`: List of `Node`. |
| 73 | + |
| 74 | + |
| 75 | +#### 3. **Document** |
| 76 | + |
| 77 | +This is the root node of a document. |
| 78 | + |
| 79 | +- `category`: Always set to `"doc"`. |
| 80 | +- `attributes`: Document-wide attributes can be set here. |
| 81 | +- `content`: List of [`Page`](#4-page) nodes that form the document. |
| 82 | + |
| 83 | +#### 4. **Page** |
| 84 | + |
| 85 | +Represents a page in the document: |
| 86 | + |
| 87 | +- `category`: Always set to `"page"`. |
| 88 | +- `attributes`: Can contain metadata like page number. |
| 89 | +- `content`: List of [`Text`](#5-text) nodes on the page. |
| 90 | + |
| 91 | +#### 5. **Text** |
| 92 | + |
| 93 | +This node represent a paragraph, a heading or any text within the document. |
| 94 | + |
| 95 | +- `category`: The type `"doc"`. |
| 96 | +- `content`: A string representing the textual content. |
| 97 | +- `marks`: List of [marks](#marks) applied to the text, such as bold, italic, etc. |
| 98 | +- `attributes`: Can contain metadata like the bounding box representing where this portion of text is located in the page. |
| 99 | + |
| 100 | + |
| 101 | + |
| 102 | +### Marks |
| 103 | + |
| 104 | +Marks are used to add style or functionality to the text within a [`Text`](#5-text) node. |
| 105 | +For example, bold text, italic text, links and custom styles such as font or colour. |
| 106 | + |
| 107 | +**Mark Types** |
| 108 | + |
| 109 | +- `Bold`: Represents bold text. |
| 110 | +- `Italic`: Represents italic text. |
| 111 | +- `TextStyle`: Allows customization of font and color. |
| 112 | +- `Link`: Represents a hyperlink. |
| 113 | + |
| 114 | +Marks are validated and enforced with the help of `Pydantic` model validators. |
| 115 | + |
| 116 | +### Attributes |
| 117 | + |
| 118 | +Attributes are optional fields that can store additional information for each node. Some predefined attributes are: |
| 119 | + |
| 120 | +- `DocumentAttributes`: General attributes for the document (currently reserved for the future). |
| 121 | +- `PageAttributes`: Specific page related attributes, such as the page number. |
| 122 | +- `TextAttributes`: Text related attributes, such as bounding boxes. |
| 123 | +- `BoundingBox`: A box that specifies the position of a text in the page. |
| 124 | + |
| 125 | + |
| 126 | +## Getting started |
| 127 | + |
| 128 | +### Installation |
| 129 | + |
| 130 | +Parse Document Model is distributed with PyPI. You can install it with `pip`. |
| 131 | + |
| 132 | +```bash |
| 133 | +pip install parse-document-model-python |
| 134 | +``` |
| 135 | + |
| 136 | +### Quick Example |
| 137 | + |
| 138 | +Here’s how you can represent a simple document with one page and some text: |
| 139 | + |
| 140 | +```python |
| 141 | +from document_model_python.document import Document, Page, Text |
| 142 | + |
| 143 | +doc = Document( |
| 144 | + category="doc", |
| 145 | + content=[ |
| 146 | + Page( |
| 147 | + category="page", |
| 148 | + content=[ |
| 149 | + Text( |
| 150 | + category="heading", |
| 151 | + content="Welcome to parse-document-model-python", |
| 152 | + marks=["bold"] |
| 153 | + ), |
| 154 | + Text( |
| 155 | + category="body", |
| 156 | + content="This is an example text using the document model." |
| 157 | + ) |
| 158 | + ] |
| 159 | + ) |
| 160 | + ] |
| 161 | +) |
| 162 | +``` |
| 163 | + |
| 164 | +## Testing |
| 165 | + |
| 166 | +Parse Document Model is tested using [pytest](https://docs.pytest.org/en/stable/). Tests run for each commit and pull request. |
| 167 | + |
| 168 | +Install the dependencies. |
| 169 | + |
| 170 | +```bash |
| 171 | +pip install -r requirements.txt -r requirements-dev.txt |
| 172 | +``` |
| 173 | + |
| 174 | +Execute the test suite. |
| 175 | + |
| 176 | +```bash |
| 177 | +pytest |
| 178 | +``` |
| 179 | + |
| 180 | + |
| 181 | +## Contributing |
| 182 | + |
| 183 | +Thank you for considering contributing to the Parse Document Model! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file. |
| 184 | + |
| 185 | +> [NOTE] |
| 186 | +> Consider opening a [discussion](https://github.com/OneOffTech/parse-document-model-python/discussions) before submitting a pull request with changes to the model structures. |
| 187 | +
|
| 188 | +## Security Vulnerabilities |
| 189 | + |
| 190 | +Please review [our security policy](./.github/SECURITY.md) on how to report security vulnerabilities. |
| 191 | + |
| 192 | +## Credits |
| 193 | + |
| 194 | +- [OneOffTech](https://github.com/OneOffTech) |
| 195 | +- [All Contributors](../../contributors) |
| 196 | + |
| 197 | +## Supporters |
| 198 | + |
| 199 | +The project is provided and supported by [OneOff-Tech (UG)](https://oneofftech.de). |
| 200 | + |
| 201 | +<p align="left"><a href="https://oneofftech.de" target="_blank"><img src="https://raw.githubusercontent.com/OneOffTech/.github/main/art/oneofftech-logo.svg" width="200"></a></p> |
| 202 | + |
| 203 | +## Aknowledgements |
| 204 | + |
| 205 | +The format and structure takes inspiration from [ProseMirror](https://prosemirror.net/docs/ref/#model.Document_Schema). |
| 206 | + |
| 207 | +## License |
| 208 | + |
| 209 | +The MIT License (MIT). Please see [License File](LICENSE.md) for more information. |
0 commit comments