Skip to content

Commit a2e4cd1

Browse files
Add Document Models (#1)
* Add base models * Add unit tests * Add setup.py and requirements-dev.txt * Update README.md * Add basic dev container configuration --------- Co-authored-by: Alessio Vertemati <alessio.vertemati@gmail.com>
1 parent bdb2391 commit a2e4cd1

17 files changed

+850
-2
lines changed

.devcontainer/devcontainer.json

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
// For format details, see https://aka.ms/devcontainer.json. For config options, see the
2+
// README at: https://github.com/devcontainers/templates/tree/main/src/python
3+
{
4+
"name": "Python 3",
5+
// Or use a Dockerfile or Docker Compose file. More info: https://containers.dev/guide/dockerfile
6+
"image": "mcr.microsoft.com/devcontainers/python:1-3.12-bullseye",
7+
8+
// Features to add to the dev container. More info: https://containers.dev/features.
9+
// "features": {},
10+
11+
// Use 'postCreateCommand' to run commands after the container is created.
12+
"postCreateCommand": "pip3 install --user -r requirements.txt -r requirements-dev.txt"
13+
14+
// Configure tool-specific properties.
15+
// "customizations": {}
16+
}

.github/CONTRIBUTING.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Contributing
2+
3+
Contributions are **welcome** and will be fully **credited**.
4+
5+
Please read and understand the contribution guide before creating an issue or pull request.
6+
7+
## Etiquette
8+
9+
This project is open source, and as such, the maintainers give their free time to build and maintain the source code held within. They make the code freely available in the hope that it will be of use to other developers. It would be extremely unfair for them to suffer abuse or anger for their hard work.
10+
11+
Please be considerate towards maintainers when raising issues or presenting pull requests. Let's show the
12+
world that developers are civilized and selfless people.
13+
14+
It's the duty of the maintainer to ensure that all submissions to the project are of sufficient
15+
quality to benefit the project. Many developers have different skillsets, strengths, and weaknesses. Respect the maintainer's decision, and do not be upset or abusive if your submission is not used.
16+
17+
## Viability
18+
19+
When requesting or submitting new features, first consider whether it might be useful to others. Open
20+
source projects are used by many developers, who may have entirely different needs to your own. Think about
21+
whether or not your feature is likely to be used by other users of the project.
22+
23+
## Procedure
24+
25+
> [!NOTE]
26+
> Issue tracking is not currently enabled for this repository. We are organising it.
27+
28+
Before filing an issue:
29+
30+
- Attempt to replicate the problem, to ensure that it wasn't a coincidental incident.
31+
- Check to make sure your feature suggestion isn't already present within the project.
32+
- Check the pull requests tab to ensure that the bug doesn't have a fix in progress.
33+
- Check the pull requests tab to ensure that the feature isn't already in progress.
34+
35+
Before submitting a pull request:
36+
37+
- Check the codebase to ensure that your feature doesn't already exist.
38+
- Check the pull requests to ensure that another person hasn't already submitted the feature or fix.
39+
40+
## Requirements
41+
42+
If the project maintainer has any additional requirements, you will find them listed here.
43+
44+
- **Add tests!** - Your patch won't be accepted if it doesn't have tests.
45+
46+
- **Document any change in behaviour** - Make sure the `README.md` and any other relevant documentation are kept up-to-date.
47+
48+
- **Consider our release cycle** - We try to follow [SemVer v2.0.0](https://semver.org/). Randomly breaking public APIs is not an option.
49+
50+
- **One pull request per feature** - If you want to do more than one thing, send multiple pull requests.
51+
52+
- **Send coherent history** - Make sure each individual commit in your pull request is meaningful. If you had to make multiple intermediate commits while developing, please [squash them](https://www.git-scm.com/book/en/v2/Git-Tools-Rewriting-History#Changing-Multiple-Commit-Messages) before submitting.
53+
54+
**Happy coding**!

.github/SECURITY.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Security Policy
2+
3+
If you discover any security related issues, please email security@oneofftech.xyz instead of using the discussions or the issue tracker.

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2024 Andrea Ponti
3+
Copyright (c) OneOffTech <info@oneofftech.xyz>
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

README.md

Lines changed: 209 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,209 @@
1-
# :card_box: Document Model Python
1+
![pypi](https://img.shields.io/pypi/v/parse-document-model-python.svg)
2+
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://docs.pydantic.dev/latest/contributing/#badges)
3+
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
4+
5+
# Parse Document Model (Python)
6+
7+
**Parse Document Model** (Python) provides Pydantic models for representing text documents using a hierarchical model.
8+
This library allows you to define documents as a hierarchy of (specialised) nodes where each node can represent a document, page, text, heading, body, and more.
9+
10+
These models aim to preserve the underlying structure of text documents for further processing, such as creating a table of contents or transforming between formats, e.g. converting a parsed PDF to Markdown.
11+
12+
- **Hierarchical structure**: The document is modelled as a hierarchy of nodes. Each node can represent a part of the
13+
document itself, pages, text.
14+
- **Rich text support**: Nodes can represent not only the content but also the formatting (e.g. bold, italic) applied to the text.
15+
- **Attributes**: Each node can have attributes that provide additional information such as page number,
16+
bounding box, etc.
17+
- **Built-in validation and types**: Built with [`Pydantic`](https://docs.pydantic.dev/latest/), ensuring type safety, validation and effortless creation of complex document structures.
18+
19+
20+
**Requirements**
21+
22+
- Python 3.12 or above (Python 3.9, 3.10 and 3.11 are supported on best-effort).
23+
24+
25+
**Next steps**
26+
27+
- [Explore the document model](#document-model-overview)
28+
- [Install the library and use the models](#getting-started)
29+
30+
31+
## Document Model Overview
32+
33+
We want to represent the document structure using a hierarchy so that the inherited structure is preserved when chapters, sections and headings are used. Consider a generic document with two pages, one heading per page and one paragraph of text. The resulting representation might be the following.
34+
35+
```
36+
Document
37+
├─Page
38+
│ ├─Text (category: heading)
39+
│ └─Text (category: body)
40+
└─Page
41+
├─Text (category: heading)
42+
└─Text (category: body)
43+
```
44+
45+
At a glance you can see the structure, the document is composed of two pages and there are two headings. To do so we defined a hierarchy around the concept of a Node, like a node in a graph.
46+
47+
### Node types
48+
49+
```mermaid
50+
classDiagram
51+
class Node
52+
Node <|-- StructuredNode
53+
Node <|-- Text
54+
StructuredNode <|-- Document
55+
StructuredNode <|-- Page
56+
```
57+
58+
59+
#### 1. **Node** (Base Class)
60+
61+
This is the abstract class from which all other nodes inherit.
62+
63+
Each node has:
64+
65+
- `category`: The type of the node (e.g., `doc`, `page`, `heading`).
66+
- `attributes`: Optional field to attach extra data to a node. See [Attributes](#attributes).
67+
68+
#### 2. **StructuredNode**
69+
70+
This extends the [`Node`](#1-node-base-class). It is used to represent the hierarchy as a node whose content is a list of other nodes, such as like [`Document`](#3-document) and [`Page`](#4-page).
71+
72+
- `content`: List of `Node`.
73+
74+
75+
#### 3. **Document**
76+
77+
This is the root node of a document.
78+
79+
- `category`: Always set to `"doc"`.
80+
- `attributes`: Document-wide attributes can be set here.
81+
- `content`: List of [`Page`](#4-page) nodes that form the document.
82+
83+
#### 4. **Page**
84+
85+
Represents a page in the document:
86+
87+
- `category`: Always set to `"page"`.
88+
- `attributes`: Can contain metadata like page number.
89+
- `content`: List of [`Text`](#5-text) nodes on the page.
90+
91+
#### 5. **Text**
92+
93+
This node represent a paragraph, a heading or any text within the document.
94+
95+
- `category`: The type `"doc"`.
96+
- `content`: A string representing the textual content.
97+
- `marks`: List of [marks](#marks) applied to the text, such as bold, italic, etc.
98+
- `attributes`: Can contain metadata like the bounding box representing where this portion of text is located in the page.
99+
100+
101+
102+
### Marks
103+
104+
Marks are used to add style or functionality to the text within a [`Text`](#5-text) node.
105+
For example, bold text, italic text, links and custom styles such as font or colour.
106+
107+
**Mark Types**
108+
109+
- `Bold`: Represents bold text.
110+
- `Italic`: Represents italic text.
111+
- `TextStyle`: Allows customization of font and color.
112+
- `Link`: Represents a hyperlink.
113+
114+
Marks are validated and enforced with the help of `Pydantic` model validators.
115+
116+
### Attributes
117+
118+
Attributes are optional fields that can store additional information for each node. Some predefined attributes are:
119+
120+
- `DocumentAttributes`: General attributes for the document (currently reserved for the future).
121+
- `PageAttributes`: Specific page related attributes, such as the page number.
122+
- `TextAttributes`: Text related attributes, such as bounding boxes.
123+
- `BoundingBox`: A box that specifies the position of a text in the page.
124+
125+
126+
## Getting started
127+
128+
### Installation
129+
130+
Parse Document Model is distributed with PyPI. You can install it with `pip`.
131+
132+
```bash
133+
pip install parse-document-model-python
134+
```
135+
136+
### Quick Example
137+
138+
Here’s how you can represent a simple document with one page and some text:
139+
140+
```python
141+
from document_model_python.document import Document, Page, Text
142+
143+
doc = Document(
144+
category="doc",
145+
content=[
146+
Page(
147+
category="page",
148+
content=[
149+
Text(
150+
category="heading",
151+
content="Welcome to parse-document-model-python",
152+
marks=["bold"]
153+
),
154+
Text(
155+
category="body",
156+
content="This is an example text using the document model."
157+
)
158+
]
159+
)
160+
]
161+
)
162+
```
163+
164+
## Testing
165+
166+
Parse Document Model is tested using [pytest](https://docs.pytest.org/en/stable/). Tests run for each commit and pull request.
167+
168+
Install the dependencies.
169+
170+
```bash
171+
pip install -r requirements.txt -r requirements-dev.txt
172+
```
173+
174+
Execute the test suite.
175+
176+
```bash
177+
pytest
178+
```
179+
180+
181+
## Contributing
182+
183+
Thank you for considering contributing to the Parse Document Model! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file.
184+
185+
> [NOTE]
186+
> Consider opening a [discussion](https://github.com/OneOffTech/parse-document-model-python/discussions) before submitting a pull request with changes to the model structures.
187+
188+
## Security Vulnerabilities
189+
190+
Please review [our security policy](./.github/SECURITY.md) on how to report security vulnerabilities.
191+
192+
## Credits
193+
194+
- [OneOffTech](https://github.com/OneOffTech)
195+
- [All Contributors](../../contributors)
196+
197+
## Supporters
198+
199+
The project is provided and supported by [OneOff-Tech (UG)](https://oneofftech.de).
200+
201+
<p align="left"><a href="https://oneofftech.de" target="_blank"><img src="https://raw.githubusercontent.com/OneOffTech/.github/main/art/oneofftech-logo.svg" width="200"></a></p>
202+
203+
## Aknowledgements
204+
205+
The format and structure takes inspiration from [ProseMirror](https://prosemirror.net/docs/ref/#model.Document_Schema).
206+
207+
## License
208+
209+
The MIT License (MIT). Please see [License File](LICENSE.md) for more information.

parse_document_model/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
from .document import Document, Page

parse_document_model/attributes.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
from abc import ABC
2+
3+
from pydantic import BaseModel
4+
5+
6+
class BoundingBox(BaseModel):
7+
min_x: float
8+
min_y: float
9+
max_x: float
10+
max_y: float
11+
page: int
12+
13+
14+
class Attributes(BaseModel, ABC):
15+
pass
16+
17+
18+
class DocumentAttributes(Attributes):
19+
pass
20+
21+
22+
class PageAttributes(Attributes):
23+
page: int
24+
25+
26+
class TextAttributes(Attributes):
27+
bounding_box: list[BoundingBox] = []

0 commit comments

Comments
 (0)