Skip to content

Commit dd14dd3

Browse files
authored
v0.1.0 schemas (#1)
* Commit files from the PR in research repo * Move scripts around * Add test attempt 1 * Add test attempt 2 * Add test attempt 3 * Add test attempt 4 * Add inline equation schema * Add image block, wip * wip * Image block passes test * Make test return 1 if at least one failed * Add bulleted_list_item, numbered_list_item, to_do * Add equation block * Add quote block * Add table, table_row, divider * Add column, column_list, update block schema * Entire page validation works * Page schema works
1 parent 4c4c306 commit dd14dd3

File tree

97 files changed

+15586
-1
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

97 files changed

+15586
-1
lines changed

.flake8

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
[flake8]
2+
max-line-length = 88
3+
max-complexity = 15
4+
extend-ignore =
5+
# E101: Indentation contains mixed spaces and tabs
6+
E101
7+
# E111: Indentation is not a multiple of four
8+
E111
9+
# E112: Expected an indented block
10+
E112
11+
# E113: Unexpected indentation
12+
E113
13+
# E114: Indentation is not a multiple of four (comment)
14+
E114
15+
# E115: Expected an indented block (comment)
16+
E115
17+
# E116: Unexpected indentation (comment)
18+
E116
19+
# E117: Over-indented
20+
E117
21+
# E121: Continuation line under-indented for hanging indent
22+
E121
23+
# E122: Continuation line missing indentation or outdented
24+
E122
25+
# E123: Closing bracket does not match indentation of opening bracket's line
26+
E123
27+
# E124: Closing bracket does not match visual indentation
28+
E124
29+
# E125: Continuation line with same indent as next logical line
30+
E125
31+
# E126: Continuation line over-indented for hanging indent
32+
E126
33+
# E127: Continuation line over-indented for visual indent
34+
E127
35+
# E128: Continuation line under-indented for visual indent
36+
E128
37+
# E129: Visually indented line with same indent as next logical line
38+
E129
39+
# E131: Continuation line unaligned for hanging indent
40+
E131
41+
# E133: Closing bracket is missing indentation
42+
E133
43+
# E201: Whitespace after '('
44+
E201,
45+
# E202: Whitespace before ')'
46+
E202,
47+
# E203: Whitespace before ':'
48+
E203,
49+
# E211: Whitespace before '('
50+
E211,
51+
# E221: Multiple spaces before operator
52+
E221,
53+
# E222: Multiple spaces after operator
54+
E222,
55+
# E223: Tab before operator
56+
E223,
57+
# E224: Tab after operator
58+
E224,
59+
# E225: Missing whitespace around operator
60+
E225,
61+
# E226: Missing whitespace around arithmetic operator
62+
E226,
63+
# E227: Missing whitespace around bitwise or shift operator
64+
E227,
65+
# E228: Missing whitespace around modulo operator
66+
E228,
67+
# E231: Missing whitespace after ',', ';', or ':'
68+
E231,
69+
# E241: Multiple spaces after ','
70+
E241,
71+
# E242: Tab after ','
72+
E242,
73+
# E251: Unexpected spaces around keyword / parameter equals
74+
E251,
75+
# E261: At least two spaces before inline comment
76+
E261,
77+
# E262: Inline comment should start with '# '
78+
E262,
79+
# E265: Block comment should start with '# '
80+
E265,
81+
# E266: Too many leading '#' for block comment
82+
E266,
83+
# E271: Multiple spaces after keyword
84+
E271,
85+
# E272: Multiple spaces before keyword
86+
E272,
87+
# E273: Tab after keyword
88+
E273,
89+
# E274: Tab before keyword
90+
E274,
91+
# E275: Missing whitespace after keyword
92+
E275,
93+
# E301: Expected 1 blank line, found 0
94+
E301,
95+
# E302: Expected 2 blank lines, found 0
96+
E302,
97+
# E303: Too many blank lines (3)
98+
E303,
99+
# E304: Blank lines found after function decorator
100+
E304,
101+
# E305: Expected 2 blank lines after end of function or class
102+
E305,
103+
# E306: Expected 1 blank line before a nested definition
104+
E306,
105+
# E401: Multiple imports on one line
106+
E401,
107+
# E704: multiple statements on one line (def)
108+
E704,
109+
# E203: whitespace before ':'
110+
E203,
111+
# W191: Indentation contains tabs
112+
W191,
113+
# W291: Trailing whitespace
114+
W291,
115+
# W292: No newline at end of file
116+
W292,
117+
# W293: Blank line contains whitespace
118+
W293,
119+
# W391: Blank line at end of file
120+
W391,
121+
# W503: line break before binary operator
122+
W503,
123+
# W504: line break after binary operator
124+
W504,
125+
# F401: imported but unused
126+
F401,
127+
# F841: local variable is assigned to but never used
128+
F841

.github/workflows/test.yaml

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
name: Test and Validate
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
branches:
9+
- main
10+
11+
jobs:
12+
test:
13+
runs-on: ubuntu-latest
14+
15+
steps:
16+
- name: Check out repository
17+
uses: actions/checkout@v2
18+
19+
- name: Set up Python
20+
uses: actions/setup-python@v2
21+
with:
22+
python-version: '3.11'
23+
24+
- name: Install Poetry
25+
uses: snok/install-poetry@v1
26+
with:
27+
version: 1.5.0
28+
virtualenvs-create: true
29+
virtualenvs-in-project: true
30+
31+
- name: Load cached venv
32+
id: cached-poetry-dependencies
33+
uses: actions/cache@v2
34+
with:
35+
path: .venv
36+
key: venv-${{ runner.os }}-${{ hashFiles('**/poetry.lock') }}
37+
38+
- name: Install dependencies
39+
if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
40+
run: poetry install --no-interaction
41+
42+
- name: Run tests
43+
run: |
44+
source .venv/bin/activate
45+
python tests/run_validation_tests.py schema
46+
47+
# - name: Upload test results
48+
# uses: actions/upload-artifact@v2
49+
# with:
50+
# name: test-results
51+
# path: test-results # Adjust this path if your tests output results to a different directory
52+
53+

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
.env
2+
.DS_Store
3+
*.pdf
4+
*.png
5+
__pycache__

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.11

README.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,16 @@
11
# JSON-DOC
2-
JSON-DOC is a block based document file format and data model
2+
3+
JSON-DOC is a simple and flexible format for storing structured content in JSON files. It is designed to support a wide variety of content types and use cases, such as paragraphs, headings, lists, tables, images, code blocks, HTML and more.
4+
5+
JSON-DOC is an attempt to standardize the data model used by [Notion](https://notion.so).
6+
7+
## Features
8+
9+
- Documents are represented as a list of blocks
10+
- Each block is a JSON object
11+
- A unique identifier for each block by hashing RFC 8785 Canonical JSON
12+
- Support for nested blocks
13+
14+
## Motivation
15+
16+
TBD

docs/notes-on-notion.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Reverse Engineering Notion Data Model and API
2+
3+
## UUIDs
4+
5+
Notion uses UUIDs (v4) for the ID of each object. We could possibly improve on this by
6+
7+
- Using TypeID's: Improves readability and attribution of IDs
8+
- Using and ID format that is more efficient for database indices.
9+
10+
## Blocks
11+
12+
A `Block` is (literally) the primary building block of documents in Notion.
13+
14+
See: https://developers.notion.com/reference/block
15+
16+
A `Block` is a container that allows stacking and nesting of various content types that Notion supports. In that way, it is a meta-object. It does not contain content itself, but it represents the relationship between content objects.
17+
18+
An example Block of type `child_database`:
19+
20+
```json
21+
{
22+
"object": "block",
23+
"id": "91589676-9cab-40dd-8ace-52f31a225d0a",
24+
"parent": {
25+
"type": "page_id",
26+
"page_id": "8d7dbc6b-5c55-4589-826c-1352450db04e"
27+
},
28+
"created_by": {
29+
"object": "user",
30+
"id": "b9eb2a95-ab37-462d-b6ff-ff84080051f0"
31+
},
32+
"created_time": "2024-05-28T20:28:00.000Z",
33+
"last_edited_time": "2024-05-28T20:29:00.000Z",
34+
"last_edited_by": {
35+
"object": "user",
36+
"id": "b9eb2a95-ab37-462d-b6ff-ff84080051f0"
37+
},
38+
"has_children": false,
39+
"archived": false,
40+
"in_trash": false,
41+
"type": "child_database",
42+
"child_database": {
43+
"title": "Example database"
44+
}
45+
}
46+
```
47+
48+
### `type` field
49+
50+
The `type` field specifies what kind of content a block represents. The content is then contained in the corresponding field of the block object. For example, if the `type` field is `code`, the content is in the `code` field.
51+
52+
53+
## Pages
54+
55+
A `Page` is not a block, but a container for blocks. Pages can exist independently and contain other pages and blocks, creating a hierarchical structure. Blocks exist within pages (or other blocks) and do not have the capability to contain pages.
56+
57+
```json
58+
{
59+
"id": "8d7dbc6b5c554589826c1352450db04e",
60+
"type": "page",
61+
"properties": {...},
62+
"children": [...]
63+
}
64+
```

docs/roadmap.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
---
2+
author: "Onur Solmaz<onur@textcortex.com>"
3+
date: 2024-08-01
4+
title: "JSON-DOC"
5+
---
6+
7+
# JSON-DOC Implementation Roadmap
8+
9+
- [ ] Create JSONSchema for each block type.
10+
- [ ] Implement converters into JSON-DOC
11+
- [ ] Multimodal-LLM based PDF/raster image -> JSON-DOC (Most important)
12+
- [ ] HTML -> JSON-DOC
13+
- [ ] DOCX -> JSON-DOC
14+
- [ ] XLSX -> JSON-DOC
15+
- [ ] PPTX -> JSON-DOC
16+
- [ ] CSV -> JSON-DOC
17+
- [ ] Google Docs -> JSON-DOC (lower priority compared to DOCX)
18+
- [ ] Google Sheets -> JSON-DOC
19+
- [ ] Google Slides -> JSON-DOC
20+
- [ ] Implement converters from JSON-DOC
21+
- [ ] JSON-DOC -> Markdown/plain text with tabular metadata for injecting into LLM context.
22+
- [ ] Ability to reference, extract and render a certain table range. (Important for scrolling in spreadsheets)
23+
- [ ] Frontend for JSON-DOC
24+
- [ ] JavaScript renderer for JSON-DOC to render it in the browser.
25+
26+
# JSON-DOC Schema
27+
28+
We will implement a JSONSchema for a Notion page and each block type.
29+
30+
## Page
31+
32+
- [x] Page block
33+
34+
See https://developers.notion.com/reference/block for the authoritative Notion specification.
35+
36+
## Blocks
37+
38+
### Rich text (See https://developers.notion.com/reference/rich-text)
39+
40+
These are not "official" blocks, but exist under the `rich_text` key in some blocks.
41+
42+
- [x] `type: text`
43+
- [x] `type: equation`
44+
- Inline equations.
45+
- Will be rendered using KaTeX on the client side.
46+
- [ ] ~~`type: mention`~~
47+
- Won't implement for now
48+
49+
### Other text-type blocks
50+
51+
- [x] `type: paragraph`
52+
- [x] `type: heading_1`
53+
- [x] `type: heading_2`
54+
- [x] `type: heading_3`
55+
- [x] `type: code`
56+
- [x] `type: equation`
57+
- Block-level equations.
58+
- [x] `type: quote`
59+
- [ ] ~~`type: callout`~~
60+
- Won't implement for now
61+
62+
### List item blocks
63+
64+
- [x] `type: bulleted_list_item`
65+
- [x] `type: numbered_list_item`
66+
- [x] `type: to_do`
67+
68+
### Table blocks
69+
70+
- [x] `type: table`
71+
- [x] `type: table_row`
72+
73+
### Non-text blocks
74+
75+
- [x] `type: image`
76+
- [ ] ~~`type: file`~~
77+
- Won't implement for now
78+
- [ ] ~~`type: pdf`~~
79+
- Won't implement for now
80+
- [ ] ~~`type: embed`~~
81+
- Won't implement for now
82+
- [ ] ~~`type: video`~~
83+
- Won't implement for now
84+
85+
86+
### Page/Container type blocks
87+
88+
- [x] `type: column`
89+
- [x] `type: column_list`
90+
- [ ] `type: table_of_contents`
91+
- [ ] `type: child_page`
92+
- [x] `type: divider`
93+
- Might implement, might not be necessary for the current document conversion use case
94+
- [ ] `type: synced_block`
95+
- Might implement, ditto
96+
- [ ] ~~`type: toggle`~~
97+
- Won't implement
98+
99+
### Link-related blocks
100+
101+
- [ ] ~~`type: link_preview`~~
102+
- Won't implement
103+
- [ ] ~~`type: link_to_page`~~
104+
- Won't implement
105+
- [ ] ~~`type: bookmark`~~
106+
- Won't implement
107+
108+
### Notion-specific blocks
109+
110+
- [ ] `type: child_database`
111+
- [ ] ~~`type: breadcrumb`~~
112+
- Won't implement
113+
- [ ] ~~`type: unsupported`~~
114+
- Meta block, not needed
115+
116+
### Deprecated blocks
117+
118+
- ~~`type: template`~~

0 commit comments

Comments
 (0)