Skip to content

Add HTML -> JSON-DOC converter #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 45 commits into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
4a23c57
Add converter script from Markdownify
osolmaz Aug 21, 2024
da4c811
Add HTML example
osolmaz Sep 3, 2024
665bba5
Minor
osolmaz Sep 3, 2024
f5aff3b
Add html converstion test, wip
osolmaz Sep 3, 2024
833b623
Add nested paragraphs
osolmaz Sep 3, 2024
852918d
Correct children relationships
osolmaz Sep 4, 2024
0e6ee44
Minor
osolmaz Sep 4, 2024
61c2dc3
Time tests
osolmaz Sep 4, 2024
4c127df
Implement ALLOWED_CHILDREN_BLOCK_TYPES
osolmaz Sep 4, 2024
06a105b
wip
osolmaz Sep 4, 2024
d1144dc
Default const values, create mermaid diagram for HTML
osolmaz Sep 5, 2024
5ba5a4e
wip
osolmaz Sep 5, 2024
062c913
Basic example works
osolmaz Sep 5, 2024
b2b10a5
Add note
osolmaz Sep 5, 2024
8a4c8ea
Add more doc
osolmaz Sep 6, 2024
126ba9c
Cleanup
osolmaz Sep 6, 2024
9557c76
Add JSON-DOC to Markdown converter, wip
osolmaz Sep 6, 2024
e8ccdea
Convert more blocks into markdown, wip
osolmaz Sep 6, 2024
73b3128
wip
osolmaz Sep 6, 2024
afe4c34
Implement reconcile_to_rich_text()
osolmaz Sep 7, 2024
ac75505
Add converter script
osolmaz Sep 8, 2024
b1d1267
Added 1 html to jsondoc test example
osolmaz Sep 8, 2024
7c817a9
Add another test
osolmaz Sep 8, 2024
3364f79
Implement reconcile_to_block(), wip
osolmaz Sep 9, 2024
b31f217
Minor
osolmaz Sep 9, 2024
f8ecf32
Implement table support
osolmaz Sep 10, 2024
d2b8411
Add <br> support
osolmaz Sep 10, 2024
914f23c
Add <ul> and <ol> support
osolmaz Sep 10, 2024
3b5d92d
Implement create_page()
osolmaz Sep 11, 2024
75363de
Handle table captions
osolmaz Sep 11, 2024
b0f79b1
Can convert html_all_elements.html
osolmaz Sep 11, 2024
f1b9d2f
Handle image captions
osolmaz Sep 11, 2024
8e6d922
Minor
osolmaz Sep 11, 2024
27087eb
Minor
osolmaz Sep 11, 2024
af521c0
Remove anthropic from deps
osolmaz Sep 11, 2024
488b8f6
Add test for <a> element, cleanup
osolmaz Sep 11, 2024
73d972c
Cleanup
osolmaz Sep 11, 2024
78accbe
Cleanup
osolmaz Sep 11, 2024
493e0e9
Minor
osolmaz Sep 11, 2024
bdc82b4
Rename docs
osolmaz Sep 12, 2024
06c72f9
Add Pypandoc for conversion from other formats
osolmaz Sep 12, 2024
c8e196e
Improve converter script logic
osolmaz Sep 12, 2024
9c69bba
Update doc
osolmaz Sep 12, 2024
91df280
Minor
osolmaz Sep 12, 2024
92d9727
Add test to workflow
osolmaz Sep 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add Pypandoc for conversion from other formats
  • Loading branch information
osolmaz committed Sep 12, 2024
commit 06c72f91d72d9daf2e61c224c7075ada37db883c
233 changes: 233 additions & 0 deletions examples/markdown/markdown_syntax_ex1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
# h1 Heading
## h2 Heading
### h3 Heading
#### h4 Heading
##### h5 Heading
###### h6 Heading


## Horizontal Rules

___

---

***


## Typographic replacements

Enable typographer option to see result.

(c) (C) (r) (R) (tm) (TM) (p) (P) +-

test.. test... test..... test?..... test!....

!!!!!! ???? ,, -- ---

"Smartypants, double quotes" and 'single quotes'


## Emphasis

**This is bold text**

__This is bold text__

*This is italic text*

_This is italic text_

~~Strikethrough~~


## Blockquotes


> Blockquotes can also be nested...
>> ...by using additional greater-than signs right next to each other...
> > > ...or with spaces between arrows.


## Lists

Unordered

+ Create a list by starting a line with `+`, `-`, or `*`
+ Sub-lists are made by indenting 2 spaces:
- Marker character change forces new list start:
* Ac tristique libero volutpat at
+ Facilisis in pretium nisl aliquet
- Nulla volutpat aliquam velit
+ Very easy!

Ordered

1. Lorem ipsum dolor sit amet
2. Consectetur adipiscing elit
3. Integer molestie lorem at massa


1. You can use sequential numbers...
1. ...or keep all the numbers as `1.`

Start numbering with offset:

57. foo
1. bar


## Code

Inline `code`

Indented code

// Some comments
line 1 of code
line 2 of code
line 3 of code


Block code "fences"

```
Sample text here...
```

Syntax highlighting

``` js
var foo = function (bar) {
return bar++;
};

console.log(foo(5));
```

## Tables

| Option | Description |
| ------ | ----------- |
| data | path to data files to supply the data that will be passed into templates. |
| engine | engine to be used for processing templates. Handlebars is the default. |
| ext | extension to be used for dest files. |

Right aligned columns

| Option | Description |
| ------:| -----------:|
| data | path to data files to supply the data that will be passed into templates. |
| engine | engine to be used for processing templates. Handlebars is the default. |
| ext | extension to be used for dest files. |


## Links

[link text](http://dev.nodeca.com)

[link with title](http://nodeca.github.io/pica/demo/ "title text!")

Autoconverted link https://github.com/nodeca/pica (enable linkify to see)


## Images

![Minion](https://octodex.github.com/images/minion.png)
![Stormtroopocat](https://octodex.github.com/images/stormtroopocat.jpg "The Stormtroopocat")

Like links, Images also have a footnote style syntax

![Alt text][id]

With a reference later in the document defining the URL location:

[id]: https://octodex.github.com/images/dojocat.jpg "The Dojocat"


## Plugins

The killer feature of `markdown-it` is very effective support of
[syntax plugins](https://www.npmjs.org/browse/keyword/markdown-it-plugin).


### [Emojies](https://github.com/markdown-it/markdown-it-emoji)

> Classic markup: :wink: :cry: :laughing: :yum:
>
> Shortcuts (emoticons): :-) :-( 8-) ;)

see [how to change output](https://github.com/markdown-it/markdown-it-emoji#change-output) with twemoji.


### [Subscript](https://github.com/markdown-it/markdown-it-sub) / [Superscript](https://github.com/markdown-it/markdown-it-sup)

- 19^th^
- H~2~O


### [\<ins>](https://github.com/markdown-it/markdown-it-ins)

++Inserted text++


### [\<mark>](https://github.com/markdown-it/markdown-it-mark)

==Marked text==


### [Footnotes](https://github.com/markdown-it/markdown-it-footnote)

Footnote 1 link[^first].

Footnote 2 link[^second].

Inline footnote^[Text of inline footnote] definition.

Duplicated footnote reference[^second].

[^first]: Footnote **can have markup**

and multiple paragraphs.

[^second]: Footnote text.


### [Definition lists](https://github.com/markdown-it/markdown-it-deflist)

Term 1

: Definition 1
with lazy continuation.

Term 2 with *inline markup*

: Definition 2

{ some code, part of Definition 2 }

Third paragraph of definition 2.

_Compact style:_

Term 1
~ Definition 1

Term 2
~ Definition 2a
~ Definition 2b


### [Abbreviations](https://github.com/markdown-it/markdown-it-abbr)

This is HTML abbreviation example.

It converts "HTML", but keep intact partial entries like "xxxHTMLyyy" and so on.

*[HTML]: Hyper Text Markup Language

### [Custom containers](https://github.com/markdown-it/markdown-it-container)

::: warning
*here be dragons*
:::
48 changes: 20 additions & 28 deletions jsondoc/bin/convert_jsondoc.py
Original file line number Diff line number Diff line change
@@ -1,30 +1,27 @@
import argparse
import json
import pypandoc

from jsondoc.convert.html import html_to_jsondoc
from jsondoc.convert.markdown import jsondoc_to_markdown
from jsondoc.serialize import jsondoc_dump_json, load_jsondoc
from jsondoc.serialize import jsondoc_dump_json


def convert_to_jsondoc(input_file, output_file=None, indent=None):
# Read the input file
with open(input_file, "r") as file:
content = file.read()

# Determine the file type based on extension
file_extension = input_file.split(".")[-1].lower()

if file_extension in ["html", "htm"]:
# Convert HTML to jsondoc
jsondoc = html_to_jsondoc(content)
elif file_extension in ["md", "markdown"]:
# For markdown, we'll first convert to HTML, then to jsondoc
# This is a placeholder as we don't have a direct markdown to jsondoc converter
html_content = markdown_to_html(
content
) # You'll need to implement this function
jsondoc = html_to_jsondoc(html_content)
with open(input_file, "r") as file:
html_content = file.read()
else:
raise ValueError(f"Unsupported file type: {file_extension}")
try:
html_content = pypandoc.convert_file(input_file, "html")
except RuntimeError as e:
raise ValueError(f"File type not supported for conversion: {input_file}")

jsondoc = html_to_jsondoc(html_content)

# Serialize the jsondoc
serialized_jsondoc = jsondoc_dump_json(jsondoc, indent=indent)
Expand All @@ -40,15 +37,6 @@ def convert_to_jsondoc(input_file, output_file=None, indent=None):
# print(jsondoc_to_markdown(jsondoc))


def markdown_to_html(markdown_content):
# Placeholder function for markdown to HTML conversion
# You'll need to implement this using a markdown library
# For example, you could use the `markdown` library:
# import markdown
# return markdown.markdown(markdown_content)
raise NotImplementedError("Markdown to HTML conversion not implemented")


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Convert files to jsondoc format")
parser.add_argument("input_file", help="Path to the input file")
Expand All @@ -66,8 +54,12 @@ def markdown_to_html(markdown_content):
)
args = parser.parse_args()

convert_to_jsondoc(
args.input_file,
args.output_file,
indent=args.indent,
)
try:
convert_to_jsondoc(
args.input_file,
args.output_file,
indent=args.indent,
)
except ValueError as e:
print(e)
exit(1)
13 changes: 12 additions & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ pydantic = "^2.7.2"
thefuzz = "^0.22.1"
jsonschema = "^4.23.0"
bs4 = "^0.0.2"
pypandoc = "^1.13"


[tool.poetry.group.dev.dependencies]
Expand Down
Loading