Skip to content

Commit f333d7f

Browse files
feat: Json elements to HTML converter (#3936)
## NOTE `test_unstructured_ingest/expected-structured-output-html` contains all test HTML fixtures. Original JSON files, from which these HTML fixtures are generated, were taken from `test_unstructured_ingest/expected-structured-output`
1 parent 43b682a commit f333d7f

File tree

186 files changed

+187279
-9
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

186 files changed

+187279
-9
lines changed

.github/workflows/ci.yml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -319,6 +319,32 @@ jobs:
319319
make install-ingest
320320
./test_unstructured_ingest/test-ingest-src.sh
321321
322+
test_json_to_html:
323+
strategy:
324+
matrix:
325+
python-version: ["3.9","3.10"]
326+
runs-on: ubuntu-latest-m
327+
needs: [setup, lint]
328+
steps:
329+
- uses: 'actions/checkout@v4'
330+
- name: Set up Python ${{ matrix.python-version }}
331+
uses: actions/setup-python@v5
332+
with:
333+
python-version: ${{ matrix.python-version }}
334+
- name: Get full Python version
335+
id: full-python-version
336+
run: echo version=$(python -c "import sys; print('-'.join(str(v) for v in sys.version_info))") >> $GITHUB_OUTPUT
337+
- name: Setup virtual environment
338+
uses: ./.github/actions/base-cache
339+
with:
340+
python-version: ${{ matrix.python-version }}
341+
- name: Test HTML fixtures
342+
env:
343+
OVERWRITE_FIXTURES: "false"
344+
PYTHONPATH: ${{ github.workspace }}
345+
run: |
346+
source .venv/bin/activate
347+
./test_unstructured_ingest/check-diff-expected-output-html.sh
322348
323349
test_unstructured_api_unit:
324350
strategy:

.github/workflows/ingest-test-fixtures-update-pr.yml

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,10 @@ jobs:
1616
env:
1717
NLTK_DATA: ${{ github.workspace }}/nltk_data
1818
steps:
19-
- uses: actions/checkout@v3
20-
- uses: ./.github/actions/base-cache
21-
with:
22-
python-version: ${{ env.PYTHON_VERSION }}
19+
- uses: actions/checkout@v3
20+
- uses: ./.github/actions/base-cache
21+
with:
22+
python-version: ${{ env.PYTHON_VERSION }}
2323

2424
setup_ingest:
2525
runs-on: ubuntu-latest
@@ -31,14 +31,14 @@ jobs:
3131
- uses: ./.github/actions/base-ingest-cache
3232
with:
3333
python-version: ${{ env.PYTHON_VERSION }}
34-
check-only: 'true'
34+
check-only: "true"
3535

3636
update-fixtures-and-pr:
3737
runs-on: ubuntu-latest-m
3838
needs: [setup_ingest]
3939
steps:
4040
# actions/checkout MUST come before auth
41-
- uses: 'actions/checkout@v4'
41+
- uses: "actions/checkout@v4"
4242
- name: Set up Python ${{ env.PYTHON_VERSION }}
4343
uses: actions/setup-python@v5
4444
with:
@@ -53,7 +53,7 @@ jobs:
5353
- name: Setup docker-compose
5454
uses: KengoTODA/actions-setup-docker-compose@v1
5555
with:
56-
version: '2.22.0'
56+
version: "2.22.0"
5757
- name: Update test fixtures
5858
env:
5959
AIRTABLE_PERSONAL_ACCESS_TOKEN: ${{ secrets.AIRTABLE_PERSONAL_ACCESS_TOKEN }}
@@ -111,6 +111,10 @@ jobs:
111111
tesseract --version
112112
python -m nltk.downloader punkt_tab averaged_perceptron_tagger_eng
113113
./test_unstructured_ingest/test-ingest-src.sh
114+
- name: Update HTML fixtures
115+
run: |
116+
source .venv/bin/activate
117+
make html-fixtures-update
114118
115119
- name: Save branch name to environment file
116120
id: branch

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -204,6 +204,7 @@ example-docs/*_images
204204
examples/**/output/
205205

206206
outputdiff.txt
207+
outputhtmldiff.txt
207208
metricsdiff.txt
208209

209210
# analysis

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,13 @@
1-
## 0.16.24-dev2
1+
## 0.16.24-dev3
22

33
### Enhancements
44

55
- **`extract_image_block_types` now also works for CamelCase elemenet type names**. Previously `NarrativeText` and similar CamelCase element types can't be extracted using the mentioned parameter in `partition`. Now figures for those elements can be extracted like `Image` and `Table` elements
66

77
### Features
88

9+
- **Add JSON elements to HTML converter** - Converts JSON elements file into an HTML file.
10+
911
### Fixes
1012

1113
## 0.16.23

Makefile

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -327,3 +327,12 @@ docker-jupyter-notebook:
327327
.PHONY: run-jupyter
328328
run-jupyter:
329329
PYTHONPATH=$(realpath .) JUPYTER_PATH=$(realpath .) jupyter-notebook --NotebookApp.token='' --NotebookApp.password=''
330+
331+
332+
###########
333+
# Other #
334+
###########
335+
336+
.PHONY: html-fixtures-update
337+
html-fixtures-update:
338+
test_unstructured_ingest/structured-json-to-html.sh test_unstructured_ingest/expected-structured-output-html

scripts/html/elements_json_to_html.py

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
import argparse
2+
import logging
3+
import os
4+
from pathlib import Path
5+
6+
from unstructured.partition.html.convert import elements_to_html
7+
from unstructured.staging.base import elements_from_json
8+
9+
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
10+
logger = logging.getLogger(__name__)
11+
12+
13+
def json_to_html(
14+
filepath: Path, outdir: Path, exclude_binary_image_data: bool, no_group_by_page: bool
15+
):
16+
logger.info("Processing: %s", filepath)
17+
elements = elements_from_json(str(filepath))
18+
elements_html = elements_to_html(elements, exclude_binary_image_data, no_group_by_page)
19+
20+
outpath = outdir / filepath.with_suffix(".html").name
21+
os.makedirs(outpath.parent, exist_ok=True)
22+
with open(outpath, "w+") as f:
23+
f.write(elements_html)
24+
logger.info("HTML rendered and saved to: %s", outpath)
25+
26+
27+
def multiple_json_to_html(
28+
path: Path, outdir: Path, exclude_binary_image_data: bool, no_group_by_page: bool
29+
):
30+
for root, _, files in os.walk(path):
31+
for file in files:
32+
if file.endswith(".json"):
33+
json_file_path = Path(root) / file
34+
outpath = outdir / json_file_path.relative_to(path).parent
35+
json_to_html(json_file_path, outpath, exclude_binary_image_data, no_group_by_page)
36+
37+
38+
def main():
39+
parser = argparse.ArgumentParser(description="Convert JSON elements to HTML.")
40+
parser.add_argument(
41+
"filepath",
42+
type=str,
43+
help="""Path to the JSON file or directory containing elements.
44+
If given directory it will convert all JSON files in directory
45+
and all sub-directories.""",
46+
)
47+
parser.add_argument(
48+
"--outdir", type=str, help="Output directory for the HTML file.", default=""
49+
)
50+
parser.add_argument(
51+
"--exclude-img", action="store_true", help="Exclude binary image data from the HTML."
52+
)
53+
parser.add_argument("--no-group", action="store_true", help="Don't group elements by pages.")
54+
args = parser.parse_args()
55+
56+
filepath = Path(args.filepath)
57+
outdir = Path(args.outdir)
58+
59+
if filepath.is_file():
60+
json_to_html(filepath, outdir, args.exclude_img, args.no_group)
61+
else:
62+
multiple_json_to_html(filepath, outdir, args.exclude_img, args.no_group)
63+
64+
65+
if __name__ == "__main__":
66+
main()

0 commit comments

Comments
 (0)