Skip to content

Commit

Permalink
feat: add partition_epub function (Unstructured-IO#364)
Browse files Browse the repository at this point in the history
* add pypandoc dependency

* added epub partitioner and file conversion

* test for partition_epub

* tests for file conversion

* add epub to filetype detection

* added epub to auto partition

* update bricks docs

* updated installing docs

* changelot and version

* add pandoc to dependencies

* add pandoc to debian dependencies

* linting, linting, linting

* typo fix

* typo fix

* file conversion type hints

* more type hints

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
  • Loading branch information
MthwRobinson and qued authored Mar 14, 2023
1 parent aa49462 commit e43cb0e
Show file tree
Hide file tree
Showing 18 changed files with 206 additions and 7 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ jobs:
source .venv/bin/activate
make install-detectron2
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libreoffice
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libreoffice pandoc
make test
make check-coverage
make install-ingest-s3
Expand Down
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.5.4-dev7
## 0.5.4

### Enhancements

Expand All @@ -21,6 +21,7 @@

* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
from `FsspecConnector`
* Add `partition_epub` for partitioning e-books in EPUB3 format.

### Fixes

Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ file to ensure your code matches the formatting and linting standards used in `u
If you'd prefer not having code changes auto-tidied before every commit, you can use `make check` to see
whether any linting or formatting changes should be applied, and `make tidy` to apply them.

If using the optional `pre-commit`, you'll just need to install the hooks with `pre-commit install` since the
If using the optional `pre-commit`, you'll just need to install the hooks with `pre-commit install` since the
`pre-commit` package is installed as part of `make install` mentioned above. Finally, if you decided to use `pre-commit`
you can also uninstall the hooks with `pre-commit uninstall`.

Expand All @@ -119,7 +119,7 @@ you can also uninstall the hooks with `pre-commit uninstall`.
You can run this [Colab notebook](https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW) to run the examples below.

The following examples show how to get started with the `unstructured` library.
You can parse **TXT**, **HTML**, **PDF**, **EML**, **DOC**, **DOCX**, **PPT**, **PPTX**, **JPG**,
You can parse **TXT**, **HTML**, **PDF**, **EML**, **EPUB**, **DOC**, **DOCX**, **PPT**, **PPTX**, **JPG**,
and **PNG** documents with one line of code!
<br></br>
See our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
Expand Down
37 changes: 36 additions & 1 deletion docs/source/bricks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ If you call the ``partition`` function, ``unstructured`` will attempt to detect
file type and route it to the appropriate partitioning brick. All partitioning bricks
called within ``partition`` are called using the default kwargs. Use the document-type
specific bricks if you need to apply non-default settings.
``partition`` currently supports ``.docx``, ``.doc``, ``.pptx``, ``.ppt``, ``.eml``, ``.html``, ``.pdf``,
``partition`` currently supports ``.docx``, ``.doc``, ``.pptx``, ``.ppt``, ``.eml``, ``.epub``, ``.html``, ``.pdf``,
``.png``, ``.jpg``, and ``.txt`` files.
If you set the ``include_page_breaks`` kwarg to ``True``, the output will include page breaks. This is only supported for ``.pptx``, ``.html``, ``.pdf``,
``.png``, and ``.jpg``.
Expand Down Expand Up @@ -306,6 +306,41 @@ Examples:
elements = partition_email(text=text, include_headers=True)
``partition_epub``
---------------------

The ``partition_epub`` function processes e-books in EPUB3 format. The function
first converts the document to HTML using ``pandocs`` and then calls ``partition_html``.
You'll need `pandocs <https://pandoc.org/installing.html>`_ installed on your system
to use ``partition_epub``.


Examples:

.. code:: python
from unstructured.partition.epub import partition_epub
elements = partition_epub(filename="example-docs/winter-sports.epub")
``partition_md``
---------------------

The ``partition_md`` function provides the ability to parse markdown files. The
following workflow shows how to use ``partition_md``.


Examples:

.. code:: python
from unstructured.partition.md import partition_md
elements = partition_md(filename="README.md")
``partition_text``
---------------------

Expand Down
1 change: 1 addition & 0 deletions docs/source/installing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ installation.
* ``poppler-utils`` (images and PDFs)
* ``tesseract-ocr`` (images and PDFs)
* ``libreoffice`` (MS Office docs)
* ``pandocs`` (EPUBs)

* If you are parsing PDFs, run the following to install the ``detectron2`` model, which ``unstructured`` uses for layout detection:
* ``pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"``
Expand Down
Binary file added example-docs/winter-sports.epub
Binary file not shown.
5 changes: 5 additions & 0 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@
#
# pip-compile --output-file=requirements/base.txt
#
--extra-index-url https://pypi.ngc.nvidia.com
--trusted-host pypi.ngc.nvidia.com

anyio==3.6.2
# via httpcore
argilla==1.4.0
Expand Down Expand Up @@ -72,6 +75,8 @@ pydantic==1.10.6
# via argilla
pygments==2.14.0
# via rich
pypandoc==1.11
# via unstructured (setup.py)
python-dateutil==2.8.2
# via pandas
python-docx==0.8.11
Expand Down
2 changes: 1 addition & 1 deletion scripts/setup_ubuntu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ $sudo $pac install -y poppler-utils

#### Tesseract
# Install tesseract as well as Russian language
$sudo $pac install -y tesseract-ocr libtesseract-dev tesseract-ocr-rus libreoffice
$sudo $pac install -y tesseract-ocr libtesseract-dev tesseract-ocr-rus libreoffice pandoc

#### libmagic
$sudo $pac install -y libmagic-dev
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@
"openpyxl",
"pandas",
"pillow",
"pypandoc",
"python-docx",
"python-pptx",
"python-magic",
Expand Down
23 changes: 23 additions & 0 deletions test_unstructured/file_utils/test_file_conversion.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import os
import pathlib
from unittest.mock import patch

import pypandoc
import pytest

from unstructured.file_utils.file_conversion import convert_file_to_text

DIRECTORY = pathlib.Path(__file__).parent.resolve()


def test_convert_file_to_text():
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
html_text = convert_file_to_text(filename, source_format="epub", target_format="html")
assert html_text.startswith("<p>")


def test_convert_to_file_raises_if_pandoc_not_available():
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
with patch.object(pypandoc, "convert_file", side_effect=FileNotFoundError):
with pytest.raises(FileNotFoundError):
convert_file_to_text(filename, source_format="epub", target_format="html")
3 changes: 3 additions & 0 deletions test_unstructured/file_utils/test_filetype.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
("fake-html.html", FileType.HTML),
("unsupported/fake-excel.xlsx", FileType.XLSX),
("fake-power-point.pptx", FileType.PPTX),
("winter-sports.epub", FileType.EPUB),
],
)
def test_detect_filetype_from_filename(file, expected):
Expand All @@ -50,6 +51,7 @@ def test_detect_filetype_from_filename(file, expected):
("fake-html.html", FileType.HTML),
("unsupported/fake-excel.xlsx", FileType.XLSX),
("fake-power-point.pptx", FileType.PPTX),
("winter-sports.epub", FileType.EPUB),
],
)
def test_detect_filetype_from_filename_with_extension(monkeypatch, file, expected):
Expand All @@ -73,6 +75,7 @@ def test_detect_filetype_from_filename_with_extension(monkeypatch, file, expecte
("fake-html.html", FileType.HTML),
("unsupported/fake-excel.xlsx", FileType.XLSX),
("fake-power-point.pptx", FileType.PPTX),
("winter-sports.epub", FileType.EPUB),
],
)
def test_detect_filetype_from_file(file, expected):
Expand Down
15 changes: 15 additions & 0 deletions test_unstructured/partition/test_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -277,3 +277,18 @@ def test_auto_with_page_breaks():
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf")
elements = partition(filename=filename, include_page_breaks=True)
assert PageBreak() in elements


def test_auto_partition_epub_from_filename():
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
elements = partition(filename=filename)
assert len(elements) > 0
assert elements[0].text.startswith("The Project Gutenberg eBook of Winter Sports")


def test_auto_partition_epub_from_file():
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
with open(filename, "rb") as f:
elements = partition(file=f)
assert len(elements) > 0
assert elements[0].text.startswith("The Project Gutenberg eBook of Winter Sports")
21 changes: 21 additions & 0 deletions test_unstructured/partition/test_epub.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import os
import pathlib

from unstructured.partition.epub import partition_epub

DIRECTORY = pathlib.Path(__file__).parent.resolve()


def test_partition_epub_from_filename():
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
elements = partition_epub(filename=filename)
assert len(elements) > 0
assert elements[0].text.startswith("The Project Gutenberg eBook of Winter Sports")


def test_partition_epub_from_file():
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
with open(filename, "rb") as f:
elements = partition_epub(file=f)
assert len(elements) > 0
assert elements[0].text.startswith("The Project Gutenberg eBook of Winter Sports")
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.5.4-dev7" # pragma: no cover
__version__ = "0.5.4" # pragma: no cover
49 changes: 49 additions & 0 deletions unstructured/file_utils/file_conversion.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import tempfile
from typing import IO, Optional

import pypandoc

from unstructured.partition.common import exactly_one


def convert_file_to_text(filename: str, source_format: str, target_format: str) -> str:
"""Uses pandoc to convert the source document to a raw text string."""
try:
text = pypandoc.convert_file(filename, "html", format="epub")
except FileNotFoundError as err:
msg = (
"Error converting the file to text. Ensure you have the pandoc "
"package installed on your system. Install instructions are available at "
"https://pandoc.org/installing.html. The original exception text was:\n"
f"{err}"
)
raise FileNotFoundError(msg)

return text


def convert_epub_to_html(
filename: Optional[str] = None,
file: Optional[IO] = None,
) -> str:
"""Converts an EPUB document to HTML raw text. Enables an EPUB doucment to be
processed using the partition_html function."""
exactly_one(filename=filename, file=file)

if file is not None:
tmp = tempfile.NamedTemporaryFile(delete=False)
tmp.write(file.read())
tmp.close()
html_text = convert_file_to_text(
filename=tmp.name,
source_format="epub",
target_format="html",
)
elif filename is not None:
html_text = convert_file_to_text(
filename=filename,
source_format="epub",
target_format="html",
)

return html_text
10 changes: 10 additions & 0 deletions unstructured/file_utils/filetype.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,11 @@
"text/x-markdown",
]

EPUB_MIME_TYPES = [
"application/epub",
"application/epub+zip",
]

# NOTE(robinson) - .docx.xlsx files are actually zip file with a .docx/.xslx extension.
# If the MIME type is application/octet-stream, we check if it's a .docx/.xlsx file by
# looking for expected filenames within the zip file.
Expand Down Expand Up @@ -94,6 +99,7 @@ class FileType(Enum):
HTML = 50
XML = 51
MD = 52
EPUB = 53

# Compressed Types
ZIP = 60
Expand Down Expand Up @@ -123,6 +129,7 @@ def __lt__(self, other):
".ppt": FileType.PPT,
".rtf": FileType.RTF,
".json": FileType.JSON,
".epub": FileType.EPUB,
}


Expand Down Expand Up @@ -180,6 +187,9 @@ def detect_filetype(
# NOTE - I am not sure whether libmagic ever returns these mimetypes.
return FileType.MD

elif mime_type in EPUB_MIME_TYPES:
return FileType.EPUB

elif mime_type in TXT_MIME_TYPES:
if extension and extension == ".eml":
return FileType.EML
Expand Down
3 changes: 3 additions & 0 deletions unstructured/partition/auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from unstructured.partition.doc import partition_doc
from unstructured.partition.docx import partition_docx
from unstructured.partition.email import partition_email
from unstructured.partition.epub import partition_epub
from unstructured.partition.html import partition_html
from unstructured.partition.image import partition_image
from unstructured.partition.json import partition_json
Expand Down Expand Up @@ -59,6 +60,8 @@ def partition(
include_page_breaks=include_page_breaks,
encoding=encoding,
)
elif filetype == FileType.EPUB:
return partition_epub(filename=filename, file=file, include_page_breaks=include_page_breaks)
elif filetype == FileType.MD:
return partition_md(filename=filename, file=file, include_page_breaks=include_page_breaks)
elif filetype == FileType.PDF:
Expand Down
32 changes: 32 additions & 0 deletions unstructured/partition/epub.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
from typing import IO, List, Optional

from unstructured.documents.elements import Element
from unstructured.file_utils.file_conversion import convert_epub_to_html
from unstructured.partition.html import partition_html


def partition_epub(
filename: Optional[str] = None,
file: Optional[IO] = None,
include_page_breaks: bool = False,
) -> List[Element]:
"""Partitions an EPUB document. The document is first converted to HTML and then
partitoned using partiton_html.
Parameters
----------
filename
A string defining the target filename path.
file
A file-like object using "rb" mode --> open(filename, "rb").
include_page_breaks
If True, the output will include page breaks if the filetype supports it
"""
html_text = convert_epub_to_html(filename=filename, file=file)
# NOTE(robinson) - pypandoc returns a text string with unicode encoding
# ref: https://github.com/JessicaTegner/pypandoc#usage
return partition_html(
text=html_text,
include_page_breaks=include_page_breaks,
encoding="unicode",
)

0 comments on commit e43cb0e

Please sign in to comment.