Skip to content

Commit

Permalink
chore: merge latests updates from main (#126)
Browse files Browse the repository at this point in the history
* feat: extract metadata from `.docx`, `.xlsx`, and `.jpg` (#113)

* add python-docx dependency

* added function for extracting metadata from word documents

* add openpyxl

* added get_jpg_metadata; fixed typing

* bump changelog

* added pillow to dependencies

* build(deps): Bump transformers from 4.23.1 to 4.25.1 in /requirements (#114)

Bumps [transformers](https://github.com/huggingface/transformers) from 4.23.1 to 4.25.1.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](huggingface/transformers@v4.23.1...v4.25.1)

---
updated-dependencies:
- dependency-name: transformers
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): Bump pip-tools from 6.10.0 to 6.12.1 in /requirements (#115)

Bumps [pip-tools](https://github.com/jazzband/pip-tools) from 6.10.0 to 6.12.1.
- [Release notes](https://github.com/jazzband/pip-tools/releases)
- [Changelog](https://github.com/jazzband/pip-tools/blob/main/CHANGELOG.md)
- [Commits](jazzband/pip-tools@6.10.0...6.12.1)

---
updated-dependencies:
- dependency-name: pip-tools
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>

* build(deps): Bump torch from 1.13.0 to 1.13.1 in /requirements (#117)

Bumps [torch](https://github.com/pytorch/pytorch) from 1.13.0 to 1.13.1.
- [Release notes](https://github.com/pytorch/pytorch/releases)
- [Changelog](https://github.com/pytorch/pytorch/blob/master/RELEASE.md)
- [Commits](pytorch/pytorch@v1.13.0...v1.13.1)

---
updated-dependencies:
- dependency-name: torch
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>

* build(deps): Bump lxml from 4.9.1 to 4.9.2 in /requirements (#118)

Bumps [lxml](https://github.com/lxml/lxml) from 4.9.1 to 4.9.2.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](lxml/lxml@lxml-4.9.1...lxml-4.9.2)

---
updated-dependencies:
- dependency-name: lxml
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): Bump mypy from 0.990 to 0.991 in /requirements (#123)

Bumps [mypy](https://github.com/python/mypy) from 0.990 to 0.991.
- [Release notes](https://github.com/python/mypy/releases)
- [Commits](python/mypy@v0.990...v0.991)

---
updated-dependencies:
- dependency-name: mypy
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): Bump packaging from 21.3 to 22.0 in /requirements (#119)

Bumps [packaging](https://github.com/pypa/packaging) from 21.3 to 22.0.
- [Release notes](https://github.com/pypa/packaging/releases)
- [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst)
- [Commits](pypa/packaging@21.3...22.0)

---
updated-dependencies:
- dependency-name: packaging
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>

* build(deps): Bump pytz from 2022.6 to 2022.7 in /requirements (#122)

Bumps [pytz](https://github.com/stub42/pytz) from 2022.6 to 2022.7.
- [Release notes](https://github.com/stub42/pytz/releases)
- [Commits](stub42/pytz@release_2022.6...release_2022.7)

---
updated-dependencies:
- dependency-name: pytz
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): Bump pillow from 9.3.0 to 9.4.0 in /requirements (#120)

Bumps [pillow](https://github.com/python-pillow/Pillow) from 9.3.0 to 9.4.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](python-pillow/Pillow@9.3.0...9.4.0)

---
updated-dependencies:
- dependency-name: pillow
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: Add `extract_attachment_info` (#112)

* Adds function to extract attachments and their metadata from eml files

* feat: helper functions to identify and extract phone numbers (#124)

* added pattern for finding phone numbers

* added cleaning brick for extracting phone numbers

* add docs

* changelog and bump version

* switch to us phone numbers

* bump dev version

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mallori Harrell <6825104+mallorih@users.noreply.github.com>
  • Loading branch information
4 people authored Jan 4, 2023
1 parent e0a76ef commit c60fd1f
Show file tree
Hide file tree
Showing 25 changed files with 570 additions and 69 deletions.
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
## 0.3.5-dev2
## 0.3.5-dev5

* Add new pattern to recognize plain text dash bullets
* Add test for bullet patterns
* Fix for `partition_html` that allows for processing `div` tags that have both text and child
elements
* Add ability to extract document metadata from `.docx`, `.xlsx`, and `.jpg` files.
* Helper functions for identifying and extracting phone numbers
* Add new function `extract_attachment_info` that extracts and decode the attachment
of an email.

## 0.3.4

Expand Down
6 changes: 1 addition & 5 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# This file is autogenerated by pip-compile with python 3.8
# This file is autogenerated by pip-compile with python 3.10
# To update, run:
#
# pip-compile requirements/build.in
Expand All @@ -22,8 +22,6 @@ idna==3.4
# via requests
imagesize==1.4.1
# via sphinx
importlib-metadata==5.0.0
# via sphinx
jinja2==3.1.2
# via sphinx
markupsafe==2.1.1
Expand Down Expand Up @@ -60,5 +58,3 @@ sphinxcontrib-serializinghtml==1.1.5
# via sphinx
urllib3==1.26.12
# via requests
zipp==3.10.0
# via importlib-metadata
47 changes: 47 additions & 0 deletions docs/source/bricks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,23 @@ Examples:
text = f.read()
elements = partition_email(text=text)
``extract_attachment_info``
----------------------

The ``extract_attachment_info`` function takes an ``email.message.Message`` object
as input and returns the a list of dictionaries containing the attachment information,
such as ``filename``, ``size``, ``payload``, etc. The attachment is saved to the ``output_dir``
if specified.

.. code:: python
import email
from unstructured.partition.email import extract_attachment_info
with open("example-docs/fake-email-attachment.eml", "r") as f:
msg = email.message_from_file(f)
attachment_info = extract_attachment_info(msg, output_dir="example-docs")
``is_bulleted_text``
----------------------
Expand Down Expand Up @@ -162,6 +179,21 @@ Examples:
is_possible_title(example_3, sentence_min_length=5)
``contains_us_phone_number``
----------------------------

Checks to see if a section of text contains a US phone number.

Examples:

.. code:: python
from unstructured.partition.text_type import contains_us_phone_number
# Returns True because the text includes a phone number
contains_us_phone_number("Phone number: 215-867-5309")
``contains_verb``
-----------------

Expand Down Expand Up @@ -471,6 +503,21 @@ Examples:
extract_text_after(text, r"SPEAKER \d{1}:")
``extract_us_phone_number``
---------------------------

Extracts a phone number from a section of text.

Examples:

.. code:: python
from unstructured.cleaners.extract import extract_us_phone_number
# Returns "215-867-5309"
extract_us_phone_number("Phone number: 215-867-5309")
``translate_text``
------------------

Expand Down
36 changes: 36 additions & 0 deletions docs/source/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -234,3 +234,39 @@ for reading in a document with an XSLT stylesheet is as follows:
If you read from a stylesheet ``HTMLDocument`` will use the ``etree.XMLParser`` by default
instead of the ``etree.HTMLParser`` because ``HTMLDocument`` assumes you want to convert
your raw XML to HTML.


##################################
Extracting Metadata from Documents
##################################

The ``unstructured`` library includes utilities for extracting metadata from
documents. Currently, there is support for extracting metadata from ``.docx``,
``.xlsx``, and ``.jpg`` documents. When you call these functions, the return type
is a ``Metadata`` data class that you can convert to a dictionary by calling the
``to_dict()`` method. If you extract metadata from a ``.jpg`` document, the output
will include EXIF metadata in the ``exif_data`` attribute, if it is available.
Here is an example of how to use the metadata extraction functionality:


.. code:: python
from unstructured.file_utils.metadata import get_jpg_metadata
filename = "example-docs/example.jpg"
metadata = get_jpg_metadata(filename=filename)
You can also pass in a file-like object with:

.. code:: python
from unstructured.file_utils.metadata import get_jpg_metadata
filename = "example-docs/example.jpg"
with open(filename, "rb") as f:
metadata = get_jpg_metadata(file=f)
To extract metadata from ``.docx`` or ``.xlsx``, use ``get_docx_metadata`` and
``get_xlsx_metadata``. The interfaces are the same as ``get_jpg_metadata``.
Binary file added example-docs/example.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
50 changes: 50 additions & 0 deletions example-docs/fake-email-attachment.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
MIME-Version: 1.0
Date: Fri, 23 Dec 2022 12:08:48 -0600
Message-ID: <CAPgNNXSzLVJ-d1OCX_TjFgJU7ugtQrjFybPtAMmmYZzphxNFYg@mail.gmail.com>
Subject: Fake email with attachment
From: Mallori Harrell <mallori@unstructured.io>
To: Mallori Harrell <mallori@unstructured.io>
Content-Type: multipart/mixed; boundary="0000000000005d654405f082adb7"

--0000000000005d654405f082adb7
Content-Type: multipart/alternative; boundary="0000000000005d654205f082adb5"
--0000000000005d654205f082adb5
Content-Type: text/plain; charset="UTF-8"
Hello!
Here's the attachments!
It includes:
- Lots of whitespace
- Little to no content
- and is a quick read
Best,
Mallori
--0000000000005d654205f082adb5
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hello!=C2=A0<div><br></div><div>Here&#39;s the attachments=
!</div><div><br></div><div>It includes:</div><div><ul><li style=3D"margin-l=
eft:15px">Lots of whitespace</li><li style=3D"margin-left:15px">Little=C2=
=A0to no content</li><li style=3D"margin-left:15px">and is a quick read</li=
></ul><div>Best,</div></div><div><br></div><div>Mallori</div><div dir=3D"lt=
r" class=3D"gmail_signature" data-smartmail=3D"gmail_signature"><div dir=3D=
"ltr"><div><div><br></div></div></div></div></div>

--0000000000005d654205f082adb5--
--0000000000005d654405f082adb7
Content-Type: text/plain; charset="US-ASCII"; name="fake-attachment.txt"
Content-Disposition: attachment; filename="fake-attachment.txt"
Content-Transfer-Encoding: base64
X-Attachment-Id: f_lc0tto5j0
Content-ID: <f_lc0tto5j0>
SGV5IHRoaXMgaXMgYSBmYWtlIGF0dGFjaG1lbnQh
--0000000000005d654405f082adb7--
Binary file added example-docs/fake-excel.xlsx
Binary file not shown.
Binary file added example-docs/fake.docx
Binary file not shown.
20 changes: 15 additions & 5 deletions requirements/base.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# This file is autogenerated by pip-compile with python 3.8
# This file is autogenerated by pip-compile with python 3.10
# To update, run:
#
# pip-compile --output-file=requirements/base.txt
Expand All @@ -16,6 +16,8 @@ click==8.1.3
# via nltk
deprecated==1.2.13
# via argilla
et-xmlfile==1.1.0
# via openpyxl
h11==0.9.0
# via httpcore
httpcore==0.11.1
Expand All @@ -26,8 +28,10 @@ idna==3.4
# via rfc3986
joblib==1.2.0
# via nltk
lxml==4.9.1
# via unstructured (setup.py)
lxml==4.9.2
# via
# python-docx
# unstructured (setup.py)
monotonic==1.6
# via argilla
nltk==3.7
Expand All @@ -36,17 +40,23 @@ numpy==1.23.5
# via
# argilla
# pandas
packaging==21.3
openpyxl==3.0.10
# via unstructured (setup.py)
packaging==22.0
# via argilla
pandas==1.5.2
# via argilla
pillow==9.4.0
# via unstructured (setup.py)
pydantic==1.10.2
# via argilla
pyparsing==3.0.9
# via packaging
python-dateutil==2.8.2
# via pandas
pytz==2022.6
python-docx==0.8.11
# via unstructured (setup.py)
pytz==2022.7
# via pandas
regex==2022.10.31
# via nltk
Expand Down
6 changes: 1 addition & 5 deletions requirements/build.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# This file is autogenerated by pip-compile with python 3.8
# This file is autogenerated by pip-compile with python 3.10
# To update, run:
#
# pip-compile requirements/build.in
Expand All @@ -22,8 +22,6 @@ idna==3.4
# via requests
imagesize==1.4.1
# via sphinx
importlib-metadata==5.0.0
# via sphinx
jinja2==3.1.2
# via sphinx
markupsafe==2.1.1
Expand Down Expand Up @@ -60,5 +58,3 @@ sphinxcontrib-serializinghtml==1.1.5
# via sphinx
urllib3==1.26.12
# via requests
zipp==3.10.0
# via importlib-metadata
30 changes: 6 additions & 24 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
@@ -1,13 +1,9 @@
#
# This file is autogenerated by pip-compile with python 3.8
# This file is autogenerated by pip-compile with python 3.10
# To update, run:
#
# pip-compile requirements/dev.in
#
appnope==0.1.3
# via
# ipykernel
# ipython
argon2-cffi==21.3.0
# via notebook
argon2-cffi-bindings==21.2.0
Expand Down Expand Up @@ -40,10 +36,6 @@ executing==1.0.0
# via stack-data
fastjsonschema==2.16.2
# via nbformat
importlib-metadata==5.0.0
# via nbconvert
importlib-resources==5.10.0
# via jsonschema
ipykernel==6.15.3
# via
# ipywidgets
Expand All @@ -53,7 +45,7 @@ ipykernel==6.15.3
# qtconsole
ipython==8.6.0
# via
# -r requirements/dev.in
# -r dev.in
# ipykernel
# ipywidgets
# jupyter-console
Expand All @@ -72,7 +64,7 @@ jinja2==3.1.2
jsonschema==4.16.0
# via nbformat
jupyter==1.0.0
# via -r requirements/dev.in
# via -r dev.in
jupyter-client==7.3.5
# via
# ipykernel
Expand All @@ -84,7 +76,7 @@ jupyter-console==6.4.4
# via jupyter
jupyter-core==5.1.0
# via
# -r requirements/dev.in
# -r dev.in
# jupyter-client
# nbconvert
# nbformat
Expand Down Expand Up @@ -141,10 +133,8 @@ pexpect==4.8.0
# via ipython
pickleshare==0.7.5
# via ipython
pip-tools==6.10.0
# via -r requirements/dev.in
pkgutil-resolve-name==1.3.10
# via jsonschema
pip-tools==6.12.1
# via -r dev.in
platformdirs==2.5.4
# via jupyter-core
prometheus-client==0.14.1
Expand Down Expand Up @@ -200,10 +190,6 @@ terminado==0.15.0
# via notebook
tinycss2==1.1.1
# via nbconvert
tomli==2.0.1
# via
# build
# pep517
tornado==6.2
# via
# ipykernel
Expand Down Expand Up @@ -233,10 +219,6 @@ wheel==0.37.1
# via pip-tools
widgetsnbextension==4.0.3
# via ipywidgets
zipp==3.10.0
# via
# importlib-metadata
# importlib-resources

# The following packages are considered to be unsafe in a requirements file:
# pip
Expand Down
Loading

0 comments on commit c60fd1f

Please sign in to comment.