Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add extract_attachment_info #112

Merged
merged 16 commits into from
Jan 3, 2023
Merged

feat: Add extract_attachment_info #112

merged 16 commits into from
Jan 3, 2023

Conversation

mallorih
Copy link
Contributor

@mallorih mallorih commented Dec 23, 2022

This is a draft PR until the previous PR is merged.

This PR adds the ability to extract attachment information and payload from an email and save it to disk or store in a python object.

Testing

 import email
  from unstructured.partition.email import extract_attachment_info

  with open("example-docs/fake-email-attachment.eml", "r") as f:
      msg = email.message_from_file(f)
  attachment_info = extract_attachment_info(msg, output_dir="example-docs")

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good! Just one naming suggestion. Can we also add a section to the docs showing how to deal with attachments?

unstructured/partition/email.py Outdated Show resolved Hide resolved
Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! LGTM after the following changes. Also made one small docs suggestion. Feel free to merge without another review once these items are in:

  • Remove DRAFT from the PR title
  • Add a testing section to the PR summary with test code (can be the same snippet from the docs)
  • Run make tidy to fix the lint job

docs/source/bricks.rst Outdated Show resolved Hide resolved
@mallorih mallorih changed the title DRAFT: Add partition_attachment_info feat: Add partition_attachment_info Jan 3, 2023
@mallorih mallorih changed the title feat: Add partition_attachment_info feat: Add extract_attachment_info Jan 3, 2023
@mallorih mallorih merged commit 509ad49 into main Jan 3, 2023
@mallorih mallorih deleted the extract-attachment branch January 3, 2023 17:41
LaverdeS added a commit that referenced this pull request Jan 4, 2023
* feat: extract metadata from `.docx`, `.xlsx`, and `.jpg` (#113)

* add python-docx dependency

* added function for extracting metadata from word documents

* add openpyxl

* added get_jpg_metadata; fixed typing

* bump changelog

* added pillow to dependencies

* build(deps): Bump transformers from 4.23.1 to 4.25.1 in /requirements (#114)

Bumps [transformers](https://github.com/huggingface/transformers) from 4.23.1 to 4.25.1.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](huggingface/transformers@v4.23.1...v4.25.1)

---
updated-dependencies:
- dependency-name: transformers
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): Bump pip-tools from 6.10.0 to 6.12.1 in /requirements (#115)

Bumps [pip-tools](https://github.com/jazzband/pip-tools) from 6.10.0 to 6.12.1.
- [Release notes](https://github.com/jazzband/pip-tools/releases)
- [Changelog](https://github.com/jazzband/pip-tools/blob/main/CHANGELOG.md)
- [Commits](jazzband/pip-tools@6.10.0...6.12.1)

---
updated-dependencies:
- dependency-name: pip-tools
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>

* build(deps): Bump torch from 1.13.0 to 1.13.1 in /requirements (#117)

Bumps [torch](https://github.com/pytorch/pytorch) from 1.13.0 to 1.13.1.
- [Release notes](https://github.com/pytorch/pytorch/releases)
- [Changelog](https://github.com/pytorch/pytorch/blob/master/RELEASE.md)
- [Commits](pytorch/pytorch@v1.13.0...v1.13.1)

---
updated-dependencies:
- dependency-name: torch
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>

* build(deps): Bump lxml from 4.9.1 to 4.9.2 in /requirements (#118)

Bumps [lxml](https://github.com/lxml/lxml) from 4.9.1 to 4.9.2.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](lxml/lxml@lxml-4.9.1...lxml-4.9.2)

---
updated-dependencies:
- dependency-name: lxml
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): Bump mypy from 0.990 to 0.991 in /requirements (#123)

Bumps [mypy](https://github.com/python/mypy) from 0.990 to 0.991.
- [Release notes](https://github.com/python/mypy/releases)
- [Commits](python/mypy@v0.990...v0.991)

---
updated-dependencies:
- dependency-name: mypy
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): Bump packaging from 21.3 to 22.0 in /requirements (#119)

Bumps [packaging](https://github.com/pypa/packaging) from 21.3 to 22.0.
- [Release notes](https://github.com/pypa/packaging/releases)
- [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst)
- [Commits](pypa/packaging@21.3...22.0)

---
updated-dependencies:
- dependency-name: packaging
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>

* build(deps): Bump pytz from 2022.6 to 2022.7 in /requirements (#122)

Bumps [pytz](https://github.com/stub42/pytz) from 2022.6 to 2022.7.
- [Release notes](https://github.com/stub42/pytz/releases)
- [Commits](stub42/pytz@release_2022.6...release_2022.7)

---
updated-dependencies:
- dependency-name: pytz
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): Bump pillow from 9.3.0 to 9.4.0 in /requirements (#120)

Bumps [pillow](https://github.com/python-pillow/Pillow) from 9.3.0 to 9.4.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](python-pillow/Pillow@9.3.0...9.4.0)

---
updated-dependencies:
- dependency-name: pillow
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: Add `extract_attachment_info` (#112)

* Adds function to extract attachments and their metadata from eml files

* feat: helper functions to identify and extract phone numbers (#124)

* added pattern for finding phone numbers

* added cleaning brick for extracting phone numbers

* add docs

* changelog and bump version

* switch to us phone numbers

* bump dev version

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mallori Harrell <6825104+mallorih@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants