-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add extract_attachment_info
#112
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks good! Just one naming suggestion. Can we also add a section to the docs showing how to deal with attachments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! LGTM after the following changes. Also made one small docs suggestion. Feel free to merge without another review once these items are in:
- Remove
DRAFT
from the PR title - Add a testing section to the PR summary with test code (can be the same snippet from the docs)
- Run
make tidy
to fix the lint job
partition_attachment_info
partition_attachment_info
partition_attachment_info
extract_attachment_info
* feat: extract metadata from `.docx`, `.xlsx`, and `.jpg` (#113) * add python-docx dependency * added function for extracting metadata from word documents * add openpyxl * added get_jpg_metadata; fixed typing * bump changelog * added pillow to dependencies * build(deps): Bump transformers from 4.23.1 to 4.25.1 in /requirements (#114) Bumps [transformers](https://github.com/huggingface/transformers) from 4.23.1 to 4.25.1. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](huggingface/transformers@v4.23.1...v4.25.1) --- updated-dependencies: - dependency-name: transformers dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps-dev): Bump pip-tools from 6.10.0 to 6.12.1 in /requirements (#115) Bumps [pip-tools](https://github.com/jazzband/pip-tools) from 6.10.0 to 6.12.1. - [Release notes](https://github.com/jazzband/pip-tools/releases) - [Changelog](https://github.com/jazzband/pip-tools/blob/main/CHANGELOG.md) - [Commits](jazzband/pip-tools@6.10.0...6.12.1) --- updated-dependencies: - dependency-name: pip-tools dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io> * build(deps): Bump torch from 1.13.0 to 1.13.1 in /requirements (#117) Bumps [torch](https://github.com/pytorch/pytorch) from 1.13.0 to 1.13.1. - [Release notes](https://github.com/pytorch/pytorch/releases) - [Changelog](https://github.com/pytorch/pytorch/blob/master/RELEASE.md) - [Commits](pytorch/pytorch@v1.13.0...v1.13.1) --- updated-dependencies: - dependency-name: torch dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io> * build(deps): Bump lxml from 4.9.1 to 4.9.2 in /requirements (#118) Bumps [lxml](https://github.com/lxml/lxml) from 4.9.1 to 4.9.2. - [Release notes](https://github.com/lxml/lxml/releases) - [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt) - [Commits](lxml/lxml@lxml-4.9.1...lxml-4.9.2) --- updated-dependencies: - dependency-name: lxml dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): Bump mypy from 0.990 to 0.991 in /requirements (#123) Bumps [mypy](https://github.com/python/mypy) from 0.990 to 0.991. - [Release notes](https://github.com/python/mypy/releases) - [Commits](python/mypy@v0.990...v0.991) --- updated-dependencies: - dependency-name: mypy dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): Bump packaging from 21.3 to 22.0 in /requirements (#119) Bumps [packaging](https://github.com/pypa/packaging) from 21.3 to 22.0. - [Release notes](https://github.com/pypa/packaging/releases) - [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst) - [Commits](pypa/packaging@21.3...22.0) --- updated-dependencies: - dependency-name: packaging dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io> * build(deps): Bump pytz from 2022.6 to 2022.7 in /requirements (#122) Bumps [pytz](https://github.com/stub42/pytz) from 2022.6 to 2022.7. - [Release notes](https://github.com/stub42/pytz/releases) - [Commits](stub42/pytz@release_2022.6...release_2022.7) --- updated-dependencies: - dependency-name: pytz dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): Bump pillow from 9.3.0 to 9.4.0 in /requirements (#120) Bumps [pillow](https://github.com/python-pillow/Pillow) from 9.3.0 to 9.4.0. - [Release notes](https://github.com/python-pillow/Pillow/releases) - [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst) - [Commits](python-pillow/Pillow@9.3.0...9.4.0) --- updated-dependencies: - dependency-name: pillow dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * feat: Add `extract_attachment_info` (#112) * Adds function to extract attachments and their metadata from eml files * feat: helper functions to identify and extract phone numbers (#124) * added pattern for finding phone numbers * added cleaning brick for extracting phone numbers * add docs * changelog and bump version * switch to us phone numbers * bump dev version Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Mallori Harrell <6825104+mallorih@users.noreply.github.com>
This is a draft PR until the previous PR is merged.
This PR adds the ability to extract attachment information and payload from an email and save it to disk or store in a python object.
Testing