Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added EmailElement for email documents #103

Merged
merged 41 commits into from
Dec 21, 2022
Merged

feat: Added EmailElement for email documents #103

merged 41 commits into from
Dec 21, 2022

Conversation

mallorih
Copy link
Contributor

Summary

Adds an EmailElement class to unstructured.documents to make it easier to parse email documents into elements. Also added test that are basically the same as the tests for Text class but modified for the new Name class that is apart of the email_elements.

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just a few small suggestions. Also, started to pull together some code that parses out the text/html content from the email file (link on the branch below). We can think through how we might want to work that in with your new data structure.

https://github.com/Unstructured-IO/unstructured/blob/robinson/partition-eml/unstructured/partition/email.py

self.name = cleaned_name


class BodyText(Text):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this List[Text], that way we can partition out NarrativeText, ListItem, etc?

CHANGELOG.md Outdated Show resolved Hide resolved
unstructured/documents/email_elements.py Show resolved Hide resolved
Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One questions on attachments, otherwise this is looking good.


Attachment:

{self.attachment}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the attachments wind up being files? If so would it make more sense to print the filenames of the attachments instead of printing the attachment itself? Just so we don't wind up printing out the bytes for the attachment.

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some initial comments. I know this one is still in flight, just a reminder to add the following to the final PR:

  • Sphinx docs on any new bricks
  • Unit tests
  • A section in the README showing how to use partition_text.

unstructured/partition/text.py Outdated Show resolved Hide resolved
file
A file-like object using "r" mode --> open(filename, "r").
text
The string representation of the .eml document.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of the .eml document -> of a plain text document


if filename is not None and not file and not text:
with open(filename, "r") as f:
msg = email.message_from_file(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should all deal with plain text .txt documents rather than emails. And we'll use it (1) when we want to partition a plain text document and (2) if your partitioning an email and choose to process the text/plain content.

content_map: Dict[str, str] = {
part.get_content_type(): part.get_payload() for part in msg.walk()
}
content = content_map.get("text/plain", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go in partition_email and we'll call partition_text from there if the user decides to process the text/plain content.

if not content:
raise ValueError("text/plain content not found in email")

content = re.split(r"\n\n\n|\n\n|\n", content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make split_by_paragraph a separate helper function and also \r and other line-ending variants a well. The links below provide some background on that.

CHANGELOG.md Outdated
@@ -1,3 +1,7 @@
## 0.3.4-dev2

* Add
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add your bullet under 0.3.3-dev1 and then change the version to 0.3.3-dev2. We'll move to 0.3.4-devx once we do the 0.3.3 release.

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just a couple of questions related to attachments. Can approve if we want to spin attachments off and deal with it in a separate PR.

CHANGELOG.md Outdated
@@ -1,8 +1,9 @@
## 0.3.3-dev1
## 0.3.3-dev2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 0.3.5-dev0 now since we just did a 0.3.4 release


Attachment:

{self.attachment_name}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a list of attachments in case there are multiple attachments?

self.body = body
self.received_info: ReceivedInfo
self.meta_data: MetaData
self.attachment: Attachment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here, should this be self.attachments: List[Attachment] in case there are multiple?

self.received_info: ReceivedInfo
self.meta_data: MetaData
self.attachment: Attachment
self.attachment_name: Attachment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a Attachment.name attribute instead of creating an extra attribute on Email for the attachment name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually name is already inherited so I think we can probably just drop this attribute

pass


class Attachment(Name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should attachment also have a bytes or file-like attribute that contains the actual attachment? If we don't have code to deal with attachments yet, we can also spin attachments off and deal with it in a separate PR.

@@ -9,7 +9,7 @@

from lxml import etree

from unstructured.logger import get_logger
from unstructured.logger import logger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why these files are showing up in your diffs? They look the same as what's on main

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mallorih mallorih merged commit e0a76ef into main Dec 21, 2022
@mallorih mallorih deleted the email-element branch December 21, 2022 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants