-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: Added EmailElement for email documents
#103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MthwRobinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just a few small suggestions. Also, started to pull together some code that parses out the text/html content from the email file (link on the branch below). We can think through how we might want to work that in with your new data structure.
| self.name = cleaned_name | ||
|
|
||
|
|
||
| class BodyText(Text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this List[Text], that way we can partition out NarrativeText, ListItem, etc?
MthwRobinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One questions on attachments, otherwise this is looking good.
| Attachment: | ||
| {self.attachment} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the attachments wind up being files? If so would it make more sense to print the filenames of the attachments instead of printing the attachment itself? Just so we don't wind up printing out the bytes for the attachment.
MthwRobinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some initial comments. I know this one is still in flight, just a reminder to add the following to the final PR:
- Sphinx docs on any new bricks
- Unit tests
- A section in the
READMEshowing how to usepartition_text.
unstructured/partition/text.py
Outdated
| file | ||
| A file-like object using "r" mode --> open(filename, "r"). | ||
| text | ||
| The string representation of the .eml document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of the .eml document -> of a plain text document
unstructured/partition/text.py
Outdated
|
|
||
| if filename is not None and not file and not text: | ||
| with open(filename, "r") as f: | ||
| msg = email.message_from_file(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should all deal with plain text .txt documents rather than emails. And we'll use it (1) when we want to partition a plain text document and (2) if your partitioning an email and choose to process the text/plain content.
unstructured/partition/text.py
Outdated
| content_map: Dict[str, str] = { | ||
| part.get_content_type(): part.get_payload() for part in msg.walk() | ||
| } | ||
| content = content_map.get("text/plain", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go in partition_email and we'll call partition_text from there if the user decides to process the text/plain content.
unstructured/partition/text.py
Outdated
| if not content: | ||
| raise ValueError("text/plain content not found in email") | ||
|
|
||
| content = re.split(r"\n\n\n|\n\n|\n", content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make split_by_paragraph a separate helper function and also \r and other line-ending variants a well. The links below provide some background on that.
CHANGELOG.md
Outdated
| @@ -1,3 +1,7 @@ | |||
| ## 0.3.4-dev2 | |||
|
|
|||
| * Add | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add your bullet under 0.3.3-dev1 and then change the version to 0.3.3-dev2. We'll move to 0.3.4-devx once we do the 0.3.3 release.
MthwRobinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just a couple of questions related to attachments. Can approve if we want to spin attachments off and deal with it in a separate PR.
CHANGELOG.md
Outdated
| @@ -1,8 +1,9 @@ | |||
| ## 0.3.3-dev1 | |||
| ## 0.3.3-dev2 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be 0.3.5-dev0 now since we just did a 0.3.4 release
| Attachment: | ||
| {self.attachment_name} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a list of attachments in case there are multiple attachments?
| self.body = body | ||
| self.received_info: ReceivedInfo | ||
| self.meta_data: MetaData | ||
| self.attachment: Attachment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here, should this be self.attachments: List[Attachment] in case there are multiple?
| self.received_info: ReceivedInfo | ||
| self.meta_data: MetaData | ||
| self.attachment: Attachment | ||
| self.attachment_name: Attachment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a Attachment.name attribute instead of creating an extra attribute on Email for the attachment name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually name is already inherited so I think we can probably just drop this attribute
| pass | ||
|
|
||
|
|
||
| class Attachment(Name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should attachment also have a bytes or file-like attribute that contains the actual attachment? If we don't have code to deal with attachments yet, we can also spin attachments off and deal with it in a separate PR.
| from lxml import etree | ||
|
|
||
| from unstructured.logger import get_logger | ||
| from unstructured.logger import logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why these files are showing up in your diffs? They look the same as what's on main
MthwRobinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Summary
Adds an
EmailElementclass tounstructured.documentsto make it easier to parse email documents into elements. Also added test that are basically the same as the tests forTextclass but modified for the newNameclass that is apart of theemail_elements.