-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added EmailElement
for email documents
#103
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just a few small suggestions. Also, started to pull together some code that parses out the text/html
content from the email file (link on the branch below). We can think through how we might want to work that in with your new data structure.
self.name = cleaned_name | ||
|
||
|
||
class BodyText(Text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this List[Text]
, that way we can partition out NarrativeText
, ListItem
, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One questions on attachments, otherwise this is looking good.
|
||
Attachment: | ||
|
||
{self.attachment} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the attachments wind up being files? If so would it make more sense to print the filenames of the attachments instead of printing the attachment itself? Just so we don't wind up printing out the bytes for the attachment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some initial comments. I know this one is still in flight, just a reminder to add the following to the final PR:
- Sphinx docs on any new bricks
- Unit tests
- A section in the
README
showing how to usepartition_text
.
unstructured/partition/text.py
Outdated
file | ||
A file-like object using "r" mode --> open(filename, "r"). | ||
text | ||
The string representation of the .eml document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of the .eml document
-> of a plain text document
unstructured/partition/text.py
Outdated
|
||
if filename is not None and not file and not text: | ||
with open(filename, "r") as f: | ||
msg = email.message_from_file(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should all deal with plain text .txt
documents rather than emails. And we'll use it (1) when we want to partition a plain text document and (2) if your partitioning an email and choose to process the text/plain
content.
unstructured/partition/text.py
Outdated
content_map: Dict[str, str] = { | ||
part.get_content_type(): part.get_payload() for part in msg.walk() | ||
} | ||
content = content_map.get("text/plain", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go in partition_email
and we'll call partition_text
from there if the user decides to process the text/plain
content.
unstructured/partition/text.py
Outdated
if not content: | ||
raise ValueError("text/plain content not found in email") | ||
|
||
content = re.split(r"\n\n\n|\n\n|\n", content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make split_by_paragraph
a separate helper function and also \r
and other line-ending variants a well. The links below provide some background on that.
CHANGELOG.md
Outdated
@@ -1,3 +1,7 @@ | |||
## 0.3.4-dev2 | |||
|
|||
* Add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add your bullet under 0.3.3-dev1
and then change the version to 0.3.3-dev2
. We'll move to 0.3.4-devx
once we do the 0.3.3
release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just a couple of questions related to attachments. Can approve if we want to spin attachments off and deal with it in a separate PR.
CHANGELOG.md
Outdated
@@ -1,8 +1,9 @@ | |||
## 0.3.3-dev1 | |||
## 0.3.3-dev2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be 0.3.5-dev0
now since we just did a 0.3.4
release
|
||
Attachment: | ||
|
||
{self.attachment_name} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a list of attachments in case there are multiple attachments?
self.body = body | ||
self.received_info: ReceivedInfo | ||
self.meta_data: MetaData | ||
self.attachment: Attachment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here, should this be self.attachments: List[Attachment]
in case there are multiple?
self.received_info: ReceivedInfo | ||
self.meta_data: MetaData | ||
self.attachment: Attachment | ||
self.attachment_name: Attachment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a Attachment.name
attribute instead of creating an extra attribute on Email
for the attachment name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually name
is already inherited so I think we can probably just drop this attribute
pass | ||
|
||
|
||
class Attachment(Name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should attachment also have a bytes
or file-like attribute that contains the actual attachment? If we don't have code to deal with attachments yet, we can also spin attachments off and deal with it in a separate PR.
unstructured/documents/html.py
Outdated
@@ -9,7 +9,7 @@ | |||
|
|||
from lxml import etree | |||
|
|||
from unstructured.logger import get_logger | |||
from unstructured.logger import logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why these files are showing up in your diffs? They look the same as what's on main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Summary
Adds an
EmailElement
class tounstructured.documents
to make it easier to parse email documents into elements. Also added test that are basically the same as the tests forText
class but modified for the newName
class that is apart of theemail_elements
.