feat: Added `EmailElement` for email documents #103

mallorih · 2022-12-16T21:53:33Z

Summary

Adds an EmailElement class to unstructured.documents to make it easier to parse email documents into elements. Also added test that are basically the same as the tests for Text class but modified for the new Name class that is apart of the email_elements.

MthwRobinson

Looking good, just a few small suggestions. Also, started to pull together some code that parses out the text/html content from the email file (link on the branch below). We can think through how we might want to work that in with your new data structure.

https://github.com/Unstructured-IO/unstructured/blob/robinson/partition-eml/unstructured/partition/email.py

MthwRobinson · 2022-12-16T21:56:03Z

unstructured/documents/email_elements.py

+        self.name = cleaned_name
+
+
+class BodyText(Text):


Can we make this List[Text], that way we can partition out NarrativeText, ListItem, etc?

CHANGELOG.md

unstructured/documents/email_elements.py

MthwRobinson

One questions on attachments, otherwise this is looking good.

MthwRobinson · 2022-12-19T15:51:57Z

unstructured/documents/email_elements.py

+
+        Attachment:
+
+        {self.attachment}


Could the attachments wind up being files? If so would it make more sense to print the filenames of the attachments instead of printing the attachment itself? Just so we don't wind up printing out the bytes for the attachment.

MthwRobinson

Added some initial comments. I know this one is still in flight, just a reminder to add the following to the final PR:

Sphinx docs on any new bricks
Unit tests
A section in the README showing how to use partition_text.

unstructured/partition/text.py

MthwRobinson · 2022-12-19T19:34:12Z

unstructured/partition/text.py

+    file
+        A file-like object using "r" mode --> open(filename, "r").
+    text
+        The string representation of the .eml document.


of the .eml document -> of a plain text document

MthwRobinson · 2022-12-19T19:36:07Z

unstructured/partition/text.py

+
+    if filename is not None and not file and not text:
+        with open(filename, "r") as f:
+            msg = email.message_from_file(f)


These should all deal with plain text .txt documents rather than emails. And we'll use it (1) when we want to partition a plain text document and (2) if your partitioning an email and choose to process the text/plain content.

MthwRobinson · 2022-12-19T19:37:09Z

unstructured/partition/text.py

+    content_map: Dict[str, str] = {
+        part.get_content_type(): part.get_payload() for part in msg.walk()
+    }
+    content = content_map.get("text/plain", "")


This should go in partition_email and we'll call partition_text from there if the user decides to process the text/plain content.

MthwRobinson · 2022-12-19T19:40:03Z

unstructured/partition/text.py

+    if not content:
+        raise ValueError("text/plain content not found in email")
+
+    content = re.split(r"\n\n\n|\n\n|\n", content)


I'd make split_by_paragraph a separate helper function and also \r and other line-ending variants a well. The links below provide some background on that.

https://stackoverflow.com/questions/20056306/match-linebreaks-n-or-r-n

https://stackoverflow.com/questions/1761051/difference-between-n-and-r

MthwRobinson · 2022-12-19T19:49:13Z

CHANGELOG.md

@@ -1,3 +1,7 @@
+## 0.3.4-dev2
+
+* Add 


You can add your bullet under 0.3.3-dev1 and then change the version to 0.3.3-dev2. We'll move to 0.3.4-devx once we do the 0.3.3 release.

MthwRobinson

Looking good, just a couple of questions related to attachments. Can approve if we want to spin attachments off and deal with it in a separate PR.

MthwRobinson · 2022-12-21T15:35:28Z

CHANGELOG.md

@@ -1,8 +1,9 @@
-## 0.3.3-dev1
+## 0.3.3-dev2


This should be 0.3.5-dev0 now since we just did a 0.3.4 release

MthwRobinson · 2022-12-21T15:37:10Z

unstructured/documents/email_elements.py

+
+        Attachment:
+
+        {self.attachment_name}


Should this be a list of attachments in case there are multiple attachments?

MthwRobinson · 2022-12-21T15:37:35Z

unstructured/documents/email_elements.py

+        self.body = body
+        self.received_info: ReceivedInfo
+        self.meta_data: MetaData
+        self.attachment: Attachment


Similar here, should this be self.attachments: List[Attachment] in case there are multiple?

MthwRobinson · 2022-12-21T15:38:35Z

unstructured/documents/email_elements.py

+        self.received_info: ReceivedInfo
+        self.meta_data: MetaData
+        self.attachment: Attachment
+        self.attachment_name: Attachment


Can we have a Attachment.name attribute instead of creating an extra attribute on Email for the attachment name?

Actually name is already inherited so I think we can probably just drop this attribute

MthwRobinson · 2022-12-21T15:40:33Z

unstructured/documents/email_elements.py

+    pass
+
+
+class Attachment(Name):


Should attachment also have a bytes or file-like attribute that contains the actual attachment? If we don't have code to deal with attachments yet, we can also spin attachments off and deal with it in a separate PR.

…structured into email-element

MthwRobinson · 2022-12-21T21:49:34Z

unstructured/documents/html.py

@@ -9,7 +9,7 @@

 from lxml import etree

-from unstructured.logger import get_logger
+from unstructured.logger import logger


Do you know why these files are showing up in your diffs? They look the same as what's on main

MthwRobinson

LGTM!

…into email-element

Mallori Harrell added 8 commits December 16, 2022 14:27

new data structure, updated CHANGELOG and __version__.py

eb73930

removed code

19beabd

adding tests

7465689

added test

daae379

fixed import statement

9fa901a

linter

a7106d5

remove unused import statement

2b5a6ce

fixed syntax

b0ea522

MthwRobinson suggested changes Dec 16, 2022

View reviewed changes

added Email class

909b975

MthwRobinson reviewed Dec 19, 2022

View reviewed changes

Mallori Harrell added 3 commits December 19, 2022 12:24

merge conflicts

6730fcf

changelog and new partition function

cc27de8

partition text

47cb1ca

MthwRobinson suggested changes Dec 19, 2022

View reviewed changes

remove partition_text

3799c6b

MthwRobinson reviewed Dec 19, 2022

View reviewed changes

Mallori Harrell and others added 6 commits December 19, 2022 15:42

updated comments and added attachment name variable

e9ed6b4

updated changelog

fe9bea7

linter

f15048b

formatting issues

56d10ef

version

3389a1b

Merge branch 'main' into email-element

1b7815f

mallorih requested a review from MthwRobinson December 19, 2022 23:38

MthwRobinson suggested changes Dec 21, 2022

View reviewed changes

Mallori Harrell added 5 commits December 21, 2022 15:16

changed attachment

ce05243

Merge branch 'email-element' of https://github.com/Unstructured-IO/un…

095de59

…structured into email-element

new data structure, updated CHANGELOG and __version__.py

310fdaf

removed code

5181591

adding tests

1b65c3f

Mallori Harrell added 15 commits December 21, 2022 15:25

added test

32eaada

fixed import statement

d358ccd

linter

572be70

remove unused import statement

65f110f

fixed syntax

e582c2c

added Email class

bdd41a2

changelog and new partition function

f53c812

partition text

9365781

remove partition_text

cd3c238

updated comments and added attachment name variable

75cb27c

updated changelog

5466a08

linter

bf80f08

formatting issues

f128fea

changed attachment

ea7ad94

merge conflicts

42362ed

MthwRobinson reviewed Dec 21, 2022

View reviewed changes

MthwRobinson approved these changes Dec 21, 2022

View reviewed changes

Mallori Harrell added 2 commits December 21, 2022 15:55

Merge branch 'main' of https://github.com/Unstructured-IO/unstructured …

764187f

…into email-element

changelog and version

bf9875c

mallorih merged commit e0a76ef into main Dec 21, 2022

mallorih deleted the email-element branch December 21, 2022 22:03

feat: Added EmailElement for email documents #103

feat: Added EmailElement for email documents #103

Uh oh!

Conversation

mallorih commented Dec 16, 2022

Summary

Uh oh!

MthwRobinson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MthwRobinson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MthwRobinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MthwRobinson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MthwRobinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

feat: Added `EmailElement` for email documents #103

feat: Added `EmailElement` for email documents #103