Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 78 additions & 73 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,73 +1,78 @@
# Automated-Email
> Building a email reader that can classify whether an email is important or not

### Summary

Every day, our department gets over 500 google alerts regarding Colby's alumni. Google alerts are emails that contain links to articles that match the name of our constituents. The task of this project is to build a classifier that can identify whether the person mentioned in the article is affiliated with Colby.

I am currently able to achieve an accuracy of 92% with my model.

The entire process can be broken down into 4 phases: Retrieving the email, scraping the website from the article, classifying the email based on features, and finally displaying it in a Graphical User Interface (GUI).

### Retrieving the Email

To connect Python to Gmail, I used Imaplib. All the methods I used to communicate to my inbox is in the EmailReader.py file

### Web Scraping

To collect the information about each article, I used the packages requests and BeautifulSoup to scrape information off of websites. Because some emails contain multiple links, I treat each link separately, and merge them in the Classification phase. All methods I used to scrape can be found in the scraper.py

### Classification

After retrieving the words from the article, I devised three scoring metrics:
- Occupation Score
- Occupation Score Adjusted
- Colby Score.

### Frontend GUI

I also built a GUI that displays the models:


<div style="display: block; float: left">
<img src="visualization/GUI.png" width="400" height="250">
</div>


## Future Endeavors

I have been doing personal studies on Natural language processin, which hopefully I can incorporate into this project to increase the accuaracy.

#ChangeLogs

Beta version completed

Version 1.2.0

WINDOWS VERSION

- basic GUI completed
- scoring accuracy of 92%
- features 3 scoring metrics
- original text words part of dataset
- prelim analysis for text data from links, such as length of words, etc

- changed double click to right click that allows user to access links
- added pathlib as a dependency to change the paths from mac to windows
- change naming convention for logs from colons (:) to periods (.)
- Deleted SigAlarm because it doesn't work on windows, need to use threading instead

-FIXED BUG binds not working: set focus AFTER displaying graph

- Changed Scoring Metric: words now have to be at least length 3


Version 1.3.0
10-18-18

WINDOWS VERSION

- decoded email id into string data type
- this will be encoded back to byte data later
- made constituent id into int data type
- Added option to choose to automate things and the threshold from GUI
# Automated-Email
> Building a email reader that can classify whether an email is important or not

### Summary

Every day, our department gets over 500 google alerts regarding Colby's alumni. Google alerts are emails that contain links to articles that match the name of our constituents. The task of this project is to build a classifier that can identify whether the person mentioned in the article is affiliated with Colby.

I am currently able to achieve an accuracy of 92% with my model.

The entire process can be broken down into 4 phases: Retrieving the email, scraping the website from the article, classifying the email based on features, and finally displaying it in a Graphical User Interface (GUI).

### Retrieving the Email

To connect Python to Gmail, I used Imaplib. All the methods I used to communicate to my inbox is in the EmailReader.py file

### Web Scraping

To collect the information about each article, I used the packages requests and BeautifulSoup to scrape information off of websites. Because some emails contain multiple links, I treat each link separately, and merge them in the Classification phase. All methods I used to scrape can be found in the scraper.py

### Classification

After retrieving the words from the article, I devised three scoring metrics:
- Occupation Score
- Occupation Score Adjusted
- Colby Score.

### Frontend GUI

I also built a GUI that displays the models:


<div style="display: block; float: left">
<img src="visualization/GUI.png" width="400" height="250">
</div>


## Future Endeavors

I have been doing personal studies on Natural language processin, which hopefully I can incorporate into this project to increase the accuaracy.

#ChangeLogs

Beta version completed

12-5-18

- Added sorting capabilities onto dataframe
- changed master df to not have extra indexCol anymore

Version 1.2.0

WINDOWS VERSION

- basic GUI completed
- scoring accuracy of 92%
- features 3 scoring metrics
- original text words part of dataset
- prelim analysis for text data from links, such as length of words, etc

- changed double click to right click that allows user to access links
- added pathlib as a dependency to change the paths from mac to windows
- change naming convention for logs from colons (:) to periods (.)
- Deleted SigAlarm because it doesn't work on windows, need to use threading instead

-FIXED BUG binds not working: set focus AFTER displaying graph

- Changed Scoring Metric: words now have to be at least length 3


Version 1.3.0
10-18-18

WINDOWS VERSION

- decoded email id into string data type
- this will be encoded back to byte data later
- made constituent id into int data type
- Added option to choose to automate things and the threshold from GUI
3 changes: 1 addition & 2 deletions scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def __init__(self):

try:
path = 'datasets/OrganizationRelationships_NickNamesAdded_5.24.2018.csv'
self.constituents_df = pd.read_csv(path, index_col=0, low_memory=False)
self.constituents_df = pd.read_csv(path, low_memory=False)
except FileNotFoundError:
warnings.warn('unable to find Constituents data. Please use set_constituents_path to locate the datafile')

Expand Down Expand Up @@ -433,7 +433,6 @@ def create_scores_data(self, df, label=None, split_up_links=False):
df[['Occupation score', 'Occupation score adjusted', 'Colby score']] = pd.DataFrame(scores, index=df.index)
df['constituent_id'] = constituent_id

print(df)
print('finished adding scores')

# if given a label (for training) add label as a column
Expand Down
61 changes: 51 additions & 10 deletions tkinter-skeleton.py
Original file line number Diff line number Diff line change
Expand Up @@ -319,9 +319,15 @@ def setBindings(self):
# binds the logs listbox to onselect
self.logs_lbox.bind('<<ListboxSelect>>', self.onselect)

# binds double click to bottom table
# binds double click to doubleClick function
self.root.bind('<Double-Button-1>', self.doubleClick)

# binds left click to handleLeftMouseClick function
self.root.bind('<Button-1>', self.handleLeftMouseClick)

# binds right click to mousebutton2 function
self.root.bind('<Button-3>', self.handleRightMouseClick)

# binds control-e to switch the label of an element in the bottom table
self.bottomFrame.bind('<Control-e>', self.switchLabel)
self.bottomFrame.bind('<Control-w>', self.switchMovedState)
Expand All @@ -332,7 +338,6 @@ def handleQuit(self, event=None):
print('Terminating')
self.root.destroy()


################################ Build Tables and Graphs to Display ############################

def buildBottomTable(self):
Expand All @@ -342,6 +347,8 @@ def buildBottomTable(self):

# delete previous tables, if ones exist
self.refreshFrame(self.bottomFrame)
# this isn't necessary, only to prevent left shifting of the main frames
self.refreshFrame(self.rightmainframe)

self.tree = ttk.Treeview(self.bottomFrame)

Expand Down Expand Up @@ -376,11 +383,14 @@ def process_text(string, length=50, total_string_size=100):
string = string[:total_string_size]
return '\n'.join(textwrap.wrap(string, length)) + '...'

# dictionary that shortens the long names from the main to display them on the treeview
self.tree_column_shortened = {'first_name': 'first',
'last_name': 'last',
'time': 'date',
'constituent_id': 'id'}

# preprocesses the dataframe
df = df.rename(index=int, columns={'first_name': 'first',
'last_name': 'last',
'time': 'date',
'constituent_id': 'id'}) # make the columns shorter
df = df.rename(index=int, columns=self.tree_column_shortened) # make the columns shorter
df['text'] = df['text'].apply(process_text) # limits the characters in the text column
df['confidence'] = df['confidence'].apply(lambda x: np.around(x, 3)) # rounds the confidence
df['date'] = df['date'].apply(lambda x: x.split()[0]) # only shows the date and not the time
Expand Down Expand Up @@ -417,7 +427,6 @@ def buildScoresTable(self, curItem):
'''
passes in the current selected treeview row to display the score table
'''
print(curItem)
row_num = self.tree.item(curItem)['text']
scores = self.df[['Occupation score', 'Occupation score adjusted', 'Colby score']].iloc[row_num]

Expand Down Expand Up @@ -538,7 +547,6 @@ def embedChart(self, master, fig=None, side=tk.TOP, legends=None, title=None):
def onselect(self, event):
w = event.widget


# for when the logs listbox is selected
if w == self.logs_lbox:
try:
Expand All @@ -563,6 +571,7 @@ def onselect(self, event):

# for when the bottom table is selected
elif w == self.tree:

self.refreshFrame(self.rightmainframe)

# displays the scores graph
Expand All @@ -583,18 +592,50 @@ def onselect(self, event):
# builds the label that displays the constituent's occupation
self.buildOccsLabel(curItem)

def handleLeftMouseClick(self, event=None):
w = event.widget
x, y = event.x, event.y

# if the user left clicks the heading of a tree -> sort it by that order
if w == self.tree and self.tree.identify_region(x, y) == 'heading':
self.sort_dataframe(x, ascending=True)

def doubleClick(self, event):
def handleRightMouseClick(self, event=None):
w = event.widget
x, y = event.x, event.y

if w == self.tree:
print(w, self.tree.identify_region(x, y))
# if the user left clicks the heading of a tree -> sort it by that order
if w == self.tree and self.tree.identify_region(x, y) == 'heading':
self.sort_dataframe(x, ascending=False)

def doubleClick(self, event=None):
w = event.widget
x, y = event.x, event.y

# Double click to access a link inside the bottom_tree widget
if w == self.tree and self.tree.identify_region(x, y) == 'cell':
print('double clicked')

# selects scores from the current row
curItem = self.tree.focus()

self.openUrl(curItem)


# sorts the dataframe by the column name
# prereq: the user clicked a heading on self.tree
def sort_dataframe(self, mouseX, ascending):
label = self.tree.heading(self.tree.identify_column(mouseX))['text']

# if a value is shortened, unshorten it by finding the key of the shortened dictionary
for k, v in self.tree_column_shortened.items():
if label == v:
label = k

self.df = self.df.sort_values([label], ascending=ascending).reset_index(drop=True)
self.buildBottomTable()

# opens the url of the current item from
def openUrl(self, curItem):
row_num = self.tree.item(curItem)['text']
Expand Down