added summarization to blurb #57

studentbrad · 2020-01-17T04:08:49Z

Blurb Summarization 🥇

It seems obvious that we needed some way of formatting and shortening the blurb.
For this I added summarization from gensim.
gensim is a very good library for creating quick summaries with a designated number of words.

Context of change

Software (software that runs on the PC)
Library (library that runs on the PC)
Tool (tool that assists coding development)
Other

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Manual testing on my PC.
Running the JobFunnel software and checking the .csv, I can see that the summarization infact works.

Checklist:

I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
Any dependent changes have been merged and published in downstream modules.

bunsenmurder · 2020-01-17T06:28:44Z

I really like this idea, but we should definitely be doing exploratory data analyses and sharing the results before implementing features that use machine learning like this one. Implementing this would reduce our duplicate filter accuracy a lot. We would need to save all of our full blurbs in a separate file from the master list for this to work without affecting duplicate filtering.

PaulMcInnis · 2020-01-17T13:45:10Z

love this feature @studentbrad !

I agree with @bunsenmurder , we should retain the complete text somewhere so that the data can be used for other things including the similarity filter.

Perhaps we can just have blurb be the shortened text and add a new column to store the complete scraped text?

This greatly improves usability when reading thru job postings, can always just hide the column of raw/scraped text - though storing it elsewhere would be a cleaner option.

studentbrad · 2020-01-17T22:03:15Z

@bunsenmurder @PaulMcInnis I have taken your advice but this may be harder to implement than I thought. I want to preserve the what is now called the description (aka job['description']). However, I do not want this in the .csv. I only want the blurb. This contradicts how we use the .csv. The .csv is loaded as a job dictionary, so any column not included in the .csv is left blank (the description). I am confused on how handle this because we need to parse the .csv to update status'. It can be done using a combination of .csv and pickle parsing, but complicates the project. Thoughts?

markkvdb · 2020-01-17T22:12:14Z

Maybe something of a relational database. Make an Id for all job posting and store the original text with this id connected to a job.

bunsenmurder · 2020-01-17T23:00:25Z

The idea that @markkvdb had could work, as we could save every job description per id in a master 'database' file.

Then before running our similarity filter, we match jobs in the master list to our database and replace the blurb with the full description in our dictionary object. Then after the similarity filter runs, we just apply the summarizer on the final product.

PaulMcInnis

This makes sense, we should think about moving the complete descriptions into some kind of database/seperate csv in the future.

One thing - will this break backwards compatibility for existing master lists?

PaulMcInnis · 2020-01-21T15:44:43Z

jobfunnel/__init__.py

@@ -1 +1 @@
-__version__ = '2.1.0'
+__version__ = '2.1.1'


Will need to up this if we merge the pip fix first

studentbrad · 2020-01-21T17:23:02Z

I can maintain backward compatibility. The description will be stored elsewhere but the blurb will become a summarised version of the description in masterlist.csv. Sometimes the description cannot be summarised. In this case, the description becomes the blurb aka. job['blurb'] = job['description']. So, maintaining backward compatibility works with opposite logic. If we read the masterlist.csv and find no description stored elsewhere for that job ID; then the blurb becomes the description aka. job['description'] = job['blurb'].

studentbrad · 2020-01-22T04:18:54Z

I will close this PR for now until I have found a solution. Anyone is free to make suggestions in the meantime.

added summarization to blurb

183e899

studentbrad requested review from bunsenmurder, PaulMcInnis and markkvdb January 17, 2020 04:30

studentbrad added the enhancement label Jan 17, 2020

added job description field

a203211

PaulMcInnis requested changes Jan 21, 2020

View reviewed changes

studentbrad closed this Jan 22, 2020

PaulMcInnis mentioned this pull request Sep 13, 2020

Apply NLP to condense description #94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added summarization to blurb #57

added summarization to blurb #57

studentbrad commented Jan 17, 2020 •

edited

Loading

bunsenmurder commented Jan 17, 2020 •

edited

Loading

PaulMcInnis commented Jan 17, 2020

studentbrad commented Jan 17, 2020

markkvdb commented Jan 17, 2020

bunsenmurder commented Jan 17, 2020 •

edited

Loading

PaulMcInnis left a comment

PaulMcInnis Jan 21, 2020

studentbrad commented Jan 21, 2020 •

edited

Loading

studentbrad commented Jan 22, 2020

		@@ -1 +1 @@
		__version__ = '2.1.0'
		__version__ = '2.1.1'

added summarization to blurb #57

added summarization to blurb #57

Conversation

studentbrad commented Jan 17, 2020 • edited Loading

Blurb Summarization 🥇

Context of change

Type of change

How Has This Been Tested?

Checklist:

bunsenmurder commented Jan 17, 2020 • edited Loading

PaulMcInnis commented Jan 17, 2020

studentbrad commented Jan 17, 2020

markkvdb commented Jan 17, 2020

bunsenmurder commented Jan 17, 2020 • edited Loading

PaulMcInnis left a comment

Choose a reason for hiding this comment

PaulMcInnis Jan 21, 2020

Choose a reason for hiding this comment

studentbrad commented Jan 21, 2020 • edited Loading

studentbrad commented Jan 22, 2020

studentbrad commented Jan 17, 2020 •

edited

Loading

bunsenmurder commented Jan 17, 2020 •

edited

Loading

bunsenmurder commented Jan 17, 2020 •

edited

Loading

studentbrad commented Jan 21, 2020 •

edited

Loading