-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added summarization to blurb #57
Conversation
I really like this idea, but we should definitely be doing exploratory data analyses and sharing the results before implementing features that use machine learning like this one. Implementing this would reduce our duplicate filter accuracy a lot. We would need to save all of our full blurbs in a separate file from the master list for this to work without affecting duplicate filtering. |
love this feature @studentbrad ! I agree with @bunsenmurder , we should retain the complete text somewhere so that the data can be used for other things including the similarity filter. Perhaps we can just have blurb be the shortened text and add a new column to store the complete scraped text? This greatly improves usability when reading thru job postings, can always just hide the column of raw/scraped text - though storing it elsewhere would be a cleaner option. |
@bunsenmurder @PaulMcInnis I have taken your advice but this may be harder to implement than I thought. I want to preserve the what is now called the |
Maybe something of a relational database. Make an Id for all job posting and store the original text with this id connected to a job. |
The idea that @markkvdb had could work, as we could save every job description per Then before running our similarity filter, we match jobs in the master list to our database and replace the blurb with the full description in our dictionary object. Then after the similarity filter runs, we just apply the summarizer on the final product. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, we should think about moving the complete descriptions into some kind of database/seperate csv in the future.
One thing - will this break backwards compatibility for existing master lists?
@@ -1 +1 @@ | |||
__version__ = '2.1.0' | |||
__version__ = '2.1.1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to up this if we merge the pip fix first
I can maintain backward compatibility. The description will be stored elsewhere but the blurb will become a summarised version of the description in masterlist.csv. Sometimes the description cannot be summarised. In this case, the description becomes the blurb aka. |
I will close this PR for now until I have found a solution. Anyone is free to make suggestions in the meantime. |
Blurb Summarization 🥇
It seems obvious that we needed some way of formatting and shortening the blurb.
For this I added summarization from gensim.
gensim is a very good library for creating quick summaries with a designated number of words.
Context of change
Type of change
How Has This Been Tested?
Running the JobFunnel software and checking the .csv, I can see that the summarization infact works.
Checklist: