Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added summarization to blurb #57

Closed
wants to merge 2 commits into from
Closed

Conversation

studentbrad
Copy link
Collaborator

@studentbrad studentbrad commented Jan 17, 2020

Blurb Summarization 🥇

It seems obvious that we needed some way of formatting and shortening the blurb.
For this I added summarization from gensim.
gensim is a very good library for creating quick summaries with a designated number of words.

Context of change

  • Software (software that runs on the PC)
  • Library (library that runs on the PC)
  • Tool (tool that assists coding development)
  • Other

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Manual testing on my PC.
    Running the JobFunnel software and checking the .csv, I can see that the summarization infact works.

Checklist:

  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • Any dependent changes have been merged and published in downstream modules.

@bunsenmurder
Copy link
Collaborator

bunsenmurder commented Jan 17, 2020

I really like this idea, but we should definitely be doing exploratory data analyses and sharing the results before implementing features that use machine learning like this one. Implementing this would reduce our duplicate filter accuracy a lot. We would need to save all of our full blurbs in a separate file from the master list for this to work without affecting duplicate filtering.

@PaulMcInnis
Copy link
Owner

love this feature @studentbrad !

I agree with @bunsenmurder , we should retain the complete text somewhere so that the data can be used for other things including the similarity filter.

Perhaps we can just have blurb be the shortened text and add a new column to store the complete scraped text?

This greatly improves usability when reading thru job postings, can always just hide the column of raw/scraped text - though storing it elsewhere would be a cleaner option.

@studentbrad
Copy link
Collaborator Author

@bunsenmurder @PaulMcInnis I have taken your advice but this may be harder to implement than I thought. I want to preserve the what is now called the description (aka job['description']). However, I do not want this in the .csv. I only want the blurb. This contradicts how we use the .csv. The .csv is loaded as a job dictionary, so any column not included in the .csv is left blank (the description). I am confused on how handle this because we need to parse the .csv to update status'. It can be done using a combination of .csv and pickle parsing, but complicates the project. Thoughts?

@markkvdb
Copy link
Collaborator

Maybe something of a relational database. Make an Id for all job posting and store the original text with this id connected to a job.

@bunsenmurder
Copy link
Collaborator

bunsenmurder commented Jan 17, 2020

The idea that @markkvdb had could work, as we could save every job description per id in a master 'database' file.

Then before running our similarity filter, we match jobs in the master list to our database and replace the blurb with the full description in our dictionary object. Then after the similarity filter runs, we just apply the summarizer on the final product.

Copy link
Owner

@PaulMcInnis PaulMcInnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, we should think about moving the complete descriptions into some kind of database/seperate csv in the future.

One thing - will this break backwards compatibility for existing master lists?

@@ -1 +1 @@
__version__ = '2.1.0'
__version__ = '2.1.1'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to up this if we merge the pip fix first

@studentbrad
Copy link
Collaborator Author

studentbrad commented Jan 21, 2020

I can maintain backward compatibility. The description will be stored elsewhere but the blurb will become a summarised version of the description in masterlist.csv. Sometimes the description cannot be summarised. In this case, the description becomes the blurb aka. job['blurb'] = job['description']. So, maintaining backward compatibility works with opposite logic. If we read the masterlist.csv and find no description stored elsewhere for that job ID; then the blurb becomes the description aka. job['description'] = job['blurb'].

@studentbrad
Copy link
Collaborator Author

I will close this PR for now until I have found a solution. Anyone is free to make suggestions in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants