Skip to content

Adding backend and client code #103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
df74b33
Project Proposal Added
OjasviAgarwal Oct 23, 2022
0edcc29
Added updated project proposal
OjasviAgarwal Oct 23, 2022
2c689fb
Delete Job Recommendation System.pdf
OjasviAgarwal Oct 23, 2022
ac6f757
Update 3
OjasviAgarwal Oct 23, 2022
cab33e1
Delete Job Recommendation System (2).pdf
OjasviAgarwal Oct 23, 2022
218d5ec
Add files via upload
OjasviAgarwal Nov 12, 2022
f9d6b2f
Add files via upload
OjasviAgarwal Nov 12, 2022
4a9045a
Delete TIS MidTerm Report (1).pdf
OjasviAgarwal Nov 12, 2022
57a4080
adding linkedin web scraper
Nov 12, 2022
248304f
adding client and backend files
Dec 7, 2022
e0c9967
updated requirements.txt
dummyuser2j Dec 7, 2022
eacdaef
removed comments
Dec 7, 2022
e3e4bc0
Merge branch 'feature/add_web_scrapper'
dummyuser2j Dec 7, 2022
a15f982
Merge branch 'feature/add_web_scrapper'
dummyuser2j Dec 7, 2022
5f5a730
Location parameter removed
Dec 7, 2022
6b56280
update frontend and remove duplicates
Dec 7, 2022
1ea301d
merging code with latest frontend
Dec 7, 2022
ef96571
Merge branch 'feature/add_web_scrapper' of https://github.com/OjasviA…
dummyuser2j Dec 7, 2022
85214ae
Merge pull request #2 from OjasviAgarwal/feature/add_web_scrapper
OjasviAgarwal Dec 7, 2022
1917a28
Merge branch 'main' of https://github.com/OjasviAgarwal/JobRecommenda…
dummyuser2j Dec 7, 2022
7dc0d33
update readme
dummyuser2j Dec 7, 2022
17d5925
updating github link
Dec 7, 2022
770c526
Merge branch 'main' of https://github.com/OjasviAgarwal/JobRecommenda…
Dec 7, 2022
4b4b6b8
update readme part 2
dummyuser2j Dec 7, 2022
313726a
Merge branch 'main' of https://github.com/OjasviAgarwal/JobRecommenda…
dummyuser2j Dec 7, 2022
68ce33d
Update README.md
OjasviAgarwal Dec 7, 2022
267d933
Add files via upload
OjasviAgarwal Dec 7, 2022
f52f55d
Final project documentation added
OjasviAgarwal Dec 7, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
venv
__pycache__
Binary file added Job Recommender Project Documentation.pdf
Binary file not shown.
37 changes: 35 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,36 @@
# CourseProject
# JobRecommendationSystem

The topic of our project is ‘Job Recommendation System’. Being students ourselves, it’s very difficult to find the right jobs based on our resumes. Currently, we end up going through most of the job descriptions and start manually checking if the job description has the skills mentioned that match with the skills that we have. Therefore, we are trying to solve this problem by providing job recommendations based on the resume that is uploaded by the user. This way, the manual process for keyword matching based on skills is not needed anymore.

Demo Presentation and Video: https://drive.google.com/file/d/1jN_jI4-0qTC7cz_S_diDNiNH6HTBZ1zq/view?usp=share_link


# Environment Setup:

Go to https://nodejs.org/en/ and download version 18.12.1 LTS
python (3.0 +)
Install virtualenv (optional)

# Procedure to Run Frontend

Frontend (In a dedicated Terminal)

cd client
npm install
npm start
Open in browser http://localhost:3001/


# Procedure to Run Backend

Backend (In a separate dedicated terminal)

cd backend
pip install -r requirements.txt
Install these packages separately to avoid issues while running the project if needed:
pip install nltk
pip install pyPDF2
python server.py



Please fork this repository and paste the github link of your fork on Microsoft CMT. Detailed instructions are on Coursera under Week 1: Course Project Overview/Week 9 Activities.
Binary file added TIS Project Progress Report.pdf
Binary file not shown.
2 changes: 2 additions & 0 deletions backend/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
venv
__pycache__
Binary file added backend/Job Recommendation System.pdf
Binary file not shown.
1,075 changes: 1,075 additions & 0 deletions backend/LinkedinJobs.csv

Large diffs are not rendered by default.

63 changes: 63 additions & 0 deletions backend/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Job Recommendation By Skill Match

Project to to create a simple skill-keyword-based job recommendation engine, which match keywords from resume to job descriptions

## Install

Install virtualenv
virtualenv is a tool to create isolated Python projects. Think of it, as a cleanroom, isolated from other virsions of Python and libriries.

Enter this command into terminal

sudo pip install virtualenv

or if you get an error

sudo -H pip install virtualenv

Start virtualenv
Navigate to where you want to store your code. Create new directory.

mkdir my_project && cd my_project

INSIDE my_project folder create a new virtualenv

virtualenv env

Activate virtualenv

source env/bin/activate

This project requires **Python 3.0+** and the following Python libraries installed:

- [NumPy](http://www.numpy.org/) - pip install numpy
- [Pandas](http://pandas.pydata.org)
- [NLTK Stopwords](https://www.nltk.org/book/ch02.html)
- [Selenium](https://www.seleniumhq.org/)
- [PyPDF2](https://pythonhosted.org/PyPDF2/)
- pip install -r requirements.txt

## Code

Code is provided in
- job_recommendation.py
- linkedin_scrapper.py
- skill_keyword_match.py
- web_scrapper.py

## Run

In a terminal or command window, navigate to the top-level project directory `TIS_Job_Project/` and run one of the following commands:

Search and match jobs in all cities:
```python indeed_job_recommendation.py```

Search and match jobs in one city e.g. Vancouver,BC:
```python indeed_job_recommendation.py Vancouver,BC```

When finishes successfully, it will say 'File of recommended jobs saved'.

## Data
Data collected from TBD


Binary file added backend/chromedriver
Binary file not shown.
10 changes: 10 additions & 0 deletions backend/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# -*- coding: utf-8 -*-
JOBS_LINKS_JSON_FILE = r'./data/indeed_jobs_links.json'
LINKED_JOBS_INFO_CSV_FILE = "LinkedinJobs.csv"
JOBS_INFO_JSON_FILE = r'./data/indeed_jobs_info.json'
RECOMMENDED_JOBS_FILE = r'./data/recommended_jobs'
WEBDRIVER_PATH = r'D:\chromedriver\chromedriver.exe'
JOB_LOCATIONS = ['Vancouver,BC', 'Toronto,ON', 'Montréal,QC', 'Ottawa,ON', 'Calgary,AB', 'Edmonton,AB']
JOB_SEARCH_WORDS = '"data scientist"+OR+"data engineer"+OR+"data analyst"'
DAY_RANGE = 30
SAMPLE_RESUME_PDF = r'./data/test_docs/Resume.pdf'
Binary file added backend/data/Ojasvi_Agarwal_Resume.pdf
Binary file not shown.
2,938 changes: 2,938 additions & 0 deletions backend/data/indeed_jobs_info.json

Large diffs are not rendered by default.

1,470 changes: 1,470 additions & 0 deletions backend/data/indeed_jobs_links.json

Large diffs are not rendered by default.

283 changes: 283 additions & 0 deletions backend/data/recommended_jobs.csv

Large diffs are not rendered by default.

169 changes: 169 additions & 0 deletions backend/data/recommended_jobsVancouver,BC.csv

Large diffs are not rendered by default.

Binary file added backend/data/test_docs/Resume.pdf
Binary file not shown.
17 changes: 17 additions & 0 deletions backend/job_recommendation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# -*- coding: utf-8 -*-
import sys
import config, web_scrapper
from skill_keyword_match import skill_keyword_match
import nltk
nltk.download('stopwords')

def main():
l = ''
jobs_info = web_scrapper.get_jobs_info(l)
skill_match = skill_keyword_match(jobs_info)
skill_match.extract_jobs_keywords()
resume_skills = skill_match.extract_resume_keywords(config.SAMPLE_RESUME_PDF)
top_job_matches = skill_match.cal_similarity(resume_skills.index, l)
print('File of recommended jobs saved')
top_job_matches.drop_duplicates(subset=['location', 'company', 'title'], inplace=True)
return top_job_matches
70 changes: 70 additions & 0 deletions backend/linkedin_scrapper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
import os
import logging
import config
from linkedin_jobs_scraper import LinkedinScraper
from linkedin_jobs_scraper.events import Events, EventData, EventMetrics
from linkedin_jobs_scraper.query import Query, QueryOptions, QueryFilters
from linkedin_jobs_scraper.filters import RelevanceFilters, TimeFilters, TypeFilters, ExperienceLevelFilters, RemoteFilters

import pandas as pd

chrome_driver_path = os.path.join(os.path.dirname(__file__), "chromedriver")
jobs_data = []

logging.basicConfig(level = logging.DEBUG)

def on_data(data: EventData):
print('[ON_DATA]', data.title, data.company, data.company_link, data.date, data.link, data.insights, len(data.description))
jobs_data.append({'link': data.link, 'location': data.location, 'title': data.title, 'company': data.company, 'salary':'', 'desc': data.description})

def on_metrics(metrics: EventMetrics):
print('[ON_METRICS]', str(metrics))

def on_error(error):
print('[ON_ERROR]', error)

def on_end():
print('[ON_END]')

scraper = LinkedinScraper(
chrome_executable_path=chrome_driver_path,
chrome_options=None,
headless=True,
max_workers=1,
slow_mo=1.5,
page_load_timeout=20
)

scraper.on(Events.DATA, on_data)
scraper.on(Events.ERROR, on_error)
scraper.on(Events.END, on_end)

queries = [
Query(
query='Software Developer',
options=QueryOptions(
locations=['United States', 'Canada'],
apply_link = True,
limit=5,
filters=QueryFilters(
relevance=RelevanceFilters.RECENT,
time=TimeFilters.MONTH,
type=[TypeFilters.FULL_TIME, TypeFilters.INTERNSHIP],
experience=None,
)
)
),
]

def web_scrape():
scraper.run(queries)
df = pd.DataFrame(jobs_data, columns=['link', 'location', 'title', 'company', 'salary', 'desc'])
df.to_csv("LinkedinJobs.csv", index=False)

def get_linkedin_jobs_info():
exists = os.path.isfile(config.LINKED_JOBS_INFO_CSV_FILE)
if exists:
df = pd.read_csv(config.LINKED_JOBS_INFO_CSV_FILE)
else:
df = web_scrape()
return df
29 changes: 29 additions & 0 deletions backend/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
async-generator==1.10
attrs==22.1.0
beautifulsoup4==4.11.1
certifi==2022.9.24
charset-normalizer==2.1.1
exceptiongroup==1.0.0
flask_cors==3.0.10
h11==0.14.0
idna==3.4
linkedin-jobs-scraper==1.15.4
nltk==3.7
numpy==1.23.4
outcome==1.2.0
pandas==1.5.1
PyPDF2==2.11.2
PySocks==1.7.1
python-dateutil==2.8.2
pytz==2022.6
requests==2.28.1
selenium==3.141.0
six==1.16.0
sniffio==1.3.0
sortedcontainers==2.4.0
soupsieve==2.3.2.post1
trio==0.22.0
trio-websocket==0.9.2
urllib3==1.26.12
websocket-client==0.59.0
wsproto==1.2.0
40 changes: 40 additions & 0 deletions backend/server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#Flask code to handle the api requests
import os
from flask import Flask, flash, request, redirect, url_for, session
from werkzeug.utils import secure_filename
from flask_cors import CORS, cross_origin
import logging
from job_recommendation import main

logging.basicConfig(level=logging.INFO)

logger = logging.getLogger('HELLO WORLD')

UPLOAD_FOLDER = './data/'
ALLOWED_EXTENSIONS = set(['txt', 'pdf', 'png', 'jpg', 'jpeg', 'gif'])

app = Flask(__name__)
cors = CORS(app)
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
app.config['CORS_HEADERS'] = 'Content-Type'

@app.route('/upload', methods=['POST'])
@cross_origin()
def fileUpload():
target=os.path.join(UPLOAD_FOLDER,'test_docs')
if not os.path.isdir(target):
os.mkdir(target)
logger.info("welcome to upload`")
file = request.files['file']
filename = secure_filename("Resume.pdf")
destination="/".join([target, filename])
file.save(destination)
session['uploadFilePath']=destination
df = main()
data = df.to_json(orient='records')
response=data
return response

if __name__ == "__main__":
app.secret_key = os.urandom(24)
app.run(debug=True,host="0.0.0.0",port=4000, use_reloader=False)
97 changes: 97 additions & 0 deletions backend/skill_keyword_match.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# -*- coding: utf-8 -*-
import numpy as np
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from collections import Counter
import pandas as pd
import PyPDF2
import config
import linkedin_scrapper

program_languages = ['bash','r','python','java','c++','ruby','perl','matlab','javascript','scala','php']
analysis_software = ['excel','tableau','sas','spss','d3','saas','pandas','numpy','scipy','sps','spotfire','scikit','splunk','power','h2o']
ml_framework = ['pytorch','tensorflow','caffe','caffe2','cntk','mxnet','paddle','keras','bigdl']
bigdata_tool = ['hadoop','mapreduce','spark','pig','hive','shark','oozie','zookeeper','flume','mahout','etl']
ml_platform = ['aws','azure','google','ibm']
methodology = ['agile','devops','scrum']
databases = ['sql','nosql','hbase','cassandra','mongodb','mysql','mssql','postgresql','oracle','rdbms','bigquery']
overall_skills_dict = program_languages + analysis_software + ml_framework + bigdata_tool + databases + ml_platform + methodology
education = ['master','phd','undergraduate','bachelor','mba']
overall_dict = overall_skills_dict + education
jobs_info_df = pd.DataFrame()

class skill_keyword_match:
def __init__(self, jobs_list):

self.jobs_info_df = pd.DataFrame(jobs_list)
linkedin_df = linkedin_scrapper.get_linkedin_jobs_info()
self.jobs_info_df = self.jobs_info_df.append(linkedin_df, ignore_index=True)

def keywords_extract(self, text):

text = re.sub("[^a-zA-Z+3]"," ", text)
text = text.lower().split()
stops = set(stopwords.words("english"))
text = [w for w in text if not w in stops]
text = list(set(text))
keywords = [str(word) for word in text if word in overall_dict]
return keywords

def keywords_count(self, keywords, counter):
keyword_count = pd.DataFrame(columns = ['Freq'])
print(keyword_count)
for each_word in keywords:
keyword_count.loc[each_word] = {'Freq':counter[each_word]}
return keyword_count

def get_cosine_similarity_bit_vector(self,x,y):
l1=[]
l2=[]
if(len(x) == 0 or len(y) == 0):
cosine = 0
return cosine
rvector = list(set().union(x,y))
for w in rvector:
if w in x: l1.append(1)
else: l1.append(0)
if w in y: l2.append(1)
else: l2.append(0)
c = 0

# cosine formula
for i in range(len(rvector)):
c+= l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)
return cosine


def cal_similarity(self, resume_keywords, location=None):
num_jobs_return = 10
similarity_cosine = []
j_info = self.jobs_info_df.loc[self.jobs_info_df['location']==location].copy() if len(location)>0 else self.jobs_info_df.copy()
if j_info.shape[0] < num_jobs_return:
num_jobs_return = j_info.shape[0]
for job_skills in j_info['keywords']:
similarity_cosine.append(self.get_cosine_similarity_bit_vector(resume_keywords.tolist(),job_skills))
j_info['similarity_cosine'] = similarity_cosine
print(j_info)
top_match_based_on_cosine = j_info.sort_values(by='similarity_cosine', ascending=False).head(num_jobs_return)
print(top_match_based_on_cosine)
return top_match_based_on_cosine

def extract_jobs_keywords(self):
self.jobs_info_df['keywords'] = [self.keywords_extract(job_desc) for job_desc in self.jobs_info_df['desc']]

def extract_resume_keywords(self, resume_pdf):
resume_file = open(resume_pdf, 'rb')
resume_reader = PyPDF2.PdfFileReader(resume_file)
resume_content = [resume_reader.getPage(x).extractText() for x in range(resume_reader.numPages)]
resume_keywords = [self.keywords_extract(page) for page in resume_content]
resume_freq = Counter()
f = [resume_freq.update(item) for item in resume_keywords]
print("Here we have creater counts of keywords for all the lists extracted based on each page")
print(resume_freq)
resume_skills = self.keywords_count(overall_skills_dict, resume_freq)
return(resume_skills[resume_skills['Freq']>0])
Loading