This repo provides a toolset for scraping job listings from various job boards, analyzing their content, and extracting valuable insights for jobseekers. The repo is setup to extract job data like skills, accessibility, sentiment, popular interview using tools and more.
python -m venv <your_environment_name>
pip install -r requirements.txt
If you just want a live demo:
Scraping: Collect job listings from sites like LinkedIn, Indeed and Greenhouse and save as .csv files. (Located in ./scraper.py)
Analysis: Process job descriptions using spaCy to extract key skills, accessibility keywords, interview prep indicators, sentiment, and named entities. (Located in ./spacy_nre_analyzer.py)
Demo Notebook: A Jupyter notebook showcasing an example dataframe obtained from scraping, followed by dataframe processing using the analysis functions. (Located in ./juypter_notebooks/dataframe_notes.ipynb)
Scrape Job Listings: Use scraper.py to gather data from popular job boards based on keywords, locations and job titles. You can modify these query links however the job board will allow you to.
Analyze Job Data: Use spacy_nre_analyzer.py to process the scraped job data and extract useful insights like sentiment, skills, and job accessibility.
Add your own keywords: in spacey_nre_analyzer.py there are variables containing lists of desired keywords
skills_keywords = [
# Web Development
'react', 'angular', 'vue.js', 'svelte', 'ember.js', 'backbone.js', 'litElement',
'node.js', 'express', 'django', 'flask', 'ruby on rails', 'spring', 'laravel', 'asp.net', 'fastapi',
'javascript', 'typescript', 'html', 'css', 'python', 'ruby', 'php', 'java', 'c#', 'go', 'kotlin', 'swift', 'rust',
'bootstrap', 'tailwind css', 'bulma', 'material ui', 'foundation', 'sass', 'less',
'jquery', 'webpack', 'babel', 'gulp', 'grunt', 'nginx', 'apache', 'varnish', '...' ]
Modify these however you want for your job queries
Explore Example: Open dataframe_notes.ipynb to see an end-to-end example of scraping and analysis, including data preprocessing and output generation.
This project is ideal for jobseekers who'd like more tools to explore job market trends and for aspiring devs interested in data analysis applications.
search_url = "https://www.linkedin.com/jobs/search/?keywords=Web%20Developer&location=United%20States"
df = scrape_linkedin_jobs(
driver,
url=search_url,
scrolls=10,
max_jobs=5,
csv_path="./csvs/linkedin_ds_jobs.csv"
)
-
Description: The job title extracted from the listing (e.g., Data Scientist, Software Engineer).
Purpose: Helps identify the role being advertised.
-
Description: The company name offering the job (e.g., Netflix, Google).
Purpose: Allows job seekers to filter by their preferred companies.
-
Description: The location of the job (e.g., New York, Remote).
Purpose: Helps job seekers target specific geographic regions.
-
Description: The direct URL link to the job listing on LinkedIn.
Purpose: Direct access for more details and to apply.
-
Description: A summary of the job's responsibilities and requirements.
Purpose: Provides the key information about the role, tasks, and expectations.
-
Description: Extracted skills and technologies mentioned in the job description (e.g., Python, SQL).
Purpose: Helps job seekers match their skill set to job requirements.
-
Description: Keywords indicating job accessibility for diverse jobseekers (e.g., "No degree required", "Senior role").
Purpose: Identifies roles suitable for early-career professionals or those with non-traditional qualifications.
-
Description: Mentions related to the interview process (e.g., "Leetcode", "Hackerrank", "Portfolio projects").
Purpose: Provides insight into the type of preparation needed for the interview.
-
Description: Language analysis of the job description, a score that represents subjective ambiguity. (Subjective, Neutral, Objective)
Purpose: Often an indicator of weighted emphasis on 'soft skills' used in the job description section. While it is the habit of most job postings to employ this type of language, it may prove helpful to check for postings that clearly outline the job's needs clearly and succinctly.
-
Description: Sentiment analysis of the job description, showing its overall positivity or negativity (e.g., Positive, Neutral).
Purpose: Helps job seekers gauge the tone of the company’s work environment.
-
Description: Extracted entities such as company names, locations, technologies, and more.
Purpose: Highlights important information such as specific tools, locations, and organizations in the job post.
Title | Company | Location | URL | Job Description | Skills | Accessibility | Interview | Sentiment | NER | |
---|---|---|---|---|---|---|---|---|---|---|
4 | Junior Web Developer | IntelliBridge | Washington, DC | https://www.linkedin.com/jobs/view/junior-web-... | Overview\n\nIntelliBridge is an award-winning ... | [] | ['collaborative', 'junior web developer'] | ['collaboration'] | ('Positive', 'Objective') | [{'PRODUCT': '401k'}, {'ORG': 'pto &'}, {'ORG'... |
0 | Full Stack - Software Engineer (L5) - Platform... | Netflix | Los Gatos, CA | https://www.linkedin.com/jobs/view/full-stack-... | Netflix is one of the world's leading entertai... | ['aws'] | [] | [] | ('Positive', 'Objective') | [{'GPE': 'netflix'}, {'GPE': 'netflix'}, {'GPE... |
1 | Frontend Engineer | Porter | New York, NY | https://www.linkedin.com/jobs/view/frontend-en... | The Role\n\nWe're looking for an experienced f... | ['https', 'github'] | [] | ['github'] | ('Positive', 'Subjective') | [{'ORG': 'justin'}] |
3 | Frontend Developer | Adapty.io | United States | https://www.linkedin.com/jobs/view/frontend-de... | Adapty is a revenue management platform for mo... | [] | [] | [] | ('Positive', 'Objective') | [{'MONEY': '$1.4 billion'}] |
2 | Front-End / Full-Stack Developer | Corsair | Milpitas, CA | https://www.linkedin.com/jobs/view/front-end-f... | Job Description\n\nWe are a fast-growing eComm... | ['react', 'tailwind css', 'css', 'typescript'] | ['css'] | [] | ('Neutral', 'Objective') | [{'ORG': 'cms'}, {'ORG': 'cms'}, {'GPE': 'milp... |
Subjectivity reflects the presence of opinions or biases in the language, while sentiment analysis gauges the overall tone (positive, negative, or neutral).
- What it is: Subjectivity includes language that’s open to interpretation, not based on objective facts.
- Examples:
- "Highly motivated and passionate individual" (subjective)
- "Fast-paced and dynamic team" (subjective)
- "Must be a team player" (subjective)
- Subjectivity can make a job description unclear, potentially leading to misinterpretations or attracting candidates who don't truly fit the role.
- This model reflects high levels of Subjectivity with a Positive rating and Objectivity with a Negative rating. Hope that's not confusing!
- What it is: Sentiment analysis identifies the emotional tone of the job description.
- What it reveals:
- Tone: Positive, negative, or neutral.
- Emotional Valence: Specific feelings associated with the job.
- Company Culture: Insight into the org's temperament/values.
- Uses:
- Candidate Matching: Aligns candidates with job postings based on tone.
This model collects the following labels from scraped job listings:
['ORG', 'GPE', 'PRODUCT', 'MONEY']
- ORG - Organizations: Business, Government, Agencies
- GPE - Geopolitical Entities: Countries, States, Provinces
- PRODUCT - Objects, vehicles, foods, etc. (Not services.)
- MONEY - Quantifiable currency amounts ($1,000,000 - 773 300 GBP)
More NERs:
- PERSON: People, including fictional.
- NORP: Nationalities or religious or political groups.
- FAC: Buildings, airports, highways, bridges, etc.
- ORG: Companies, agencies, institutions, etc.
- GPE: Countries, cities, states.
- LOC: Non-GPE locations, mountain ranges, bodies of water.
- PRODUCT: Objects, vehicles, foods, etc. (Not services.)
- EVENT: Named hurricanes, battles, wars, sports events, etc.
- WORK_OF_ART: Titles of books, songs, etc.
- LAW: Named documents made into laws.
- LANGUAGE: Any named language.
- DATE: Absolute or relative dates or periods.
- TIME: Times smaller than a day.
- PERCENT: Percentage, including ”%“.
- MONEY: Monetary values, including unit.
- QUANTITY: Measurements, as of weight or distance.
- ORDINAL: “first”, “second”, etc.
- CARDINAL: Numerals that do not fall under another type.