Skip to content

Recommending Paths to Occupation Using Predictors Based on 2017 Census; an agile data science approach.

Notifications You must be signed in to change notification settings

DMSaunders/Recommending-Paths-to-Occupations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Recommending Paths to Occupations: An agile data science approach

Actionable Research Question: What are the strongest predictors of a person’s occupation among all the features an individual can control included in the 2017 U.S. Census Bureau Data?

The "Freewill" Features (things people can control):

  • Undergraduate College Major

  • Educational Attainment

  • Location

  • Military Service

  • English Fluency

  • Precise Marital Status

  • Number of Times Married

  • Relationship within Household*

  • Working in Private vs Public Sector vs Self-Employed

Many projects have explored the effects of these features on individuals' reported income. I’m looking at individuals' occupation since it also contributes to well-being, especially when there is a good job fit with the individual. As the end goal for this project, it would be ideal if young adults could enter their desired occupation into a web app and see its most predictive features as a suggestion of paths others have successfully taken to that occupation. This is proposed as "tried and true" rather than optimal. They could also enter their demographics and receive tailored resources based on their likelihood of achieving that occupation, which will be most impactful for disadvantaged individuals desiring prestigious occupations. The best way to change entrenched patterns is to focus on what you have the power to change.

pandas scikit matplot jupyter ec2

Coded in Python including Pandas, scikit-learn, and matplotlib. Challenged by runtime while gridsearching models, minimal tuning gains, and feature selection. Learned to run pipelines and fire up a model in Jupyter on an AWS EC2 instance in 7 minutes.

The American Community Survey Public Use Microdata Sample provides anonymized individuals' data in the amount of about 1% of each state’s population each year, which is incredible. The tables have around 200 features on these individuals, including demographic information such as handicap, insurance, income, language spoken, citizenship, working hours, and more. See the Data Dictionary starting from row 1145 for examples of the features.

As an agile data science approach, I defined a minimum viable product to reduce the CRISP-DM cycle to the shortest length for me to assess whether the project is worthwhile and avoid cleaning data or tuning models that do not suit the objective. This involved selecting the latest survey for a single state (California). I also limited the dataset to people who started their careers in the last 20 years so that the information would not be too outdated by limiting to younger ages.

I hypothesized that I would find the most signal in looking first at the predictive power of education features on each of the 2010 Standard Occupational Classification’s (SOC) Major occupation groups. I examined the homogeneity of the major occupation groups to feel out whether they could be meaningfully predicted. For example, I was concerned 'Management' included both retail managers and executives, which are disparate, but the SOC has grouped first-line supervisors into their related groups. For this reason I decided not to cluster a new target to represent occupation on this first pass, since I think this target is adequate. Here is the distribution:

young californians occupations

This compares educational attainment and major occupation group. The x-axis is roughly ordinal. I will later make an approximately ratio version of attainment where x = 1 year.

violin plot

Clearly some occupations involve more schooling, as expected. Here’s how I scoped the project down:

process flow

I ran a "dumb" model with a random input for its only feature to establish a baseline on which to improve, which produced a test f1 score of .07. Next an untuned random forest model for predicting occupations based only on the education features produced .21. Given the breakdown below, education did not appear to be predictive of certain occupations like sales and healthcare support, while it was predictive of Engineering and Computer Science and Math occupations. I chose F1-score as a metric which describes the balance between my true positive rate and my precision. I was predicting computer professional(CP) vs not a CP: 'How many of all the CPs did I identify', balanced with 'how many that I identified as CPs were really CPs'? This can be advantageous over accuracy as a metric for imbalanced classes, which I demonstrate having in the chart of occupation groups, since I can achieve 96.4% accuracy simply by predicting that no one is a CP, however my f1-score would be low. I ended up increasing accuracy by half a percent over that.

dumb metrics

I examined the Computer and Math occupation as a target before pursuing the full project since it was promising. Of 8 different classification models, tree-based learning scored best across metrics. Although I found the majority of predictive power in the education variables of undergraduate major and educational attainment (left curve) vs all the freewill features (right), location may be nearly as important as having a CS degree.

roc1 roc2

partial dependencies

What these partial dependencies from the gradient boosting model show are features (mainly names of undergraduate degrees) which are important to the model. X-axis is how the feature x changes and y is the log-likelihood of the target class being positive (the person being a computer professional). In addition to these, working in for-profit sector and English fluency also appeared to matter. As next steps I may improve dimensionality reduction, engineer more features, and create an only-women model which can take into account features like children which could not be part of the current dataset since it would cause leakage of gender as a feature.

I also lightly tuned a model for the healthcare practitioner occupation (mainly highly educated professionals, as opposed to healthcare support), another promising one. My f1 score was .378, with an area under the curve of .62, which is not great. However, I expect I could get this higher with more work. Partial dependencies show related majors, professional, associates, and doctorate degrees having an impact. A second undergrad major in nursing makes a striking appearance, worth investigating. This time location was not distinguishing. The first nine plots are shown.

partial dependencies

I did not get as much signal from the non-education features as I had hoped on this initial occupation. For those occupations that education does not predict well, my signal may be weak. However, I have enough confidence to proceed to adding more data from earlier years and other states. I am especially feeling positive about location and creating a heatmap based on the PUMS areas. Each PUMS area contains around 100k people, so I could have a per-capita representation, allowing for a map that does not simply point out cities. It may also be useful in identifying shortages or surpluses.

With feedback, I revised the proposed functionality of the webapp: Users enter their desired occupation and their demographic information and I run a model for them, which outputs but may not display a predicted likelihood of that person being that occupation. I present appropriate resources based on the likelihood, for example more supportive vs more accelerated, and recommend actions corresponding to the features that the individual can control.

website wireframe

*For each individual, I have their relationship to the survey-taker, including family relationships like stepson and non-family ones like housemate or boarder. This feature needs some engineering. I surmise that someone can change their living situation by moving into a different household, allowing this to be a 'freewill feature'.

About

Recommending Paths to Occupation Using Predictors Based on 2017 Census; an agile data science approach.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published