Skip to content

This is an expansion of dsb318-group4 (see repo: dsb318-group4), in which we collaborated to predict high school graduation rates in CA from other trends (e.g., poverty rate, availability of e-cigarettes). Collaboration between Eli and Emily.

License

Notifications You must be signed in to change notification settings

emilyksanders/CA_dropout_rates

Repository files navigation

It Takes a Village

An exploration of county-level predictors of high school dropout rates in California

Eli Winton and Emily K. Sanders
with contributions by Radha Mohanty
Original version presented for: DSB-318, Project 4, Group 4, May 17, 2024

A table of contents for the repository is available at the bottom of this README.

Problem Statement

Not obtaining a high school diploma carries serious detrimental consequences for a child's whole life. We will investigate how various social conditions predict high school dropout rates at the county level in California using regression algorithms and data retrieved from government sources.

Target Audience

We have been hired by the Coalition for Unprecedented Transformative Education for California Teachers and Students, an umbrella organization for non-governmental organizations (NGOs) in California, to assess factors that may contribute to unequal opportunity for educational attainment across regions of the state. These NGOs will use our findings and recommendations to shape their own policy goals and political advocacy.

Background and Purpose

There are many social determinants of educational success, beyond just any child’s merit or self-discipline, but that children who do not obtain high school diplomas often face social and economic hardships for the rest of their lives. Therefore, at the request of our commissioners, we will use publicly available data about a variety of social conditions across California counties to predict the dropout rate within a 4 year adjusted graduation cohort. This will allow our commissioners to become more informed about the disparities in educational outcomes across the state and better advocate for increasing students' opportunities for success by decreasing the dropout rate.

If a suitable predictive model can be created, it will allow advocates to anticipate the county-level dropout rate if current trends continue, and plan their community efforts and political advocacy accordingly. For example, if Modoc County were predicted to have a high dropout rate whereas Lassen County were predicted to have a low one, the North-Eastern Alliance for Transforming the Californian Technology Sector, a group of innovators who seek to inspire underpriviliged young people to pursue careers in the technology industry, may choose to focus their efforts on Modoc County, where they are more needed. Furthermore, if a model suitable for interpretation can be created, it will allow us and our commissioners to evaluate which social determinants may be the most influential on the rate of degree acquisition, and make decisions accordingly. In this case, for example, the Glenn Open Opportunity Society for Education, a civic organization in Glenn County dedicated to mentoring high school students, may choose to prioritize students with certain characteristics to to matched with members for 1-on-1 mentoring. In either case, political advocacy groups in the coalition, such as Educators Dare to Dream of Immigration Equity, can use our findings to advance their goals at the statehouse.

Metric of Success

We will assess our models using R-squared, a measurement of how much variance in the dropout rate across counties can be explained by our model, and Root Mean Squared Error (RMSE), a measurement of how far off our model's estimates, based on the predictor variables, are from the actual dropout rate that was observed in the population. Although we will present on the best model(s) we find, whatever they may be, we aspire to find model(s) with R-squared scores upward of 80%, and/or RMSEs equal to or less than 2 standard deviations of dropout rates - that is, that the value of true dropout rate minus the RMSE would not exceed 18.1326.

Deliverables

  • A written report of our procedure, findings, and recommendations for action and future work. A full table of contents is available at the bottom of this README.
  • Slides from the presentation of this report to the Spring 2024 Meeting of the Coalition for Unprecedented Transformative Education for California Teachers and Students.

Apparatus

This analysis was primarily conducted in python, and partially in Excel. The developers of python have graciously made it open source; Excel is a Microsoft product. Each collaborator used their python development environment of choice; we expect that anyone reproducing our work would be able to do the same. Within python, we used several relevant modules, most notably pandas, sklearn, numpy, matplotlib.pyplot, and seaborn. At least one collaborator also used os, datetime, and string. All of these python modules are open source.

Method

We collected multiple publicly available datasets about county-level social conditions in California for use in our analysis. We retrieved the majority of these datasets from data.world, where they had been compiled after originating with the California Health and Human Services Agency. We also retrieved a few datasets from the California Employment Development Department, the California Department of Education, and KidsData.org, where we found neatly packaged US Census data. For each dataset, we downloaded the file to one of our local computers, then used python and/or Excel to reshape it for our needs, reduce dimensionality (rows and columns) wherever necessary, and perform any necessary data cleaning (i.e., attending to missing values). In some cases, we engineered new features out of existing ones; for example, we combined two datasets about the GINI coefficient, each with several missing values, into one GINI feature with no missing values. Whenever necessary, we imputed missing values; more details are provided on this in the notebooks. At the end of this process, we had assembled a dataset with one row for each of 57 out of California's 58 counties. (Because Alpine County has very few residents, too much of its data were suppressed for it to be usable in the analysis.) In the interest of creating reproducible code, we used a random seed of 31 throughout.

Exploratory Data Analysis

We began our analyses by creating histograms to explore the distribution of each variable, and scatterplots to explore each variable's relationship with dropout rates. These graphs quickly revealed that none of our variables were normally distributed (bell-shaped), and that the majority of them were severely right-skewed (L-shaped). Nearly every predictor variable had at least one value (usually the one for Los Angeles County) that was orders of magnitude greater than the others. The dropout rate also suffered from outliers; most counties had a rate around 10%, some approaching 20%, but Inyo County and Nevada County had dropout rates of nearly 50%. A non-normal and/or skewed distribution in any variable can be a serious obstacle to fitting a high-performing model; in all variables, it can be nearly impossible. Distribution problems can occassionally be resolved by transforming the original values. For the time being, we prioritized attempting to normalize our target variable of dropout rates. After trying several approaches, we settled on a logarithmic transformation. Because log(0) is undefined, we added 0.0001 to each dropout rate before transforming them.

Our next step was to create a correlation matrix between all variables, which created further concerns about the likelihood of finding a quality model. Most correlations were near 0, indicating that these predictors may not provide much utility in a model. We attempted to improve performance by creating interaction terms; that is, features in the model that account for intersectional affects of multiple variables, rather than treating each variable in isolation. This did improve the correlations somewhat, and we retained many of these interaction terms in the models we fit; however, their predictive power remaining significantly lower than we had hoped to achieve.

Modeling

Because of the concerns raised in exploratory data analysis, we decided to run three simple models on the top ten most correlated columns after polynomial feature engineering. The first model we ran was a random forest, which gave us an R² training score of 0.83 and an R² testing score of 0.13. This score suggested that our model was overfit. We then applied a standard scaler and ran a Lasso regression, cross-validating over a range of different alphas. This process indicated that 100 was the best alpha, but the model produced an R² score of 0, indicating poor performance. Finally, we attempted some clustering and ran a for loop over a range of 1 to 40 clusters. It appeared that around 13 clusters yielded the best results, with a silhouette score of 0.24, indicating that the clusters are weakly defined. With results this poor on the initial metrics,

Conclusions and Recommendations

We are unfortunately not able to present a useful model at this time. The irregularities in the data, combined with the small number of observations (n=57) precluded a quality fit. We hope to improve model performance in future iterations by implementing better controls for population size and outliers. If it is agreeable to our commissioners, we believe we may be able to fit a better model on a smaller unit of analysis, such as city or school district, because this would yield a greater number of observations (rows). Alternatively, we could exclude outlier counties from the analyses; Los Angeles County in particular operates on a much larger scale than the others, and may be best evaluated on its own terms. The practicality and ethics of doing either of these things, however, are questions only the commissioners can answer. They would have to tell us, we cannot tell them, whether they have the capacity to act on more granular information, and/or whether they are comfortable excluding counties on the basis of their size.

If the commissioners desire to continue searching for a county-level model, we would recommend exploring more options for normalizing the data we have, and to implement better controls for population size. We would also continue refining the features we include in the model, ideally adding several features that we hypothesize could be useful predictors, but that were not readily available within the timeframe we had; namely, demographic data such as race, migrant populations, and languages spoken, county financial data such as tax revenue and per-pupil spending, and additional educational data such as average class size, teachers' years of experience, and student absenteeism.

Although our models were resoundingly unsuccessful, we can offer one tentative recommendation based on our correlation matrix. We believe that there is some relationship at play between the availability of e-cigarettes and the dropout rate. The 7 strongest correlates of the log-dropout rate were interaction terms containing the percentage of tobacco stores that sold e-cigarettes (0.3374 < r < 0.5622). Furthermore, all of these correlations were positive, indicating that dropout rates rise and fall in the same direction as the rate of tobacco stores that sell e-cigarettes. Although correlation does not imply causation, and the mechanism behind this correlation - assuming it is not merely due to chance - is surely complex, it does offer a potential avenue for immediate action. The prevalence of e-cigarette vendors may be a useful indicator of potential dropout risk, and NGOs within the coalition with expertise on the matter may be able to divine more meaning from the relationship than we can. In any case, coalition members may be able to capitalize on California's recent $175.8 million settlement with Juul, which state officials say will be used "to fund research, education and enforcement efforts related to e-cigarettes," by tying together the goals of decreasing tobacco use and increasing high school completion rates.

In conclusion, we are honored to have had the opportunity to contribute to such an important cause, and we look forward to continuing our partnership with the Coalition for Unprecedented Transformative Education for California Teachers and Students.

Table of Contents

Folder Contents Description
(none) readme This is the current file.
(none) ca_dropout_rate_presentation.pdf The presentation slides that accompany this report.
/01_notebooks This folder contains a collection of Jupyter notebooks, in which reader can find examples of the python code we used to run our analyses, as well as brief written explanations. These notebooks are named in the order we recommend reading them.
/02_data This folder contains data files. The overall data file that we based our analysis on is loose in the folder. The subfolders contain the original datasets and data dictionaries as retrieved from the sources, as well as our cleaned versions of those datasets.
ca_dropout_and_predictors.csv This is the dataset we based our analyses on. It was constructed from the multiple datasets retrieved from the sources.
/01_original_datasets This folder contains the individual CSV files corresponding to different variables as retrieved from the sources. Note that we have only included the files we used; most of them were originally part of a larger package of files available at the source.
/02_cleaned_datasets This folder contains the individual CSV files corresponding to different variables as they appeared after we cleaned and streamlined them. Note that we made a few further edits as needed between these versions and the final analysis dataset. These edits should be readily visibly apparent.
/03_output This folder contains any files produced within the notebooks.
/04_data_dictionaries This folder contains our data dictionary, as well as the original data dictionaries that came with the datasets, whenever available. Note that not all datasets came with a data dictionary, and that not all of the variables in these data dictionaries remained in our final dataset.
/03_images This folder contains images, mostly graphs, that support the report.

About

This is an expansion of dsb318-group4 (see repo: dsb318-group4), in which we collaborated to predict high school graduation rates in CA from other trends (e.g., poverty rate, availability of e-cigarettes). Collaboration between Eli and Emily.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published