Isaiah Jenkins
This project analyzes the "How Couples Meet and Stay Together" dataset (2017–2022) to predict relationship duration using linear regression models. The study leverages a nationally representative survey of 4,002 American adults, with 3,009 reporting a spouse or romantic partner. The analysis focuses on key features such as age, income, employment status, past partner history, and relationship quality to identify factors influencing long-lasting relationships. Python libraries including Pandas, NumPy, Scikit-learn, Matplotlib, and Seaborn were used for data processing, modeling, and visualization.
The dataset, sourced from the "How Couples Meet and Stay Together" study, contains 726 features capturing demographic and relationship data across multiple waves (2017–2022). This project focuses on Wave 3 (2022) data, narrowed to 10 relevant features, including:
- Key Features:
w3_ppage(age),w3_ppincimp(income),w3_ppwork(employment status),w3_past_partners_gender_1/2/3(past partner history),w3_relatives(number of relatives),w3_weekly_sex_frequency,w3_rel_qual(relationship quality), andw3_relationship_duration_yrs(target variable). - Preprocessing: Handled missing values, encoded categorical variables, and filtered to 1,026 records with complete data for analysis.
The analysis included:
- Data Exploration: Examined dataset structure, identified missing values, and computed descriptive statistics.
- Feature Engineering: Encoded categorical variables (e.g., income, employment, relationship quality) using one-hot encoding, resulting in 40 features.
- Modeling:
- Baseline Linear Regression: Achieved train R² of 0.635 and test R² of 0.572.
- Polynomial Regression: Degree 2 and 4 models yielded negative R² scores (-3.21e6 and -3113.74, respectively), indicating underfitting.
- Lasso Regression: Applied regularization, achieving a test R² of 0.583, with key features like age and income showing influence.
- Visualizations: Used box plots and other visualizations to inspect feature distributions and relationships.
- Model Performance: The baseline linear regression model performed best (test R² = 0.572), but polynomial and Lasso models struggled due to inconsistent relationship duration data (e.g., excellent relationship quality reported for both short and long durations).
- Feature Insights: Age and income were significant predictors, but inconsistencies in the target variable limited model accuracy.
- Challenges: The dataset's complexity and inconsistencies in outcome variables hindered robust predictions.
To run this project, install the required dependencies:
pip install pandas numpy matplotlib seaborn scikit-learnDownload the HCMST_2017_to_2022.csv dataset and place it in the data/ directory.
-
Clone the repository:
git clone https://github.com/your-username/couples-regression.git cd couples-regression -
Set up the dataset:
- Place
HCMST_2017_to_2022.csvin thedata/directory.
- Place
-
Run the Jupyter Notebook:
jupyter notebook Stay_Together_Regression.ipynb
-
Follow the notebook to explore data, train models, and review results.
- Expand Dataset: Incorporate data from Waves 1 and 2 to increase sample size and feature diversity.
- Refine Features: Select more consistent outcome variables and explore additional features (e.g., education, shared interests).
- Model Improvements: Revisit polynomial regression with tuned hyperparameters and explore non-linear models (e.g., decision trees, random forests).
- Alternative Datasets: Consider datasets with more consistent relationship duration metrics for improved predictive accuracy.