Skip to content

Analyzed the "How Couples Meet and Stay Together" dataset (2017–2022) to predict relationship duration using linear regression models.

Notifications You must be signed in to change notification settings

Jenkins1128/StayTogetherRegression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

How Couples Meet and Stay Together Regression

Isaiah Jenkins

Project Overview

This project analyzes the "How Couples Meet and Stay Together" dataset (2017–2022) to predict relationship duration using linear regression models. The study leverages a nationally representative survey of 4,002 American adults, with 3,009 reporting a spouse or romantic partner. The analysis focuses on key features such as age, income, employment status, past partner history, and relationship quality to identify factors influencing long-lasting relationships. Python libraries including Pandas, NumPy, Scikit-learn, Matplotlib, and Seaborn were used for data processing, modeling, and visualization.

Dataset

The dataset, sourced from the "How Couples Meet and Stay Together" study, contains 726 features capturing demographic and relationship data across multiple waves (2017–2022). This project focuses on Wave 3 (2022) data, narrowed to 10 relevant features, including:

  • Key Features: w3_ppage (age), w3_ppincimp (income), w3_ppwork (employment status), w3_past_partners_gender_1/2/3 (past partner history), w3_relatives (number of relatives), w3_weekly_sex_frequency, w3_rel_qual (relationship quality), and w3_relationship_duration_yrs (target variable).
  • Preprocessing: Handled missing values, encoded categorical variables, and filtered to 1,026 records with complete data for analysis.

Analysis

The analysis included:

  1. Data Exploration: Examined dataset structure, identified missing values, and computed descriptive statistics.
  2. Feature Engineering: Encoded categorical variables (e.g., income, employment, relationship quality) using one-hot encoding, resulting in 40 features.
  3. Modeling:
    • Baseline Linear Regression: Achieved train R² of 0.635 and test R² of 0.572.
    • Polynomial Regression: Degree 2 and 4 models yielded negative R² scores (-3.21e6 and -3113.74, respectively), indicating underfitting.
    • Lasso Regression: Applied regularization, achieving a test R² of 0.583, with key features like age and income showing influence.
  4. Visualizations: Used box plots and other visualizations to inspect feature distributions and relationships.

Key Findings

  • Model Performance: The baseline linear regression model performed best (test R² = 0.572), but polynomial and Lasso models struggled due to inconsistent relationship duration data (e.g., excellent relationship quality reported for both short and long durations).
  • Feature Insights: Age and income were significant predictors, but inconsistencies in the target variable limited model accuracy.
  • Challenges: The dataset's complexity and inconsistencies in outcome variables hindered robust predictions.

Installation

To run this project, install the required dependencies:

pip install pandas numpy matplotlib seaborn scikit-learn

Download the HCMST_2017_to_2022.csv dataset and place it in the data/ directory.

Usage

  1. Clone the repository:

    git clone https://github.com/your-username/couples-regression.git
    cd couples-regression
  2. Set up the dataset:

    • Place HCMST_2017_to_2022.csv in the data/ directory.
  3. Run the Jupyter Notebook:

    jupyter notebook Stay_Together_Regression.ipynb
  4. Follow the notebook to explore data, train models, and review results.

Next Steps

  • Expand Dataset: Incorporate data from Waves 1 and 2 to increase sample size and feature diversity.
  • Refine Features: Select more consistent outcome variables and explore additional features (e.g., education, shared interests).
  • Model Improvements: Revisit polynomial regression with tuned hyperparameters and explore non-linear models (e.g., decision trees, random forests).
  • Alternative Datasets: Consider datasets with more consistent relationship duration metrics for improved predictive accuracy.

About

Analyzed the "How Couples Meet and Stay Together" dataset (2017–2022) to predict relationship duration using linear regression models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published