WIDS 2020 Datathon

This repository contains code for the Women in Data Science (WIDS) 2020 Datathon completed by the San Jose State University team consisting of Sonia Meyer and Emma Hendry. The code performs data preprocessing, feature selection, and training a logistic regression model on the dataset.

The code is organized into several sections:

Module Import and Data Reading: Imports necessary modules and reads the training and unlabeled data from CSV files.
Handling Missing Values: Drops columns with more than 75% missing values and removes irrelevant columns.
Removing Collinear Variables: Removes variables with high correlations.
Feature Selection: Selects the top 20 most correlated variables and their dictionary descriptions. Selects a subset of variables for training the model.
Data Imputation: Imputes missing values with the median value for the selected variables.
Training the Model: Trains a logistic regression model on the imputed data and evaluates its performance.
Cross-Validation: Performs cross-validation on the trained model to assess its average accuracy.
Receiver Operating Characteristic (ROC) Curve: Plots the ROC curve and calculates the Area Under the Curve (AUC) to evaluate the model's performance.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
WiDS Datathon 2020 Dictionary.csv		WiDS Datathon 2020 Dictionary.csv
samplesubmission.csv		samplesubmission.csv
sjsu_wids_solutions.csv		sjsu_wids_solutions.csv
solution_template.csv		solution_template.csv
training_v2.csv		training_v2.csv
unlabeled.csv		unlabeled.csv
wids2020.ipynb		wids2020.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WIDS 2020 Datathon

About

Releases

Packages

Languages

soniawmeyer/wids2020

Folders and files

Latest commit

History

Repository files navigation

WIDS 2020 Datathon

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages