This repository contains code for training various machine learning models to predict tournament results using logistic regression, random forest, and gradient boosting techniques. The implementation supports both men's and women's tournaments.
- Mount Google Drive to access datasets.
- Load and preprocess data.
- Train a logistic regression model using
LogisticRegressionfromsklearn. - Calibrate the model using
CalibratedClassifierCV. - Generate predictions for tournament teams.
- Save results in a submission file.
- Loads data from Google Drive.
- Performs feature engineering (calculates score difference as a feature).
- Splits data into training and testing sets.
- Uses a pipeline with standardization and logistic regression.
- Calibrates the model for better probability estimates.
- Generates win probability predictions for each team.
- Creates a CSV submission file for tournament predictions.
- Load preprocessed features from logistic regression section.
- Train a
RandomForestClassifier. - Calibrate using
CalibratedClassifierCV. - Predict tournament outcomes.
NameError: name 'X' is not definedmissing feature definitions.
- Define a
LGBMClassifiermodel. - Calibrate with
CalibratedClassifierCV. - Train and predict probabilities for tournament teams.
NameError: name 'X_train' is not defineddue to missing variable assignment.
- Define a parameter grid for hyperparameter tuning.
- Use
GridSearchCVwithbrier_score_lossto optimize model performance.
A function create_submission_file_2025() generates a CSV file containing predictions for matchups in the 2025 NCAA tournament. It combines results from men's and women's tournaments.