This repository contains the solution for the second day's problem of the IOAI TST (Team Selection Test) in Kazakhstan. The objective was to cluster football (soccer) players based on their in-game attributes.
Kaggle Competition Link: https://www.kaggle.com/competitions/tst-day-2
The task was to group football players into clusters based on their various in-game statistics (e.g., passing, shooting, defense). A key challenge was that players could have multiple positions (e.g., {CM, CAM}), and the clustering needed to reflect these overlapping positional roles.
The clustering performance was evaluated using the B-Cubed F1 score for multi-positional clustering. This metric assesses how well players with overlapping positions are grouped together.
Our solution achieved a score of 0.534 on this metric.
The approach taken involves several steps:
- Data Loading: Reading the player data and the sample submission file.
- Feature Engineering (Meta-features): Creating aggregated "meta-features" from raw player statistics to represent broader skill sets (e.g.,
attacking_skill,passing_ability). - Goalkeeper Identification: Separately identifying goalkeepers, as their skill sets are distinctly different from outfield players.
- Data Splitting: Dividing the dataset into goalkeepers and outfield players.
- Preprocessing: Applying imputation for missing values and standardization to scale features.
- Optimal Cluster Determination: Using the Silhouette Score to find an optimal number of clusters for outfield players.
- Clustering: Applying Gaussian Mixture Models (GMM) for clustering both outfield players and goalkeepers separately.
- Cluster ID Consolidation: Assigning unique cluster IDs to both groups.
- Submission Generation: Merging the clustered data with the sample submission format.
The provided Python script batyr-yerdenov-2.2.ipynb (intended to be run in a Jupyter/Kaggle notebook environment) implements the following:
-
Import Libraries: Essential libraries like
pandas,numpy,sklearn.preprocessing,sklearn.impute,sklearn.mixture,sklearn.metrics. -
Data Loading:
df = pd.read_csv("/kaggle/input/tst-day-2/train.csv") sample = pd.read_csv("/kaggle/input/tst-day-2/sample_submission.csv")
-
Meta-feature Engineering: New features are created by averaging relevant raw attributes. Examples include
attacking_skill,passing_ability,dribble_mobility,pace,defense_skill,physicality,set_piece_specialist,goalkeeper_score,composure_score,offensive_support,attack_support, anddefending_positioning. -
Goalkeeper Identification (
is_gk): A boolean columnis_gkis created. A player is classified as a goalkeeper if all their specific goalkeeper attributes (gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes) are above a threshold of 40. -
Data Splitting: The DataFrame
dfis split intogk_df(goalkeepers) andfield_df(outfield players). -
Feature Selection for Clustering:
features: A list of meta-features used for clustering outfield players.gk_features: A list containinggoalkeeper_scorefor clustering goalkeepers.
-
Preprocessing Function (
preprocess):def preprocess(X): X = SimpleImputer(strategy="mean").fit_transform(X) # Impute missing values with mean X = StandardScaler().fit_transform(X) # Scale features to zero mean and unit variance return X
-
Optimal Cluster Determination for Outfield Players: A loop iterates from 5 to 14 clusters, fitting a
GaussianMixturemodel and calculating thesilhouette_score. The number of clusters yielding the highest silhouette score is chosen asbest_k(found to be 6 in this run). -
Clustering with GMM:
gmm_field: Gaussian Mixture Model fitted toX_fieldwithbest_kcomponents.gmm_gk: Gaussian Mixture Model fitted toX_gkwith 1 component (assuming goalkeepers form a single, distinct cluster).
-
Assigning Cluster Labels: Cluster labels are predicted for both groups. Goalkeeper cluster IDs are offset by
best_kto ensure unique IDs across the entire dataset. -
Combining Results: The clustered
field_dfandgk_dfare concatenated and sorted by playerid. -
Submission Generation: The final
submission.csvfile is created by merging the playeridand their assignedclusterwith thesample_submission.csv.
The solution achieved a B-Cubed F1 score of 0.534.
pandasnumpyscikit-learn(forStandardScaler,SimpleImputer,GaussianMixture,silhouette_score)
You can install these dependencies using pip:
pip install pandas numpy scikit-learn- Download the data: Obtain
train.csvandsample_submission.csvfrom the competition page (https://www.kaggle.com/competitions/tst-day-2) and place them in the specified path (/kaggle/input/tst-day-2/). If running locally, adjust the paths accordingly. - Run the Jupyter Notebook: Open and run the
batyr-yerdenov-2.2.ipynbnotebook. - Generate Submission: The script will automatically generate a
submission.csvfile in the same directory where the notebook is executed. This file will contain player IDs and their predicted cluster assignments.