Hello! This is my third data science project. After playing around with supervised learning in my previous projects, I decide to explore more on the topic of unsupervised learning, particularly clustering. Here, I use aviation dataset to create a customer segmentation, analyze the customers profile in the clusters obtained, and give business recommendations accordingly. Although I am not an expert in the aviation industry, this dataset is still very interesting and fun to play around. Enjoy!
This repo contains this README, a python notebook called Airline Customer Segmentation.ipynb, a slides in pdf format to highlight the most important aspects of this project called 'Airline Customer Segmentation - slides.pdf', and the dataset flight.csv is in the data directory.
I use the Airline Customer Segmentation dataset from Kaggle. It is actually still unclear for me where this dataset is originally from, but I notice that it has been circulating around the internet since at least 4 years ago. The content of this dataset is as follows:
Basic customer information:
MEMBER_NO
: Membership card number (ID)FFP_DATE
: Membership join dateFIRST_FLIGHT_DATE
: First flight dateGENDER
: GenderFFP_TIER
: Membership card levelWORK_CITY
: The city where the customer worksWORK_PROVINCE
: The province where the customer worksWORK_COUNTRY
: The country where the customer worksAGE
: Age
Flight information:
LOAD_TIME
: The end time of the observation window (observation window: time period of observation)FLIGHT_COUNT
: Number of flights in the observation windowSUM_YR_1
: Fare revenueSUM_YR_2
: Votes pricesSEG_KM_SUM
: Total flight kilometers in the observation windowLAST_FLIGHT_DATE
: Last flight dateLAST_TO_END
: The time from the last flight to the end of the observation windowAVG_INTERVAL
: Average flight time intervalMAX_INTERVAL
: Maximum flight intervalavg_discount
: Average discount rate
Integral information
BP_SUM
: Total basic integralEXCHANGE_COUNT
: Number of points exchangedPoints_Sum
: Total cumulative pointsPoint_NotFlight
: points not used by the customer
In this project, I use the LRFMC model to create the segmentation, which is commonly used for aviation dataset. I use the K-means algorithm to segment the customers, and I employ the elbow method and the silhouette score to determine the optimum number of clusters (k-value). I analyze each cluster from its typical LRFMC values, which I take from the median.
This project requires the standard numpy
, pandas
, matplotlib
, seaborn
, and sklearn
packages. In addition, it also uses yellowbrick
to visualize the silhouette score and plotly
to create radar charts. So you might need to install these two packages beforehand if you want to test them out.
I find that k=6 gives the best segmentation based on the elbow method and the silhouette score. I also try using k=5 and k=7 to see what information I gain/lose. With 5 clusters only, we lose the group of potential VIP members, and so it is not optimal to use. In other hand, using 7 clusters adds a cluster with mostly redundant characteristics, and so it is not worth the additional effort.