-
ABSTRACT: This project aims to investigate the features behind resale premiums on StockX and their prediction power by conduct feature engineering and utilize external popularity index on different brands
-
ASSUMPTION: Hot sneakers present little seasonalities, and in this project we will not discuss time series analysis.
-
DATA: StockX Data Challenge 2019; Demographic Data 2020
-
MODEL: Tree-based and Linear Regression
Tree-based Linear Random Forest Lasso XGboost SVM(Linear Kernel) -
RESULT: We have utilized our strategy and identified undervalued shoes through March 2020, and these shoes have increased their price within 80% of our prediction by the end of March 2020.
Out-of-Sample Results:
The right is the famous Red Nike Yeezy. Its retail price is $250, and the latest resale price is $6,200, marked up by nearly 2400%. The high resale premium of this pair is not a single event. The once niche market of sneaker resale has grown to become a $2 billion market, and it is projected to reach $6 billion by 2025. Within the sneaker resale market, StockX is one of the largest platforms. The website operates like a stock exchange, where users can place a bidding or asking prices, and a deal is made whenever thereโs a match. What is so valuable for us is that the platform offers transparent and actionable data. Using such data, we want to build a predictive model to identify undervalued sneakers, which resellers can invest in now and sell at higher price later.
We utilized public data offered through the StockX Data Contest, which consists of 99,956 transactions from 2017 to 2019. The dataset consists of two brands -- Yeezy and Nike Off-White, and over 50 different styles.
To expand upon the features, we also manually obtained the colorway and number of sales from the StockX website. We then converted style and color into dummy variables, and sneaker size into frequency encoding as common shoe sizes are sold at higher premium. For modeling purposes, our target variable is price premium, which equals price markup over retail price, and our input variables are days since release, style, colorway, size, and number of sales.
Table: Initial Data Preprocessing
Looking at price premium, we could easily observe that it is heavily skewed to the right. Most resale transactions were marked up between 0% to 500%, yet there are some extreme transactions whose markups were over 2000%. As such, we would first conduct anomaly detection to examine those points on the very high end and find common features among them.
Since the target variable is non-negative and heavily right skewed, we would take the logarithm to have more robust results in outlier detection.
-
Step 1 Train an isolation forest on target value and use decision rules to find outliers.
model=IsolationForest(n_estimators=100, max_samples='auto', contamination= 0.05 ,max_features=1.0) model.fit(y[['Pct_change']]) y['scores']=model.decision_function(y[['Pct_change']]) y['anomaly']=model.predict(y[['Pct_change']]) #IsolationForest(behaviour='deprecated', bootstrap=False, contamination=0.05, # max_features=1.0, max_samples='auto', n_estimators=100, # n_jobs=None, random_state=None, verbose=0, warm_start=False)
-
Step 2 Create anomaly lists and compare it to non-anomaly points
Most Anomaly points lie on the right tail of distribution, and their cut-off(using median statistics) is approximately exp(5), this is a crucial indicator that if our prediction is beyond 100 times premium, there is large probability the point is an outlier and some statistically important features are underneath the pair of shoe.
Metric whole normal Anamoly Mean 1.25 1.03 5.40 Median 0.70 0.68 5.00 -
Step 3 Explore anomaly points
Group anomaly points on their three features: brand, color and region, we could peek into what features are heavily weighted in our dataset.
Style The top three styles that saw extreme resale prices are Air Jordan, Presto, and Blazer. Note that these are all Nike sneakers. Among them, Air Jordan saw the highest price premium of over 2000%. Yeezy sneakers followed at the fourth place.
Color White is the dominating color feature. There are two hypothesis on why its resale prices are high:
- Most sneakers are white;
- White is indeed a significant feature.
To test our hypothesis, we will further examine which specific sneakers contributed the most.
Region California and New York overall saw the highest price premium. However, this doesn't mean these two states have the highest per capita premium. Stay around for further analysis in per capital level.
Statistically, the best time to resell is 3 to 5 weeks before the release date 0. The worst time to resell is the first 9 weeks after release, when the market is saturated. After that, as the availability in market declines, buyers are willing to pay higher premiums.
-
Time effect on Top 3 Nike Brands
The most sale Nike brands are Air Jordan, Presto and Zoom. Among these sneakers, Air Jordan has the highest premium even after long time. The common golden time period to have high price premium. are after day 200. White color has relative competitive premium against others.
-
Time effect on Yeezy
It is worth notice that Different colors present different time pattern for Yeezy. Grey and Tan color have laggest 200 days after white and black color.
The best time to sell in the secondary market is 3 to 5 weeks before the release date. The worst time to sell is the first 9 weeks after release, when the market is saturated. After that, as the availability in market declines, buyers are generally willing to pay higher premiums.
Our dataset consists of two major brands โ Yeezy and Nike off-white. In terms of yeezy, we could see that basic colors including black, white, and grey have constant growth. Bolder colors like orange would start high but decline as time passes. In terms of off-white, red is the most popular color.
Looking at the number of sales, California ranks the first and Oregon ranks the third. However, when we look at the percentage of population that purchases from StockX, Oregon comes to the top with nearly 2 transactions per 1000 people. Hence, StockX might want to make more promotions in Oregon, but keep in mind that as a reseller, we cannot control the sales region.
To recapture the features in our datasets, the target variable is the price premium, calculated as the percentage change of sale price over retail price, and input features are days since release, brand, region, colorway and number of sales, adding up to 31 variables.
In our project, we tried two types of machine learning models, linear and tree-based regressions. We carefully accessed the performance of each model using 5-fold cross validation:
- we used the same train-test split and the same seed for cross validation
- we used the GridSearchCV package to implement all models
- we used R2 as our evaluation metric
-
Linear Model - Lasso
According to our Exploratory Data Analysis, certain styles and colors (i.e. white) have over emphasizing power. In consideration of outliers and overfitting issues, we utilized the Lasso model to ensure regularization.
lasso = Lasso() parameters = {'alpha': [1e-5,1e-4,1e-3,1e-2,1e-1,1e0,1e1,1e2,1e3,1e4,1e5]} r2 = make_scorer(r2_score, greater_is_better=True) clf = GridSearchCV(lasso, parameters, cv=5, scoring=r2)
To choose the optimal shrinkage parameter, we used cross validation, and the optimal shrinkage coefficient is
-
Linear Model - SVM
x.train = preprocessing.scale(x_train) linearsvr = LinearSVR(tol=0.01) parameters = {'C': [0.01, 0.1, 1, 10, 100]} r2 = make_scorer(r2_score, greater_is_better=True) clf = GridSearchCV(linearsvr, parameters, cv=5, scoring=r2)
-
Tree-based - XGBoost
XGBoost model comes in handy when dealing with nan value in datasets, in this project, we tuned parameters in XG Regression Tree to get prediction on our dataset.
params = {'colsample_bytree': [i/10. for i in range(8,11)], 'subsample': [i/10. for i in range(8,11)], 'eta': [.3, .4, .5], 'max_depth': list(range(3,6)), 'min_child_weight': list(range(4,7)), 'eval_metric': ['rmse'], 'objective': ['reg:squarederror']} xg_reg = xgb.XGBRegressor() r2 = make_scorer(r2_score, greater_is_better=True) clf = GridSearchCV(xg_reg, params, cv=5, scoring=r2)
In hyper-parameter selection,
max_depth
andmin_child_weight
is limited by the number of attributes, in our case, we give a range that have combination smaller than the variable count. The way of feature selection in XGRegression Tree is achieved by grid searchcolsample_bytree
andsubsample
.Though we have limited
max_depth
andmin_child_weight
, the optimal parameters are 4 and 5, getting up combination more than the number of variables. Therefore the model is prone for overfitting in this case.
Model | r2 score | MSE | Pros | Cons |
---|---|---|---|---|
Lassoย | 0.78 | 0.51 | Easy to interpret, handle overfitting, handle multicollinearity | Weak in outliers |
SVM | 0.75 | 0.39 | Memory efficient, fast, accurate in high dimensions | Overfitting |
XGBoost | 0.96 | 0.04 | High r_square , clear tree structure |
Overfitting |
According to our modeling results, XGBoost has the highest R2. However, the XGBoost model is considered overfitted. The other two models, Lasso and SVM, are comparable in terms of predictive power. Further, the lasso model is more interpretable as it allows us to easily examine constant effect of a predictor variable on price premiums. As such, taking both predictive power and model interpretability into account, we decided to proceed with the lasso model. Below is a summary of our fitted lasso model.
The Lasso coefficients are as above. Orange blocks represents negative coefficients, and blues represents positive coefficients. In terms of styles, AirJordan is the most lucrative, with an additional price premium of 324%. In terms of colors, red and tan will boost price premium, while blue seems to hurt resale prices. Further, as the number of sales increase, the price premium will go down.
According to our prediction model, sneakers of certain styles (i.e. AirJordan 1 and blazer) and of certain colors (i.e.red and tan) are the most investable. Based on this knowledge, we tried to find some undervalued sneakers, which we can invest in now and sell higher later. We listed some proposed candidates below.
The first pair listed is Air Jordan 1 by Travis Scott, released in May, 2019. If you are a sneakerhead, you probably already know that this is a popular pair. The question is, given the current resale price, could the price go up even higher. The answer is YES! This pair is selling at 500% markup as of early April, while our model predicts that the markup should have been 800%. As such, there is an arbitrage opportunity here, which would turn into a profit of over $500 on a single pair. The other candidates would follow similarly.
Sneaker | Feature | Image |
---|---|---|
Air Jordan 1 Retro High Travis Scott | ย Features: air-jordan(3.24), Tan/Brown(1.83) Current Price Premium: 497.4% Predicted Price Premium:802.0% |
|
Blazer Mid 77 Vintage Slam Jam | Features: blazer(0.58), White(0.71) Current Price Premium:381.0% Predicted Price Premium: 455.5% |
|
Yeezy Boost 350 V2 Tail Light | Features: blazer(0.58), White(0.71) Current Price Premium: 381.0% Predicted Price Premium: 455.5% |
1. Time Limitation The data we utilized only cover sales from 2017 to 2019. Due to the changing tendency of the sneaker resale market, the modelโs predictive power might be less stable in far futures.
2. Brand Limitation Since our data only covers two major brands -- Yeezy and Nike Off-White, it might not be appropriate to generalize our predictive model to other brands of sneakers.