This repository is the solution that obtains the top 2% ranking of NYC Taxi Fare Prediction competition in Kaggle.
- remove null records
- remove records whose locations are not within range provided in test data
- remove data points in sea
- eliminate outlier according to fare distribution
- the new feature
cluster
is added.- during data exploration, I found that the fare/distance ratio is varying according to the location. So, I add the new categorical feature to specify the area of the dropoff location and pickup location
- I used HDBScan to get the clustering model. Then, I use this model to predict the area of each record.
- the new feature
distance
is added - the new feature
distance to airport
is added - categorical data are changed to float32 to prevent memory surge due to Lightgbm python package. (The library will convert all data to float. So if the data is integer, new data will be created)
- lightgbm is used, and it was trained in the Amazon EC2 instance
- With this model, the test score is 2.85311