Analyzed the marathon runners’ split finished times to determine top finishers as per the success ratio of the country. This analysis helped to find out the strategies used by top finishers (Positive or Negative split) to complete the marathon effectively.
Their are two sets of Datasets to complete this project Boston and Moscow Marathon.
- marathon_results_2017.csv
Columns : {Bib ,Name, Age, M/F, City, State, Country, Citizen, 5K, 10K, 15K, 20K, Half, 25K, 30K, 35K, 40K, Pace, Proj Time, Official Time, Overall, Gender, Division}
Data : {11, Kirui, Geoffrey, 24, M, Keringet, KEN, 0:15:25, 0:30:28, 0:45:44, 1:01:15, 1:04:35, 1:16:59, 1:33:01, 1:48:19, 2:02:53, 0:04:57, - 2:09:37, 1, 1, 1}
- 1_full_results_mm_2018.csv
Columns : {Bib, finish_time_sec, finish_time_result, race, pace_sec, pace(minpkm), pace(kmph), half_pace_sec, half_pace(minpkm), half_pace(kmph), gender_en, agev name_en, location_city_ru, location_city_en, country_code_alpha_3, flag_DNF, flag_all_split_exist, race_uniform_index}
Data : {1, 8911, 2h 28min 31sec, 42.195 km, 211.1861595, 3:31 min/km 17.0 km/h, 208.3185212, 3:28 min/km, 17.3 km/h, Female, 30, Sardana Trofimova, –Ø–∫—É—Ç—Å–∫, Yakutsk, RUS, 0, 1, 0.000132899}
- 1_split_results_mm_2018.csv
Columns : {bib, split_name, split, split_time_sec, split_time_result, split_pace_sec, split_pace(minpkm), split_pace(kmph), split_uniform_index}
Data : {11, Kirui, Geoffrey, 24, M, Keringet, KEN, 0:15:25, 0:30:28, 0:45:44, 1:01:15, 1:04:35, 1:16:59, 1:33:01, 1:48:19, 2:02:53, 0:04:57, - 2:09:37, 1, 1, 11}
%autosave 0
Autosave disabled
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from datetime import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import pandas as pd
%pylab inline
import sklearn as sk
import sklearn.tree as tree
from IPython.display import Image
import pydotplus
Populating the interactive namespace from numpy and matplotlib
/opt/anaconda3/lib/python3.8/site-packages/IPython/core/magics/pylab.py:159: UserWarning: pylab import has clobbered these variables: ['datetime']
`%matplotlib` prevents importing * from pylab and numpy
warn("pylab import has clobbered these variables: %s" % clobbered +
import warnings
#suppress all future warning
warnings.filterwarnings('ignore')
#see only one time warning
#warnings.filterwarnings(action='once')
- df_boston is a dataset of all athelets in boston marathon. Each row is represent attribute of an individual athelete.
- df_moscow is a dataset of all athelets in Moscow marathon. Each row is represent attribute of an individual athelete.
Important columns in df_boston and df_moscow are:
- BibNo: Unique Number given to each athelet in marathon.
- Age: Age of each Athelet.
- Gender: Gender of each Athelet.
- Country: The Country which each athelet belongs to
- Finish_Time: Time taken in seconds by each athelet to complete the race
- Final_10k_sec: Time taken in seconds by each athelet to complete 10k distance of the race
- Final_Half_sec: Time taken in seconds by each athelet to complete half of the race
- Final_40k_sec: Time taken in seconds by each athelet to complete 40k distance of the race
- Overall_Ranking: Ranking with respect to Finishing time of each athelet
- Gender_Ranking: Ranking with respect to Gender of Athelet.
- Division_Ranking: Ranking with repect to Age group of Athelet.
df_boston_raw = pd.read_csv('marathon_results_2017.csv')
df_moscow_full_results = pd.read_csv('1_full_results_mm_2018.csv')
df_moscow_split_results = pd.read_csv('1_split_results_mm_2018.csv')
pd.set_option('display.max_columns',30)
df_boston_raw.head()
Unnamed: 0 | Bib | Name | Age | M/F | City | State | Country | Citizen | Unnamed: 9 | 5K | 10K | 15K | 20K | Half | 25K | 30K | 35K | 40K | Pace | Proj Time | Official Time | Overall | Gender | Division | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 11 | Kirui, Geoffrey | 24 | M | Keringet | NaN | KEN | NaN | NaN | 0:15:25 | 0:30:28 | 0:45:44 | 1:01:15 | 1:04:35 | 1:16:59 | 1:33:01 | 1:48:19 | 2:02:53 | 0:04:57 | - | 2:09:37 | 1 | 1 | 1 |
1 | 1 | 17 | Rupp, Galen | 30 | M | Portland | OR | USA | NaN | NaN | 0:15:24 | 0:30:27 | 0:45:44 | 1:01:15 | 1:04:35 | 1:16:59 | 1:33:01 | 1:48:19 | 2:03:14 | 0:04:58 | - | 2:09:58 | 2 | 2 | 2 |
2 | 2 | 23 | Osako, Suguru | 25 | M | Machida-City | NaN | JPN | NaN | NaN | 0:15:25 | 0:30:29 | 0:45:44 | 1:01:16 | 1:04:36 | 1:17:00 | 1:33:01 | 1:48:31 | 2:03:38 | 0:04:59 | - | 2:10:28 | 3 | 3 | 3 |
3 | 3 | 21 | Biwott, Shadrack | 32 | M | Mammoth Lakes | CA | USA | NaN | NaN | 0:15:25 | 0:30:29 | 0:45:44 | 1:01:19 | 1:04:45 | 1:17:00 | 1:33:01 | 1:48:58 | 2:04:35 | 0:05:03 | - | 2:12:08 | 4 | 4 | 4 |
4 | 4 | 9 | Chebet, Wilson | 31 | M | Marakwet | NaN | KEN | NaN | NaN | 0:15:25 | 0:30:28 | 0:45:44 | 1:01:15 | 1:04:35 | 1:16:59 | 1:33:01 | 1:48:41 | 2:05:00 | 0:05:04 | - | 2:12:35 | 5 | 5 | 5 |
df_moscow_full_results.head()
bib | finish_time_sec | finish_time_result | race | pace_sec | pace(minpkm) | pace(kmph) | half_pace_sec | half_pace(minpkm) | half_pace(kmph) | gender_en | age | name_en | location_city_ru | location_city_en | country_code_alpha_3 | flag_DNF | flag_all_split_exist | race_uniform_index | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 8911.0 | 2h 28min 31sec | 42.195 km | 211.186160 | 3:31 min/km | 17.0 km/h | 208.318521 | 3:28 min/km | 17.3 km/h | Female | 30 | Sardana Trofimova | Якутск | Yakutsk | RUS | 0 | 1 | 0.000133 |
1 | 2 | 9308.0 | 2h 35min 08sec | 42.195 km | 220.594857 | 3:40 min/km | 16.3 km/h | 212.157839 | 3:32 min/km | 17.0 km/h | Female | 35 | Tat'yana Arkhipova | Чебоксары | Cheboksary | RUS | 0 | 1 | 0.002556 |
2 | 3 | 8122.0 | 2h 15min 22sec | 42.195 km | 192.487261 | 3:12 min/km | 18.7 km/h | 189.880318 | 3:09 min/km | 19.0 km/h | Male | 31 | Stepan Kiselev | Казань | Kazan' | RUS | 0 | 1 | 0.000397 |
3 | 4 | 8467.0 | 2h 21min 07sec | 42.195 km | 200.663586 | 3:20 min/km | 17.9 km/h | 186.894182 | 3:06 min/km | 19.3 km/h | Male | 36 | Dmitriy Safronov | Москва | Moskva | RUS | 0 | 1 | 0.009003 |
4 | 5 | 8738.0 | 2h 25min 38sec | 42.195 km | 207.086148 | 3:27 min/km | 17.4 km/h | 194.904610 | 3:14 min/km | 18.5 km/h | Male | 42 | Grigoriy Andreev | Уфа | Ufa | RUS | 0 | 1 | 0.006439 |
df_boston_raw.shape
(26410, 25)
df_boston_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26410 entries, 0 to 26409
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 26410 non-null int64
1 Bib 26410 non-null object
2 Name 26410 non-null object
3 Age 26410 non-null int64
4 M/F 26410 non-null object
5 City 26410 non-null object
6 State 22815 non-null object
7 Country 26410 non-null object
8 Citizen 1254 non-null object
9 Unnamed: 9 91 non-null object
10 5K 26410 non-null object
11 10K 26410 non-null object
12 15K 26410 non-null object
13 20K 26410 non-null object
14 Half 26410 non-null object
15 25K 26410 non-null object
16 30K 26410 non-null object
17 35K 26410 non-null object
18 40K 26410 non-null object
19 Pace 26410 non-null object
20 Proj Time 26410 non-null object
21 Official Time 26410 non-null object
22 Overall 26410 non-null int64
23 Gender 26410 non-null int64
24 Division 26410 non-null int64
dtypes: int64(5), object(20)
memory usage: 5.0+ MB
df_boston_raw.isna().sum()
Unnamed: 0 0
Bib 0
Name 0
Age 0
M/F 0
City 0
State 3595
Country 0
Citizen 25156
Unnamed: 9 26319
5K 0
10K 0
15K 0
20K 0
Half 0
25K 0
30K 0
35K 0
40K 0
Pace 0
Proj Time 0
Official Time 0
Overall 0
Gender 0
Division 0
dtype: int64
df_moscow_full_results.shape
(21662, 19)
df_moscow_split_results.shape
(104438, 9)
df_moscow_full_results.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21662 entries, 0 to 21661
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 bib 21662 non-null int64
1 finish_time_sec 21385 non-null float64
2 finish_time_result 21385 non-null object
3 race 21662 non-null object
4 pace_sec 21385 non-null float64
5 pace(minpkm) 21385 non-null object
6 pace(kmph) 21385 non-null object
7 half_pace_sec 21380 non-null float64
8 half_pace(minpkm) 21380 non-null object
9 half_pace(kmph) 21385 non-null object
10 gender_en 21662 non-null object
11 age 21662 non-null int64
12 name_en 21662 non-null object
13 location_city_ru 21662 non-null object
14 location_city_en 21662 non-null object
15 country_code_alpha_3 21662 non-null object
16 flag_DNF 21662 non-null int64
17 flag_all_split_exist 21662 non-null int64
18 race_uniform_index 21289 non-null float64
dtypes: float64(4), int64(4), object(11)
memory usage: 3.1+ MB
df_moscow_full_results.isna().sum()
bib 0
finish_time_sec 277
finish_time_result 277
race 0
pace_sec 277
pace(minpkm) 277
pace(kmph) 277
half_pace_sec 282
half_pace(minpkm) 282
half_pace(kmph) 277
gender_en 0
age 0
name_en 0
location_city_ru 0
location_city_en 0
country_code_alpha_3 0
flag_DNF 0
flag_all_split_exist 0
race_uniform_index 373
dtype: int64
Drop the unused columns and standardised the column names by renaming as per the both datasets in Boston Dataframe
df_boston_raw.drop(columns=['Unnamed: 0','Citizen','Unnamed: 9','Proj Time'],axis=1,inplace=True)
df_boston_raw.rename(columns={'Bib':'BibNo','M/F':'Gender','Pace':'Pace_minpermile','Official Time':'Finish_Time',\
'Overall':'Overall_ranking','Gender':'Gender_ranking','Division':'Division_ranking'},inplace=True)
df_boston_raw.replace(to_replace='M', value=1.0, inplace=True)
df_boston_raw.replace(to_replace='F', value=0.0, inplace=True)
df_boston_raw.head()
BibNo | Name | Age | Gender | City | State | Country | 5K | 10K | 15K | 20K | Half | 25K | 30K | 35K | 40K | Pace_minpermile | Finish_Time | Overall_ranking | Gender_ranking | Division_ranking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 11 | Kirui, Geoffrey | 24 | 1.0 | Keringet | NaN | KEN | 0:15:25 | 0:30:28 | 0:45:44 | 1:01:15 | 1:04:35 | 1:16:59 | 1:33:01 | 1:48:19 | 2:02:53 | 0:04:57 | 2:09:37 | 1 | 1 | 1 |
1 | 17 | Rupp, Galen | 30 | 1.0 | Portland | OR | USA | 0:15:24 | 0:30:27 | 0:45:44 | 1:01:15 | 1:04:35 | 1:16:59 | 1:33:01 | 1:48:19 | 2:03:14 | 0:04:58 | 2:09:58 | 2 | 2 | 2 |
2 | 23 | Osako, Suguru | 25 | 1.0 | Machida-City | NaN | JPN | 0:15:25 | 0:30:29 | 0:45:44 | 1:01:16 | 1:04:36 | 1:17:00 | 1:33:01 | 1:48:31 | 2:03:38 | 0:04:59 | 2:10:28 | 3 | 3 | 3 |
3 | 21 | Biwott, Shadrack | 32 | 1.0 | Mammoth Lakes | CA | USA | 0:15:25 | 0:30:29 | 0:45:44 | 1:01:19 | 1:04:45 | 1:17:00 | 1:33:01 | 1:48:58 | 2:04:35 | 0:05:03 | 2:12:08 | 4 | 4 | 4 |
4 | 9 | Chebet, Wilson | 31 | 1.0 | Marakwet | NaN | KEN | 0:15:25 | 0:30:28 | 0:45:44 | 1:01:15 | 1:04:35 | 1:16:59 | 1:33:01 | 1:48:41 | 2:05:00 | 0:05:04 | 2:12:35 | 5 | 5 | 5 |
Drop the unused columns and standardised the column names by renaming as per the both datasets in Moscow Dataframe
df_moscow_full_results.drop(['finish_time_result','location_city_ru','flag_all_split_exist',\
'race','pace(kmph)','half_pace(kmph)','location_city_ru','flag_DNF',\
'flag_all_split_exist','race_uniform_index'],axis=1,inplace=True)
df_moscow_full_results.rename(columns ={'bib':'BibNo','pace_sec':'Pace_Sec','finish_time_sec':'Finish_Time_Sec',\
'pace(minpkm)':'Pace_Minpkm','half_pace_sec':'Half_Pace_Sec',\
'half_pace(minpkm)':'Half_Pace_Minpkm','gender_en':'Gender',\
'age':'Age','name_en':'Name','location_city_en':'City','country_code_alpha_3':'Country'},inplace=True)
df_moscow_split_results.drop(['split_pace(minpkm)','split_pace(kmph)','split_uniform_index','split'],axis=1,inplace=True)
df_moscow_split_results.head()
bib | split_name | split_time_sec | split_time_result | split_pace_sec | |
---|---|---|---|---|---|
0 | 1 | 5 km | 1057 | 17min 37sec | 211.400000 |
1 | 1 | 10 km | 2112 | 35min 12sec | 211.000000 |
2 | 1 | 15 km | 3149 | 52min 29sec | 207.400000 |
3 | 1 | Half marathon | 4395 | 1h 13min 15sec | 204.346043 |
4 | 1 | 25 km | 5258 | 1h 27min 38sec | 221.140295 |
df_moscow_split_results.rename (columns={'bib':'BibNo','split_name':'Split_Name','split_time_sec':'Split_Time_Sec',\
'split_time_result':'Split_Time_Result','split_pace_sec':'Split_Pace_Sec'},inplace=True)
By using pivot table to make split_name values as column in Moscow_Split_Results and Moscow_Full_Results Dataframe
df_moscow_split_results['Marathon_Split'] = ((df_moscow_split_results.Split_Name == ('10 km'))\
|(df_moscow_split_results.Split_Name == ('Half marathon'))\
|(df_moscow_split_results.Split_Name == ('Marathon')))
df_temp_pivot = df_moscow_split_results.pivot_table(index='BibNo', columns='Split_Name', values='Split_Time_Sec')
df_temp_pivot.head()
Split_Name | 10 km | 15 km | 25 km | 30 km | 35 km | 40 km | 5 km | Half marathon | Marathon |
---|---|---|---|---|---|---|---|---|---|
BibNo | |||||||||
1 | 2112.0 | 3149.0 | 5258.0 | 6321.0 | 7362.0 | 8442.0 | 1057.0 | 4395.0 | 8911.0 |
2 | 2115.0 | 3188.0 | 5386.0 | 6508.0 | 7608.0 | 8784.0 | 1059.0 | 4476.0 | 9308.0 |
3 | 1899.0 | 2867.0 | 4796.0 | 5786.0 | 6632.0 | 7703.0 | 947.0 | 4006.0 | 8122.0 |
4 | 1876.0 | 2814.0 | 4752.0 | 5780.0 | 6827.0 | 7967.0 | 945.0 | 3943.0 | 8467.0 |
5 | 1958.0 | 2933.0 | 4949.0 | 6015.0 | 7080.0 | 8235.0 | 976.0 | 4112.0 | 8738.0 |
df_temp_pivot.drop(['15 km','25 km','30 km','35 km','40 km','5 km'],axis=1,inplace=True)
df_temp_pivot.set_index='BibNo'
Used the Outer Merge on Pivot table and Moscow_Full_Result to to get the split name values as column in Moscow_full_result
df_Moscow = df_moscow_full_results.merge(df_temp_pivot,how='outer',right_index=True,left_on='BibNo')
df_Moscow = df_moscow_full_results.merge(df_temp_pivot,how='outer',right_index=True,left_on='BibNo')
df_Moscow.head()
BibNo | Finish_Time_Sec | Pace_Sec | Pace_Minpkm | Half_Pace_Sec | Half_Pace_Minpkm | Gender | Age | Name | City | Country | 10 km | Half marathon | Marathon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 8911.0 | 211.186160 | 3:31 min/km | 208.318521 | 3:28 min/km | Female | 30 | Sardana Trofimova | Yakutsk | RUS | 2112.0 | 4395.0 | 8911.0 |
1 | 2 | 9308.0 | 220.594857 | 3:40 min/km | 212.157839 | 3:32 min/km | Female | 35 | Tat'yana Arkhipova | Cheboksary | RUS | 2115.0 | 4476.0 | 9308.0 |
2 | 3 | 8122.0 | 192.487261 | 3:12 min/km | 189.880318 | 3:09 min/km | Male | 31 | Stepan Kiselev | Kazan' | RUS | 1899.0 | 4006.0 | 8122.0 |
3 | 4 | 8467.0 | 200.663586 | 3:20 min/km | 186.894182 | 3:06 min/km | Male | 36 | Dmitriy Safronov | Moskva | RUS | 1876.0 | 3943.0 | 8467.0 |
4 | 5 | 8738.0 | 207.086148 | 3:27 min/km | 194.904610 | 3:14 min/km | Male | 42 | Grigoriy Andreev | Ufa | RUS | 1958.0 | 4112.0 | 8738.0 |
df_Moscow.sort_values(by='Finish_Time_Sec')
BibNo | Finish_Time_Sec | Pace_Sec | Pace_Minpkm | Half_Pace_Sec | Half_Pace_Minpkm | Gender | Age | Name | City | Country | 10 km | Half marathon | Marathon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8982 | 17006 | 1728.0 | 172.8 | 2:52 min/km | 175.8 | 2:55 min/km | Male | 26 | Vladimir Nikitin | Perm' | RUS | 1728.0 | NaN | NaN |
8991 | 17018 | 1826.0 | 182.6 | 3:02 min/km | 183.6 | 3:03 min/km | Male | 29 | Denis Vasil'ev | Sankt-Peterburg | RUS | 1826.0 | NaN | NaN |
21362 | 32745 | 1834.0 | 183.4 | 3:03 min/km | 185.6 | 3:05 min/km | Male | 30 | Artur Burtsev | Sankt-Peterburg | RUS | 1834.0 | NaN | NaN |
8988 | 17013 | 1869.0 | 186.9 | 3:06 min/km | 186.6 | 3:06 min/km | Male | 24 | Ivan Panferov | Moskva | RUS | 1869.0 | NaN | NaN |
21114 | 32472 | 1869.0 | 186.9 | 3:06 min/km | 188.2 | 3:08 min/km | Male | 22 | Nikita Kotenev | Minsk | BLR | 1869.0 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8954 | 11201 | NaN | NaN | NaN | NaN | NaN | Male | 30 | Murtuz Ramazanov | Moskva | RUS | 4493.0 | 10282.0 | NaN |
8960 | 11209 | NaN | NaN | NaN | NaN | NaN | Female | 37 | Ekaterina Andreeva | Moskva | RUS | NaN | NaN | NaN |
8961 | 11210 | NaN | NaN | NaN | NaN | NaN | Male | 28 | Sergey Nikitin | Moskva | RUS | NaN | NaN | NaN |
8973 | 15013 | NaN | NaN | NaN | NaN | NaN | Male | 32 | Mikhail Maksimov | Sankt-Peterburg | RUS | 1957.0 | 4086.0 | NaN |
16208 | 26352 | NaN | NaN | NaN | NaN | NaN | Female | 29 | Lyudmila Tret'yakova | Moskva | RUS | NaN | NaN | NaN |
21662 rows × 14 columns
df_Moscow['Marathon'].isna().sum()
12960
Drop the NAN values Moscow Dataframe from Marathon column which shows that person has not completed the marathon
df_Moscow.dropna(how='any',subset=['Marathon'],inplace=True)
df_new_Moscow= df_Moscow.sort_values(by='Finish_Time_Sec')
df_new_Moscow.head()
BibNo | Finish_Time_Sec | Pace_Sec | Pace_Minpkm | Half_Pace_Sec | Half_Pace_Minpkm | Gender | Age | Name | City | Country | 10 km | Half marathon | Marathon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 3 | 8122.0 | 192.487261 | 3:12 min/km | 189.880318 | 3:09 min/km | Male | 31 | Stepan Kiselev | Kazan' | RUS | 1899.0 | 4006.0 | 8122.0 |
10 | 12 | 8161.0 | 193.411542 | 3:13 min/km | 189.738121 | 3:09 min/km | Male | 32 | Aleksey Troshkin | Saransk | RUS | 1905.0 | 4003.0 | 8161.0 |
8 | 10 | 8394.0 | 198.933523 | 3:18 min/km | 188.268752 | 3:08 min/km | Male | 27 | Artem Aplachkin | Barnaul | RUS | 1879.0 | 3972.0 | 8394.0 |
8968 | 15008 | 8416.0 | 199.454912 | 3:19 min/km | 195.094205 | 3:15 min/km | Male | 29 | Sergey Popov | Voronezh | RUS | 1960.0 | 4116.0 | 8416.0 |
3 | 4 | 8467.0 | 200.663586 | 3:20 min/km | 186.894182 | 3:06 min/km | Male | 36 | Dmitriy Safronov | Moskva | RUS | 1876.0 | 3943.0 | 8467.0 |
df_new_Moscow = df_new_Moscow.reset_index()
df_new_Moscow['Overall_Ranking'] = df_new_Moscow.index + 1
df_new_Moscow
index | BibNo | Finish_Time_Sec | Pace_Sec | Pace_Minpkm | Half_Pace_Sec | Half_Pace_Minpkm | Gender | Age | Name | City | Country | 10 km | Half marathon | Marathon | Overall_Ranking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 3 | 8122.0 | 192.487261 | 3:12 min/km | 189.880318 | 3:09 min/km | Male | 31 | Stepan Kiselev | Kazan' | RUS | 1899.0 | 4006.0 | 8122.0 | 1 |
1 | 10 | 12 | 8161.0 | 193.411542 | 3:13 min/km | 189.738121 | 3:09 min/km | Male | 32 | Aleksey Troshkin | Saransk | RUS | 1905.0 | 4003.0 | 8161.0 | 2 |
2 | 8 | 10 | 8394.0 | 198.933523 | 3:18 min/km | 188.268752 | 3:08 min/km | Male | 27 | Artem Aplachkin | Barnaul | RUS | 1879.0 | 3972.0 | 8394.0 | 3 |
3 | 8968 | 15008 | 8416.0 | 199.454912 | 3:19 min/km | 195.094205 | 3:15 min/km | Male | 29 | Sergey Popov | Voronezh | RUS | 1960.0 | 4116.0 | 8416.0 | 4 |
4 | 3 | 4 | 8467.0 | 200.663586 | 3:20 min/km | 186.894182 | 3:06 min/km | Male | 36 | Dmitriy Safronov | Moskva | RUS | 1876.0 | 3943.0 | 8467.0 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8697 | 1297 | 1818 | 22102.0 | 523.806138 | 8:43 min/km | 492.617609 | 8:12 min/km | Male | 64 | Songkran Amornsaengthong | Moskva | RUS | 4984.0 | 10393.0 | 22102.0 | 8698 |
8698 | 8578 | 10770 | 22178.0 | 525.607299 | 8:45 min/km | 559.450172 | 9:19 min/km | Male | 36 | Aleksey Popov | Moskva | RUS | 5956.0 | 11803.0 | 22178.0 | 8699 |
8699 | 3199 | 4301 | 22183.0 | 525.725797 | 8:45 min/km | 503.471975 | 8:23 min/km | Male | 26 | Danila Malykhin | Moskva | RUS | 4862.0 | 10622.0 | 22183.0 | 8700 |
8700 | 6340 | 8204 | 22291.0 | 528.285342 | 8:48 min/km | 383.315559 | 6:23 min/km | Male | 63 | Chi_Wan Yu | Hong Kong | HKG | 3704.0 | 8087.0 | 22291.0 | 8701 |
8701 | 2526 | 3462 | 22672.0 | 537.314848 | 8:57 min/km | 493.518189 | 8:13 min/km | Male | 25 | Artem Prasolov | Moskva | RUS | 4607.0 | 10412.0 | 22672.0 | 8702 |
8702 rows × 16 columns
df_new_Moscow.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8702 entries, 0 to 8701
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 8702 non-null int64
1 BibNo 8702 non-null int64
2 Finish_Time_Sec 8702 non-null float64
3 Pace_Sec 8702 non-null float64
4 Pace_Minpkm 8702 non-null object
5 Half_Pace_Sec 8702 non-null float64
6 Half_Pace_Minpkm 8702 non-null object
7 Gender 8702 non-null object
8 Age 8702 non-null int64
9 Name 8702 non-null object
10 City 8702 non-null object
11 Country 8702 non-null object
12 10 km 8701 non-null float64
13 Half marathon 8702 non-null float64
14 Marathon 8702 non-null float64
15 Overall_Ranking 8702 non-null int64
dtypes: float64(6), int64(4), object(6)
memory usage: 1.1+ MB
df_new_Moscow.drop(columns=['index'],axis=1,inplace=True)
df_new_Moscow.head(20)
BibNo | Finish_Time_Sec | Pace_Sec | Pace_Minpkm | Half_Pace_Sec | Half_Pace_Minpkm | Gender | Age | Name | City | Country | 10 km | Half marathon | Marathon | Overall_Ranking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 8122.0 | 192.487261 | 3:12 min/km | 189.880318 | 3:09 min/km | Male | 31 | Stepan Kiselev | Kazan' | RUS | 1899.0 | 4006.0 | 8122.0 | 1 |
1 | 12 | 8161.0 | 193.411542 | 3:13 min/km | 189.738121 | 3:09 min/km | Male | 32 | Aleksey Troshkin | Saransk | RUS | 1905.0 | 4003.0 | 8161.0 | 2 |
2 | 10 | 8394.0 | 198.933523 | 3:18 min/km | 188.268752 | 3:08 min/km | Male | 27 | Artem Aplachkin | Barnaul | RUS | 1879.0 | 3972.0 | 8394.0 | 3 |
3 | 15008 | 8416.0 | 199.454912 | 3:19 min/km | 195.094205 | 3:15 min/km | Male | 29 | Sergey Popov | Voronezh | RUS | 1960.0 | 4116.0 | 8416.0 | 4 |
4 | 4 | 8467.0 | 200.663586 | 3:20 min/km | 186.894182 | 3:06 min/km | Male | 36 | Dmitriy Safronov | Moskva | RUS | 1876.0 | 3943.0 | 8467.0 | 5 |
5 | 7 | 8513.0 | 201.753762 | 3:21 min/km | 194.478019 | 3:14 min/km | Male | 33 | Mikhail Kul'kov | Khanty-Mansiysk | RUS | 1959.0 | 4103.0 | 8513.0 | 6 |
6 | 14 | 8601.0 | 203.839317 | 3:23 min/km | 197.416755 | 3:17 min/km | Male | 31 | Viktor Ugarov | Moskva | RUS | 1976.0 | 4165.0 | 8601.0 | 7 |
7 | 8 | 8670.0 | 205.474582 | 3:25 min/km | 195.094205 | 3:15 min/km | Male | 30 | Oleg Grigor'ev | Moskva | RUS | 1958.0 | 4116.0 | 8670.0 | 8 |
8 | 10522 | 8690.0 | 205.948572 | 3:25 min/km | 201.493068 | 3:21 min/km | Male | 26 | Aleksey Shirshov | Chkalov | RUS | 1978.0 | 4251.0 | 8690.0 | 9 |
9 | 5 | 8738.0 | 207.086148 | 3:27 min/km | 194.904610 | 3:14 min/km | Male | 42 | Grigoriy Andreev | Ufa | RUS | 1958.0 | 4112.0 | 8738.0 | 10 |
10 | 33 | 8800.0 | 208.555516 | 3:28 min/km | 192.250267 | 3:12 min/km | Male | 27 | Iskander Yadgarov | Moskva | RUS | 1943.0 | 4056.0 | 8800.0 | 11 |
11 | 15010 | 8808.0 | 208.745112 | 3:28 min/km | 199.597109 | 3:19 min/km | Male | 32 | Andrey Smirnov | Perm' | RUS | 1972.0 | 4211.0 | 8808.0 | 12 |
12 | 13 | 8852.0 | 209.787890 | 3:29 min/km | 208.223723 | 3:28 min/km | Male | 25 | Mikhail Zvyagintsev | Biryuch | RUS | 2111.0 | 4393.0 | 8852.0 | 13 |
13 | 8422 | 8856.0 | 209.882688 | 3:29 min/km | 208.223723 | 3:28 min/km | Male | 43 | Andrey Klinov | Ryazan' | RUS | 2110.0 | 4393.0 | 8856.0 | 14 |
14 | 11 | 8856.0 | 209.882688 | 3:29 min/km | 201.540467 | 3:21 min/km | Male | 28 | Aleksandr Krotovich | Sankt-Peterburg | RUS | 1996.0 | 4252.0 | 8856.0 | 15 |
15 | 10709 | 8873.0 | 210.285579 | 3:30 min/km | 208.223723 | 3:28 min/km | Male | 38 | Yipeng Li | Beijing | CHN | 2111.0 | 4393.0 | 8873.0 | 16 |
16 | 15006 | 8890.0 | 210.688470 | 3:30 min/km | 208.271122 | 3:28 min/km | Male | 34 | Denis Korablev | Salekhard | RUS | 2111.0 | 4394.0 | 8890.0 | 17 |
17 | 1 | 8911.0 | 211.186160 | 3:31 min/km | 208.318521 | 3:28 min/km | Female | 30 | Sardana Trofimova | Yakutsk | RUS | 2112.0 | 4395.0 | 8911.0 | 18 |
18 | 22 | 8912.0 | 211.209859 | 3:31 min/km | 208.176324 | 3:28 min/km | Male | 35 | Renat Kashapov | Kazan' | RUS | 2110.0 | 4392.0 | 8912.0 | 19 |
19 | 15012 | 8952.0 | 212.157839 | 3:32 min/km | 208.271122 | 3:28 min/km | Male | 36 | Dmitriy Shilkin | Saratov | RUS | 2111.0 | 4394.0 | 8952.0 | 20 |
df_Boston=df_boston_raw.copy()
df_Boston.head()
BibNo | Name | Age | Gender | City | State | Country | 5K | 10K | 15K | 20K | Half | 25K | 30K | 35K | 40K | Pace_minpermile | Finish_Time | Overall_ranking | Gender_ranking | Division_ranking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 11 | Kirui, Geoffrey | 24 | 1.0 | Keringet | NaN | KEN | 0:15:25 | 0:30:28 | 0:45:44 | 1:01:15 | 1:04:35 | 1:16:59 | 1:33:01 | 1:48:19 | 2:02:53 | 0:04:57 | 2:09:37 | 1 | 1 | 1 |
1 | 17 | Rupp, Galen | 30 | 1.0 | Portland | OR | USA | 0:15:24 | 0:30:27 | 0:45:44 | 1:01:15 | 1:04:35 | 1:16:59 | 1:33:01 | 1:48:19 | 2:03:14 | 0:04:58 | 2:09:58 | 2 | 2 | 2 |
2 | 23 | Osako, Suguru | 25 | 1.0 | Machida-City | NaN | JPN | 0:15:25 | 0:30:29 | 0:45:44 | 1:01:16 | 1:04:36 | 1:17:00 | 1:33:01 | 1:48:31 | 2:03:38 | 0:04:59 | 2:10:28 | 3 | 3 | 3 |
3 | 21 | Biwott, Shadrack | 32 | 1.0 | Mammoth Lakes | CA | USA | 0:15:25 | 0:30:29 | 0:45:44 | 1:01:19 | 1:04:45 | 1:17:00 | 1:33:01 | 1:48:58 | 2:04:35 | 0:05:03 | 2:12:08 | 4 | 4 | 4 |
4 | 9 | Chebet, Wilson | 31 | 1.0 | Marakwet | NaN | KEN | 0:15:25 | 0:30:28 | 0:45:44 | 1:01:15 | 1:04:35 | 1:16:59 | 1:33:01 | 1:48:41 | 2:05:00 | 0:05:04 | 2:12:35 | 5 | 5 | 5 |
Converted the finish time from Hour, Min & Sec format to seconds only to make it Final 10K, Half, 40k and Finish Time in Boston dataframe
df_Boston['FTime_10K_hour']= df_Boston['10K'].str.split(':').str[0]
df_Boston['FTime_10K_min']= df_Boston['10K'].str.split(':').str[1]
df_Boston['FTime_10K_sec']= df_Boston['10K'].str.split(':',n=2).str[-1]
df_Boston.FTime_10K_sec.fillna(0,inplace=True)
df_Boston.FTime_10K_min.fillna(0,inplace=True)
df_Boston.FTime_10K_hour.fillna(0,inplace=True)
df_Boston.FTime_10K_sec.replace('-',0,inplace=True)
df_Boston.FTime_10K_hour.replace('-',0,inplace=True)
final1=df_Boston['FTime_10K_hour'].astype(int) *3600
final2=(df_Boston['FTime_10K_min'].astype(int)) * 60
final3=(df_Boston['FTime_10K_sec'].astype(int))
df_Boston['Final_10k_sec'] = final1 + final2 + final3
df_Boston.drop(columns =['FTime_10K_hour','FTime_10K_min','FTime_10K_sec'],inplace =True)
df_Boston['FTime_Half_hour']= df_Boston['Half'].str.split(':').str[0]
df_Boston['FTime_Half_min']= df_Boston['Half'].str.split(':').str[1]
df_Boston['FTime_Half_sec']= df_Boston['Half'].str.split(':',n=2).str[-1]
df_Boston.FTime_Half_sec.fillna(0,inplace=True)
df_Boston.FTime_Half_min.fillna(0,inplace=True)
df_Boston.FTime_Half_hour.fillna(0,inplace=True)
df_Boston.FTime_Half_sec.replace('-',0,inplace=True)
df_Boston.FTime_Half_hour.replace('-',0,inplace=True)
final1=df_Boston['FTime_Half_hour'].astype(int) *3600
final2=(df_Boston['FTime_Half_min'].astype(int)) * 60
final3=(df_Boston['FTime_Half_sec'].astype(int))
df_Boston['Final_Half_sec'] = final1 + final2 + final3
df_Boston.drop(columns =['FTime_Half_hour','FTime_Half_min','FTime_Half_sec'],inplace =True)
df_Boston['FTime_40K_hour']= df_Boston['40K'].str.split(':').str[0]
df_Boston['FTime_40K_min']= df_Boston['40K'].str.split(':').str[1]
df_Boston['FTime_40K_sec']= df_Boston['40K'].str.split(':',n=2).str[-1] #To understand n=2 used
df_Boston.FTime_40K_sec.fillna(0,inplace=True)
df_Boston.FTime_40K_min.fillna(0,inplace=True)
df_Boston.FTime_40K_hour.fillna(0,inplace=True)
df_Boston.FTime_40K_sec.replace('-',0,inplace=True)
df_Boston.FTime_40K_hour.replace('-',0,inplace=True)
final1=df_Boston['FTime_40K_hour'].astype(int) *3600
final2=(df_Boston['FTime_40K_min'].astype(int)) * 60
final3=(df_Boston['FTime_40K_sec'].astype(int))
df_Boston['Final_40k_sec'] = final1 + final2 + final3
df_Boston.drop(columns =['FTime_40K_hour','FTime_40K_min','FTime_40K_sec'],inplace =True)
df_Boston['FTime_hour']= df_Boston['Finish_Time'].str.split(':').str[0]
df_Boston['FTime_min']= df_Boston['Finish_Time'].str.split(':').str[1]
df_Boston['FTime_sec']= df_Boston['Finish_Time'].str.split(':',n=2).str[-1] #To understand n=2 used
df_Boston.FTime_sec.fillna(0,inplace=True)
df_Boston.FTime_min.fillna(0,inplace=True)
df_Boston.FTime_hour.fillna(0,inplace=True)
df_Boston.FTime_sec.replace('-',0,inplace=True)
df_Boston.FTime_hour.replace('-',0,inplace=True)
final1=df_Boston['FTime_hour'].astype(int) *3600
final2=(df_Boston['FTime_min'].astype(int)) * 60
final3=(df_Boston['FTime_sec'].astype(int))
df_Boston['FinishTime_sec'] = final1 + final2 + final3
df_Boston.drop(columns =['FTime_hour','FTime_min','FTime_sec'],inplace =True)
df_Boston.drop(columns=['5K','15K','20K','25K','30K','35K'],inplace=True)
df_Boston.head()
BibNo | Name | Age | Gender | City | State | Country | 10K | Half | 40K | Pace_minpermile | Finish_Time | Overall_ranking | Gender_ranking | Division_ranking | Final_10k_sec | Final_Half_sec | Final_40k_sec | FinishTime_sec | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 11 | Kirui, Geoffrey | 24 | 1.0 | Keringet | NaN | KEN | 0:30:28 | 1:04:35 | 2:02:53 | 0:04:57 | 2:09:37 | 1 | 1 | 1 | 1828 | 3875 | 7373 | 7777 |
1 | 17 | Rupp, Galen | 30 | 1.0 | Portland | OR | USA | 0:30:27 | 1:04:35 | 2:03:14 | 0:04:58 | 2:09:58 | 2 | 2 | 2 | 1827 | 3875 | 7394 | 7798 |
2 | 23 | Osako, Suguru | 25 | 1.0 | Machida-City | NaN | JPN | 0:30:29 | 1:04:36 | 2:03:38 | 0:04:59 | 2:10:28 | 3 | 3 | 3 | 1829 | 3876 | 7418 | 7828 |
3 | 21 | Biwott, Shadrack | 32 | 1.0 | Mammoth Lakes | CA | USA | 0:30:29 | 1:04:45 | 2:04:35 | 0:05:03 | 2:12:08 | 4 | 4 | 4 | 1829 | 3885 | 7475 | 7928 |
4 | 9 | Chebet, Wilson | 31 | 1.0 | Marakwet | NaN | KEN | 0:30:28 | 1:04:35 | 2:05:00 | 0:05:04 | 2:12:35 | 5 | 5 | 5 | 1828 | 3875 | 7500 | 7955 |
df_Boston.isna().sum()
BibNo 0
Name 0
Age 0
Gender 0
City 0
State 3595
Country 0
10K 0
Half 0
40K 0
Pace_minpermile 0
Finish_Time 0
Overall_ranking 0
Gender_ranking 0
Division_ranking 0
Final_10k_sec 0
Final_Half_sec 0
Final_40k_sec 0
FinishTime_sec 0
dtype: int64
df_Boston.shape
(26410, 19)
len(df_Boston.State)
26410
Creating merged dataset for Boston and Moscow, keeping only relevant columns and sorting them by overall ranking.
df_Boston_merge=df_Boston.copy()
df_Moscow_merge=df_new_Moscow.copy()
Created isBoston column in Boston dataframe assigned value 1 which means after merging data row with value 1.0 are from Boston Marathon
df_Boston_merge['isBoston'] = '1.0'
df_Moscow_merge.drop(columns=['Finish_Time_Sec'],inplace=True)
df_Moscow_merge.drop(columns=['BibNo'],axis=1,inplace=True)
df_Boston_merge.drop(columns=['BibNo'],axis=1,inplace=True)
df_Moscow_merge.rename(columns={'10 km':'Final_10k_sec', 'Half marathon':'Final_Half_sec', \
'Marathon':'FinishTime_sec','Overall_Ranking':'Overall_ranking'}, inplace=True)
df_Moscow_merge.replace(to_replace='Male', value=1.0, inplace=True)
df_Moscow_merge.replace(to_replace='Female', value=0.0, inplace=True)
Created isBoston column in Moscow dataframe assigned value 0 which means after merging data row with value 0.0 are from Moscow Marathon
df_Moscow_merge['isBoston'] = '0.0'
df_Moscow_merge.drop(columns=['Pace_Sec', 'Pace_Minpkm', 'Half_Pace_Sec', 'Half_Pace_Minpkm'], axis=1, inplace=True)
df_Moscow_merge.head(1)
Gender | Age | Name | City | Country | Final_10k_sec | Final_Half_sec | FinishTime_sec | Overall_ranking | isBoston | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 31 | Stepan Kiselev | Kazan' | RUS | 1899.0 | 4006.0 | 8122.0 | 1 | 0.0 |
df_Boston_merge.drop(columns=['Final_40k_sec', 'State', '10K', 'Half', '40K','Pace_minpermile','Finish_Time','Gender_ranking','Division_ranking'], axis=1, inplace=True)
df_Boston_merge.head(1)
Name | Age | Gender | City | Country | Overall_ranking | Final_10k_sec | Final_Half_sec | FinishTime_sec | isBoston | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Kirui, Geoffrey | 24 | 1.0 | Keringet | KEN | 1 | 1828 | 3875 | 7777 | 1.0 |
Merge Boston and Moscow dataframe to get the insite about each athlete and sort them based on overall ranking
df_merged_bm = df_Boston_merge.merge(df_Moscow_merge,how='outer', left_on=['Name', 'City', 'Overall_ranking', 'FinishTime_sec', 'Final_Half_sec', 'Final_10k_sec', 'Country', 'Gender', 'Age', 'isBoston'],\
right_on=['Name', 'City', 'Overall_ranking', 'FinishTime_sec', 'Final_Half_sec', 'Final_10k_sec', 'Country', 'Gender', 'Age', 'isBoston'])
df_merged_bm = df_merged_bm.sort_values(by='Overall_ranking')
df_merged_bm.head(1)
Name | Age | Gender | City | Country | Overall_ranking | Final_10k_sec | Final_Half_sec | FinishTime_sec | isBoston | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Kirui, Geoffrey | 24 | 1.0 | Keringet | KEN | 1 | 1828.0 | 3875.0 | 7777.0 | 1.0 |
- Success rate of a country - ratio of number of participants of a country in top 100 / total number of participants of that country.
- Summary - Amongst 91 participating countries, Kenya has a success ratio of 100% in top 100 runners in Boston Marathon.
- Summary - Amongst 53 participating countries, Thailand has a success ratio of 70% in top 1000 runners in Moscow Marathon.
Get the total number of countries participating in the run.
df_country = df_Boston.groupby('Country').size()
len(df_country)
91
Find the countries of the runners who have finished in top 100.
df_top100_country=df_Boston[:100].groupby('Country').size()
len(df_top100_country)
14
Calculate the success ratio which is the number of runners in top 100 divide by the total number of runners from that country.
df_Boston_countries_success_rate = ((df_top100_country/df_country)* 100).nlargest(14) # 14 different countries in the top 100 as shown above.
df_Boston_countries_success_rate
Country
BDI 100.000000
BRN 100.000000
KEN 100.000000
ZIM 100.000000
ETH 80.000000
JPN 1.764706
NED 1.282051
IRL 1.149425
MEX 0.701754
ITA 0.606061
BRA 0.487805
USA 0.338983
GBR 0.235294
CAN 0.213904
dtype: float64
df_country['BDI']
1
df_country['BRN']
1
df_country['KEN']
8
df_country['ZIM']
1
df_country['ETH']
5
Plot of top 100 finishers countries.
sns.countplot(y='Country',data =df_Boston[:100])
<matplotlib.axes._subplots.AxesSubplot at 0xfef4b00>
Boston_success_df = df_Boston_countries_success_rate.to_frame().reset_index()
Boston_success_df.rename(columns={0:'success_rate'}, inplace=True)
sns.catplot(x='success_rate',y='Country', data=Boston_success_df,aspect=3,kind='bar')
<seaborn.axisgrid.FacetGrid at 0x10158080>
Eliminate the 3 countries - BDI, BRN and ZIM since they have only 1 participant. Plot for ETH and KEN. Show the success ratio plot as well. Show values of each bar in the graph.
Doing the same analysis for Moscow and get the total number of countries participating in the run
df_country_moscow=df_Moscow.groupby('Country').size()
df_top1000_country_moscow=df_Moscow[:1000].groupby('Country').size()
df_top1000_country_moscow
Country
CHN 1
HKG 10
RUS 903
THA 86
dtype: int64
success_rate_by_country_moscow =((df_top1000_country_moscow/df_country_moscow)* 100).nlargest(4)
success_rate_by_country_moscow
Country
THA 69.354839
HKG 34.482759
RUS 11.134402
CHN 1.960784
dtype: float64
Plot of top 1000 finishers by countries
sns.countplot(y='Country',data =df_Moscow[:1000])
<matplotlib.axes._subplots.AxesSubplot at 0xf2aa160>
Moscow_success_df = success_rate_by_country_moscow.to_frame().reset_index()
Moscow_success_df.rename(columns={0:'success_rate'}, inplace=True)
sns.catplot(x='success_rate',y='Country', data=Moscow_success_df,aspect=3,kind='bar')
<seaborn.axisgrid.FacetGrid at 0x10196320>
Doing the same analysis for merged data. This will show success rate for the combined two runs.
df_merged_bm_country=df_merged_bm.groupby('Country').size()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-b3695a4b09ad> in <module>
----> 1 df_merged_bm_country=df_merged_bm.groupby('Country').size()
NameError: name 'df_merged_bm' is not defined
df_merged_bm_country_500=df_merged_bm[:5000].groupby('Country').size()
df_merged_bm_success_rate =((df_merged_bm_country_500/df_merged_bm_country)* 100).nlargest(20) #expected finding Kenya has highest success rate
df_merged_bm_success_rate
Country
BDI 100.000000
BRN 100.000000
ETH 100.000000
IRN 100.000000
KEN 100.000000
MKD 100.000000
ZIM 100.000000
CHE 60.000000
LVA 60.000000
EGY 50.000000
PAN 50.000000
UZB 50.000000
DNK 42.857143
UKR 37.500000
BLR 35.555556
BER 33.333333
CAY 33.333333
CRC 33.333333
DOM 33.333333
GUA 33.333333
dtype: float64
df_merged_bm_success_rate_df = df_merged_bm_success_rate.to_frame().reset_index()
df_merged_bm_success_rate_df.rename(columns={0:'success_rate'}, inplace=True)
#df_merged_bm_success_rate_df.head()
sns.catplot(x='success_rate',y='Country', data=df_merged_bm_success_rate_df,aspect=2,kind='bar')
<seaborn.axisgrid.FacetGrid at 0x10184a20>
df_merged_bm_country = df_merged_bm.groupby('Country').count()
df_merged_bm_country[df_merged_bm_country > 100]
Name | Age | Gender | City | Overall_ranking | Final_10k_sec | Final_Half_sec | FinishTime_sec | isBoston | |
---|---|---|---|---|---|---|---|---|---|
Country | |||||||||
- | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ALG | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
AND | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ARE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ARG | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
AUS | 195.0 | 195.0 | 195.0 | 195.0 | 195.0 | 195.0 | 195.0 | 195.0 | 195.0 |
AUT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
AZE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
BAR | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
BDI | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
BEL | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
BER | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
BIH | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
BLR | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
BRA | 206.0 | 206.0 | 206.0 | 206.0 | 206.0 | 206.0 | 206.0 | 206.0 | 206.0 |
BRN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
BUL | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
CAN | 1871.0 | 1871.0 | 1871.0 | 1871.0 | 1871.0 | 1871.0 | 1871.0 | 1871.0 | 1871.0 |
CAY | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
CHE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
CHI | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
CHN | 293.0 | 293.0 | 293.0 | 293.0 | 293.0 | 293.0 | 293.0 | 293.0 | 293.0 |
COL | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
CRC | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
CRO | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
CZE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
DEN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
DEU | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
DNK | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
DOM | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
PER | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
PHI | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
POL | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
POR | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
PRT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ROU | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
RSA | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
RUS | 8139.0 | 8139.0 | 8139.0 | 8139.0 | 8139.0 | 8138.0 | 8139.0 | 8139.0 | 8139.0 |
SEN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
SIN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
SLO | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
SMR | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
SRB | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
SUI | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
SVK | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
SVN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
SWE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
TCA | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
THA | 129.0 | 129.0 | 129.0 | 129.0 | 129.0 | 129.0 | 129.0 | 129.0 | 129.0 |
TRI | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
TUR | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
TWN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
UAE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
UKR | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
URU | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
USA | 20945.0 | 20945.0 | 20945.0 | 20945.0 | 20945.0 | 20945.0 | 20945.0 | 20945.0 | 20945.0 |
UZB | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
VEN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ZAF | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ZIM | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
113 rows × 9 columns
Show agreegated result of boston and moscow success rate in one graph. Still on different datasets.
Boston_success_df = df_Boston_countries_success_rate.to_frame().reset_index()
Boston_success_df.rename(columns={0:'success_rate'}, inplace=True)
Boston_success_df['isBoston'] = '1.0'
#Boston_success_df.head(1)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-706243f5f236> in <module>
----> 1 Boston_success_df = df_Boston_countries_success_rate.to_frame().reset_index()
2 Boston_success_df.rename(columns={0:'success_rate'}, inplace=True)
3 Boston_success_df['isBoston'] = '1.0'
4 #Boston_success_df.head(1)
NameError: name 'df_Boston_countries_success_rate' is not defined
Moscow_success_df = success_rate_by_country_moscow.to_frame().reset_index()
Moscow_success_df.rename(columns={0:'success_rate'}, inplace=True)
Moscow_success_df['isBoston'] = '0.0'
#Moscow_success_df.head(1)
df_merge_success_rate = Boston_success_df.merge(Moscow_success_df,how='outer', left_on=['Country', 'success_rate', 'isBoston'],\
right_on=['Country', 'success_rate', 'isBoston'])
df_merge_success_rate = df_merge_success_rate.sort_values(by='success_rate', ascending=False)
sns.catplot(x='success_rate',y='Country',hue='isBoston',data=df_merge_success_rate,aspect=3,kind='bar')
<seaborn.axisgrid.FacetGrid at 0x536c128>
- To perform well in long run marathons athletes should follow the practices what Kenya and Thailand athletes have implemented. As Kenya athletes have a dominance in the olympics and most recently Thailand is seeing the hobby jogging boom.
- Kenya and Thailand athletes are used to perform in severe conditions, so they have the capacity to manage well in Boston and Moscow Marathon
- Kenya as a country has the most successful representation in the run. The runners from Kenya are really fast and they are amongst the winners as well. In order to make the run more competitive and atttractive sponsors and viewers, we should make sure to have good representation from Kenya and top Kenyan players. This will drive more advertising and sponsorship money to the run. Don't just focus on local talent but rather go for the bigger talent from abroad.
- Summary - Negative split is not an effective nor popular strategy for Boston Marathon.
- Background - Negative split is a very popular and common strategy where a runner runs the second half of a marathon
- Faster than the first half.
- After doing the analysis, we observed that to perfrom well in Marathon athletes should use the even split strategy that is athlete should keep his/her pace consistent throughout the marathon instead of conserving the energy for first or second split of the marathon
- It is advisable to athlete to practice their chosen strategy in training to help set the realistic expectation and to learn how their body is going to respond.
- Summary - Middle aged athletes are faster as compare to younger aged athletes.
- It is advisible to younger age athletes to focus on improving the endurance
import matplotlib.pyplot as plt
df_Boston['boston_split_ratio'] = (df_Boston['FinishTime_sec'] - df_Boston['Final_Half_sec'])/(df_Boston['Final_Half_sec'])
df_split = df_Boston[df_Boston.Final_Half_sec > 0]
len(df_split[df_split.boston_split_ratio < 1.0])
811
len(df_Boston[df_Boston.boston_split_ratio < 1.0]) / len(df_Boston) * 100
3.070806512684589
Proves that mature people use negative split and not young.
df_split[df_split.boston_split_ratio < 1.0].groupby(['Gender']).size()
Gender
0.0 377
1.0 434
dtype: int64
sns.kdeplot(df_split.boston_split_ratio)
plt.xlim([0.7,1.7])
plt.xlabel('Boston Split Ratio')
plt.ylabel('Runner Density')
plt.title('Split Distribution (Negative split when < 1)')
<matplotlib.axes._subplots.AxesSubplot at 0xf291c50>
(0.7, 1.7)
Text(0.5, 0, 'Boston Split Ratio')
Text(0, 0.5, 'Runner Density')
Text(0.5, 1.0, 'Split Distribution (Negative split when < 1)')
plt.plot(df_split.Overall_ranking,df_split.boston_split_ratio,'o', alpha = 0.2)
plt.ylim([0.5,3])
plt.xlabel('Overall Ranking')
plt.ylabel('Split Ratio')
plt.title('Split and performance')
[<matplotlib.lines.Line2D at 0x100b5a58>]
(0.5, 3)
Text(0.5, 0, 'Overall Ranking')
Text(0, 0.5, 'Split Ratio')
Text(0.5, 1.0, 'Split and performance')
Till here, we should show how many have positive split and how many have negative split. Show them by age and gender. Plot split ratio of each runner by finish time. Variance in split ratio by age? by gender?
Derive split of top 100 from the above dataframe to optimize.
dfBoston_100 = df_Boston[:100]
dfBoston_100['split_ratio_100'] = (dfBoston_100['FinishTime_sec'] - dfBoston_100['Final_Half_sec'])/(dfBoston_100['Final_Half_sec'])
df_split_100 = dfBoston_100[dfBoston_100.Final_Half_sec > 0]
Top 10 runners have not used negative split.
The below chart shows that successful countries didn't solely use negative split strategy.
plt.plot(df_split_100.Overall_ranking,df_split_100.split_ratio_100,'o', alpha = 0.2)
plt.ylim([0.8,1.4])
plt.xlabel('Overall Rank')
plt.ylabel('Split')
plt.title('Split and performance')
[<matplotlib.lines.Line2D at 0x10113b38>]
(0.8, 1.4)
Text(0.5, 0, 'Overall Rank')
Text(0, 0.5, 'Split')
Text(0.5, 1.0, 'Split and performance')
# Till here, we should show how many have +ve split and how many have negative split. Show them by age and gender.
# plot split ratio of each runner by finish time. split ratio on y axis and finish time on x.
# variance in split ratio by age? by gender?
plt.plot(df_split.Overall_ranking[(df_split.Country is not 'KEN') & (df_split.Overall_ranking<100)], df_split.boston_split_ratio[(df_split.Country is not 'KEN') & (df_split.Overall_ranking<100)],'o')
plt.plot(df_split.Overall_ranking[(df_split.Country == 'KEN') & (df_split.Overall_ranking<100)], df_split.boston_split_ratio[(df_split.Country == 'KEN') & (df_split.Overall_ranking<100)],'o', color = 'r')
plt.xlabel('Overall Rank')
plt.ylabel('Split Ratio')
plt.legend(['Others','Kenya'])
[<matplotlib.lines.Line2D at 0x13988da0>]
[<matplotlib.lines.Line2D at 0x13994438>]
Text(0.5, 0, 'Overall Rank')
Text(0, 0.5, 'Split Ratio')
<matplotlib.legend.Legend at 0x13988ef0>
#doing same analysis for moscow.
import matplotlib.pyplot as plt
df_new_Moscow['moscow_split_ratio'] = (df_new_Moscow['Marathon'] - df_new_Moscow['Half marathon'])/(df_new_Moscow['Half marathon'])
# count number of positive split and number of negative split. Plot it.
sns.kdeplot(df_new_Moscow.moscow_split_ratio)
plt.xlim([0.7,1.8])
plt.xlabel('Split Ratio')
plt.title('Split Distribution (Negative split when < 1)')
# what does y axis represent? is it count?
<matplotlib.axes._subplots.AxesSubplot at 0x139cc588>
(0.7, 1.8)
Text(0.5, 0, 'Split Ratio')
Text(0.5, 1.0, 'Split Distribution (Negative split when < 1)')
plt.plot(df_new_Moscow.Overall_Ranking,df_new_Moscow.moscow_split_ratio,'o', alpha = 0.2)
plt.ylim([0.5,2.2])
plt.xlabel('Overall_Ranking')
plt.ylabel('Split Ratio')
plt.title('Split Ratio and performance')
[<matplotlib.lines.Line2D at 0x13c8d160>]
(0.5, 2.2)
Text(0.5, 0, 'Overall_Ranking')
Text(0, 0.5, 'Split Ratio')
Text(0.5, 1.0, 'Split Ratio and performance')
len(df_new_Moscow[df_new_Moscow.moscow_split_ratio < 1.0])
935
len(df_new_Moscow[df_new_Moscow.moscow_split_ratio < 1.0]) / len(df_new_Moscow) * 100
10.744656400827395
Derive split of top 100 from the above df to optimize.
dfMoscow_100 = df_new_Moscow[:100]
dfMoscow_100['split_ratio_100'] = (dfMoscow_100['Marathon'] - dfMoscow_100['Half marathon'])/(dfMoscow_100['Half marathon'])
df_split_moscow_100 = dfMoscow_100
sns.kdeplot(df_split_moscow_100.split_ratio_100)
plt.xlim([0.7,1.7])
plt.xlabel('Split Ratio')
plt.title('Split Distribution (Negative split when < 1)')
<matplotlib.axes._subplots.AxesSubplot at 0x13c991d0>
(0.7, 1.7)
Text(0.5, 0, 'Split Ratio')
Text(0.5, 1.0, 'Split Distribution (Negative split when < 1)')
plt.plot(df_split_moscow_100.Overall_Ranking,df_split_moscow_100.split_ratio_100,'o', alpha = 0.2)
plt.ylim([0.9,1.3])
plt.xlabel('Overall Rank')
plt.ylabel('Split')
plt.title('Split and performance')
[<matplotlib.lines.Line2D at 0x14ed58d0>]
(0.9, 1.3)
Text(0.5, 0, 'Overall Rank')
Text(0, 0.5, 'Split')
Text(0.5, 1.0, 'Split and performance')
len(df_split_moscow_100)
100
Count number of positive split and number of negative split. Plot it.
import matplotlib.pyplot as plt
df_merged_bm['split_ratio'] = (df_merged_bm['FinishTime_sec'] - df_merged_bm['Final_Half_sec'])/(df_merged_bm['Final_Half_sec'])
type(df_split.boston_split_ratio)
pandas.core.series.Series
f, ax = plt.subplots(figsize=(8, 8))
ax = sns.kdeplot(df_split.boston_split_ratio, legend=True)
ax = sns.kdeplot(df_new_Moscow.moscow_split_ratio, legend=True)
#sns.lineplot(x='split_ratio',y='FinishTime_sec',hue='isBoston',data=df_Boston,aspect=2,kind='point')
plt.xlim([0.2,2.8])
plt.xlabel('Split Ratio')
plt.ylabel('Runner Density')
plt.title('Split Distribution (Negative split when < 1)')
# what does y axis represent? is it count?
(0.2, 2.8)
Text(0.5, 0, 'Split Ratio')
Text(0, 0.5, 'Runner Density')
Text(0.5, 1.0, 'Split Distribution (Negative split when < 1)')
plt.plot(df_merged_bm.Overall_ranking,df_merged_bm.split_ratio,'o', alpha = 0.2)
plt.ylim([0.5,2.2])
plt.xlabel('Overall_Ranking')
plt.ylabel('Split')
plt.title('Split and performance')
[<matplotlib.lines.Line2D at 0x14fa7be0>]
(0.5, 2.2)
Text(0.5, 0, 'Overall_Ranking')
Text(0, 0.5, 'Split')
Text(0.5, 1.0, 'Split and performance')
df_Boston['binned_age1']= pd.cut(df_Boston.Age,bins=[15,25,35,45,55,70])
df_Boston[df_Boston.Gender == 1.0].Age.describe() # male dataset details. #REMOVE!!!
count 14438.000000
mean 44.772475
std 11.533255
min 18.000000
25% 36.000000
50% 45.000000
75% 53.000000
max 83.000000
Name: Age, dtype: float64
df_Boston[df_Boston.Gender == 0.0].Age.describe() #REMOVE!!!
count 11972.000000
mean 39.952974
std 10.703554
min 18.000000
25% 31.000000
50% 40.000000
75% 48.000000
max 84.000000
Name: Age, dtype: float64
Unexpected as middle aged person more participated as compared to young.
df_Boston.groupby('Age').size().nlargest(5)
Age
45 1127
46 976
40 926
47 843
50 815
dtype: int64
df_Boston.groupby('Gender').size()
Gender
0.0 11972
1.0 14438
dtype: int64
df_Boston.groupby(['Gender','binned_age1']).size()
Gender binned_age1
0.0 (15, 25] 1067
(25, 35] 3350
(35, 45] 3780
(45, 55] 2823
(55, 70] 932
1.0 (15, 25] 691
(25, 35] 2710
(35, 45] 3991
(45, 55] 4316
(55, 70] 2623
dtype: int64
sns.countplot(x='binned_age1',hue='Gender',data=df_Boston)
<matplotlib.axes._subplots.AxesSubplot at 0x14fe5978>
df_Boston['binned_age2']= pd.cut(df_Boston.Age,bins=[15,20,25,29,35,40,45,50,70])
sns.factorplot(x='binned_age2',y='FinishTime_sec',hue='Gender',data=df_Boston,aspect=2,kind='point')
<seaborn.axisgrid.FacetGrid at 0x14f8a240>
Older and mature people perform the best and are the faster marathon runner. Older and mature people also participated more than young people and are more health considerate.
df_new_Moscow.head()
BibNo | Finish_Time_Sec | Pace_Sec | Pace_Minpkm | Half_Pace_Sec | Half_Pace_Minpkm | Gender | Age | Name | City | Country | 10 km | Half marathon | Marathon | Overall_Ranking | moscow_split_ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 8122.0 | 192.487261 | 3:12 min/km | 189.880318 | 3:09 min/km | Male | 31 | Stepan Kiselev | Kazan' | RUS | 1899.0 | 4006.0 | 8122.0 | 1 | 1.027459 |
1 | 12 | 8161.0 | 193.411542 | 3:13 min/km | 189.738121 | 3:09 min/km | Male | 32 | Aleksey Troshkin | Saransk | RUS | 1905.0 | 4003.0 | 8161.0 | 2 | 1.038721 |
2 | 10 | 8394.0 | 198.933523 | 3:18 min/km | 188.268752 | 3:08 min/km | Male | 27 | Artem Aplachkin | Barnaul | RUS | 1879.0 | 3972.0 | 8394.0 | 3 | 1.113293 |
3 | 15008 | 8416.0 | 199.454912 | 3:19 min/km | 195.094205 | 3:15 min/km | Male | 29 | Sergey Popov | Voronezh | RUS | 1960.0 | 4116.0 | 8416.0 | 4 | 1.044704 |
4 | 4 | 8467.0 | 200.663586 | 3:20 min/km | 186.894182 | 3:06 min/km | Male | 36 | Dmitriy Safronov | Moskva | RUS | 1876.0 | 3943.0 | 8467.0 | 5 | 1.147350 |
df_new_Moscow['binned_age1']= pd.cut(df_new_Moscow.Age,bins=[15,25,30,35,40,50,70])
df_new_Moscow.groupby('Age').size().nlargest(10)
Age
31 485
30 482
33 468
32 457
34 450
35 434
29 408
28 372
36 360
37 340
dtype: int64
df_new_Moscow.groupby(['Gender']).size()
Gender
Female 1491
Male 7211
dtype: int64
df_new_Moscow.groupby(['Gender','binned_age1']).size()
Gender binned_age1
Female (15, 25] 183
(25, 30] 403
(30, 35] 395
(35, 40] 238
(40, 50] 207
(50, 70] 63
Male (15, 25] 530
(25, 30] 1401
(30, 35] 1899
(35, 40] 1335
(40, 50] 1409
(50, 70] 616
dtype: int64
sns.countplot(x='binned_age1',hue='Gender',data=df_new_Moscow) # color is opposite for male/female as compared to other graphs.
<matplotlib.axes._subplots.AxesSubplot at 0x16cc04a8>
df_new_Moscow['binned_age2']= pd.cut(df_new_Moscow.Age,bins=[20,25,30,35,40,45,50,70])
sns.factorplot(x='binned_age2',y='Marathon',hue='Gender',data=df_new_Moscow,aspect=3,kind='point')
<seaborn.axisgrid.FacetGrid at 0x16cc0908>
Analysis on combined dataset.
df_merged_bm['binned_age1']= pd.cut(df_merged_bm.Age,bins=[15,25,30,35,40,45,50,60,70])
df_merged_bm.groupby('Age').size().nlargest(10)
Age
45 1275
35 1234
40 1193
46 1150
30 1111
36 1093
31 1068
33 1059
29 1044
41 1034
dtype: int64
df_merged_bm.groupby(['Gender']).size()
Gender
0.0 13463
1.0 21649
dtype: int64
df_merged_bm.groupby(['Gender','binned_age1']).size()
Gender binned_age1
0.0 (15, 25] 1250
(25, 30] 2187
(30, 35] 1961
(35, 40] 2126
(40, 45] 2026
(45, 50] 1774
(50, 60] 1763
(60, 70] 354
1.0 (15, 25] 1221
(25, 30] 2600
(30, 35] 3410
(35, 40] 3175
(40, 45] 3000
(45, 50] 2963
(50, 60] 3834
(60, 70] 1318
dtype: int64
Age wise participation across boston and moscow
sns.countplot(x='binned_age1',hue='isBoston',data=df_merged_bm)
<matplotlib.axes._subplots.AxesSubplot at 0x13c5cf28>
sns.countplot(x='binned_age1',hue='isBoston',data=df_merged_bm[df_merged_bm.Gender == 1.0])
<matplotlib.axes._subplots.AxesSubplot at 0x15239f60>
Age across boston and moscow for binned ages BUT only for females
sns.countplot(x='binned_age1',hue='isBoston',data=df_merged_bm[df_merged_bm.Gender == 0.0])
<matplotlib.axes._subplots.AxesSubplot at 0x16344cf8>
#df_new_Moscow[(df_new_Moscow.Age > 45) & (df_new_Moscowdd.Gender == 'Female')][['Age', 'Marathon']]
Binned age for combined df
df_merged_bm['binned_age']=pd.cut(df_merged_bm.Age,bins=[20,25,30,35,40,45,50,70])
Plot to show finish time across both marathons for binned ages.
sns.factorplot(x='binned_age',y='FinishTime_sec',hue='isBoston',data=df_merged_bm,aspect=2,kind='point')
<seaborn.axisgrid.FacetGrid at 0x1525f668>
Finish time across boston and moscow for binned ages BUT only for males.
sns.factorplot(x='binned_age',y='FinishTime_sec',hue='isBoston',data=df_merged_bm[df_merged_bm.Gender == 1.0],aspect=2,kind='point')
<seaborn.axisgrid.FacetGrid at 0x16761da0>
Finish time across boston and moscow for binned ages BUT only for females.
sns.factorplot(x='binned_age',y='FinishTime_sec',hue='isBoston',data=df_merged_bm[df_merged_bm.Gender == 0.0],aspect=2,kind='point')
<seaborn.axisgrid.FacetGrid at 0x16e1efd0>
df_Boston.head()
BibNo | Name | Age | Gender | City | State | Country | 10K | Half | 40K | Pace_minpermile | Finish_Time | Overall_ranking | Gender_ranking | Division_ranking | Final_10k_sec | Final_Half_sec | Final_40k_sec | FinishTime_sec | boston_split_ratio | binned_age1 | binned_age2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 11 | Kirui, Geoffrey | 24 | 1.0 | Keringet | NaN | KEN | 0:30:28 | 1:04:35 | 2:02:53 | 0:04:57 | 2:09:37 | 1 | 1 | 1 | 1828 | 3875 | 7373 | 7777 | 1.006968 | (15, 25] | (20, 25] |
1 | 17 | Rupp, Galen | 30 | 1.0 | Portland | OR | USA | 0:30:27 | 1:04:35 | 2:03:14 | 0:04:58 | 2:09:58 | 2 | 2 | 2 | 1827 | 3875 | 7394 | 7798 | 1.012387 | (25, 35] | (29, 35] |
2 | 23 | Osako, Suguru | 25 | 1.0 | Machida-City | NaN | JPN | 0:30:29 | 1:04:36 | 2:03:38 | 0:04:59 | 2:10:28 | 3 | 3 | 3 | 1829 | 3876 | 7418 | 7828 | 1.019608 | (15, 25] | (20, 25] |
3 | 21 | Biwott, Shadrack | 32 | 1.0 | Mammoth Lakes | CA | USA | 0:30:29 | 1:04:45 | 2:04:35 | 0:05:03 | 2:12:08 | 4 | 4 | 4 | 1829 | 3885 | 7475 | 7928 | 1.040669 | (25, 35] | (29, 35] |
4 | 9 | Chebet, Wilson | 31 | 1.0 | Marakwet | NaN | KEN | 0:30:28 | 1:04:35 | 2:05:00 | 0:05:04 | 2:12:35 | 5 | 5 | 5 | 1828 | 3875 | 7500 | 7955 | 1.052903 | (25, 35] | (29, 35] |
dt = tree.DecisionTreeRegressor(max_depth=2)
df_bos_dec_tree = df_Boston.copy()
df_bos_dec_tree = pd.get_dummies(data=df_bos_dec_tree,columns=['Country'])
X= df_bos_dec_tree.drop(columns=['BibNo','Age','City','State','10K','40K','Pace_minpermile','Name',\
'Finish_Time','Overall_ranking','Gender_ranking','boston_split_ratio',\
'Division_ranking','Half','Final_40k_sec','binned_age2',\
'binned_age1'],axis=1)
X.head()
X.columns
Gender | Final_10k_sec | Final_Half_sec | FinishTime_sec | Country_ALG | Country_AND | Country_ARG | Country_AUS | Country_AUT | Country_BAR | Country_BDI | Country_BEL | Country_BER | Country_BRA | Country_BRN | ... | Country_SRB | Country_SUI | Country_SVK | Country_SWE | Country_TCA | Country_THA | Country_TRI | Country_TUR | Country_TWN | Country_UAE | Country_UKR | Country_URU | Country_USA | Country_VEN | Country_ZIM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1828 | 3875 | 7777 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1.0 | 1827 | 3875 | 7798 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 1.0 | 1829 | 3876 | 7828 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1.0 | 1829 | 3885 | 7928 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 1.0 | 1828 | 3875 | 7955 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 95 columns
Index(['Gender', 'Final_10k_sec', 'Final_Half_sec', 'FinishTime_sec',
'Country_ALG', 'Country_AND', 'Country_ARG', 'Country_AUS',
'Country_AUT', 'Country_BAR', 'Country_BDI', 'Country_BEL',
'Country_BER', 'Country_BRA', 'Country_BRN', 'Country_BUL',
'Country_CAN', 'Country_CAY', 'Country_CHI', 'Country_CHN',
'Country_COL', 'Country_CRC', 'Country_CRO', 'Country_CZE',
'Country_DEN', 'Country_DOM', 'Country_ECU', 'Country_EGY',
'Country_ESA', 'Country_ESP', 'Country_EST', 'Country_ETH',
'Country_FIN', 'Country_FLK', 'Country_FRA', 'Country_GBR',
'Country_GER', 'Country_GRE', 'Country_GRN', 'Country_GUA',
'Country_HKG', 'Country_HON', 'Country_HUN', 'Country_INA',
'Country_IND', 'Country_IRL', 'Country_ISL', 'Country_ISR',
'Country_ITA', 'Country_JAM', 'Country_JPN', 'Country_KEN',
'Country_KOR', 'Country_KSA', 'Country_KUW', 'Country_LAT',
'Country_LTU', 'Country_LUX', 'Country_MAR', 'Country_MAS',
'Country_MEX', 'Country_MGL', 'Country_MLT', 'Country_NCA',
'Country_NED', 'Country_NGR', 'Country_NOR', 'Country_NZL',
'Country_PAN', 'Country_PAR', 'Country_PER', 'Country_PHI',
'Country_POL', 'Country_POR', 'Country_ROU', 'Country_RSA',
'Country_RUS', 'Country_SIN', 'Country_SLO', 'Country_SMR',
'Country_SRB', 'Country_SUI', 'Country_SVK', 'Country_SWE',
'Country_TCA', 'Country_THA', 'Country_TRI', 'Country_TUR',
'Country_TWN', 'Country_UAE', 'Country_UKR', 'Country_URU',
'Country_USA', 'Country_VEN', 'Country_ZIM'],
dtype='object')
Y=df_bos_dec_tree.Age
dt.fit(X,Y)
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
import os
os.environ["PATH"] += os.pathsep + r'C:\Users\MIH\Anaconda3\Library\bin\graphviz'
# This code will visualize a decision tree dt, trained with the attributes in X and the class labels in Y
dt_feature_names = list(X.columns)
dt_target_names = [str(s) for s in Y.unique()]
tree.export_graphviz(dt, out_file='tree.dot',
feature_names=dt_feature_names, class_names=dt_target_names,
filled=True)
graph = pydotplus.graph_from_dot_file('tree.dot')
Image(graph.create_png())
from sklearn import linear_model
regLasso = linear_model.Lasso()
regLasso.fit(X,Y)
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)
regLasso.coef_
array([ 2.45946604e+00, 1.54926501e-03, 6.73233896e-04, 5.91283677e-04,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 0.00000000e+00,
0.00000000e+00, -0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, -0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00,
0.00000000e+00, -0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, -0.00000000e+00, 0.00000000e+00, -0.00000000e+00,
0.00000000e+00, -0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00,
0.00000000e+00, 0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
-0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
0.00000000e+00, -0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
-0.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00,
-0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 0.00000000e+00,
-0.00000000e+00, 0.00000000e+00, -0.00000000e+00])
from sklearn.cluster import KMeans
clu = KMeans(n_clusters=2, random_state=0)
clu
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=0, tol=0.0001, verbose=0)
clu.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=0, tol=0.0001, verbose=0)
clu.labels_
array([1, 1, 1, ..., 0, 0, 0])
df2 = X.copy()
df2['cluster']=clu.labels_
df2.groupby('cluster').mean()
Gender | Final_10k_sec | Final_Half_sec | FinishTime_sec | Country_ALG | Country_AND | Country_ARG | Country_AUS | Country_AUT | Country_BAR | Country_BDI | Country_BEL | Country_BER | Country_BRA | Country_BRN | ... | Country_SRB | Country_SUI | Country_SVK | Country_SWE | Country_TCA | Country_THA | Country_TRI | Country_TUR | Country_TWN | Country_UAE | Country_UKR | Country_URU | Country_USA | Country_VEN | Country_ZIM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cluster | |||||||||||||||||||||||||||||||
0 | 0.432154 | 3581.235880 | 7843.994727 | 17233.904031 | 0.000000 | 0.000000 | 0.000586 | 0.004921 | 0.001055 | 0.000000 | 0.000000 | 0.001758 | 0.000117 | 0.004570 | 0.000000 | ... | 0.000000 | 0.003984 | 0.000234 | 0.001875 | 0.000000 | 0.000234 | 0.000000 | 0.000117 | 0.000937 | 0.000234 | 0.000234 | 0.000469 | 0.825521 | 0.000000 | 0.000000 |
1 | 0.601365 | 2821.851253 | 6044.298221 | 12875.059297 | 0.000112 | 0.000056 | 0.001734 | 0.008335 | 0.000727 | 0.000224 | 0.000056 | 0.001510 | 0.000112 | 0.009286 | 0.000056 | ... | 0.000056 | 0.003524 | 0.000224 | 0.002517 | 0.000056 | 0.000168 | 0.000056 | 0.000224 | 0.002573 | 0.000280 | 0.000112 | 0.000056 | 0.777579 | 0.000336 | 0.000056 |
2 rows × 95 columns