Skip to content

Working with Python Pandas and Matplotlib. Analyzed the marathon runners’ split finished times to determine top finishers as per the success ratio of the country. This analysis helped determine the strategies used by top finishers (Positive or Negative split) to complete the marathon effectively.

Notifications You must be signed in to change notification settings

mohitcpatil/Boston-Marathon-Analysis-Python-Pandas

Repository files navigation

Boston-Moscow Marathon Analysis

Analyzed the marathon runners’ split finished times to determine top finishers as per the success ratio of the country. This analysis helped to find out the strategies used by top finishers (Positive or Negative split) to complete the marathon effectively.

Project Datasets

Their are two sets of Datasets to complete this project Boston and Moscow Marathon.

1. Boston Marathon Results 2017

  • marathon_results_2017.csv

Columns : {Bib ,Name, Age, M/F, City, State, Country, Citizen, 5K, 10K, 15K, 20K, Half, 25K, 30K, 35K, 40K, Pace, Proj Time, Official Time, Overall, Gender, Division}

Data : {11, Kirui, Geoffrey, 24, M, Keringet, KEN, 0:15:25, 0:30:28, 0:45:44, 1:01:15, 1:04:35, 1:16:59, 1:33:01, 1:48:19, 2:02:53, 0:04:57, - 2:09:37, 1, 1, 1}

2. Moscow Marathon Full Results 2018

  • 1_full_results_mm_2018.csv

Columns : {Bib, finish_time_sec, finish_time_result, race, pace_sec, pace(minpkm), pace(kmph), half_pace_sec, half_pace(minpkm), half_pace(kmph), gender_en, agev name_en, location_city_ru, location_city_en, country_code_alpha_3, flag_DNF, flag_all_split_exist, race_uniform_index}

Data : {1, 8911, 2h 28min 31sec, 42.195 km, 211.1861595, 3:31 min/km 17.0 km/h, 208.3185212, 3:28 min/km, 17.3 km/h, Female, 30, Sardana Trofimova, –Ø–∫—É—Ç—Å–∫, Yakutsk, RUS, 0, 1, 0.000132899}

  • 1_split_results_mm_2018.csv

Columns : {bib, split_name, split, split_time_sec, split_time_result, split_pace_sec, split_pace(minpkm), split_pace(kmph), split_uniform_index}

Data : {11, Kirui, Geoffrey, 24, M, Keringet, KEN, 0:15:25, 0:30:28, 0:45:44, 1:01:15, 1:04:35, 1:16:59, 1:33:01, 1:48:19, 2:02:53, 0:04:57, - 2:09:37, 1, 1, 11}

Jupyter Notebook with 3 Interesting findings

%autosave 0
Autosave disabled
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from datetime import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import pandas as pd
%pylab inline

import sklearn as sk
import sklearn.tree as tree
from IPython.display import Image  
import pydotplus
Populating the interactive namespace from numpy and matplotlib


/opt/anaconda3/lib/python3.8/site-packages/IPython/core/magics/pylab.py:159: UserWarning: pylab import has clobbered these variables: ['datetime']
`%matplotlib` prevents importing * from pylab and numpy
  warn("pylab import has clobbered these variables: %s"  % clobbered +
import warnings
#suppress all future warning
warnings.filterwarnings('ignore')
#see only one time warning
#warnings.filterwarnings(action='once')

1. Data Set Description

  • df_boston is a dataset of all athelets in boston marathon. Each row is represent attribute of an individual athelete.
  • df_moscow is a dataset of all athelets in Moscow marathon. Each row is represent attribute of an individual athelete.

Important columns in df_boston and df_moscow are:

  1. BibNo: Unique Number given to each athelet in marathon.
  2. Age: Age of each Athelet.
  3. Gender: Gender of each Athelet.
  4. Country: The Country which each athelet belongs to
  5. Finish_Time: Time taken in seconds by each athelet to complete the race
  6. Final_10k_sec: Time taken in seconds by each athelet to complete 10k distance of the race
  7. Final_Half_sec: Time taken in seconds by each athelet to complete half of the race
  8. Final_40k_sec: Time taken in seconds by each athelet to complete 40k distance of the race
  9. Overall_Ranking: Ranking with respect to Finishing time of each athelet
  10. Gender_Ranking: Ranking with respect to Gender of Athelet.
  11. Division_Ranking: Ranking with repect to Age group of Athelet.
df_boston_raw = pd.read_csv('marathon_results_2017.csv')
df_moscow_full_results  = pd.read_csv('1_full_results_mm_2018.csv')
df_moscow_split_results = pd.read_csv('1_split_results_mm_2018.csv')
pd.set_option('display.max_columns',30)
df_boston_raw.head()
Unnamed: 0 Bib Name Age M/F City State Country Citizen Unnamed: 9 5K 10K 15K 20K Half 25K 30K 35K 40K Pace Proj Time Official Time Overall Gender Division
0 0 11 Kirui, Geoffrey 24 M Keringet NaN KEN NaN NaN 0:15:25 0:30:28 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:19 2:02:53 0:04:57 - 2:09:37 1 1 1
1 1 17 Rupp, Galen 30 M Portland OR USA NaN NaN 0:15:24 0:30:27 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:19 2:03:14 0:04:58 - 2:09:58 2 2 2
2 2 23 Osako, Suguru 25 M Machida-City NaN JPN NaN NaN 0:15:25 0:30:29 0:45:44 1:01:16 1:04:36 1:17:00 1:33:01 1:48:31 2:03:38 0:04:59 - 2:10:28 3 3 3
3 3 21 Biwott, Shadrack 32 M Mammoth Lakes CA USA NaN NaN 0:15:25 0:30:29 0:45:44 1:01:19 1:04:45 1:17:00 1:33:01 1:48:58 2:04:35 0:05:03 - 2:12:08 4 4 4
4 4 9 Chebet, Wilson 31 M Marakwet NaN KEN NaN NaN 0:15:25 0:30:28 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:41 2:05:00 0:05:04 - 2:12:35 5 5 5
df_moscow_full_results.head()
bib finish_time_sec finish_time_result race pace_sec pace(minpkm) pace(kmph) half_pace_sec half_pace(minpkm) half_pace(kmph) gender_en age name_en location_city_ru location_city_en country_code_alpha_3 flag_DNF flag_all_split_exist race_uniform_index
0 1 8911.0 2h 28min 31sec 42.195 km 211.186160 3:31 min/km 17.0 km/h 208.318521 3:28 min/km 17.3 km/h Female 30 Sardana Trofimova Якутск Yakutsk RUS 0 1 0.000133
1 2 9308.0 2h 35min 08sec 42.195 km 220.594857 3:40 min/km 16.3 km/h 212.157839 3:32 min/km 17.0 km/h Female 35 Tat'yana Arkhipova Чебоксары Cheboksary RUS 0 1 0.002556
2 3 8122.0 2h 15min 22sec 42.195 km 192.487261 3:12 min/km 18.7 km/h 189.880318 3:09 min/km 19.0 km/h Male 31 Stepan Kiselev Казань Kazan' RUS 0 1 0.000397
3 4 8467.0 2h 21min 07sec 42.195 km 200.663586 3:20 min/km 17.9 km/h 186.894182 3:06 min/km 19.3 km/h Male 36 Dmitriy Safronov Москва Moskva RUS 0 1 0.009003
4 5 8738.0 2h 25min 38sec 42.195 km 207.086148 3:27 min/km 17.4 km/h 194.904610 3:14 min/km 18.5 km/h Male 42 Grigoriy Andreev Уфа Ufa RUS 0 1 0.006439
df_boston_raw.shape
(26410, 25)
df_boston_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26410 entries, 0 to 26409
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Unnamed: 0     26410 non-null  int64 
 1   Bib            26410 non-null  object
 2   Name           26410 non-null  object
 3   Age            26410 non-null  int64 
 4   M/F            26410 non-null  object
 5   City           26410 non-null  object
 6   State          22815 non-null  object
 7   Country        26410 non-null  object
 8   Citizen        1254 non-null   object
 9   Unnamed: 9     91 non-null     object
 10  5K             26410 non-null  object
 11  10K            26410 non-null  object
 12  15K            26410 non-null  object
 13  20K            26410 non-null  object
 14  Half           26410 non-null  object
 15  25K            26410 non-null  object
 16  30K            26410 non-null  object
 17  35K            26410 non-null  object
 18  40K            26410 non-null  object
 19  Pace           26410 non-null  object
 20  Proj Time      26410 non-null  object
 21  Official Time  26410 non-null  object
 22  Overall        26410 non-null  int64 
 23  Gender         26410 non-null  int64 
 24  Division       26410 non-null  int64 
dtypes: int64(5), object(20)
memory usage: 5.0+ MB
df_boston_raw.isna().sum()
Unnamed: 0           0
Bib                  0
Name                 0
Age                  0
M/F                  0
City                 0
State             3595
Country              0
Citizen          25156
Unnamed: 9       26319
5K                   0
10K                  0
15K                  0
20K                  0
Half                 0
25K                  0
30K                  0
35K                  0
40K                  0
Pace                 0
Proj Time            0
Official Time        0
Overall              0
Gender               0
Division             0
dtype: int64
df_moscow_full_results.shape
(21662, 19)
df_moscow_split_results.shape
(104438, 9)
df_moscow_full_results.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21662 entries, 0 to 21661
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   bib                   21662 non-null  int64  
 1   finish_time_sec       21385 non-null  float64
 2   finish_time_result    21385 non-null  object 
 3   race                  21662 non-null  object 
 4   pace_sec              21385 non-null  float64
 5   pace(minpkm)          21385 non-null  object 
 6   pace(kmph)            21385 non-null  object 
 7   half_pace_sec         21380 non-null  float64
 8   half_pace(minpkm)     21380 non-null  object 
 9   half_pace(kmph)       21385 non-null  object 
 10  gender_en             21662 non-null  object 
 11  age                   21662 non-null  int64  
 12  name_en               21662 non-null  object 
 13  location_city_ru      21662 non-null  object 
 14  location_city_en      21662 non-null  object 
 15  country_code_alpha_3  21662 non-null  object 
 16  flag_DNF              21662 non-null  int64  
 17  flag_all_split_exist  21662 non-null  int64  
 18  race_uniform_index    21289 non-null  float64
dtypes: float64(4), int64(4), object(11)
memory usage: 3.1+ MB
df_moscow_full_results.isna().sum()
bib                       0
finish_time_sec         277
finish_time_result      277
race                      0
pace_sec                277
pace(minpkm)            277
pace(kmph)              277
half_pace_sec           282
half_pace(minpkm)       282
half_pace(kmph)         277
gender_en                 0
age                       0
name_en                   0
location_city_ru          0
location_city_en          0
country_code_alpha_3      0
flag_DNF                  0
flag_all_split_exist      0
race_uniform_index      373
dtype: int64

Drop the unused columns and standardised the column names by renaming as per the both datasets in Boston Dataframe

df_boston_raw.drop(columns=['Unnamed: 0','Citizen','Unnamed: 9','Proj Time'],axis=1,inplace=True)
df_boston_raw.rename(columns={'Bib':'BibNo','M/F':'Gender','Pace':'Pace_minpermile','Official Time':'Finish_Time',\
                   'Overall':'Overall_ranking','Gender':'Gender_ranking','Division':'Division_ranking'},inplace=True)

Replaced Male and Female with values 0 and 1 in Boston Dataframe

df_boston_raw.replace(to_replace='M', value=1.0, inplace=True)
df_boston_raw.replace(to_replace='F', value=0.0, inplace=True)
df_boston_raw.head()
BibNo Name Age Gender City State Country 5K 10K 15K 20K Half 25K 30K 35K 40K Pace_minpermile Finish_Time Overall_ranking Gender_ranking Division_ranking
0 11 Kirui, Geoffrey 24 1.0 Keringet NaN KEN 0:15:25 0:30:28 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:19 2:02:53 0:04:57 2:09:37 1 1 1
1 17 Rupp, Galen 30 1.0 Portland OR USA 0:15:24 0:30:27 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:19 2:03:14 0:04:58 2:09:58 2 2 2
2 23 Osako, Suguru 25 1.0 Machida-City NaN JPN 0:15:25 0:30:29 0:45:44 1:01:16 1:04:36 1:17:00 1:33:01 1:48:31 2:03:38 0:04:59 2:10:28 3 3 3
3 21 Biwott, Shadrack 32 1.0 Mammoth Lakes CA USA 0:15:25 0:30:29 0:45:44 1:01:19 1:04:45 1:17:00 1:33:01 1:48:58 2:04:35 0:05:03 2:12:08 4 4 4
4 9 Chebet, Wilson 31 1.0 Marakwet NaN KEN 0:15:25 0:30:28 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:41 2:05:00 0:05:04 2:12:35 5 5 5

Drop the unused columns and standardised the column names by renaming as per the both datasets in Moscow Dataframe

df_moscow_full_results.drop(['finish_time_result','location_city_ru','flag_all_split_exist',\
                      'race','pace(kmph)','half_pace(kmph)','location_city_ru','flag_DNF',\
                      'flag_all_split_exist','race_uniform_index'],axis=1,inplace=True)
df_moscow_full_results.rename(columns ={'bib':'BibNo','pace_sec':'Pace_Sec','finish_time_sec':'Finish_Time_Sec',\
                    'pace(minpkm)':'Pace_Minpkm','half_pace_sec':'Half_Pace_Sec',\
                    'half_pace(minpkm)':'Half_Pace_Minpkm','gender_en':'Gender',\
                    'age':'Age','name_en':'Name','location_city_en':'City','country_code_alpha_3':'Country'},inplace=True)
df_moscow_split_results.drop(['split_pace(minpkm)','split_pace(kmph)','split_uniform_index','split'],axis=1,inplace=True)
df_moscow_split_results.head()
bib split_name split_time_sec split_time_result split_pace_sec
0 1 5 km 1057 17min 37sec 211.400000
1 1 10 km 2112 35min 12sec 211.000000
2 1 15 km 3149 52min 29sec 207.400000
3 1 Half marathon 4395 1h 13min 15sec 204.346043
4 1 25 km 5258 1h 27min 38sec 221.140295
df_moscow_split_results.rename (columns={'bib':'BibNo','split_name':'Split_Name','split_time_sec':'Split_Time_Sec',\
                    'split_time_result':'Split_Time_Result','split_pace_sec':'Split_Pace_Sec'},inplace=True)

By using pivot table to make split_name values as column in Moscow_Split_Results and Moscow_Full_Results Dataframe

df_moscow_split_results['Marathon_Split'] = ((df_moscow_split_results.Split_Name == ('10 km'))\
                                             |(df_moscow_split_results.Split_Name == ('Half marathon'))\
                                             |(df_moscow_split_results.Split_Name == ('Marathon')))
df_temp_pivot = df_moscow_split_results.pivot_table(index='BibNo', columns='Split_Name', values='Split_Time_Sec')
df_temp_pivot.head()
Split_Name 10 km 15 km 25 km 30 km 35 km 40 km 5 km Half marathon Marathon
BibNo
1 2112.0 3149.0 5258.0 6321.0 7362.0 8442.0 1057.0 4395.0 8911.0
2 2115.0 3188.0 5386.0 6508.0 7608.0 8784.0 1059.0 4476.0 9308.0
3 1899.0 2867.0 4796.0 5786.0 6632.0 7703.0 947.0 4006.0 8122.0
4 1876.0 2814.0 4752.0 5780.0 6827.0 7967.0 945.0 3943.0 8467.0
5 1958.0 2933.0 4949.0 6015.0 7080.0 8235.0 976.0 4112.0 8738.0
df_temp_pivot.drop(['15 km','25 km','30 km','35 km','40 km','5 km'],axis=1,inplace=True)
df_temp_pivot.set_index='BibNo'

Used the Outer Merge on Pivot table and Moscow_Full_Result to to get the split name values as column in Moscow_full_result

df_Moscow = df_moscow_full_results.merge(df_temp_pivot,how='outer',right_index=True,left_on='BibNo')
df_Moscow = df_moscow_full_results.merge(df_temp_pivot,how='outer',right_index=True,left_on='BibNo')
df_Moscow.head()
BibNo Finish_Time_Sec Pace_Sec Pace_Minpkm Half_Pace_Sec Half_Pace_Minpkm Gender Age Name City Country 10 km Half marathon Marathon
0 1 8911.0 211.186160 3:31 min/km 208.318521 3:28 min/km Female 30 Sardana Trofimova Yakutsk RUS 2112.0 4395.0 8911.0
1 2 9308.0 220.594857 3:40 min/km 212.157839 3:32 min/km Female 35 Tat'yana Arkhipova Cheboksary RUS 2115.0 4476.0 9308.0
2 3 8122.0 192.487261 3:12 min/km 189.880318 3:09 min/km Male 31 Stepan Kiselev Kazan' RUS 1899.0 4006.0 8122.0
3 4 8467.0 200.663586 3:20 min/km 186.894182 3:06 min/km Male 36 Dmitriy Safronov Moskva RUS 1876.0 3943.0 8467.0
4 5 8738.0 207.086148 3:27 min/km 194.904610 3:14 min/km Male 42 Grigoriy Andreev Ufa RUS 1958.0 4112.0 8738.0

Sorted the values based on Finish time seconds for overall ranking of athletes in Moscow dataframe

df_Moscow.sort_values(by='Finish_Time_Sec')
BibNo Finish_Time_Sec Pace_Sec Pace_Minpkm Half_Pace_Sec Half_Pace_Minpkm Gender Age Name City Country 10 km Half marathon Marathon
8982 17006 1728.0 172.8 2:52 min/km 175.8 2:55 min/km Male 26 Vladimir Nikitin Perm' RUS 1728.0 NaN NaN
8991 17018 1826.0 182.6 3:02 min/km 183.6 3:03 min/km Male 29 Denis Vasil'ev Sankt-Peterburg RUS 1826.0 NaN NaN
21362 32745 1834.0 183.4 3:03 min/km 185.6 3:05 min/km Male 30 Artur Burtsev Sankt-Peterburg RUS 1834.0 NaN NaN
8988 17013 1869.0 186.9 3:06 min/km 186.6 3:06 min/km Male 24 Ivan Panferov Moskva RUS 1869.0 NaN NaN
21114 32472 1869.0 186.9 3:06 min/km 188.2 3:08 min/km Male 22 Nikita Kotenev Minsk BLR 1869.0 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8954 11201 NaN NaN NaN NaN NaN Male 30 Murtuz Ramazanov Moskva RUS 4493.0 10282.0 NaN
8960 11209 NaN NaN NaN NaN NaN Female 37 Ekaterina Andreeva Moskva RUS NaN NaN NaN
8961 11210 NaN NaN NaN NaN NaN Male 28 Sergey Nikitin Moskva RUS NaN NaN NaN
8973 15013 NaN NaN NaN NaN NaN Male 32 Mikhail Maksimov Sankt-Peterburg RUS 1957.0 4086.0 NaN
16208 26352 NaN NaN NaN NaN NaN Female 29 Lyudmila Tret'yakova Moskva RUS NaN NaN NaN

21662 rows × 14 columns

df_Moscow['Marathon'].isna().sum()
12960

Drop the NAN values Moscow Dataframe from Marathon column which shows that person has not completed the marathon

df_Moscow.dropna(how='any',subset=['Marathon'],inplace=True)
df_new_Moscow= df_Moscow.sort_values(by='Finish_Time_Sec')
df_new_Moscow.head()
BibNo Finish_Time_Sec Pace_Sec Pace_Minpkm Half_Pace_Sec Half_Pace_Minpkm Gender Age Name City Country 10 km Half marathon Marathon
2 3 8122.0 192.487261 3:12 min/km 189.880318 3:09 min/km Male 31 Stepan Kiselev Kazan' RUS 1899.0 4006.0 8122.0
10 12 8161.0 193.411542 3:13 min/km 189.738121 3:09 min/km Male 32 Aleksey Troshkin Saransk RUS 1905.0 4003.0 8161.0
8 10 8394.0 198.933523 3:18 min/km 188.268752 3:08 min/km Male 27 Artem Aplachkin Barnaul RUS 1879.0 3972.0 8394.0
8968 15008 8416.0 199.454912 3:19 min/km 195.094205 3:15 min/km Male 29 Sergey Popov Voronezh RUS 1960.0 4116.0 8416.0
3 4 8467.0 200.663586 3:20 min/km 186.894182 3:06 min/km Male 36 Dmitriy Safronov Moskva RUS 1876.0 3943.0 8467.0

Added the new column overall ranking calculated based on Finish time seconds in Moscow dataframe

df_new_Moscow = df_new_Moscow.reset_index()
df_new_Moscow['Overall_Ranking'] = df_new_Moscow.index + 1
df_new_Moscow
index BibNo Finish_Time_Sec Pace_Sec Pace_Minpkm Half_Pace_Sec Half_Pace_Minpkm Gender Age Name City Country 10 km Half marathon Marathon Overall_Ranking
0 2 3 8122.0 192.487261 3:12 min/km 189.880318 3:09 min/km Male 31 Stepan Kiselev Kazan' RUS 1899.0 4006.0 8122.0 1
1 10 12 8161.0 193.411542 3:13 min/km 189.738121 3:09 min/km Male 32 Aleksey Troshkin Saransk RUS 1905.0 4003.0 8161.0 2
2 8 10 8394.0 198.933523 3:18 min/km 188.268752 3:08 min/km Male 27 Artem Aplachkin Barnaul RUS 1879.0 3972.0 8394.0 3
3 8968 15008 8416.0 199.454912 3:19 min/km 195.094205 3:15 min/km Male 29 Sergey Popov Voronezh RUS 1960.0 4116.0 8416.0 4
4 3 4 8467.0 200.663586 3:20 min/km 186.894182 3:06 min/km Male 36 Dmitriy Safronov Moskva RUS 1876.0 3943.0 8467.0 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8697 1297 1818 22102.0 523.806138 8:43 min/km 492.617609 8:12 min/km Male 64 Songkran Amornsaengthong Moskva RUS 4984.0 10393.0 22102.0 8698
8698 8578 10770 22178.0 525.607299 8:45 min/km 559.450172 9:19 min/km Male 36 Aleksey Popov Moskva RUS 5956.0 11803.0 22178.0 8699
8699 3199 4301 22183.0 525.725797 8:45 min/km 503.471975 8:23 min/km Male 26 Danila Malykhin Moskva RUS 4862.0 10622.0 22183.0 8700
8700 6340 8204 22291.0 528.285342 8:48 min/km 383.315559 6:23 min/km Male 63 Chi_Wan Yu Hong Kong HKG 3704.0 8087.0 22291.0 8701
8701 2526 3462 22672.0 537.314848 8:57 min/km 493.518189 8:13 min/km Male 25 Artem Prasolov Moskva RUS 4607.0 10412.0 22672.0 8702

8702 rows × 16 columns

df_new_Moscow.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8702 entries, 0 to 8701
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   index             8702 non-null   int64  
 1   BibNo             8702 non-null   int64  
 2   Finish_Time_Sec   8702 non-null   float64
 3   Pace_Sec          8702 non-null   float64
 4   Pace_Minpkm       8702 non-null   object 
 5   Half_Pace_Sec     8702 non-null   float64
 6   Half_Pace_Minpkm  8702 non-null   object 
 7   Gender            8702 non-null   object 
 8   Age               8702 non-null   int64  
 9   Name              8702 non-null   object 
 10  City              8702 non-null   object 
 11  Country           8702 non-null   object 
 12  10 km             8701 non-null   float64
 13  Half marathon     8702 non-null   float64
 14  Marathon          8702 non-null   float64
 15  Overall_Ranking   8702 non-null   int64  
dtypes: float64(6), int64(4), object(6)
memory usage: 1.1+ MB
df_new_Moscow.drop(columns=['index'],axis=1,inplace=True)
df_new_Moscow.head(20)
BibNo Finish_Time_Sec Pace_Sec Pace_Minpkm Half_Pace_Sec Half_Pace_Minpkm Gender Age Name City Country 10 km Half marathon Marathon Overall_Ranking
0 3 8122.0 192.487261 3:12 min/km 189.880318 3:09 min/km Male 31 Stepan Kiselev Kazan' RUS 1899.0 4006.0 8122.0 1
1 12 8161.0 193.411542 3:13 min/km 189.738121 3:09 min/km Male 32 Aleksey Troshkin Saransk RUS 1905.0 4003.0 8161.0 2
2 10 8394.0 198.933523 3:18 min/km 188.268752 3:08 min/km Male 27 Artem Aplachkin Barnaul RUS 1879.0 3972.0 8394.0 3
3 15008 8416.0 199.454912 3:19 min/km 195.094205 3:15 min/km Male 29 Sergey Popov Voronezh RUS 1960.0 4116.0 8416.0 4
4 4 8467.0 200.663586 3:20 min/km 186.894182 3:06 min/km Male 36 Dmitriy Safronov Moskva RUS 1876.0 3943.0 8467.0 5
5 7 8513.0 201.753762 3:21 min/km 194.478019 3:14 min/km Male 33 Mikhail Kul'kov Khanty-Mansiysk RUS 1959.0 4103.0 8513.0 6
6 14 8601.0 203.839317 3:23 min/km 197.416755 3:17 min/km Male 31 Viktor Ugarov Moskva RUS 1976.0 4165.0 8601.0 7
7 8 8670.0 205.474582 3:25 min/km 195.094205 3:15 min/km Male 30 Oleg Grigor'ev Moskva RUS 1958.0 4116.0 8670.0 8
8 10522 8690.0 205.948572 3:25 min/km 201.493068 3:21 min/km Male 26 Aleksey Shirshov Chkalov RUS 1978.0 4251.0 8690.0 9
9 5 8738.0 207.086148 3:27 min/km 194.904610 3:14 min/km Male 42 Grigoriy Andreev Ufa RUS 1958.0 4112.0 8738.0 10
10 33 8800.0 208.555516 3:28 min/km 192.250267 3:12 min/km Male 27 Iskander Yadgarov Moskva RUS 1943.0 4056.0 8800.0 11
11 15010 8808.0 208.745112 3:28 min/km 199.597109 3:19 min/km Male 32 Andrey Smirnov Perm' RUS 1972.0 4211.0 8808.0 12
12 13 8852.0 209.787890 3:29 min/km 208.223723 3:28 min/km Male 25 Mikhail Zvyagintsev Biryuch RUS 2111.0 4393.0 8852.0 13
13 8422 8856.0 209.882688 3:29 min/km 208.223723 3:28 min/km Male 43 Andrey Klinov Ryazan' RUS 2110.0 4393.0 8856.0 14
14 11 8856.0 209.882688 3:29 min/km 201.540467 3:21 min/km Male 28 Aleksandr Krotovich Sankt-Peterburg RUS 1996.0 4252.0 8856.0 15
15 10709 8873.0 210.285579 3:30 min/km 208.223723 3:28 min/km Male 38 Yipeng Li Beijing CHN 2111.0 4393.0 8873.0 16
16 15006 8890.0 210.688470 3:30 min/km 208.271122 3:28 min/km Male 34 Denis Korablev Salekhard RUS 2111.0 4394.0 8890.0 17
17 1 8911.0 211.186160 3:31 min/km 208.318521 3:28 min/km Female 30 Sardana Trofimova Yakutsk RUS 2112.0 4395.0 8911.0 18
18 22 8912.0 211.209859 3:31 min/km 208.176324 3:28 min/km Male 35 Renat Kashapov Kazan' RUS 2110.0 4392.0 8912.0 19
19 15012 8952.0 212.157839 3:32 min/km 208.271122 3:28 min/km Male 36 Dmitriy Shilkin Saratov RUS 2111.0 4394.0 8952.0 20
df_Boston=df_boston_raw.copy()
df_Boston.head()
BibNo Name Age Gender City State Country 5K 10K 15K 20K Half 25K 30K 35K 40K Pace_minpermile Finish_Time Overall_ranking Gender_ranking Division_ranking
0 11 Kirui, Geoffrey 24 1.0 Keringet NaN KEN 0:15:25 0:30:28 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:19 2:02:53 0:04:57 2:09:37 1 1 1
1 17 Rupp, Galen 30 1.0 Portland OR USA 0:15:24 0:30:27 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:19 2:03:14 0:04:58 2:09:58 2 2 2
2 23 Osako, Suguru 25 1.0 Machida-City NaN JPN 0:15:25 0:30:29 0:45:44 1:01:16 1:04:36 1:17:00 1:33:01 1:48:31 2:03:38 0:04:59 2:10:28 3 3 3
3 21 Biwott, Shadrack 32 1.0 Mammoth Lakes CA USA 0:15:25 0:30:29 0:45:44 1:01:19 1:04:45 1:17:00 1:33:01 1:48:58 2:04:35 0:05:03 2:12:08 4 4 4
4 9 Chebet, Wilson 31 1.0 Marakwet NaN KEN 0:15:25 0:30:28 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:41 2:05:00 0:05:04 2:12:35 5 5 5

Converted the finish time from Hour, Min & Sec format to seconds only to make it Final 10K, Half, 40k and Finish Time in Boston dataframe

df_Boston['FTime_10K_hour']= df_Boston['10K'].str.split(':').str[0]
df_Boston['FTime_10K_min']= df_Boston['10K'].str.split(':').str[1]
df_Boston['FTime_10K_sec']= df_Boston['10K'].str.split(':',n=2).str[-1]
df_Boston.FTime_10K_sec.fillna(0,inplace=True)
df_Boston.FTime_10K_min.fillna(0,inplace=True)
df_Boston.FTime_10K_hour.fillna(0,inplace=True)
df_Boston.FTime_10K_sec.replace('-',0,inplace=True)
df_Boston.FTime_10K_hour.replace('-',0,inplace=True)
final1=df_Boston['FTime_10K_hour'].astype(int) *3600
final2=(df_Boston['FTime_10K_min'].astype(int)) * 60 
final3=(df_Boston['FTime_10K_sec'].astype(int))
df_Boston['Final_10k_sec'] = final1 + final2 + final3
df_Boston.drop(columns =['FTime_10K_hour','FTime_10K_min','FTime_10K_sec'],inplace =True)
df_Boston['FTime_Half_hour']= df_Boston['Half'].str.split(':').str[0]
df_Boston['FTime_Half_min']= df_Boston['Half'].str.split(':').str[1]
df_Boston['FTime_Half_sec']= df_Boston['Half'].str.split(':',n=2).str[-1]
df_Boston.FTime_Half_sec.fillna(0,inplace=True)
df_Boston.FTime_Half_min.fillna(0,inplace=True)
df_Boston.FTime_Half_hour.fillna(0,inplace=True)
df_Boston.FTime_Half_sec.replace('-',0,inplace=True)
df_Boston.FTime_Half_hour.replace('-',0,inplace=True)
final1=df_Boston['FTime_Half_hour'].astype(int) *3600
final2=(df_Boston['FTime_Half_min'].astype(int)) * 60 
final3=(df_Boston['FTime_Half_sec'].astype(int))
df_Boston['Final_Half_sec'] = final1 + final2 + final3
df_Boston.drop(columns =['FTime_Half_hour','FTime_Half_min','FTime_Half_sec'],inplace =True)
df_Boston['FTime_40K_hour']= df_Boston['40K'].str.split(':').str[0]
df_Boston['FTime_40K_min']= df_Boston['40K'].str.split(':').str[1]
df_Boston['FTime_40K_sec']= df_Boston['40K'].str.split(':',n=2).str[-1] #To understand n=2 used
df_Boston.FTime_40K_sec.fillna(0,inplace=True)
df_Boston.FTime_40K_min.fillna(0,inplace=True)
df_Boston.FTime_40K_hour.fillna(0,inplace=True)
df_Boston.FTime_40K_sec.replace('-',0,inplace=True)
df_Boston.FTime_40K_hour.replace('-',0,inplace=True)
final1=df_Boston['FTime_40K_hour'].astype(int) *3600
final2=(df_Boston['FTime_40K_min'].astype(int)) * 60 
final3=(df_Boston['FTime_40K_sec'].astype(int))
df_Boston['Final_40k_sec'] = final1 + final2 + final3
df_Boston.drop(columns =['FTime_40K_hour','FTime_40K_min','FTime_40K_sec'],inplace =True)
df_Boston['FTime_hour']= df_Boston['Finish_Time'].str.split(':').str[0]
df_Boston['FTime_min']= df_Boston['Finish_Time'].str.split(':').str[1]
df_Boston['FTime_sec']= df_Boston['Finish_Time'].str.split(':',n=2).str[-1] #To understand n=2 used
df_Boston.FTime_sec.fillna(0,inplace=True)
df_Boston.FTime_min.fillna(0,inplace=True)
df_Boston.FTime_hour.fillna(0,inplace=True)
df_Boston.FTime_sec.replace('-',0,inplace=True)
df_Boston.FTime_hour.replace('-',0,inplace=True)
final1=df_Boston['FTime_hour'].astype(int) *3600
final2=(df_Boston['FTime_min'].astype(int)) * 60 
final3=(df_Boston['FTime_sec'].astype(int))
df_Boston['FinishTime_sec'] = final1 + final2 + final3
df_Boston.drop(columns =['FTime_hour','FTime_min','FTime_sec'],inplace =True)
df_Boston.drop(columns=['5K','15K','20K','25K','30K','35K'],inplace=True)
df_Boston.head()
BibNo Name Age Gender City State Country 10K Half 40K Pace_minpermile Finish_Time Overall_ranking Gender_ranking Division_ranking Final_10k_sec Final_Half_sec Final_40k_sec FinishTime_sec
0 11 Kirui, Geoffrey 24 1.0 Keringet NaN KEN 0:30:28 1:04:35 2:02:53 0:04:57 2:09:37 1 1 1 1828 3875 7373 7777
1 17 Rupp, Galen 30 1.0 Portland OR USA 0:30:27 1:04:35 2:03:14 0:04:58 2:09:58 2 2 2 1827 3875 7394 7798
2 23 Osako, Suguru 25 1.0 Machida-City NaN JPN 0:30:29 1:04:36 2:03:38 0:04:59 2:10:28 3 3 3 1829 3876 7418 7828
3 21 Biwott, Shadrack 32 1.0 Mammoth Lakes CA USA 0:30:29 1:04:45 2:04:35 0:05:03 2:12:08 4 4 4 1829 3885 7475 7928
4 9 Chebet, Wilson 31 1.0 Marakwet NaN KEN 0:30:28 1:04:35 2:05:00 0:05:04 2:12:35 5 5 5 1828 3875 7500 7955
df_Boston.isna().sum()
BibNo                  0
Name                   0
Age                    0
Gender                 0
City                   0
State               3595
Country                0
10K                    0
Half                   0
40K                    0
Pace_minpermile        0
Finish_Time            0
Overall_ranking        0
Gender_ranking         0
Division_ranking       0
Final_10k_sec          0
Final_Half_sec         0
Final_40k_sec          0
FinishTime_sec         0
dtype: int64
df_Boston.shape
(26410, 19)
len(df_Boston.State)
26410

Creating merged dataset for Boston and Moscow, keeping only relevant columns and sorting them by overall ranking.

df_Boston_merge=df_Boston.copy()
df_Moscow_merge=df_new_Moscow.copy()

Created isBoston column in Boston dataframe assigned value 1 which means after merging data row with value 1.0 are from Boston Marathon

df_Boston_merge['isBoston'] = '1.0'
df_Moscow_merge.drop(columns=['Finish_Time_Sec'],inplace=True)
df_Moscow_merge.drop(columns=['BibNo'],axis=1,inplace=True)
df_Boston_merge.drop(columns=['BibNo'],axis=1,inplace=True)
df_Moscow_merge.rename(columns={'10 km':'Final_10k_sec', 'Half marathon':'Final_Half_sec', \
                                'Marathon':'FinishTime_sec','Overall_Ranking':'Overall_ranking'}, inplace=True)
df_Moscow_merge.replace(to_replace='Male', value=1.0, inplace=True)
df_Moscow_merge.replace(to_replace='Female', value=0.0, inplace=True)

Created isBoston column in Moscow dataframe assigned value 0 which means after merging data row with value 0.0 are from Moscow Marathon

df_Moscow_merge['isBoston'] = '0.0'

Drop the unused columns from Boston and Moscow

df_Moscow_merge.drop(columns=['Pace_Sec', 'Pace_Minpkm', 'Half_Pace_Sec', 'Half_Pace_Minpkm'], axis=1, inplace=True)
df_Moscow_merge.head(1)
Gender Age Name City Country Final_10k_sec Final_Half_sec FinishTime_sec Overall_ranking isBoston
0 1.0 31 Stepan Kiselev Kazan' RUS 1899.0 4006.0 8122.0 1 0.0
df_Boston_merge.drop(columns=['Final_40k_sec', 'State', '10K', 'Half', '40K','Pace_minpermile','Finish_Time','Gender_ranking','Division_ranking'], axis=1, inplace=True)
df_Boston_merge.head(1)
Name Age Gender City Country Overall_ranking Final_10k_sec Final_Half_sec FinishTime_sec isBoston
0 Kirui, Geoffrey 24 1.0 Keringet KEN 1 1828 3875 7777 1.0

Merge Boston and Moscow dataframe to get the insite about each athlete and sort them based on overall ranking

df_merged_bm = df_Boston_merge.merge(df_Moscow_merge,how='outer', left_on=['Name', 'City', 'Overall_ranking', 'FinishTime_sec', 'Final_Half_sec', 'Final_10k_sec', 'Country', 'Gender', 'Age', 'isBoston'],\
                                     right_on=['Name', 'City', 'Overall_ranking', 'FinishTime_sec', 'Final_Half_sec', 'Final_10k_sec', 'Country', 'Gender', 'Age', 'isBoston'])
df_merged_bm = df_merged_bm.sort_values(by='Overall_ranking')
df_merged_bm.head(1)
Name Age Gender City Country Overall_ranking Final_10k_sec Final_Half_sec FinishTime_sec isBoston
0 Kirui, Geoffrey 24 1.0 Keringet KEN 1 1828.0 3875.0 7777.0 1.0

Interesting Finding 1 Summary

  • Success rate of a country - ratio of number of participants of a country in top 100 / total number of participants of that country.
  • Summary - Amongst 91 participating countries, Kenya has a success ratio of 100% in top 100 runners in Boston Marathon.
  • Summary - Amongst 53 participating countries, Thailand has a success ratio of 70% in top 1000 runners in Moscow Marathon.

Get the total number of countries participating in the run.

df_country = df_Boston.groupby('Country').size()
len(df_country)
91

Find the countries of the runners who have finished in top 100.

df_top100_country=df_Boston[:100].groupby('Country').size()
len(df_top100_country)
14

Calculate the success ratio which is the number of runners in top 100 divide by the total number of runners from that country.

df_Boston_countries_success_rate = ((df_top100_country/df_country)* 100).nlargest(14)  # 14 different countries in the top 100 as shown above.
df_Boston_countries_success_rate
Country
BDI    100.000000
BRN    100.000000
KEN    100.000000
ZIM    100.000000
ETH     80.000000
JPN      1.764706
NED      1.282051
IRL      1.149425
MEX      0.701754
ITA      0.606061
BRA      0.487805
USA      0.338983
GBR      0.235294
CAN      0.213904
dtype: float64
df_country['BDI'] 
1
df_country['BRN'] 
1
df_country['KEN']  
8
df_country['ZIM']  
1
df_country['ETH']
5

Plot of top 100 finishers countries.

sns.countplot(y='Country',data =df_Boston[:100])   
<matplotlib.axes._subplots.AxesSubplot at 0xfef4b00>

png

Boston_success_df = df_Boston_countries_success_rate.to_frame().reset_index()
Boston_success_df.rename(columns={0:'success_rate'}, inplace=True)
sns.catplot(x='success_rate',y='Country', data=Boston_success_df,aspect=3,kind='bar')
<seaborn.axisgrid.FacetGrid at 0x10158080>

png

Eliminate the 3 countries - BDI, BRN and ZIM since they have only 1 participant. Plot for ETH and KEN. Show the success ratio plot as well. Show values of each bar in the graph.

Doing the same analysis for Moscow and get the total number of countries participating in the run

df_country_moscow=df_Moscow.groupby('Country').size()
df_top1000_country_moscow=df_Moscow[:1000].groupby('Country').size()
df_top1000_country_moscow
Country
CHN      1
HKG     10
RUS    903
THA     86
dtype: int64
success_rate_by_country_moscow =((df_top1000_country_moscow/df_country_moscow)* 100).nlargest(4)
success_rate_by_country_moscow
Country
THA    69.354839
HKG    34.482759
RUS    11.134402
CHN     1.960784
dtype: float64

Plot of top 1000 finishers by countries

sns.countplot(y='Country',data =df_Moscow[:1000])   
<matplotlib.axes._subplots.AxesSubplot at 0xf2aa160>

png

Moscow_success_df = success_rate_by_country_moscow.to_frame().reset_index()
Moscow_success_df.rename(columns={0:'success_rate'}, inplace=True)
sns.catplot(x='success_rate',y='Country', data=Moscow_success_df,aspect=3,kind='bar')
<seaborn.axisgrid.FacetGrid at 0x10196320>

png

Doing the same analysis for merged data. This will show success rate for the combined two runs.

df_merged_bm_country=df_merged_bm.groupby('Country').size()
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-1-b3695a4b09ad> in <module>
----> 1 df_merged_bm_country=df_merged_bm.groupby('Country').size()


NameError: name 'df_merged_bm' is not defined
df_merged_bm_country_500=df_merged_bm[:5000].groupby('Country').size()
df_merged_bm_success_rate =((df_merged_bm_country_500/df_merged_bm_country)* 100).nlargest(20) #expected finding Kenya has highest success rate
df_merged_bm_success_rate
Country
BDI    100.000000
BRN    100.000000
ETH    100.000000
IRN    100.000000
KEN    100.000000
MKD    100.000000
ZIM    100.000000
CHE     60.000000
LVA     60.000000
EGY     50.000000
PAN     50.000000
UZB     50.000000
DNK     42.857143
UKR     37.500000
BLR     35.555556
BER     33.333333
CAY     33.333333
CRC     33.333333
DOM     33.333333
GUA     33.333333
dtype: float64
df_merged_bm_success_rate_df = df_merged_bm_success_rate.to_frame().reset_index()
df_merged_bm_success_rate_df.rename(columns={0:'success_rate'}, inplace=True)
#df_merged_bm_success_rate_df.head()
sns.catplot(x='success_rate',y='Country', data=df_merged_bm_success_rate_df,aspect=2,kind='bar')
<seaborn.axisgrid.FacetGrid at 0x10184a20>

png

df_merged_bm_country = df_merged_bm.groupby('Country').count()
df_merged_bm_country[df_merged_bm_country > 100]
Name Age Gender City Overall_ranking Final_10k_sec Final_Half_sec FinishTime_sec isBoston
Country
- NaN NaN NaN NaN NaN NaN NaN NaN NaN
ALG NaN NaN NaN NaN NaN NaN NaN NaN NaN
AND NaN NaN NaN NaN NaN NaN NaN NaN NaN
ARE NaN NaN NaN NaN NaN NaN NaN NaN NaN
ARG NaN NaN NaN NaN NaN NaN NaN NaN NaN
AUS 195.0 195.0 195.0 195.0 195.0 195.0 195.0 195.0 195.0
AUT NaN NaN NaN NaN NaN NaN NaN NaN NaN
AZE NaN NaN NaN NaN NaN NaN NaN NaN NaN
BAR NaN NaN NaN NaN NaN NaN NaN NaN NaN
BDI NaN NaN NaN NaN NaN NaN NaN NaN NaN
BEL NaN NaN NaN NaN NaN NaN NaN NaN NaN
BER NaN NaN NaN NaN NaN NaN NaN NaN NaN
BIH NaN NaN NaN NaN NaN NaN NaN NaN NaN
BLR NaN NaN NaN NaN NaN NaN NaN NaN NaN
BRA 206.0 206.0 206.0 206.0 206.0 206.0 206.0 206.0 206.0
BRN NaN NaN NaN NaN NaN NaN NaN NaN NaN
BUL NaN NaN NaN NaN NaN NaN NaN NaN NaN
CAN 1871.0 1871.0 1871.0 1871.0 1871.0 1871.0 1871.0 1871.0 1871.0
CAY NaN NaN NaN NaN NaN NaN NaN NaN NaN
CHE NaN NaN NaN NaN NaN NaN NaN NaN NaN
CHI NaN NaN NaN NaN NaN NaN NaN NaN NaN
CHN 293.0 293.0 293.0 293.0 293.0 293.0 293.0 293.0 293.0
COL NaN NaN NaN NaN NaN NaN NaN NaN NaN
CRC NaN NaN NaN NaN NaN NaN NaN NaN NaN
CRO NaN NaN NaN NaN NaN NaN NaN NaN NaN
CZE NaN NaN NaN NaN NaN NaN NaN NaN NaN
DEN NaN NaN NaN NaN NaN NaN NaN NaN NaN
DEU NaN NaN NaN NaN NaN NaN NaN NaN NaN
DNK NaN NaN NaN NaN NaN NaN NaN NaN NaN
DOM NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
PER NaN NaN NaN NaN NaN NaN NaN NaN NaN
PHI NaN NaN NaN NaN NaN NaN NaN NaN NaN
POL NaN NaN NaN NaN NaN NaN NaN NaN NaN
POR NaN NaN NaN NaN NaN NaN NaN NaN NaN
PRT NaN NaN NaN NaN NaN NaN NaN NaN NaN
ROU NaN NaN NaN NaN NaN NaN NaN NaN NaN
RSA NaN NaN NaN NaN NaN NaN NaN NaN NaN
RUS 8139.0 8139.0 8139.0 8139.0 8139.0 8138.0 8139.0 8139.0 8139.0
SEN NaN NaN NaN NaN NaN NaN NaN NaN NaN
SIN NaN NaN NaN NaN NaN NaN NaN NaN NaN
SLO NaN NaN NaN NaN NaN NaN NaN NaN NaN
SMR NaN NaN NaN NaN NaN NaN NaN NaN NaN
SRB NaN NaN NaN NaN NaN NaN NaN NaN NaN
SUI NaN NaN NaN NaN NaN NaN NaN NaN NaN
SVK NaN NaN NaN NaN NaN NaN NaN NaN NaN
SVN NaN NaN NaN NaN NaN NaN NaN NaN NaN
SWE NaN NaN NaN NaN NaN NaN NaN NaN NaN
TCA NaN NaN NaN NaN NaN NaN NaN NaN NaN
THA 129.0 129.0 129.0 129.0 129.0 129.0 129.0 129.0 129.0
TRI NaN NaN NaN NaN NaN NaN NaN NaN NaN
TUR NaN NaN NaN NaN NaN NaN NaN NaN NaN
TWN NaN NaN NaN NaN NaN NaN NaN NaN NaN
UAE NaN NaN NaN NaN NaN NaN NaN NaN NaN
UKR NaN NaN NaN NaN NaN NaN NaN NaN NaN
URU NaN NaN NaN NaN NaN NaN NaN NaN NaN
USA 20945.0 20945.0 20945.0 20945.0 20945.0 20945.0 20945.0 20945.0 20945.0
UZB NaN NaN NaN NaN NaN NaN NaN NaN NaN
VEN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ZAF NaN NaN NaN NaN NaN NaN NaN NaN NaN
ZIM NaN NaN NaN NaN NaN NaN NaN NaN NaN

113 rows × 9 columns

Show agreegated result of boston and moscow success rate in one graph. Still on different datasets.

Boston_success_df = df_Boston_countries_success_rate.to_frame().reset_index()
Boston_success_df.rename(columns={0:'success_rate'}, inplace=True)
Boston_success_df['isBoston'] = '1.0'
#Boston_success_df.head(1)
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-2-706243f5f236> in <module>
----> 1 Boston_success_df = df_Boston_countries_success_rate.to_frame().reset_index()
      2 Boston_success_df.rename(columns={0:'success_rate'}, inplace=True)
      3 Boston_success_df['isBoston'] = '1.0'
      4 #Boston_success_df.head(1)


NameError: name 'df_Boston_countries_success_rate' is not defined
Moscow_success_df = success_rate_by_country_moscow.to_frame().reset_index()
Moscow_success_df.rename(columns={0:'success_rate'}, inplace=True)
Moscow_success_df['isBoston'] = '0.0'
#Moscow_success_df.head(1)
df_merge_success_rate = Boston_success_df.merge(Moscow_success_df,how='outer', left_on=['Country', 'success_rate', 'isBoston'],\
                                     right_on=['Country', 'success_rate', 'isBoston'])
df_merge_success_rate = df_merge_success_rate.sort_values(by='success_rate', ascending=False)
sns.catplot(x='success_rate',y='Country',hue='isBoston',data=df_merge_success_rate,aspect=3,kind='bar')
<seaborn.axisgrid.FacetGrid at 0x536c128>

png

Managerial Insights

  • To perform well in long run marathons athletes should follow the practices what Kenya and Thailand athletes have implemented. As Kenya athletes have a dominance in the olympics and most recently Thailand is seeing the hobby jogging boom.
  • Kenya and Thailand athletes are used to perform in severe conditions, so they have the capacity to manage well in Boston and Moscow Marathon
  • Kenya as a country has the most successful representation in the run. The runners from Kenya are really fast and they are amongst the winners as well. In order to make the run more competitive and atttractive sponsors and viewers, we should make sure to have good representation from Kenya and top Kenyan players. This will drive more advertising and sponsorship money to the run. Don't just focus on local talent but rather go for the bigger talent from abroad.

Interesting Finding 2 Summary

  • Summary - Negative split is not an effective nor popular strategy for Boston Marathon.
  • Background - Negative split is a very popular and common strategy where a runner runs the second half of a marathon
  • Faster than the first half.
  • import matplotlib.pyplot as plt
    df_Boston['boston_split_ratio'] = (df_Boston['FinishTime_sec'] - df_Boston['Final_Half_sec'])/(df_Boston['Final_Half_sec'])
    df_split = df_Boston[df_Boston.Final_Half_sec > 0]
    len(df_split[df_split.boston_split_ratio < 1.0])
    811
    
    len(df_Boston[df_Boston.boston_split_ratio < 1.0]) / len(df_Boston) * 100
    3.070806512684589
    

    Proves that mature people use negative split and not young.

    df_split[df_split.boston_split_ratio < 1.0].groupby(['Gender']).size() 
    Gender
    0.0    377
    1.0    434
    dtype: int64
    
    sns.kdeplot(df_split.boston_split_ratio)
    plt.xlim([0.7,1.7])
    plt.xlabel('Boston Split Ratio')
    plt.ylabel('Runner Density')
    plt.title('Split Distribution (Negative split when < 1)')
    <matplotlib.axes._subplots.AxesSubplot at 0xf291c50>
    
    
    
    
    
    
    (0.7, 1.7)
    
    
    
    
    
    
    Text(0.5, 0, 'Boston Split Ratio')
    
    
    
    
    
    
    Text(0, 0.5, 'Runner Density')
    
    
    
    
    
    
    Text(0.5, 1.0, 'Split Distribution (Negative split when < 1)')
    

    png

    plt.plot(df_split.Overall_ranking,df_split.boston_split_ratio,'o', alpha = 0.2)
    plt.ylim([0.5,3])
    plt.xlabel('Overall Ranking')
    plt.ylabel('Split Ratio')
    plt.title('Split and performance')
    [<matplotlib.lines.Line2D at 0x100b5a58>]
    
    
    
    
    
    
    (0.5, 3)
    
    
    
    
    
    
    Text(0.5, 0, 'Overall Ranking')
    
    
    
    
    
    
    Text(0, 0.5, 'Split Ratio')
    
    
    
    
    
    
    Text(0.5, 1.0, 'Split and performance')
    

    png

    Till here, we should show how many have positive split and how many have negative split. Show them by age and gender. Plot split ratio of each runner by finish time. Variance in split ratio by age? by gender?

    Derive split of top 100 from the above dataframe to optimize.

    dfBoston_100 = df_Boston[:100]
    dfBoston_100['split_ratio_100'] = (dfBoston_100['FinishTime_sec'] - dfBoston_100['Final_Half_sec'])/(dfBoston_100['Final_Half_sec'])
    df_split_100 = dfBoston_100[dfBoston_100.Final_Half_sec > 0]

    Top 10 runners have not used negative split.

    The below chart shows that successful countries didn't solely use negative split strategy.

    plt.plot(df_split_100.Overall_ranking,df_split_100.split_ratio_100,'o', alpha = 0.2)
    plt.ylim([0.8,1.4])
    plt.xlabel('Overall Rank')
    plt.ylabel('Split')
    plt.title('Split and performance') 
    [<matplotlib.lines.Line2D at 0x10113b38>]
    
    
    
    
    
    
    (0.8, 1.4)
    
    
    
    
    
    
    Text(0.5, 0, 'Overall Rank')
    
    
    
    
    
    
    Text(0, 0.5, 'Split')
    
    
    
    
    
    
    Text(0.5, 1.0, 'Split and performance')
    

    png

    # Till here, we should show how many have +ve split and how many have negative split. Show them by age and gender.
    # plot split ratio of each runner by finish time.  split ratio on y axis and finish time on x.
    # variance in split ratio by age? by gender?
    plt.plot(df_split.Overall_ranking[(df_split.Country is not 'KEN') & (df_split.Overall_ranking<100)], df_split.boston_split_ratio[(df_split.Country is not 'KEN') & (df_split.Overall_ranking<100)],'o')
    
    plt.plot(df_split.Overall_ranking[(df_split.Country == 'KEN') & (df_split.Overall_ranking<100)], df_split.boston_split_ratio[(df_split.Country == 'KEN')  & (df_split.Overall_ranking<100)],'o', color = 'r')
    plt.xlabel('Overall Rank')
    plt.ylabel('Split Ratio')
    plt.legend(['Others','Kenya'])
    [<matplotlib.lines.Line2D at 0x13988da0>]
    
    
    
    
    
    
    [<matplotlib.lines.Line2D at 0x13994438>]
    
    
    
    
    
    
    Text(0.5, 0, 'Overall Rank')
    
    
    
    
    
    
    Text(0, 0.5, 'Split Ratio')
    
    
    
    
    
    
    <matplotlib.legend.Legend at 0x13988ef0>
    

    png

    #doing same analysis for moscow.
    import matplotlib.pyplot as plt
    df_new_Moscow['moscow_split_ratio'] = (df_new_Moscow['Marathon'] - df_new_Moscow['Half marathon'])/(df_new_Moscow['Half marathon'])
    # count number of positive split and number of negative split. Plot it.
    sns.kdeplot(df_new_Moscow.moscow_split_ratio)
    plt.xlim([0.7,1.8])
    plt.xlabel('Split Ratio')
    plt.title('Split Distribution (Negative split when < 1)')
    # what does y axis represent? is it count?
    <matplotlib.axes._subplots.AxesSubplot at 0x139cc588>
    
    
    
    
    
    
    (0.7, 1.8)
    
    
    
    
    
    
    Text(0.5, 0, 'Split Ratio')
    
    
    
    
    
    
    Text(0.5, 1.0, 'Split Distribution (Negative split when < 1)')
    

    png

    plt.plot(df_new_Moscow.Overall_Ranking,df_new_Moscow.moscow_split_ratio,'o', alpha = 0.2)
    plt.ylim([0.5,2.2])
    plt.xlabel('Overall_Ranking')
    plt.ylabel('Split Ratio')
    plt.title('Split Ratio and performance')
    [<matplotlib.lines.Line2D at 0x13c8d160>]
    
    
    
    
    
    
    (0.5, 2.2)
    
    
    
    
    
    
    Text(0.5, 0, 'Overall_Ranking')
    
    
    
    
    
    
    Text(0, 0.5, 'Split Ratio')
    
    
    
    
    
    
    Text(0.5, 1.0, 'Split Ratio and performance')
    

    png

    len(df_new_Moscow[df_new_Moscow.moscow_split_ratio < 1.0])
    935
    
    len(df_new_Moscow[df_new_Moscow.moscow_split_ratio < 1.0]) / len(df_new_Moscow) * 100
    10.744656400827395
    

    Derive split of top 100 from the above df to optimize.

    dfMoscow_100 = df_new_Moscow[:100]
    dfMoscow_100['split_ratio_100'] = (dfMoscow_100['Marathon'] - dfMoscow_100['Half marathon'])/(dfMoscow_100['Half marathon'])
    df_split_moscow_100 = dfMoscow_100
    sns.kdeplot(df_split_moscow_100.split_ratio_100)
    plt.xlim([0.7,1.7])
    plt.xlabel('Split Ratio')
    plt.title('Split Distribution (Negative split when < 1)')
    <matplotlib.axes._subplots.AxesSubplot at 0x13c991d0>
    
    
    
    
    
    
    (0.7, 1.7)
    
    
    
    
    
    
    Text(0.5, 0, 'Split Ratio')
    
    
    
    
    
    
    Text(0.5, 1.0, 'Split Distribution (Negative split when < 1)')
    

    png

    plt.plot(df_split_moscow_100.Overall_Ranking,df_split_moscow_100.split_ratio_100,'o', alpha = 0.2)
    plt.ylim([0.9,1.3])
    plt.xlabel('Overall Rank')
    plt.ylabel('Split')
    plt.title('Split and performance')
    [<matplotlib.lines.Line2D at 0x14ed58d0>]
    
    
    
    
    
    
    (0.9, 1.3)
    
    
    
    
    
    
    Text(0.5, 0, 'Overall Rank')
    
    
    
    
    
    
    Text(0, 0.5, 'Split')
    
    
    
    
    
    
    Text(0.5, 1.0, 'Split and performance')
    

    png

    len(df_split_moscow_100)
    100
    

    Count number of positive split and number of negative split. Plot it.

    import matplotlib.pyplot as plt
    df_merged_bm['split_ratio'] = (df_merged_bm['FinishTime_sec'] - df_merged_bm['Final_Half_sec'])/(df_merged_bm['Final_Half_sec'])
    type(df_split.boston_split_ratio)
    pandas.core.series.Series
    
    f, ax = plt.subplots(figsize=(8, 8))
    ax = sns.kdeplot(df_split.boston_split_ratio, legend=True)
    ax = sns.kdeplot(df_new_Moscow.moscow_split_ratio, legend=True)
    #sns.lineplot(x='split_ratio',y='FinishTime_sec',hue='isBoston',data=df_Boston,aspect=2,kind='point')
    plt.xlim([0.2,2.8])
    plt.xlabel('Split Ratio')
    plt.ylabel('Runner Density')
    plt.title('Split Distribution (Negative split when < 1)')
    # what does y axis represent? is it count?
    (0.2, 2.8)
    
    
    
    
    
    
    Text(0.5, 0, 'Split Ratio')
    
    
    
    
    
    
    Text(0, 0.5, 'Runner Density')
    
    
    
    
    
    
    Text(0.5, 1.0, 'Split Distribution (Negative split when < 1)')
    

    png

    plt.plot(df_merged_bm.Overall_ranking,df_merged_bm.split_ratio,'o', alpha = 0.2)
    plt.ylim([0.5,2.2])
    plt.xlabel('Overall_Ranking')
    plt.ylabel('Split')
    plt.title('Split and performance')
    [<matplotlib.lines.Line2D at 0x14fa7be0>]
    
    
    
    
    
    
    (0.5, 2.2)
    
    
    
    
    
    
    Text(0.5, 0, 'Overall_Ranking')
    
    
    
    
    
    
    Text(0, 0.5, 'Split')
    
    
    
    
    
    
    Text(0.5, 1.0, 'Split and performance')
    

    png

    Managerial Insights

    • After doing the analysis, we observed that to perfrom well in Marathon athletes should use the even split strategy that is athlete should keep his/her pace consistent throughout the marathon instead of conserving the energy for first or second split of the marathon
    • It is advisable to athlete to practice their chosen strategy in training to help set the realistic expectation and to learn how their body is going to respond.

    Interesting Finding 3 Summary

    • Summary - Middle aged athletes are faster as compare to younger aged athletes.
    df_Boston['binned_age1']= pd.cut(df_Boston.Age,bins=[15,25,35,45,55,70])
    df_Boston[df_Boston.Gender == 1.0].Age.describe()      # male dataset details.    #REMOVE!!!
    count    14438.000000
    mean        44.772475
    std         11.533255
    min         18.000000
    25%         36.000000
    50%         45.000000
    75%         53.000000
    max         83.000000
    Name: Age, dtype: float64
    
    df_Boston[df_Boston.Gender == 0.0].Age.describe() #REMOVE!!!
    count    11972.000000
    mean        39.952974
    std         10.703554
    min         18.000000
    25%         31.000000
    50%         40.000000
    75%         48.000000
    max         84.000000
    Name: Age, dtype: float64
    

    Unexpected as middle aged person more participated as compared to young.

    df_Boston.groupby('Age').size().nlargest(5) 
    Age
    45    1127
    46     976
    40     926
    47     843
    50     815
    dtype: int64
    
    df_Boston.groupby('Gender').size()
    Gender
    0.0    11972
    1.0    14438
    dtype: int64
    
    df_Boston.groupby(['Gender','binned_age1']).size()
    Gender  binned_age1
    0.0     (15, 25]       1067
            (25, 35]       3350
            (35, 45]       3780
            (45, 55]       2823
            (55, 70]        932
    1.0     (15, 25]        691
            (25, 35]       2710
            (35, 45]       3991
            (45, 55]       4316
            (55, 70]       2623
    dtype: int64
    
    sns.countplot(x='binned_age1',hue='Gender',data=df_Boston)
    <matplotlib.axes._subplots.AxesSubplot at 0x14fe5978>
    

    png

    df_Boston['binned_age2']= pd.cut(df_Boston.Age,bins=[15,20,25,29,35,40,45,50,70])
    sns.factorplot(x='binned_age2',y='FinishTime_sec',hue='Gender',data=df_Boston,aspect=2,kind='point')
    <seaborn.axisgrid.FacetGrid at 0x14f8a240>
    

    png

    Older and mature people perform the best and are the faster marathon runner. Older and mature people also participated more than young people and are more health considerate.

    df_new_Moscow.head()
    BibNo Finish_Time_Sec Pace_Sec Pace_Minpkm Half_Pace_Sec Half_Pace_Minpkm Gender Age Name City Country 10 km Half marathon Marathon Overall_Ranking moscow_split_ratio
    0 3 8122.0 192.487261 3:12 min/km 189.880318 3:09 min/km Male 31 Stepan Kiselev Kazan' RUS 1899.0 4006.0 8122.0 1 1.027459
    1 12 8161.0 193.411542 3:13 min/km 189.738121 3:09 min/km Male 32 Aleksey Troshkin Saransk RUS 1905.0 4003.0 8161.0 2 1.038721
    2 10 8394.0 198.933523 3:18 min/km 188.268752 3:08 min/km Male 27 Artem Aplachkin Barnaul RUS 1879.0 3972.0 8394.0 3 1.113293
    3 15008 8416.0 199.454912 3:19 min/km 195.094205 3:15 min/km Male 29 Sergey Popov Voronezh RUS 1960.0 4116.0 8416.0 4 1.044704
    4 4 8467.0 200.663586 3:20 min/km 186.894182 3:06 min/km Male 36 Dmitriy Safronov Moskva RUS 1876.0 3943.0 8467.0 5 1.147350
    df_new_Moscow['binned_age1']= pd.cut(df_new_Moscow.Age,bins=[15,25,30,35,40,50,70])
    df_new_Moscow.groupby('Age').size().nlargest(10)
    Age
    31    485
    30    482
    33    468
    32    457
    34    450
    35    434
    29    408
    28    372
    36    360
    37    340
    dtype: int64
    
    df_new_Moscow.groupby(['Gender']).size()
    Gender
    Female    1491
    Male      7211
    dtype: int64
    
    df_new_Moscow.groupby(['Gender','binned_age1']).size()
    Gender  binned_age1
    Female  (15, 25]        183
            (25, 30]        403
            (30, 35]        395
            (35, 40]        238
            (40, 50]        207
            (50, 70]         63
    Male    (15, 25]        530
            (25, 30]       1401
            (30, 35]       1899
            (35, 40]       1335
            (40, 50]       1409
            (50, 70]        616
    dtype: int64
    
    sns.countplot(x='binned_age1',hue='Gender',data=df_new_Moscow)  # color is opposite for male/female as compared to other graphs.
    <matplotlib.axes._subplots.AxesSubplot at 0x16cc04a8>
    

    png

    df_new_Moscow['binned_age2']= pd.cut(df_new_Moscow.Age,bins=[20,25,30,35,40,45,50,70])
    sns.factorplot(x='binned_age2',y='Marathon',hue='Gender',data=df_new_Moscow,aspect=3,kind='point')
    <seaborn.axisgrid.FacetGrid at 0x16cc0908>
    

    png

    Analysis on combined dataset.

    df_merged_bm['binned_age1']= pd.cut(df_merged_bm.Age,bins=[15,25,30,35,40,45,50,60,70])
    df_merged_bm.groupby('Age').size().nlargest(10)
    Age
    45    1275
    35    1234
    40    1193
    46    1150
    30    1111
    36    1093
    31    1068
    33    1059
    29    1044
    41    1034
    dtype: int64
    
    df_merged_bm.groupby(['Gender']).size()
    Gender
    0.0    13463
    1.0    21649
    dtype: int64
    
    df_merged_bm.groupby(['Gender','binned_age1']).size()
    Gender  binned_age1
    0.0     (15, 25]       1250
            (25, 30]       2187
            (30, 35]       1961
            (35, 40]       2126
            (40, 45]       2026
            (45, 50]       1774
            (50, 60]       1763
            (60, 70]        354
    1.0     (15, 25]       1221
            (25, 30]       2600
            (30, 35]       3410
            (35, 40]       3175
            (40, 45]       3000
            (45, 50]       2963
            (50, 60]       3834
            (60, 70]       1318
    dtype: int64
    

    Age wise participation across boston and moscow

    sns.countplot(x='binned_age1',hue='isBoston',data=df_merged_bm)
    <matplotlib.axes._subplots.AxesSubplot at 0x13c5cf28>
    

    png

    sns.countplot(x='binned_age1',hue='isBoston',data=df_merged_bm[df_merged_bm.Gender == 1.0])
    <matplotlib.axes._subplots.AxesSubplot at 0x15239f60>
    

    png

    Age across boston and moscow for binned ages BUT only for females

    sns.countplot(x='binned_age1',hue='isBoston',data=df_merged_bm[df_merged_bm.Gender == 0.0])
    <matplotlib.axes._subplots.AxesSubplot at 0x16344cf8>
    

    png

    #df_new_Moscow[(df_new_Moscow.Age > 45) & (df_new_Moscowdd.Gender == 'Female')][['Age', 'Marathon']]

    Binned age for combined df

    df_merged_bm['binned_age']=pd.cut(df_merged_bm.Age,bins=[20,25,30,35,40,45,50,70])

    Plot to show finish time across both marathons for binned ages.

    sns.factorplot(x='binned_age',y='FinishTime_sec',hue='isBoston',data=df_merged_bm,aspect=2,kind='point')
    <seaborn.axisgrid.FacetGrid at 0x1525f668>
    

    png

    Finish time across boston and moscow for binned ages BUT only for males.

    sns.factorplot(x='binned_age',y='FinishTime_sec',hue='isBoston',data=df_merged_bm[df_merged_bm.Gender == 1.0],aspect=2,kind='point')
    <seaborn.axisgrid.FacetGrid at 0x16761da0>
    

    png

    Finish time across boston and moscow for binned ages BUT only for females.

    sns.factorplot(x='binned_age',y='FinishTime_sec',hue='isBoston',data=df_merged_bm[df_merged_bm.Gender == 0.0],aspect=2,kind='point')
    <seaborn.axisgrid.FacetGrid at 0x16e1efd0>
    

    png

    Managerial Insights

    • It is advisible to younger age athletes to focus on improving the endurance

    Decision Tree

    df_Boston.head()
    BibNo Name Age Gender City State Country 10K Half 40K Pace_minpermile Finish_Time Overall_ranking Gender_ranking Division_ranking Final_10k_sec Final_Half_sec Final_40k_sec FinishTime_sec boston_split_ratio binned_age1 binned_age2
    0 11 Kirui, Geoffrey 24 1.0 Keringet NaN KEN 0:30:28 1:04:35 2:02:53 0:04:57 2:09:37 1 1 1 1828 3875 7373 7777 1.006968 (15, 25] (20, 25]
    1 17 Rupp, Galen 30 1.0 Portland OR USA 0:30:27 1:04:35 2:03:14 0:04:58 2:09:58 2 2 2 1827 3875 7394 7798 1.012387 (25, 35] (29, 35]
    2 23 Osako, Suguru 25 1.0 Machida-City NaN JPN 0:30:29 1:04:36 2:03:38 0:04:59 2:10:28 3 3 3 1829 3876 7418 7828 1.019608 (15, 25] (20, 25]
    3 21 Biwott, Shadrack 32 1.0 Mammoth Lakes CA USA 0:30:29 1:04:45 2:04:35 0:05:03 2:12:08 4 4 4 1829 3885 7475 7928 1.040669 (25, 35] (29, 35]
    4 9 Chebet, Wilson 31 1.0 Marakwet NaN KEN 0:30:28 1:04:35 2:05:00 0:05:04 2:12:35 5 5 5 1828 3875 7500 7955 1.052903 (25, 35] (29, 35]
    dt = tree.DecisionTreeRegressor(max_depth=2)
    df_bos_dec_tree = df_Boston.copy()
    df_bos_dec_tree = pd.get_dummies(data=df_bos_dec_tree,columns=['Country'])
    X= df_bos_dec_tree.drop(columns=['BibNo','Age','City','State','10K','40K','Pace_minpermile','Name',\
                                  'Finish_Time','Overall_ranking','Gender_ranking','boston_split_ratio',\
                                     'Division_ranking','Half','Final_40k_sec','binned_age2',\
                                    'binned_age1'],axis=1)
    X.head()
    X.columns
    Gender Final_10k_sec Final_Half_sec FinishTime_sec Country_ALG Country_AND Country_ARG Country_AUS Country_AUT Country_BAR Country_BDI Country_BEL Country_BER Country_BRA Country_BRN ... Country_SRB Country_SUI Country_SVK Country_SWE Country_TCA Country_THA Country_TRI Country_TUR Country_TWN Country_UAE Country_UKR Country_URU Country_USA Country_VEN Country_ZIM
    0 1.0 1828 3875 7777 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    1 1.0 1827 3875 7798 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
    2 1.0 1829 3876 7828 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    3 1.0 1829 3885 7928 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
    4 1.0 1828 3875 7955 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    5 rows × 95 columns

    Index(['Gender', 'Final_10k_sec', 'Final_Half_sec', 'FinishTime_sec',
           'Country_ALG', 'Country_AND', 'Country_ARG', 'Country_AUS',
           'Country_AUT', 'Country_BAR', 'Country_BDI', 'Country_BEL',
           'Country_BER', 'Country_BRA', 'Country_BRN', 'Country_BUL',
           'Country_CAN', 'Country_CAY', 'Country_CHI', 'Country_CHN',
           'Country_COL', 'Country_CRC', 'Country_CRO', 'Country_CZE',
           'Country_DEN', 'Country_DOM', 'Country_ECU', 'Country_EGY',
           'Country_ESA', 'Country_ESP', 'Country_EST', 'Country_ETH',
           'Country_FIN', 'Country_FLK', 'Country_FRA', 'Country_GBR',
           'Country_GER', 'Country_GRE', 'Country_GRN', 'Country_GUA',
           'Country_HKG', 'Country_HON', 'Country_HUN', 'Country_INA',
           'Country_IND', 'Country_IRL', 'Country_ISL', 'Country_ISR',
           'Country_ITA', 'Country_JAM', 'Country_JPN', 'Country_KEN',
           'Country_KOR', 'Country_KSA', 'Country_KUW', 'Country_LAT',
           'Country_LTU', 'Country_LUX', 'Country_MAR', 'Country_MAS',
           'Country_MEX', 'Country_MGL', 'Country_MLT', 'Country_NCA',
           'Country_NED', 'Country_NGR', 'Country_NOR', 'Country_NZL',
           'Country_PAN', 'Country_PAR', 'Country_PER', 'Country_PHI',
           'Country_POL', 'Country_POR', 'Country_ROU', 'Country_RSA',
           'Country_RUS', 'Country_SIN', 'Country_SLO', 'Country_SMR',
           'Country_SRB', 'Country_SUI', 'Country_SVK', 'Country_SWE',
           'Country_TCA', 'Country_THA', 'Country_TRI', 'Country_TUR',
           'Country_TWN', 'Country_UAE', 'Country_UKR', 'Country_URU',
           'Country_USA', 'Country_VEN', 'Country_ZIM'],
          dtype='object')
    
    Y=df_bos_dec_tree.Age
    dt.fit(X,Y)
    DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
                          max_leaf_nodes=None, min_impurity_decrease=0.0,
                          min_impurity_split=None, min_samples_leaf=1,
                          min_samples_split=2, min_weight_fraction_leaf=0.0,
                          presort=False, random_state=None, splitter='best')
    
    import os
    os.environ["PATH"] += os.pathsep + r'C:\Users\MIH\Anaconda3\Library\bin\graphviz'
    # This code will visualize a decision tree dt, trained with the attributes in X and the class labels in Y
    dt_feature_names = list(X.columns)
    dt_target_names = [str(s) for s in Y.unique()]
    tree.export_graphviz(dt, out_file='tree.dot', 
        feature_names=dt_feature_names, class_names=dt_target_names,
        filled=True)  
    graph = pydotplus.graph_from_dot_file('tree.dot')
    Image(graph.create_png())

    png

    LassoRegressor

    from sklearn import linear_model
    regLasso = linear_model.Lasso()
    regLasso.fit(X,Y)
    Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
          normalize=False, positive=False, precompute=False, random_state=None,
          selection='cyclic', tol=0.0001, warm_start=False)
    
    regLasso.coef_
    array([ 2.45946604e+00,  1.54926501e-03,  6.73233896e-04,  5.91283677e-04,
            0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
            0.00000000e+00,  0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
            0.00000000e+00,  0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
            0.00000000e+00, -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
            0.00000000e+00, -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
            0.00000000e+00,  0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
            0.00000000e+00,  0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
            0.00000000e+00, -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
            0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
            0.00000000e+00, -0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
            0.00000000e+00, -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
            0.00000000e+00,  0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
            0.00000000e+00,  0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
           -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
            0.00000000e+00,  0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
            0.00000000e+00, -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
            0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
            0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
           -0.00000000e+00,  0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
           -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
            0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
            0.00000000e+00,  0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
           -0.00000000e+00,  0.00000000e+00, -0.00000000e+00])
    

    KMeans Clustering

    from sklearn.cluster import KMeans
    clu = KMeans(n_clusters=2, random_state=0)
    clu
    KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
           n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
           random_state=0, tol=0.0001, verbose=0)
    
    clu.fit(X)
    KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
           n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
           random_state=0, tol=0.0001, verbose=0)
    
    clu.labels_
    array([1, 1, 1, ..., 0, 0, 0])
    
    df2 = X.copy()
    df2['cluster']=clu.labels_
    df2.groupby('cluster').mean()
    Gender Final_10k_sec Final_Half_sec FinishTime_sec Country_ALG Country_AND Country_ARG Country_AUS Country_AUT Country_BAR Country_BDI Country_BEL Country_BER Country_BRA Country_BRN ... Country_SRB Country_SUI Country_SVK Country_SWE Country_TCA Country_THA Country_TRI Country_TUR Country_TWN Country_UAE Country_UKR Country_URU Country_USA Country_VEN Country_ZIM
    cluster
    0 0.432154 3581.235880 7843.994727 17233.904031 0.000000 0.000000 0.000586 0.004921 0.001055 0.000000 0.000000 0.001758 0.000117 0.004570 0.000000 ... 0.000000 0.003984 0.000234 0.001875 0.000000 0.000234 0.000000 0.000117 0.000937 0.000234 0.000234 0.000469 0.825521 0.000000 0.000000
    1 0.601365 2821.851253 6044.298221 12875.059297 0.000112 0.000056 0.001734 0.008335 0.000727 0.000224 0.000056 0.001510 0.000112 0.009286 0.000056 ... 0.000056 0.003524 0.000224 0.002517 0.000056 0.000168 0.000056 0.000224 0.002573 0.000280 0.000112 0.000056 0.777579 0.000336 0.000056

    2 rows × 95 columns

About

Working with Python Pandas and Matplotlib. Analyzed the marathon runners’ split finished times to determine top finishers as per the success ratio of the country. This analysis helped determine the strategies used by top finishers (Positive or Negative split) to complete the marathon effectively.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published