Boston-Moscow Marathon Analysis

Analyzed the marathon runners’ split finished times to determine top finishers as per the success ratio of the country. This analysis helped to find out the strategies used by top finishers (Positive or Negative split) to complete the marathon effectively.

Project Datasets

Their are two sets of Datasets to complete this project Boston and Moscow Marathon.

1. Boston Marathon Results 2017

marathon_results_2017.csv

Columns : {Bib ,Name, Age, M/F, City, State, Country, Citizen, 5K, 10K, 15K, 20K, Half, 25K, 30K, 35K, 40K, Pace, Proj Time, Official Time, Overall, Gender, Division}

Data : {11, Kirui, Geoffrey, 24, M, Keringet, KEN, 0:15:25, 0:30:28, 0:45:44, 1:01:15, 1:04:35, 1:16:59, 1:33:01, 1:48:19, 2:02:53, 0:04:57, - 2:09:37, 1, 1, 1}

2. Moscow Marathon Full Results 2018

1_full_results_mm_2018.csv

Columns : {Bib, finish_time_sec, finish_time_result, race, pace_sec, pace(minpkm), pace(kmph), half_pace_sec, half_pace(minpkm), half_pace(kmph), gender_en, agev name_en, location_city_ru, location_city_en, country_code_alpha_3, flag_DNF, flag_all_split_exist, race_uniform_index}

Data : {1, 8911, 2h 28min 31sec, 42.195 km, 211.1861595, 3:31 min/km 17.0 km/h, 208.3185212, 3:28 min/km, 17.3 km/h, Female, 30, Sardana Trofimova, –Ø–∫—É—Ç—Å–∫, Yakutsk, RUS, 0, 1, 0.000132899}

1_split_results_mm_2018.csv

Columns : {bib, split_name, split, split_time_sec, split_time_result, split_pace_sec, split_pace(minpkm), split_pace(kmph), split_uniform_index}

Data : {11, Kirui, Geoffrey, 24, M, Keringet, KEN, 0:15:25, 0:30:28, 0:45:44, 1:01:15, 1:04:35, 1:16:59, 1:33:01, 1:48:19, 2:02:53, 0:04:57, - 2:09:37, 1, 1, 11}

Jupyter Notebook with 3 Interesting findings

%autosave 0

Autosave disabled

# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from datetime import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import pandas as pd
%pylab inline

import sklearn as sk
import sklearn.tree as tree
from IPython.display import Image  
import pydotplus

Populating the interactive namespace from numpy and matplotlib


/opt/anaconda3/lib/python3.8/site-packages/IPython/core/magics/pylab.py:159: UserWarning: pylab import has clobbered these variables: ['datetime']
`%matplotlib` prevents importing * from pylab and numpy
  warn("pylab import has clobbered these variables: %s"  % clobbered +

import warnings
#suppress all future warning
warnings.filterwarnings('ignore')
#see only one time warning
#warnings.filterwarnings(action='once')

1. Data Set Description

df_boston is a dataset of all athelets in boston marathon. Each row is represent attribute of an individual athelete.
df_moscow is a dataset of all athelets in Moscow marathon. Each row is represent attribute of an individual athelete.

Important columns in df_boston and df_moscow are:

BibNo: Unique Number given to each athelet in marathon.
Age: Age of each Athelet.
Gender: Gender of each Athelet.
Country: The Country which each athelet belongs to
Finish_Time: Time taken in seconds by each athelet to complete the race
Final_10k_sec: Time taken in seconds by each athelet to complete 10k distance of the race
Final_Half_sec: Time taken in seconds by each athelet to complete half of the race
Final_40k_sec: Time taken in seconds by each athelet to complete 40k distance of the race
Overall_Ranking: Ranking with respect to Finishing time of each athelet
Gender_Ranking: Ranking with respect to Gender of Athelet.
Division_Ranking: Ranking with repect to Age group of Athelet.

df_boston_raw = pd.read_csv('marathon_results_2017.csv')

df_moscow_full_results  = pd.read_csv('1_full_results_mm_2018.csv')

df_moscow_split_results = pd.read_csv('1_split_results_mm_2018.csv')

pd.set_option('display.max_columns',30)

df_boston_raw.head()

	Unnamed: 0	Bib	Name	Age	M/F	City	State	Country	Citizen	Unnamed: 9	5K	10K	15K	20K	Half	25K	30K	35K	40K	Pace	Proj Time	Official Time	Overall	Gender	Division
0	0	11	Kirui, Geoffrey	24	M	Keringet	NaN	KEN	NaN	NaN	0:15:25	0:30:28	0:45:44	1:01:15	1:04:35	1:16:59	1:33:01	1:48:19	2:02:53	0:04:57	-	2:09:37	1	1	1
1	1	17	Rupp, Galen	30	M	Portland	OR	USA	NaN	NaN	0:15:24	0:30:27	0:45:44	1:01:15	1:04:35	1:16:59	1:33:01	1:48:19	2:03:14	0:04:58	-	2:09:58	2	2	2
2	2	23	Osako, Suguru	25	M	Machida-City	NaN	JPN	NaN	NaN	0:15:25	0:30:29	0:45:44	1:01:16	1:04:36	1:17:00	1:33:01	1:48:31	2:03:38	0:04:59	-	2:10:28	3	3	3
3	3	21	Biwott, Shadrack	32	M	Mammoth Lakes	CA	USA	NaN	NaN	0:15:25	0:30:29	0:45:44	1:01:19	1:04:45	1:17:00	1:33:01	1:48:58	2:04:35	0:05:03	-	2:12:08	4	4	4
4	4	9	Chebet, Wilson	31	M	Marakwet	NaN	KEN	NaN	NaN	0:15:25	0:30:28	0:45:44	1:01:15	1:04:35	1:16:59	1:33:01	1:48:41	2:05:00	0:05:04	-	2:12:35	5	5	5

df_moscow_full_results.head()

	bib	finish_time_sec	finish_time_result	race	pace_sec	pace(minpkm)	pace(kmph)	half_pace_sec	half_pace(minpkm)	half_pace(kmph)	gender_en	age	name_en	location_city_ru	location_city_en	country_code_alpha_3	flag_all_split_exist	race_uniform_index
0	1	8911.0	2h 28min 31sec	42.195 km	211.186160	3:31 min/km	17.0 km/h	208.318521	3:28 min/km	17.3 km/h	Female	30	Sardana Trofimova	Якутск	Yakutsk	RUS	1	0.000133
1	2	9308.0	2h 35min 08sec	42.195 km	220.594857	3:40 min/km	16.3 km/h	212.157839	3:32 min/km	17.0 km/h	Female	35	Tat'yana Arkhipova	Чебоксары	Cheboksary	RUS	1	0.002556
2	3	8122.0	2h 15min 22sec	42.195 km	192.487261	3:12 min/km	18.7 km/h	189.880318	3:09 min/km	19.0 km/h	Male	31	Stepan Kiselev	Казань	Kazan'	RUS	1	0.000397
3	4	8467.0	2h 21min 07sec	42.195 km	200.663586	3:20 min/km	17.9 km/h	186.894182	3:06 min/km	19.3 km/h	Male	36	Dmitriy Safronov	Москва	Moskva	RUS	1	0.009003
4	5	8738.0	2h 25min 38sec	42.195 km	207.086148	3:27 min/km	17.4 km/h	194.904610	3:14 min/km	18.5 km/h	Male	42	Grigoriy Andreev	Уфа	Ufa	RUS	1	0.006439

df_boston_raw.shape

(26410, 25)

df_boston_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26410 entries, 0 to 26409
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Unnamed: 0     26410 non-null  int64 
 1   Bib            26410 non-null  object
 2   Name           26410 non-null  object
 3   Age            26410 non-null  int64 
 4   M/F            26410 non-null  object
 5   City           26410 non-null  object
 6   State          22815 non-null  object
 7   Country        26410 non-null  object
 8   Citizen        1254 non-null   object
 9   Unnamed: 9     91 non-null     object
 10  5K             26410 non-null  object
 11  10K            26410 non-null  object
 12  15K            26410 non-null  object
 13  20K            26410 non-null  object
 14  Half           26410 non-null  object
 15  25K            26410 non-null  object
 16  30K            26410 non-null  object
 17  35K            26410 non-null  object
 18  40K            26410 non-null  object
 19  Pace           26410 non-null  object
 20  Proj Time      26410 non-null  object
 21  Official Time  26410 non-null  object
 22  Overall        26410 non-null  int64 
 23  Gender         26410 non-null  int64 
 24  Division       26410 non-null  int64 
dtypes: int64(5), object(20)
memory usage: 5.0+ MB

df_boston_raw.isna().sum()

Unnamed: 0           0
Bib                  0
Name                 0
Age                  0
M/F                  0
City                 0
State             3595
Country              0
Citizen          25156
Unnamed: 9       26319
5K                   0
10K                  0
15K                  0
20K                  0
Half                 0
25K                  0
30K                  0
35K                  0
40K                  0
Pace                 0
Proj Time            0
Official Time        0
Overall              0
Gender               0
Division             0
dtype: int64

df_moscow_full_results.shape

(21662, 19)

df_moscow_split_results.shape

(104438, 9)

df_moscow_full_results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21662 entries, 0 to 21661
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   bib                   21662 non-null  int64  
 1   finish_time_sec       21385 non-null  float64
 2   finish_time_result    21385 non-null  object 
 3   race                  21662 non-null  object 
 4   pace_sec              21385 non-null  float64
 5   pace(minpkm)          21385 non-null  object 
 6   pace(kmph)            21385 non-null  object 
 7   half_pace_sec         21380 non-null  float64
 8   half_pace(minpkm)     21380 non-null  object 
 9   half_pace(kmph)       21385 non-null  object 
 10  gender_en             21662 non-null  object 
 11  age                   21662 non-null  int64  
 12  name_en               21662 non-null  object 
 13  location_city_ru      21662 non-null  object 
 14  location_city_en      21662 non-null  object 
 15  country_code_alpha_3  21662 non-null  object 
 16  flag_DNF              21662 non-null  int64  
 17  flag_all_split_exist  21662 non-null  int64  
 18  race_uniform_index    21289 non-null  float64
dtypes: float64(4), int64(4), object(11)
memory usage: 3.1+ MB

df_moscow_full_results.isna().sum()

bib                       0
finish_time_sec         277
finish_time_result      277
race                      0
pace_sec                277
pace(minpkm)            277
pace(kmph)              277
half_pace_sec           282
half_pace(minpkm)       282
half_pace(kmph)         277
gender_en                 0
age                       0
name_en                   0
location_city_ru          0
location_city_en          0
country_code_alpha_3      0
flag_DNF                  0
flag_all_split_exist      0
race_uniform_index      373
dtype: int64

Drop the unused columns and standardised the column names by renaming as per the both datasets in Boston Dataframe

df_boston_raw.drop(columns=['Unnamed: 0','Citizen','Unnamed: 9','Proj Time'],axis=1,inplace=True)

df_boston_raw.rename(columns={'Bib':'BibNo','M/F':'Gender','Pace':'Pace_minpermile','Official Time':'Finish_Time',\
                   'Overall':'Overall_ranking','Gender':'Gender_ranking','Division':'Division_ranking'},inplace=True)

Replaced Male and Female with values 0 and 1 in Boston Dataframe

df_boston_raw.replace(to_replace='M', value=1.0, inplace=True)

df_boston_raw.replace(to_replace='F', value=0.0, inplace=True)

df_boston_raw.head()

	BibNo	Name	Age	Gender	City	State	Country	5K	10K	15K	20K	Half	25K	30K	35K	40K	Pace_minpermile	Finish_Time	Overall_ranking	Gender_ranking	Division_ranking
0	11	Kirui, Geoffrey	24	1.0	Keringet	NaN	KEN	0:15:25	0:30:28	0:45:44	1:01:15	1:04:35	1:16:59	1:33:01	1:48:19	2:02:53	0:04:57	2:09:37	1	1	1
1	17	Rupp, Galen	30	1.0	Portland	OR	USA	0:15:24	0:30:27	0:45:44	1:01:15	1:04:35	1:16:59	1:33:01	1:48:19	2:03:14	0:04:58	2:09:58	2	2	2
2	23	Osako, Suguru	25	1.0	Machida-City	NaN	JPN	0:15:25	0:30:29	0:45:44	1:01:16	1:04:36	1:17:00	1:33:01	1:48:31	2:03:38	0:04:59	2:10:28	3	3	3
3	21	Biwott, Shadrack	32	1.0	Mammoth Lakes	CA	USA	0:15:25	0:30:29	0:45:44	1:01:19	1:04:45	1:17:00	1:33:01	1:48:58	2:04:35	0:05:03	2:12:08	4	4	4
4	9	Chebet, Wilson	31	1.0	Marakwet	NaN	KEN	0:15:25	0:30:28	0:45:44	1:01:15	1:04:35	1:16:59	1:33:01	1:48:41	2:05:00	0:05:04	2:12:35	5	5	5

Drop the unused columns and standardised the column names by renaming as per the both datasets in Moscow Dataframe

df_moscow_full_results.drop(['finish_time_result','location_city_ru','flag_all_split_exist',\
                      'race','pace(kmph)','half_pace(kmph)','location_city_ru','flag_DNF',\
                      'flag_all_split_exist','race_uniform_index'],axis=1,inplace=True)

df_moscow_full_results.rename(columns ={'bib':'BibNo','pace_sec':'Pace_Sec','finish_time_sec':'Finish_Time_Sec',\
                    'pace(minpkm)':'Pace_Minpkm','half_pace_sec':'Half_Pace_Sec',\
                    'half_pace(minpkm)':'Half_Pace_Minpkm','gender_en':'Gender',\
                    'age':'Age','name_en':'Name','location_city_en':'City','country_code_alpha_3':'Country'},inplace=True)

df_moscow_split_results.drop(['split_pace(minpkm)','split_pace(kmph)','split_uniform_index','split'],axis=1,inplace=True)

df_moscow_split_results.head()

	bib	split_name	split_time_sec	split_time_result	split_pace_sec
0	1	5 km	1057	17min 37sec	211.400000
1	1	10 km	2112	35min 12sec	211.000000
2	1	15 km	3149	52min 29sec	207.400000
3	1	Half marathon	4395	1h 13min 15sec	204.346043
4	1	25 km	5258	1h 27min 38sec	221.140295

df_moscow_split_results.rename (columns={'bib':'BibNo','split_name':'Split_Name','split_time_sec':'Split_Time_Sec',\
                    'split_time_result':'Split_Time_Result','split_pace_sec':'Split_Pace_Sec'},inplace=True)

By using pivot table to make split_name values as column in Moscow_Split_Results and Moscow_Full_Results Dataframe

df_moscow_split_results['Marathon_Split'] = ((df_moscow_split_results.Split_Name == ('10 km'))\
                                             |(df_moscow_split_results.Split_Name == ('Half marathon'))\
                                             |(df_moscow_split_results.Split_Name == ('Marathon')))

df_temp_pivot = df_moscow_split_results.pivot_table(index='BibNo', columns='Split_Name', values='Split_Time_Sec')

df_temp_pivot.head()

Split_Name	10 km	15 km	25 km	30 km	35 km	40 km	5 km	Half marathon	Marathon
BibNo
1	2112.0	3149.0	5258.0	6321.0	7362.0	8442.0	1057.0	4395.0	8911.0
2	2115.0	3188.0	5386.0	6508.0	7608.0	8784.0	1059.0	4476.0	9308.0
3	1899.0	2867.0	4796.0	5786.0	6632.0	7703.0	947.0	4006.0	8122.0
4	1876.0	2814.0	4752.0	5780.0	6827.0	7967.0	945.0	3943.0	8467.0
5	1958.0	2933.0	4949.0	6015.0	7080.0	8235.0	976.0	4112.0	8738.0

df_temp_pivot.drop(['15 km','25 km','30 km','35 km','40 km','5 km'],axis=1,inplace=True)

df_temp_pivot.set_index='BibNo'

Used the Outer Merge on Pivot table and Moscow_Full_Result to to get the split name values as column in Moscow_full_result

df_Moscow = df_moscow_full_results.merge(df_temp_pivot,how='outer',right_index=True,left_on='BibNo')

df_Moscow = df_moscow_full_results.merge(df_temp_pivot,how='outer',right_index=True,left_on='BibNo')

df_Moscow.head()

	BibNo	Finish_Time_Sec	Pace_Sec	Pace_Minpkm	Half_Pace_Sec	Half_Pace_Minpkm	Gender	Age	Name	City	Country	10 km	Half marathon	Marathon
0	1	8911.0	211.186160	3:31 min/km	208.318521	3:28 min/km	Female	30	Sardana Trofimova	Yakutsk	RUS	2112.0	4395.0	8911.0
1	2	9308.0	220.594857	3:40 min/km	212.157839	3:32 min/km	Female	35	Tat'yana Arkhipova	Cheboksary	RUS	2115.0	4476.0	9308.0
2	3	8122.0	192.487261	3:12 min/km	189.880318	3:09 min/km	Male	31	Stepan Kiselev	Kazan'	RUS	1899.0	4006.0	8122.0
3	4	8467.0	200.663586	3:20 min/km	186.894182	3:06 min/km	Male	36	Dmitriy Safronov	Moskva	RUS	1876.0	3943.0	8467.0
4	5	8738.0	207.086148	3:27 min/km	194.904610	3:14 min/km	Male	42	Grigoriy Andreev	Ufa	RUS	1958.0	4112.0	8738.0

Sorted the values based on Finish time seconds for overall ranking of athletes in Moscow dataframe

df_Moscow.sort_values(by='Finish_Time_Sec')

	BibNo	Finish_Time_Sec	Pace_Sec	Pace_Minpkm	Half_Pace_Sec	Half_Pace_Minpkm	Gender	Age	Name	City	Country	10 km	Half marathon	Marathon
8982	17006	1728.0	172.8	2:52 min/km	175.8	2:55 min/km	Male	26	Vladimir Nikitin	Perm'	RUS	1728.0	NaN	NaN
8991	17018	1826.0	182.6	3:02 min/km	183.6	3:03 min/km	Male	29	Denis Vasil'ev	Sankt-Peterburg	RUS	1826.0	NaN	NaN
21362	32745	1834.0	183.4	3:03 min/km	185.6	3:05 min/km	Male	30	Artur Burtsev	Sankt-Peterburg	RUS	1834.0	NaN	NaN
8988	17013	1869.0	186.9	3:06 min/km	186.6	3:06 min/km	Male	24	Ivan Panferov	Moskva	RUS	1869.0	NaN	NaN
21114	32472	1869.0	186.9	3:06 min/km	188.2	3:08 min/km	Male	22	Nikita Kotenev	Minsk	BLR	1869.0	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8954	11201	NaN	NaN	NaN	NaN	NaN	Male	30	Murtuz Ramazanov	Moskva	RUS	4493.0	10282.0	NaN
8960	11209	NaN	NaN	NaN	NaN	NaN	Female	37	Ekaterina Andreeva	Moskva	RUS	NaN	NaN	NaN
8961	11210	NaN	NaN	NaN	NaN	NaN	Male	28	Sergey Nikitin	Moskva	RUS	NaN	NaN	NaN
8973	15013	NaN	NaN	NaN	NaN	NaN	Male	32	Mikhail Maksimov	Sankt-Peterburg	RUS	1957.0	4086.0	NaN
16208	26352	NaN	NaN	NaN	NaN	NaN	Female	29	Lyudmila Tret'yakova	Moskva	RUS	NaN	NaN	NaN

21662 rows × 14 columns

df_Moscow['Marathon'].isna().sum()

Drop the NAN values Moscow Dataframe from Marathon column which shows that person has not completed the marathon

df_Moscow.dropna(how='any',subset=['Marathon'],inplace=True)

df_new_Moscow= df_Moscow.sort_values(by='Finish_Time_Sec')

df_new_Moscow.head()

	BibNo	Finish_Time_Sec	Pace_Sec	Pace_Minpkm	Half_Pace_Sec	Half_Pace_Minpkm	Gender	Age	Name	City	Country	10 km	Half marathon	Marathon
2	3	8122.0	192.487261	3:12 min/km	189.880318	3:09 min/km	Male	31	Stepan Kiselev	Kazan'	RUS	1899.0	4006.0	8122.0
10	12	8161.0	193.411542	3:13 min/km	189.738121	3:09 min/km	Male	32	Aleksey Troshkin	Saransk	RUS	1905.0	4003.0	8161.0
8	10	8394.0	198.933523	3:18 min/km	188.268752	3:08 min/km	Male	27	Artem Aplachkin	Barnaul	RUS	1879.0	3972.0	8394.0
8968	15008	8416.0	199.454912	3:19 min/km	195.094205	3:15 min/km	Male	29	Sergey Popov	Voronezh	RUS	1960.0	4116.0	8416.0
3	4	8467.0	200.663586	3:20 min/km	186.894182	3:06 min/km	Male	36	Dmitriy Safronov	Moskva	RUS	1876.0	3943.0	8467.0

Added the new column overall ranking calculated based on Finish time seconds in Moscow dataframe

df_new_Moscow = df_new_Moscow.reset_index()
df_new_Moscow['Overall_Ranking'] = df_new_Moscow.index + 1

df_new_Moscow

	index	BibNo	Finish_Time_Sec	Pace_Sec	Pace_Minpkm	Half_Pace_Sec	Half_Pace_Minpkm	Gender	Age	Name	City	Country	10 km	Half marathon	Marathon	Overall_Ranking
0	2	3	8122.0	192.487261	3:12 min/km	189.880318	3:09 min/km	Male	31	Stepan Kiselev	Kazan'	RUS	1899.0	4006.0	8122.0	1
1	10	12	8161.0	193.411542	3:13 min/km	189.738121	3:09 min/km	Male	32	Aleksey Troshkin	Saransk	RUS	1905.0	4003.0	8161.0	2
2	8	10	8394.0	198.933523	3:18 min/km	188.268752	3:08 min/km	Male	27	Artem Aplachkin	Barnaul	RUS	1879.0	3972.0	8394.0	3
3	8968	15008	8416.0	199.454912	3:19 min/km	195.094205	3:15 min/km	Male	29	Sergey Popov	Voronezh	RUS	1960.0	4116.0	8416.0	4
4	3	4	8467.0	200.663586	3:20 min/km	186.894182	3:06 min/km	Male	36	Dmitriy Safronov	Moskva	RUS	1876.0	3943.0	8467.0	5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8697	1297	1818	22102.0	523.806138	8:43 min/km	492.617609	8:12 min/km	Male	64	Songkran Amornsaengthong	Moskva	RUS	4984.0	10393.0	22102.0	8698
8698	8578	10770	22178.0	525.607299	8:45 min/km	559.450172	9:19 min/km	Male	36	Aleksey Popov	Moskva	RUS	5956.0	11803.0	22178.0	8699
8699	3199	4301	22183.0	525.725797	8:45 min/km	503.471975	8:23 min/km	Male	26	Danila Malykhin	Moskva	RUS	4862.0	10622.0	22183.0	8700
8700	6340	8204	22291.0	528.285342	8:48 min/km	383.315559	6:23 min/km	Male	63	Chi_Wan Yu	Hong Kong	HKG	3704.0	8087.0	22291.0	8701
8701	2526	3462	22672.0	537.314848	8:57 min/km	493.518189	8:13 min/km	Male	25	Artem Prasolov	Moskva	RUS	4607.0	10412.0	22672.0	8702

8702 rows × 16 columns

df_new_Moscow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8702 entries, 0 to 8701
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   index             8702 non-null   int64  
 1   BibNo             8702 non-null   int64  
 2   Finish_Time_Sec   8702 non-null   float64
 3   Pace_Sec          8702 non-null   float64
 4   Pace_Minpkm       8702 non-null   object 
 5   Half_Pace_Sec     8702 non-null   float64
 6   Half_Pace_Minpkm  8702 non-null   object 
 7   Gender            8702 non-null   object 
 8   Age               8702 non-null   int64  
 9   Name              8702 non-null   object 
 10  City              8702 non-null   object 
 11  Country           8702 non-null   object 
 12  10 km             8701 non-null   float64
 13  Half marathon     8702 non-null   float64
 14  Marathon          8702 non-null   float64
 15  Overall_Ranking   8702 non-null   int64  
dtypes: float64(6), int64(4), object(6)
memory usage: 1.1+ MB

df_new_Moscow.drop(columns=['index'],axis=1,inplace=True)

df_new_Moscow.head(20)

	BibNo	Finish_Time_Sec	Pace_Sec	Pace_Minpkm	Half_Pace_Sec	Half_Pace_Minpkm	Gender	Age	Name	City	Country	10 km	Half marathon	Marathon	Overall_Ranking
0	3	8122.0	192.487261	3:12 min/km	189.880318	3:09 min/km	Male	31	Stepan Kiselev	Kazan'	RUS	1899.0	4006.0	8122.0	1
1	12	8161.0	193.411542	3:13 min/km	189.738121	3:09 min/km	Male	32	Aleksey Troshkin	Saransk	RUS	1905.0	4003.0	8161.0	2
2	10	8394.0	198.933523	3:18 min/km	188.268752	3:08 min/km	Male	27	Artem Aplachkin	Barnaul	RUS	1879.0	3972.0	8394.0	3
3	15008	8416.0	199.454912	3:19 min/km	195.094205	3:15 min/km	Male	29	Sergey Popov	Voronezh	RUS	1960.0	4116.0	8416.0	4
4	4	8467.0	200.663586	3:20 min/km	186.894182	3:06 min/km	Male	36	Dmitriy Safronov	Moskva	RUS	1876.0	3943.0	8467.0	5
5	7	8513.0	201.753762	3:21 min/km	194.478019	3:14 min/km	Male	33	Mikhail Kul'kov	Khanty-Mansiysk	RUS	1959.0	4103.0	8513.0	6
6	14	8601.0	203.839317	3:23 min/km	197.416755	3:17 min/km	Male	31	Viktor Ugarov	Moskva	RUS	1976.0	4165.0	8601.0	7
7	8	8670.0	205.474582	3:25 min/km	195.094205	3:15 min/km	Male	30	Oleg Grigor'ev	Moskva	RUS	1958.0	4116.0	8670.0	8
8	10522	8690.0	205.948572	3:25 min/km	201.493068	3:21 min/km	Male	26	Aleksey Shirshov	Chkalov	RUS	1978.0	4251.0	8690.0	9
9	5	8738.0	207.086148	3:27 min/km	194.904610	3:14 min/km	Male	42	Grigoriy Andreev	Ufa	RUS	1958.0	4112.0	8738.0	10
10	33	8800.0	208.555516	3:28 min/km	192.250267	3:12 min/km	Male	27	Iskander Yadgarov	Moskva	RUS	1943.0	4056.0	8800.0	11
11	15010	8808.0	208.745112	3:28 min/km	199.597109	3:19 min/km	Male	32	Andrey Smirnov	Perm'	RUS	1972.0	4211.0	8808.0	12
12	13	8852.0	209.787890	3:29 min/km	208.223723	3:28 min/km	Male	25	Mikhail Zvyagintsev	Biryuch	RUS	2111.0	4393.0	8852.0	13
13	8422	8856.0	209.882688	3:29 min/km	208.223723	3:28 min/km	Male	43	Andrey Klinov	Ryazan'	RUS	2110.0	4393.0	8856.0	14
14	11	8856.0	209.882688	3:29 min/km	201.540467	3:21 min/km	Male	28	Aleksandr Krotovich	Sankt-Peterburg	RUS	1996.0	4252.0	8856.0	15
15	10709	8873.0	210.285579	3:30 min/km	208.223723	3:28 min/km	Male	38	Yipeng Li	Beijing	CHN	2111.0	4393.0	8873.0	16
16	15006	8890.0	210.688470	3:30 min/km	208.271122	3:28 min/km	Male	34	Denis Korablev	Salekhard	RUS	2111.0	4394.0	8890.0	17
17	1	8911.0	211.186160	3:31 min/km	208.318521	3:28 min/km	Female	30	Sardana Trofimova	Yakutsk	RUS	2112.0	4395.0	8911.0	18
18	22	8912.0	211.209859	3:31 min/km	208.176324	3:28 min/km	Male	35	Renat Kashapov	Kazan'	RUS	2110.0	4392.0	8912.0	19
19	15012	8952.0	212.157839	3:32 min/km	208.271122	3:28 min/km	Male	36	Dmitriy Shilkin	Saratov	RUS	2111.0	4394.0	8952.0	20

df_Boston=df_boston_raw.copy()
df_Boston.head()

	BibNo	Name	Age	Gender	City	State	Country	5K	10K	15K	20K	Half	25K	30K	35K	40K	Pace_minpermile	Finish_Time	Overall_ranking	Gender_ranking	Division_ranking
0	11	Kirui, Geoffrey	24	1.0	Keringet	NaN	KEN	0:15:25	0:30:28	0:45:44	1:01:15	1:04:35	1:16:59	1:33:01	1:48:19	2:02:53	0:04:57	2:09:37	1	1	1
1	17	Rupp, Galen	30	1.0	Portland	OR	USA	0:15:24	0:30:27	0:45:44	1:01:15	1:04:35	1:16:59	1:33:01	1:48:19	2:03:14	0:04:58	2:09:58	2	2	2
2	23	Osako, Suguru	25	1.0	Machida-City	NaN	JPN	0:15:25	0:30:29	0:45:44	1:01:16	1:04:36	1:17:00	1:33:01	1:48:31	2:03:38	0:04:59	2:10:28	3	3	3
3	21	Biwott, Shadrack	32	1.0	Mammoth Lakes	CA	USA	0:15:25	0:30:29	0:45:44	1:01:19	1:04:45	1:17:00	1:33:01	1:48:58	2:04:35	0:05:03	2:12:08	4	4	4
4	9	Chebet, Wilson	31	1.0	Marakwet	NaN	KEN	0:15:25	0:30:28	0:45:44	1:01:15	1:04:35	1:16:59	1:33:01	1:48:41	2:05:00	0:05:04	2:12:35	5	5	5

Converted the finish time from Hour, Min & Sec format to seconds only to make it Final 10K, Half, 40k and Finish Time in Boston dataframe

df_Boston['FTime_10K_hour']= df_Boston['10K'].str.split(':').str[0]
df_Boston['FTime_10K_min']= df_Boston['10K'].str.split(':').str[1]
df_Boston['FTime_10K_sec']= df_Boston['10K'].str.split(':',n=2).str[-1]
df_Boston.FTime_10K_sec.fillna(0,inplace=True)
df_Boston.FTime_10K_min.fillna(0,inplace=True)
df_Boston.FTime_10K_hour.fillna(0,inplace=True)
df_Boston.FTime_10K_sec.replace('-',0,inplace=True)
df_Boston.FTime_10K_hour.replace('-',0,inplace=True)
final1=df_Boston['FTime_10K_hour'].astype(int) *3600
final2=(df_Boston['FTime_10K_min'].astype(int)) * 60 
final3=(df_Boston['FTime_10K_sec'].astype(int))
df_Boston['Final_10k_sec'] = final1 + final2 + final3
df_Boston.drop(columns =['FTime_10K_hour','FTime_10K_min','FTime_10K_sec'],inplace =True)

df_Boston['FTime_Half_hour']= df_Boston['Half'].str.split(':').str[0]
df_Boston['FTime_Half_min']= df_Boston['Half'].str.split(':').str[1]
df_Boston['FTime_Half_sec']= df_Boston['Half'].str.split(':',n=2).str[-1]
df_Boston.FTime_Half_sec.fillna(0,inplace=True)
df_Boston.FTime_Half_min.fillna(0,inplace=True)
df_Boston.FTime_Half_hour.fillna(0,inplace=True)
df_Boston.FTime_Half_sec.replace('-',0,inplace=True)
df_Boston.FTime_Half_hour.replace('-',0,inplace=True)
final1=df_Boston['FTime_Half_hour'].astype(int) *3600
final2=(df_Boston['FTime_Half_min'].astype(int)) * 60 
final3=(df_Boston['FTime_Half_sec'].astype(int))
df_Boston['Final_Half_sec'] = final1 + final2 + final3
df_Boston.drop(columns =['FTime_Half_hour','FTime_Half_min','FTime_Half_sec'],inplace =True)

df_Boston['FTime_40K_hour']= df_Boston['40K'].str.split(':').str[0]
df_Boston['FTime_40K_min']= df_Boston['40K'].str.split(':').str[1]
df_Boston['FTime_40K_sec']= df_Boston['40K'].str.split(':',n=2).str[-1] #To understand n=2 used
df_Boston.FTime_40K_sec.fillna(0,inplace=True)
df_Boston.FTime_40K_min.fillna(0,inplace=True)
df_Boston.FTime_40K_hour.fillna(0,inplace=True)
df_Boston.FTime_40K_sec.replace('-',0,inplace=True)
df_Boston.FTime_40K_hour.replace('-',0,inplace=True)
final1=df_Boston['FTime_40K_hour'].astype(int) *3600
final2=(df_Boston['FTime_40K_min'].astype(int)) * 60 
final3=(df_Boston['FTime_40K_sec'].astype(int))
df_Boston['Final_40k_sec'] = final1 + final2 + final3
df_Boston.drop(columns =['FTime_40K_hour','FTime_40K_min','FTime_40K_sec'],inplace =True)

df_Boston['FTime_hour']= df_Boston['Finish_Time'].str.split(':').str[0]
df_Boston['FTime_min']= df_Boston['Finish_Time'].str.split(':').str[1]
df_Boston['FTime_sec']= df_Boston['Finish_Time'].str.split(':',n=2).str[-1] #To understand n=2 used
df_Boston.FTime_sec.fillna(0,inplace=True)
df_Boston.FTime_min.fillna(0,inplace=True)
df_Boston.FTime_hour.fillna(0,inplace=True)
df_Boston.FTime_sec.replace('-',0,inplace=True)
df_Boston.FTime_hour.replace('-',0,inplace=True)
final1=df_Boston['FTime_hour'].astype(int) *3600
final2=(df_Boston['FTime_min'].astype(int)) * 60 
final3=(df_Boston['FTime_sec'].astype(int))
df_Boston['FinishTime_sec'] = final1 + final2 + final3
df_Boston.drop(columns =['FTime_hour','FTime_min','FTime_sec'],inplace =True)

df_Boston.drop(columns=['5K','15K','20K','25K','30K','35K'],inplace=True)

df_Boston.head()

	BibNo	Name	Age	Gender	City	State	Country	10K	Half	40K	Pace_minpermile	Finish_Time	Overall_ranking	Gender_ranking	Division_ranking	Final_10k_sec	Final_Half_sec	Final_40k_sec	FinishTime_sec
0	11	Kirui, Geoffrey	24	1.0	Keringet	NaN	KEN	0:30:28	1:04:35	2:02:53	0:04:57	2:09:37	1	1	1	1828	3875	7373	7777
1	17	Rupp, Galen	30	1.0	Portland	OR	USA	0:30:27	1:04:35	2:03:14	0:04:58	2:09:58	2	2	2	1827	3875	7394	7798
2	23	Osako, Suguru	25	1.0	Machida-City	NaN	JPN	0:30:29	1:04:36	2:03:38	0:04:59	2:10:28	3	3	3	1829	3876	7418	7828
3	21	Biwott, Shadrack	32	1.0	Mammoth Lakes	CA	USA	0:30:29	1:04:45	2:04:35	0:05:03	2:12:08	4	4	4	1829	3885	7475	7928
4	9	Chebet, Wilson	31	1.0	Marakwet	NaN	KEN	0:30:28	1:04:35	2:05:00	0:05:04	2:12:35	5	5	5	1828	3875	7500	7955

df_Boston.isna().sum()

BibNo                  0
Name                   0
Age                    0
Gender                 0
City                   0
State               3595
Country                0
10K                    0
Half                   0
40K                    0
Pace_minpermile        0
Finish_Time            0
Overall_ranking        0
Gender_ranking         0
Division_ranking       0
Final_10k_sec          0
Final_Half_sec         0
Final_40k_sec          0
FinishTime_sec         0
dtype: int64

df_Boston.shape

(26410, 19)

len(df_Boston.State)

Creating merged dataset for Boston and Moscow, keeping only relevant columns and sorting them by overall ranking.

df_Boston_merge=df_Boston.copy()

df_Moscow_merge=df_new_Moscow.copy()

Created isBoston column in Boston dataframe assigned value 1 which means after merging data row with value 1.0 are from Boston Marathon

df_Boston_merge['isBoston'] = '1.0'

df_Moscow_merge.drop(columns=['Finish_Time_Sec'],inplace=True)

df_Moscow_merge.drop(columns=['BibNo'],axis=1,inplace=True)

df_Boston_merge.drop(columns=['BibNo'],axis=1,inplace=True)

df_Moscow_merge.rename(columns={'10 km':'Final_10k_sec', 'Half marathon':'Final_Half_sec', \
                                'Marathon':'FinishTime_sec','Overall_Ranking':'Overall_ranking'}, inplace=True)

df_Moscow_merge.replace(to_replace='Male', value=1.0, inplace=True)

df_Moscow_merge.replace(to_replace='Female', value=0.0, inplace=True)

Created isBoston column in Moscow dataframe assigned value 0 which means after merging data row with value 0.0 are from Moscow Marathon

df_Moscow_merge['isBoston'] = '0.0'

Drop the unused columns from Boston and Moscow

df_Moscow_merge.drop(columns=['Pace_Sec', 'Pace_Minpkm', 'Half_Pace_Sec', 'Half_Pace_Minpkm'], axis=1, inplace=True)

df_Moscow_merge.head(1)

	Gender	Age	Name	City	Country	Final_10k_sec	Final_Half_sec	FinishTime_sec	Overall_ranking	isBoston
0	1.0	31	Stepan Kiselev	Kazan'	RUS	1899.0	4006.0	8122.0	1	0.0

df_Boston_merge.drop(columns=['Final_40k_sec', 'State', '10K', 'Half', '40K','Pace_minpermile','Finish_Time','Gender_ranking','Division_ranking'], axis=1, inplace=True)

df_Boston_merge.head(1)

	Name	Age	Gender	City	Country	Overall_ranking	Final_10k_sec	Final_Half_sec	FinishTime_sec	isBoston
0	Kirui, Geoffrey	24	1.0	Keringet	KEN	1	1828	3875	7777	1.0

Merge Boston and Moscow dataframe to get the insite about each athlete and sort them based on overall ranking

df_merged_bm = df_Boston_merge.merge(df_Moscow_merge,how='outer', left_on=['Name', 'City', 'Overall_ranking', 'FinishTime_sec', 'Final_Half_sec', 'Final_10k_sec', 'Country', 'Gender', 'Age', 'isBoston'],\
                                     right_on=['Name', 'City', 'Overall_ranking', 'FinishTime_sec', 'Final_Half_sec', 'Final_10k_sec', 'Country', 'Gender', 'Age', 'isBoston'])

df_merged_bm = df_merged_bm.sort_values(by='Overall_ranking')
df_merged_bm.head(1)

	Name	Age	Gender	City	Country	Overall_ranking	Final_10k_sec	Final_Half_sec	FinishTime_sec	isBoston
0	Kirui, Geoffrey	24	1.0	Keringet	KEN	1	1828.0	3875.0	7777.0	1.0

Interesting Finding 1 Summary

Success rate of a country - ratio of number of participants of a country in top 100 / total number of participants of that country.
Summary - Amongst 91 participating countries, Kenya has a success ratio of 100% in top 100 runners in Boston Marathon.
Summary - Amongst 53 participating countries, Thailand has a success ratio of 70% in top 1000 runners in Moscow Marathon.

Get the total number of countries participating in the run.

df_country = df_Boston.groupby('Country').size()
len(df_country)

Find the countries of the runners who have finished in top 100.

df_top100_country=df_Boston[:100].groupby('Country').size()
len(df_top100_country)

Calculate the success ratio which is the number of runners in top 100 divide by the total number of runners from that country.

df_Boston_countries_success_rate = ((df_top100_country/df_country)* 100).nlargest(14)  # 14 different countries in the top 100 as shown above.
df_Boston_countries_success_rate

Country
BDI    100.000000
BRN    100.000000
KEN    100.000000
ZIM    100.000000
ETH     80.000000
JPN      1.764706
NED      1.282051
IRL      1.149425
MEX      0.701754
ITA      0.606061
BRA      0.487805
USA      0.338983
GBR      0.235294
CAN      0.213904
dtype: float64

df_country['BDI']

df_country['BRN']

df_country['KEN']

df_country['ZIM']

df_country['ETH']

Plot of top 100 finishers countries.

sns.countplot(y='Country',data =df_Boston[:100])

<matplotlib.axes._subplots.AxesSubplot at 0xfef4b00>

Boston_success_df = df_Boston_countries_success_rate.to_frame().reset_index()
Boston_success_df.rename(columns={0:'success_rate'}, inplace=True)

sns.catplot(x='success_rate',y='Country', data=Boston_success_df,aspect=3,kind='bar')

<seaborn.axisgrid.FacetGrid at 0x10158080>

Eliminate the 3 countries - BDI, BRN and ZIM since they have only 1 participant. Plot for ETH and KEN. Show the success ratio plot as well. Show values of each bar in the graph.

Doing the same analysis for Moscow and get the total number of countries participating in the run

df_country_moscow=df_Moscow.groupby('Country').size()

df_top1000_country_moscow=df_Moscow[:1000].groupby('Country').size()
df_top1000_country_moscow

Country
CHN      1
HKG     10
RUS    903
THA     86
dtype: int64

success_rate_by_country_moscow =((df_top1000_country_moscow/df_country_moscow)* 100).nlargest(4)
success_rate_by_country_moscow

Country
THA    69.354839
HKG    34.482759
RUS    11.134402
CHN     1.960784
dtype: float64

Plot of top 1000 finishers by countries

sns.countplot(y='Country',data =df_Moscow[:1000])

<matplotlib.axes._subplots.AxesSubplot at 0xf2aa160>

Moscow_success_df = success_rate_by_country_moscow.to_frame().reset_index()
Moscow_success_df.rename(columns={0:'success_rate'}, inplace=True)

sns.catplot(x='success_rate',y='Country', data=Moscow_success_df,aspect=3,kind='bar')

<seaborn.axisgrid.FacetGrid at 0x10196320>

Doing the same analysis for merged data. This will show success rate for the combined two runs.

df_merged_bm_country=df_merged_bm.groupby('Country').size()

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-1-b3695a4b09ad> in <module>
----> 1 df_merged_bm_country=df_merged_bm.groupby('Country').size()


NameError: name 'df_merged_bm' is not defined

df_merged_bm_country_500=df_merged_bm[:5000].groupby('Country').size()

df_merged_bm_success_rate =((df_merged_bm_country_500/df_merged_bm_country)* 100).nlargest(20) #expected finding Kenya has highest success rate
df_merged_bm_success_rate

Country
BDI    100.000000
BRN    100.000000
ETH    100.000000
IRN    100.000000
KEN    100.000000
MKD    100.000000
ZIM    100.000000
CHE     60.000000
LVA     60.000000
EGY     50.000000
PAN     50.000000
UZB     50.000000
DNK     42.857143
UKR     37.500000
BLR     35.555556
BER     33.333333
CAY     33.333333
CRC     33.333333
DOM     33.333333
GUA     33.333333
dtype: float64

df_merged_bm_success_rate_df = df_merged_bm_success_rate.to_frame().reset_index()
df_merged_bm_success_rate_df.rename(columns={0:'success_rate'}, inplace=True)
#df_merged_bm_success_rate_df.head()

sns.catplot(x='success_rate',y='Country', data=df_merged_bm_success_rate_df,aspect=2,kind='bar')

<seaborn.axisgrid.FacetGrid at 0x10184a20>

df_merged_bm_country = df_merged_bm.groupby('Country').count()
df_merged_bm_country[df_merged_bm_country > 100]

	Name	Age	Gender	City	Overall_ranking	Final_10k_sec	Final_Half_sec	FinishTime_sec	isBoston
Country
-	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
ALG	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
AND	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
ARE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
ARG	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
AUS	195.0	195.0	195.0	195.0	195.0	195.0	195.0	195.0	195.0
AUT	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
AZE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
BAR	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
BDI	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
BEL	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
BER	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
BIH	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
BLR	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
BRA	206.0	206.0	206.0	206.0	206.0	206.0	206.0	206.0	206.0
BRN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
BUL	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
CAN	1871.0	1871.0	1871.0	1871.0	1871.0	1871.0	1871.0	1871.0	1871.0
CAY	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
CHE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
CHI	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
CHN	293.0	293.0	293.0	293.0	293.0	293.0	293.0	293.0	293.0
COL	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
CRC	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
CRO	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
CZE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
DEN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
DEU	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
DNK	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
DOM	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...
PER	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
PHI	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
POL	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
POR	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
PRT	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
ROU	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
RSA	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
RUS	8139.0	8139.0	8139.0	8139.0	8139.0	8138.0	8139.0	8139.0	8139.0
SEN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
SIN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
SLO	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
SMR	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
SRB	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
SUI	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
SVK	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
SVN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
SWE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
TCA	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
THA	129.0	129.0	129.0	129.0	129.0	129.0	129.0	129.0	129.0
TRI	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
TUR	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
TWN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
UAE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
UKR	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
URU	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
USA	20945.0	20945.0	20945.0	20945.0	20945.0	20945.0	20945.0	20945.0	20945.0
UZB	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
VEN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
ZAF	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
ZIM	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

113 rows × 9 columns

Show agreegated result of boston and moscow success rate in one graph. Still on different datasets.

Boston_success_df = df_Boston_countries_success_rate.to_frame().reset_index()
Boston_success_df.rename(columns={0:'success_rate'}, inplace=True)
Boston_success_df['isBoston'] = '1.0'
#Boston_success_df.head(1)

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-2-706243f5f236> in <module>
----> 1 Boston_success_df = df_Boston_countries_success_rate.to_frame().reset_index()
      2 Boston_success_df.rename(columns={0:'success_rate'}, inplace=True)
      3 Boston_success_df['isBoston'] = '1.0'
      4 #Boston_success_df.head(1)


NameError: name 'df_Boston_countries_success_rate' is not defined

Moscow_success_df = success_rate_by_country_moscow.to_frame().reset_index()
Moscow_success_df.rename(columns={0:'success_rate'}, inplace=True)
Moscow_success_df['isBoston'] = '0.0'
#Moscow_success_df.head(1)

df_merge_success_rate = Boston_success_df.merge(Moscow_success_df,how='outer', left_on=['Country', 'success_rate', 'isBoston'],\
                                     right_on=['Country', 'success_rate', 'isBoston'])

df_merge_success_rate = df_merge_success_rate.sort_values(by='success_rate', ascending=False)

sns.catplot(x='success_rate',y='Country',hue='isBoston',data=df_merge_success_rate,aspect=3,kind='bar')

<seaborn.axisgrid.FacetGrid at 0x536c128>

Managerial Insights

To perform well in long run marathons athletes should follow the practices what Kenya and Thailand athletes have implemented. As Kenya athletes have a dominance in the olympics and most recently Thailand is seeing the hobby jogging boom.
Kenya and Thailand athletes are used to perform in severe conditions, so they have the capacity to manage well in Boston and Moscow Marathon
Kenya as a country has the most successful representation in the run. The runners from Kenya are really fast and they are amongst the winners as well. In order to make the run more competitive and atttractive sponsors and viewers, we should make sure to have good representation from Kenya and top Kenyan players. This will drive more advertising and sponsorship money to the run. Don't just focus on local talent but rather go for the bigger talent from abroad.

Interesting Finding 2 Summary

Summary - Negative split is not an effective nor popular strategy for Boston Marathon.
Background - Negative split is a very popular and common strategy where a runner runs the second half of a marathon
Faster than the first half.

import matplotlib.pyplot as plt
df_Boston['boston_split_ratio'] = (df_Boston['FinishTime_sec'] - df_Boston['Final_Half_sec'])/(df_Boston['Final_Half_sec'])
df_split = df_Boston[df_Boston.Final_Half_sec > 0]

len(df_split[df_split.boston_split_ratio < 1.0])

len(df_Boston[df_Boston.boston_split_ratio < 1.0]) / len(df_Boston) * 100

3.070806512684589

Proves that mature people use negative split and not young.

df_split[df_split.boston_split_ratio < 1.0].groupby(['Gender']).size()

Gender
0.0    377
1.0    434
dtype: int64

sns.kdeplot(df_split.boston_split_ratio)
plt.xlim([0.7,1.7])
plt.xlabel('Boston Split Ratio')
plt.ylabel('Runner Density')
plt.title('Split Distribution (Negative split when < 1)')

<matplotlib.axes._subplots.AxesSubplot at 0xf291c50>






(0.7, 1.7)






Text(0.5, 0, 'Boston Split Ratio')






Text(0, 0.5, 'Runner Density')






Text(0.5, 1.0, 'Split Distribution (Negative split when < 1)')

plt.plot(df_split.Overall_ranking,df_split.boston_split_ratio,'o', alpha = 0.2)
plt.ylim([0.5,3])
plt.xlabel('Overall Ranking')
plt.ylabel('Split Ratio')
plt.title('Split and performance')

[<matplotlib.lines.Line2D at 0x100b5a58>]






(0.5, 3)






Text(0.5, 0, 'Overall Ranking')






Text(0, 0.5, 'Split Ratio')






Text(0.5, 1.0, 'Split and performance')

Till here, we should show how many have positive split and how many have negative split. Show them by age and gender. Plot split ratio of each runner by finish time. Variance in split ratio by age? by gender?

Derive split of top 100 from the above dataframe to optimize.

dfBoston_100 = df_Boston[:100]
dfBoston_100['split_ratio_100'] = (dfBoston_100['FinishTime_sec'] - dfBoston_100['Final_Half_sec'])/(dfBoston_100['Final_Half_sec'])
df_split_100 = dfBoston_100[dfBoston_100.Final_Half_sec > 0]

Top 10 runners have not used negative split.

The below chart shows that successful countries didn't solely use negative split strategy.

plt.plot(df_split_100.Overall_ranking,df_split_100.split_ratio_100,'o', alpha = 0.2)
plt.ylim([0.8,1.4])
plt.xlabel('Overall Rank')
plt.ylabel('Split')
plt.title('Split and performance')

[<matplotlib.lines.Line2D at 0x10113b38>]






(0.8, 1.4)






Text(0.5, 0, 'Overall Rank')






Text(0, 0.5, 'Split')






Text(0.5, 1.0, 'Split and performance')

# Till here, we should show how many have +ve split and how many have negative split. Show them by age and gender.
# plot split ratio of each runner by finish time.  split ratio on y axis and finish time on x.
# variance in split ratio by age? by gender?

plt.plot(df_split.Overall_ranking[(df_split.Country is not 'KEN') & (df_split.Overall_ranking<100)], df_split.boston_split_ratio[(df_split.Country is not 'KEN') & (df_split.Overall_ranking<100)],'o')

plt.plot(df_split.Overall_ranking[(df_split.Country == 'KEN') & (df_split.Overall_ranking<100)], df_split.boston_split_ratio[(df_split.Country == 'KEN')  & (df_split.Overall_ranking<100)],'o', color = 'r')
plt.xlabel('Overall Rank')
plt.ylabel('Split Ratio')
plt.legend(['Others','Kenya'])

[<matplotlib.lines.Line2D at 0x13988da0>]






[<matplotlib.lines.Line2D at 0x13994438>]






Text(0.5, 0, 'Overall Rank')






Text(0, 0.5, 'Split Ratio')






<matplotlib.legend.Legend at 0x13988ef0>

#doing same analysis for moscow.

import matplotlib.pyplot as plt
df_new_Moscow['moscow_split_ratio'] = (df_new_Moscow['Marathon'] - df_new_Moscow['Half marathon'])/(df_new_Moscow['Half marathon'])
# count number of positive split and number of negative split. Plot it.

sns.kdeplot(df_new_Moscow.moscow_split_ratio)
plt.xlim([0.7,1.8])
plt.xlabel('Split Ratio')
plt.title('Split Distribution (Negative split when < 1)')
# what does y axis represent? is it count?

<matplotlib.axes._subplots.AxesSubplot at 0x139cc588>






(0.7, 1.8)






Text(0.5, 0, 'Split Ratio')






Text(0.5, 1.0, 'Split Distribution (Negative split when < 1)')

plt.plot(df_new_Moscow.Overall_Ranking,df_new_Moscow.moscow_split_ratio,'o', alpha = 0.2)
plt.ylim([0.5,2.2])
plt.xlabel('Overall_Ranking')
plt.ylabel('Split Ratio')
plt.title('Split Ratio and performance')

[<matplotlib.lines.Line2D at 0x13c8d160>]






(0.5, 2.2)






Text(0.5, 0, 'Overall_Ranking')






Text(0, 0.5, 'Split Ratio')






Text(0.5, 1.0, 'Split Ratio and performance')

len(df_new_Moscow[df_new_Moscow.moscow_split_ratio < 1.0])

len(df_new_Moscow[df_new_Moscow.moscow_split_ratio < 1.0]) / len(df_new_Moscow) * 100

10.744656400827395

Derive split of top 100 from the above df to optimize.

dfMoscow_100 = df_new_Moscow[:100]
dfMoscow_100['split_ratio_100'] = (dfMoscow_100['Marathon'] - dfMoscow_100['Half marathon'])/(dfMoscow_100['Half marathon'])
df_split_moscow_100 = dfMoscow_100
sns.kdeplot(df_split_moscow_100.split_ratio_100)
plt.xlim([0.7,1.7])
plt.xlabel('Split Ratio')
plt.title('Split Distribution (Negative split when < 1)')

<matplotlib.axes._subplots.AxesSubplot at 0x13c991d0>






(0.7, 1.7)






Text(0.5, 0, 'Split Ratio')






Text(0.5, 1.0, 'Split Distribution (Negative split when < 1)')

plt.plot(df_split_moscow_100.Overall_Ranking,df_split_moscow_100.split_ratio_100,'o', alpha = 0.2)
plt.ylim([0.9,1.3])
plt.xlabel('Overall Rank')
plt.ylabel('Split')
plt.title('Split and performance')

[<matplotlib.lines.Line2D at 0x14ed58d0>]






(0.9, 1.3)






Text(0.5, 0, 'Overall Rank')






Text(0, 0.5, 'Split')






Text(0.5, 1.0, 'Split and performance')

len(df_split_moscow_100)

Count number of positive split and number of negative split. Plot it.

import matplotlib.pyplot as plt
df_merged_bm['split_ratio'] = (df_merged_bm['FinishTime_sec'] - df_merged_bm['Final_Half_sec'])/(df_merged_bm['Final_Half_sec'])

type(df_split.boston_split_ratio)

pandas.core.series.Series

f, ax = plt.subplots(figsize=(8, 8))
ax = sns.kdeplot(df_split.boston_split_ratio, legend=True)
ax = sns.kdeplot(df_new_Moscow.moscow_split_ratio, legend=True)
#sns.lineplot(x='split_ratio',y='FinishTime_sec',hue='isBoston',data=df_Boston,aspect=2,kind='point')
plt.xlim([0.2,2.8])
plt.xlabel('Split Ratio')
plt.ylabel('Runner Density')
plt.title('Split Distribution (Negative split when < 1)')
# what does y axis represent? is it count?

(0.2, 2.8)






Text(0.5, 0, 'Split Ratio')






Text(0, 0.5, 'Runner Density')






Text(0.5, 1.0, 'Split Distribution (Negative split when < 1)')

plt.plot(df_merged_bm.Overall_ranking,df_merged_bm.split_ratio,'o', alpha = 0.2)
plt.ylim([0.5,2.2])
plt.xlabel('Overall_Ranking')
plt.ylabel('Split')
plt.title('Split and performance')

[<matplotlib.lines.Line2D at 0x14fa7be0>]






(0.5, 2.2)






Text(0.5, 0, 'Overall_Ranking')






Text(0, 0.5, 'Split')






Text(0.5, 1.0, 'Split and performance')

Managerial Insights

After doing the analysis, we observed that to perfrom well in Marathon athletes should use the even split strategy that is athlete should keep his/her pace consistent throughout the marathon instead of conserving the energy for first or second split of the marathon
It is advisable to athlete to practice their chosen strategy in training to help set the realistic expectation and to learn how their body is going to respond.

Interesting Finding 3 Summary

Summary - Middle aged athletes are faster as compare to younger aged athletes.

df_Boston['binned_age1']= pd.cut(df_Boston.Age,bins=[15,25,35,45,55,70])

df_Boston[df_Boston.Gender == 1.0].Age.describe()      # male dataset details.    #REMOVE!!!

count    14438.000000
mean        44.772475
std         11.533255
min         18.000000
25%         36.000000
50%         45.000000
75%         53.000000
max         83.000000
Name: Age, dtype: float64

df_Boston[df_Boston.Gender == 0.0].Age.describe() #REMOVE!!!

count    11972.000000
mean        39.952974
std         10.703554
min         18.000000
25%         31.000000
50%         40.000000
75%         48.000000
max         84.000000
Name: Age, dtype: float64

Unexpected as middle aged person more participated as compared to young.

df_Boston.groupby('Age').size().nlargest(5)

Age
45    1127
46     976
40     926
47     843
50     815
dtype: int64

df_Boston.groupby('Gender').size()

Gender
0.0    11972
1.0    14438
dtype: int64

df_Boston.groupby(['Gender','binned_age1']).size()

Gender  binned_age1
0.0     (15, 25]       1067
        (25, 35]       3350
        (35, 45]       3780
        (45, 55]       2823
        (55, 70]        932
1.0     (15, 25]        691
        (25, 35]       2710
        (35, 45]       3991
        (45, 55]       4316
        (55, 70]       2623
dtype: int64

sns.countplot(x='binned_age1',hue='Gender',data=df_Boston)

<matplotlib.axes._subplots.AxesSubplot at 0x14fe5978>

df_Boston['binned_age2']= pd.cut(df_Boston.Age,bins=[15,20,25,29,35,40,45,50,70])

sns.factorplot(x='binned_age2',y='FinishTime_sec',hue='Gender',data=df_Boston,aspect=2,kind='point')

<seaborn.axisgrid.FacetGrid at 0x14f8a240>

Older and mature people perform the best and are the faster marathon runner. Older and mature people also participated more than young people and are more health considerate.

df_new_Moscow.head()

	BibNo	Finish_Time_Sec	Pace_Sec	Pace_Minpkm	Half_Pace_Sec	Half_Pace_Minpkm	Gender	Age	Name	City	Country	10 km	Half marathon	Marathon	Overall_Ranking	moscow_split_ratio
0	3	8122.0	192.487261	3:12 min/km	189.880318	3:09 min/km	Male	31	Stepan Kiselev	Kazan'	RUS	1899.0	4006.0	8122.0	1	1.027459
1	12	8161.0	193.411542	3:13 min/km	189.738121	3:09 min/km	Male	32	Aleksey Troshkin	Saransk	RUS	1905.0	4003.0	8161.0	2	1.038721
2	10	8394.0	198.933523	3:18 min/km	188.268752	3:08 min/km	Male	27	Artem Aplachkin	Barnaul	RUS	1879.0	3972.0	8394.0	3	1.113293
3	15008	8416.0	199.454912	3:19 min/km	195.094205	3:15 min/km	Male	29	Sergey Popov	Voronezh	RUS	1960.0	4116.0	8416.0	4	1.044704
4	4	8467.0	200.663586	3:20 min/km	186.894182	3:06 min/km	Male	36	Dmitriy Safronov	Moskva	RUS	1876.0	3943.0	8467.0	5	1.147350

df_new_Moscow['binned_age1']= pd.cut(df_new_Moscow.Age,bins=[15,25,30,35,40,50,70])

df_new_Moscow.groupby('Age').size().nlargest(10)

Age
31    485
30    482
33    468
32    457
34    450
35    434
29    408
28    372
36    360
37    340
dtype: int64

df_new_Moscow.groupby(['Gender']).size()

Gender
Female    1491
Male      7211
dtype: int64

df_new_Moscow.groupby(['Gender','binned_age1']).size()

Gender  binned_age1
Female  (15, 25]        183
        (25, 30]        403
        (30, 35]        395
        (35, 40]        238
        (40, 50]        207
        (50, 70]         63
Male    (15, 25]        530
        (25, 30]       1401
        (30, 35]       1899
        (35, 40]       1335
        (40, 50]       1409
        (50, 70]        616
dtype: int64

sns.countplot(x='binned_age1',hue='Gender',data=df_new_Moscow)  # color is opposite for male/female as compared to other graphs.

<matplotlib.axes._subplots.AxesSubplot at 0x16cc04a8>

df_new_Moscow['binned_age2']= pd.cut(df_new_Moscow.Age,bins=[20,25,30,35,40,45,50,70])

sns.factorplot(x='binned_age2',y='Marathon',hue='Gender',data=df_new_Moscow,aspect=3,kind='point')

<seaborn.axisgrid.FacetGrid at 0x16cc0908>

Analysis on combined dataset.

df_merged_bm['binned_age1']= pd.cut(df_merged_bm.Age,bins=[15,25,30,35,40,45,50,60,70])

df_merged_bm.groupby('Age').size().nlargest(10)

Age
45    1275
35    1234
40    1193
46    1150
30    1111
36    1093
31    1068
33    1059
29    1044
41    1034
dtype: int64

df_merged_bm.groupby(['Gender']).size()

Gender
0.0    13463
1.0    21649
dtype: int64

df_merged_bm.groupby(['Gender','binned_age1']).size()

Gender  binned_age1
0.0     (15, 25]       1250
        (25, 30]       2187
        (30, 35]       1961
        (35, 40]       2126
        (40, 45]       2026
        (45, 50]       1774
        (50, 60]       1763
        (60, 70]        354
1.0     (15, 25]       1221
        (25, 30]       2600
        (30, 35]       3410
        (35, 40]       3175
        (40, 45]       3000
        (45, 50]       2963
        (50, 60]       3834
        (60, 70]       1318
dtype: int64

Age wise participation across boston and moscow

sns.countplot(x='binned_age1',hue='isBoston',data=df_merged_bm)

<matplotlib.axes._subplots.AxesSubplot at 0x13c5cf28>

sns.countplot(x='binned_age1',hue='isBoston',data=df_merged_bm[df_merged_bm.Gender == 1.0])

<matplotlib.axes._subplots.AxesSubplot at 0x15239f60>

Age across boston and moscow for binned ages BUT only for females

sns.countplot(x='binned_age1',hue='isBoston',data=df_merged_bm[df_merged_bm.Gender == 0.0])

<matplotlib.axes._subplots.AxesSubplot at 0x16344cf8>

#df_new_Moscow[(df_new_Moscow.Age > 45) & (df_new_Moscowdd.Gender == 'Female')][['Age', 'Marathon']]

Binned age for combined df

df_merged_bm['binned_age']=pd.cut(df_merged_bm.Age,bins=[20,25,30,35,40,45,50,70])

Plot to show finish time across both marathons for binned ages.

sns.factorplot(x='binned_age',y='FinishTime_sec',hue='isBoston',data=df_merged_bm,aspect=2,kind='point')

<seaborn.axisgrid.FacetGrid at 0x1525f668>

Finish time across boston and moscow for binned ages BUT only for males.

sns.factorplot(x='binned_age',y='FinishTime_sec',hue='isBoston',data=df_merged_bm[df_merged_bm.Gender == 1.0],aspect=2,kind='point')

<seaborn.axisgrid.FacetGrid at 0x16761da0>

Finish time across boston and moscow for binned ages BUT only for females.

sns.factorplot(x='binned_age',y='FinishTime_sec',hue='isBoston',data=df_merged_bm[df_merged_bm.Gender == 0.0],aspect=2,kind='point')

<seaborn.axisgrid.FacetGrid at 0x16e1efd0>

Managerial Insights

It is advisible to younger age athletes to focus on improving the endurance

Decision Tree

df_Boston.head()

	BibNo	Name	Age	Gender	City	State	Country	10K	Half	40K	Pace_minpermile	Finish_Time	Overall_ranking	Gender_ranking	Division_ranking	Final_10k_sec	Final_Half_sec	Final_40k_sec	FinishTime_sec	boston_split_ratio	binned_age1	binned_age2
0	11	Kirui, Geoffrey	24	1.0	Keringet	NaN	KEN	0:30:28	1:04:35	2:02:53	0:04:57	2:09:37	1	1	1	1828	3875	7373	7777	1.006968	(15, 25]	(20, 25]
1	17	Rupp, Galen	30	1.0	Portland	OR	USA	0:30:27	1:04:35	2:03:14	0:04:58	2:09:58	2	2	2	1827	3875	7394	7798	1.012387	(25, 35]	(29, 35]
2	23	Osako, Suguru	25	1.0	Machida-City	NaN	JPN	0:30:29	1:04:36	2:03:38	0:04:59	2:10:28	3	3	3	1829	3876	7418	7828	1.019608	(15, 25]	(20, 25]
3	21	Biwott, Shadrack	32	1.0	Mammoth Lakes	CA	USA	0:30:29	1:04:45	2:04:35	0:05:03	2:12:08	4	4	4	1829	3885	7475	7928	1.040669	(25, 35]	(29, 35]
4	9	Chebet, Wilson	31	1.0	Marakwet	NaN	KEN	0:30:28	1:04:35	2:05:00	0:05:04	2:12:35	5	5	5	1828	3875	7500	7955	1.052903	(25, 35]	(29, 35]

dt = tree.DecisionTreeRegressor(max_depth=2)

df_bos_dec_tree = df_Boston.copy()

df_bos_dec_tree = pd.get_dummies(data=df_bos_dec_tree,columns=['Country'])

X= df_bos_dec_tree.drop(columns=['BibNo','Age','City','State','10K','40K','Pace_minpermile','Name',\
                              'Finish_Time','Overall_ranking','Gender_ranking','boston_split_ratio',\
                                 'Division_ranking','Half','Final_40k_sec','binned_age2',\
                                'binned_age1'],axis=1)
X.head()
X.columns

	Gender	Final_10k_sec	Final_Half_sec	FinishTime_sec	...	Country_USA
0	1.0	1828	3875	7777	...	0
1	1.0	1827	3875	7798	...	1
2	1.0	1829	3876	7828	...	0
3	1.0	1829	3885	7928	...	1
4	1.0	1828	3875	7955	...	0

5 rows × 95 columns

Index(['Gender', 'Final_10k_sec', 'Final_Half_sec', 'FinishTime_sec',
       'Country_ALG', 'Country_AND', 'Country_ARG', 'Country_AUS',
       'Country_AUT', 'Country_BAR', 'Country_BDI', 'Country_BEL',
       'Country_BER', 'Country_BRA', 'Country_BRN', 'Country_BUL',
       'Country_CAN', 'Country_CAY', 'Country_CHI', 'Country_CHN',
       'Country_COL', 'Country_CRC', 'Country_CRO', 'Country_CZE',
       'Country_DEN', 'Country_DOM', 'Country_ECU', 'Country_EGY',
       'Country_ESA', 'Country_ESP', 'Country_EST', 'Country_ETH',
       'Country_FIN', 'Country_FLK', 'Country_FRA', 'Country_GBR',
       'Country_GER', 'Country_GRE', 'Country_GRN', 'Country_GUA',
       'Country_HKG', 'Country_HON', 'Country_HUN', 'Country_INA',
       'Country_IND', 'Country_IRL', 'Country_ISL', 'Country_ISR',
       'Country_ITA', 'Country_JAM', 'Country_JPN', 'Country_KEN',
       'Country_KOR', 'Country_KSA', 'Country_KUW', 'Country_LAT',
       'Country_LTU', 'Country_LUX', 'Country_MAR', 'Country_MAS',
       'Country_MEX', 'Country_MGL', 'Country_MLT', 'Country_NCA',
       'Country_NED', 'Country_NGR', 'Country_NOR', 'Country_NZL',
       'Country_PAN', 'Country_PAR', 'Country_PER', 'Country_PHI',
       'Country_POL', 'Country_POR', 'Country_ROU', 'Country_RSA',
       'Country_RUS', 'Country_SIN', 'Country_SLO', 'Country_SMR',
       'Country_SRB', 'Country_SUI', 'Country_SVK', 'Country_SWE',
       'Country_TCA', 'Country_THA', 'Country_TRI', 'Country_TUR',
       'Country_TWN', 'Country_UAE', 'Country_UKR', 'Country_URU',
       'Country_USA', 'Country_VEN', 'Country_ZIM'],
      dtype='object')

Y=df_bos_dec_tree.Age

dt.fit(X,Y)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

import os
os.environ["PATH"] += os.pathsep + r'C:\Users\MIH\Anaconda3\Library\bin\graphviz'

# This code will visualize a decision tree dt, trained with the attributes in X and the class labels in Y
dt_feature_names = list(X.columns)
dt_target_names = [str(s) for s in Y.unique()]
tree.export_graphviz(dt, out_file='tree.dot', 
    feature_names=dt_feature_names, class_names=dt_target_names,
    filled=True)  
graph = pydotplus.graph_from_dot_file('tree.dot')
Image(graph.create_png())

LassoRegressor

from sklearn import linear_model
regLasso = linear_model.Lasso()

regLasso.fit(X,Y)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

regLasso.coef_

array([ 2.45946604e+00,  1.54926501e-03,  6.73233896e-04,  5.91283677e-04,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
        0.00000000e+00, -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00, -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
        0.00000000e+00, -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00, -0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
        0.00000000e+00, -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
        0.00000000e+00,  0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
        0.00000000e+00, -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00, -0.00000000e+00])

KMeans Clustering

from sklearn.cluster import KMeans

clu = KMeans(n_clusters=2, random_state=0)

clu

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

clu.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

clu.labels_

array([1, 1, 1, ..., 0, 0, 0])

df2 = X.copy()

df2['cluster']=clu.labels_

df2.groupby('cluster').mean()

	Gender	Final_10k_sec	Final_Half_sec	FinishTime_sec	Country_ALG	Country_AND	Country_ARG	Country_AUS	Country_AUT	Country_BAR	Country_BDI	Country_BEL	Country_BER	Country_BRA	Country_BRN	...	Country_SRB	Country_SUI	Country_SVK	Country_SWE	Country_TCA	Country_THA	Country_TRI	Country_TUR	Country_TWN	Country_UAE	Country_UKR	Country_URU	Country_USA	Country_VEN	Country_ZIM
cluster
0	0.432154	3581.235880	7843.994727	17233.904031	0.000000	0.000000	0.000586	0.004921	0.001055	0.000000	0.000000	0.001758	0.000117	0.004570	0.000000	...	0.000000	0.003984	0.000234	0.001875	0.000000	0.000234	0.000000	0.000117	0.000937	0.000234	0.000234	0.000469	0.825521	0.000000	0.000000
1	0.601365	2821.851253	6044.298221	12875.059297	0.000112	0.000056	0.001734	0.008335	0.000727	0.000224	0.000056	0.001510	0.000112	0.009286	0.000056	...	0.000056	0.003524	0.000224	0.002517	0.000056	0.000168	0.000056	0.000224	0.002573	0.000280	0.000112	0.000056	0.777579	0.000336	0.000056

2 rows × 95 columns

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Boston2017_Jupyter_Notebook_files		Boston2017_Jupyter_Notebook_files
.DS_Store		.DS_Store
1_full_results_mm_2018.csv		1_full_results_mm_2018.csv
1_split_results_mm_2018.csv		1_split_results_mm_2018.csv
Boston2017 _Jupyter Notebook V2.ipynb		Boston2017 _Jupyter Notebook V2.ipynb
Boston2017 _Jupyter Notebook.ipynb		Boston2017 _Jupyter Notebook.ipynb
Boston2017_Jupyter_Notebook.md		Boston2017_Jupyter_Notebook.md
ReadMe.md		ReadMe.md
marathon_results_2017.csv		marathon_results_2017.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Boston-Moscow Marathon Analysis

Project Datasets

1. Boston Marathon Results 2017

2. Moscow Marathon Full Results 2018

Jupyter Notebook with 3 Interesting findings

1. Data Set Description

Drop the unused columns and standardised the column names by renaming as per the both datasets in Boston Dataframe

Replaced Male and Female with values 0 and 1 in Boston Dataframe

Drop the unused columns and standardised the column names by renaming as per the both datasets in Moscow Dataframe

By using pivot table to make split_name values as column in Moscow_Split_Results and Moscow_Full_Results Dataframe

Used the Outer Merge on Pivot table and Moscow_Full_Result to to get the split name values as column in Moscow_full_result

Sorted the values based on Finish time seconds for overall ranking of athletes in Moscow dataframe

Drop the NAN values Moscow Dataframe from Marathon column which shows that person has not completed the marathon

Added the new column overall ranking calculated based on Finish time seconds in Moscow dataframe

Converted the finish time from Hour, Min & Sec format to seconds only to make it Final 10K, Half, 40k and Finish Time in Boston dataframe

Creating merged dataset for Boston and Moscow, keeping only relevant columns and sorting them by overall ranking.

Created isBoston column in Boston dataframe assigned value 1 which means after merging data row with value 1.0 are from Boston Marathon

Created isBoston column in Moscow dataframe assigned value 0 which means after merging data row with value 0.0 are from Moscow Marathon

Drop the unused columns from Boston and Moscow

Merge Boston and Moscow dataframe to get the insite about each athlete and sort them based on overall ranking

Interesting Finding 1 Summary

Eliminate the 3 countries - BDI, BRN and ZIM since they have only 1 participant. Plot for ETH and KEN. Show the success ratio plot as well. Show values of each bar in the graph.

Managerial Insights

Interesting Finding 2 Summary

Managerial Insights

Interesting Finding 3 Summary

Managerial Insights

Decision Tree

LassoRegressor

KMeans Clustering

About

Uh oh!

Releases

Packages

Languages

mohitcpatil/Boston-Marathon-Analysis-Python-Pandas

Folders and files

Latest commit

History

Repository files navigation

Boston-Moscow Marathon Analysis

Project Datasets

1. Boston Marathon Results 2017

2. Moscow Marathon Full Results 2018

Jupyter Notebook with 3 Interesting findings

1. Data Set Description

Drop the unused columns and standardised the column names by renaming as per the both datasets in Boston Dataframe

Replaced Male and Female with values 0 and 1 in Boston Dataframe

Drop the unused columns and standardised the column names by renaming as per the both datasets in Moscow Dataframe

By using pivot table to make split_name values as column in Moscow_Split_Results and Moscow_Full_Results Dataframe

Used the Outer Merge on Pivot table and Moscow_Full_Result to to get the split name values as column in Moscow_full_result

Sorted the values based on Finish time seconds for overall ranking of athletes in Moscow dataframe

Drop the NAN values Moscow Dataframe from Marathon column which shows that person has not completed the marathon

Added the new column overall ranking calculated based on Finish time seconds in Moscow dataframe

Converted the finish time from Hour, Min & Sec format to seconds only to make it Final 10K, Half, 40k and Finish Time in Boston dataframe

Creating merged dataset for Boston and Moscow, keeping only relevant columns and sorting them by overall ranking.

Created isBoston column in Boston dataframe assigned value 1 which means after merging data row with value 1.0 are from Boston Marathon

Created isBoston column in Moscow dataframe assigned value 0 which means after merging data row with value 0.0 are from Moscow Marathon

Drop the unused columns from Boston and Moscow

Merge Boston and Moscow dataframe to get the insite about each athlete and sort them based on overall ranking

Interesting Finding 1 Summary

Eliminate the 3 countries - BDI, BRN and ZIM since they have only 1 participant. Plot for ETH and KEN. Show the success ratio plot as well. Show values of each bar in the graph.

Managerial Insights

Interesting Finding 2 Summary

Managerial Insights

Interesting Finding 3 Summary

Managerial Insights

Decision Tree

LassoRegressor

KMeans Clustering

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages