Erick Lu
March 31, 2020 - Jupyter Notebook
- Introduction
- Scrape roster information for each NBA team
- Scrape player stats for career averages
- Joining and cleaning the data
- Calculating statistics
- Conclusion
In this project, I use Python to “scrape” ESPN for stats on all the players in the NBA, clean and organize the data into a data science-friendly format, and calculate some interesting statistics. Web scraping is a useful technique for extracting data from websites that don’t offer formatted, raw data for download.
As an example, I will be scraping data from the rosters of each team in the NBA for information such as player age, height, weight, and salary. I will also loop through each individual player's stats page and extract career averages such as points per game, free throw percentages, and more (as of currently, March 2020).
We can use this data to answer questions such as:
- Do factors such as age, height, weight, etc. correlate with player performance? (i.e. does height matter?)
- What is the average salary paid by each team in the NBA, and which player earns the most on each team?
- How much more do better players cost? Can we model the average price of hiring a player given his performance? If so, what is the cost per increase in points per game?
I've exported the data to a nicely organized csv file, accessible in the GitHub repo for this project, in case you would like to analyze it yourself. You can also run the python script scrape_nba_statistics.py
to re-scrape ESPN for up-to-date data.
In the following sections, I will describe how to loop through ESPN page sources using urllib
, extract information using re
(regular expressions), organize player statistics in pandas
DataFrames, and perform some simple modeling using scikit-learn
.
We will first take a look at the structure of the website and figure out which web pages we need to scrape information from. The teams page at https://www.espn.com/nba/teams looks like the following:
This looks very promising. All the teams are listed on this page, which means that they can easily be extracted from the page source. Let’s take a look at the page source to see if we can find URLs for each team's roster:
It looks like URLs for each of the teams rosters are listed in the page source with the following format: https://www.espn.com/nba/team/roster/_/name/team/team-name, as shown in the highlighted portion of the image above. Given that these all follow the same format, we can use regular expressions to pull out a list of all the team names from the page source, and then construct the roster URLs using the format above. Start by importing the urllib
and re
packages in Python:
import re
import urllib
from time import sleep
Now, let’s create a function that will extract all the team names from http://www.espn.com/nba/teams and construct roster URLs for each of the teams:
# This method finds the urls for each of the rosters in the NBA using regexes.
def build_team_urls():
# Open the espn teams webpage and extract the names of each roster available.
f = urllib.request.urlopen('https://www.espn.com/nba/teams')
teams_source = f.read().decode('utf-8')
teams = dict(re.findall("www\.espn\.com/nba/team/_/name/(\w+)/(.+?)\",", teams_source))
# Using the names of the rosters, create the urls of each roster
roster_urls = []
for key in teams.keys():
# each roster webpage follows this general pattern.
roster_urls.append('https://www.espn.com/nba/team/roster/_/name/' + key + '/' + teams[key])
teams[key] = str(teams[key])
return dict(zip(teams.values(), roster_urls))
rosters = build_team_urls()
rosters
{'atlanta-hawks': 'https://www.espn.com/nba/team/roster/_/name/atl/atlanta-hawks',
'boston-celtics': 'https://www.espn.com/nba/team/roster/_/name/bos/boston-celtics',
'brooklyn-nets': 'https://www.espn.com/nba/team/roster/_/name/bkn/brooklyn-nets',
'charlotte-hornets': 'https://www.espn.com/nba/team/roster/_/name/cha/charlotte-hornets',
'chicago-bulls': 'https://www.espn.com/nba/team/roster/_/name/chi/chicago-bulls',
'cleveland-cavaliers': 'https://www.espn.com/nba/team/roster/_/name/cle/cleveland-cavaliers',
'dallas-mavericks': 'https://www.espn.com/nba/team/roster/_/name/dal/dallas-mavericks',
'denver-nuggets': 'https://www.espn.com/nba/team/roster/_/name/den/denver-nuggets',
'detroit-pistons': 'https://www.espn.com/nba/team/roster/_/name/det/detroit-pistons',
'golden-state-warriors': 'https://www.espn.com/nba/team/roster/_/name/gs/golden-state-warriors',
'houston-rockets': 'https://www.espn.com/nba/team/roster/_/name/hou/houston-rockets',
'indiana-pacers': 'https://www.espn.com/nba/team/roster/_/name/ind/indiana-pacers',
'la-clippers': 'https://www.espn.com/nba/team/roster/_/name/lac/la-clippers',
'los-angeles-lakers': 'https://www.espn.com/nba/team/roster/_/name/lal/los-angeles-lakers',
'memphis-grizzlies': 'https://www.espn.com/nba/team/roster/_/name/mem/memphis-grizzlies',
'miami-heat': 'https://www.espn.com/nba/team/roster/_/name/mia/miami-heat',
'milwaukee-bucks': 'https://www.espn.com/nba/team/roster/_/name/mil/milwaukee-bucks',
'minnesota-timberwolves': 'https://www.espn.com/nba/team/roster/_/name/min/minnesota-timberwolves',
'new-orleans-pelicans': 'https://www.espn.com/nba/team/roster/_/name/no/new-orleans-pelicans',
'new-york-knicks': 'https://www.espn.com/nba/team/roster/_/name/ny/new-york-knicks',
'oklahoma-city-thunder': 'https://www.espn.com/nba/team/roster/_/name/okc/oklahoma-city-thunder',
'orlando-magic': 'https://www.espn.com/nba/team/roster/_/name/orl/orlando-magic',
'philadelphia-76ers': 'https://www.espn.com/nba/team/roster/_/name/phi/philadelphia-76ers',
'phoenix-suns': 'https://www.espn.com/nba/team/roster/_/name/phx/phoenix-suns',
'portland-trail-blazers': 'https://www.espn.com/nba/team/roster/_/name/por/portland-trail-blazers',
'sacramento-kings': 'https://www.espn.com/nba/team/roster/_/name/sac/sacramento-kings',
'san-antonio-spurs': 'https://www.espn.com/nba/team/roster/_/name/sa/san-antonio-spurs',
'toronto-raptors': 'https://www.espn.com/nba/team/roster/_/name/tor/toronto-raptors',
'utah-jazz': 'https://www.espn.com/nba/team/roster/_/name/utah/utah-jazz',
'washington-wizards': 'https://www.espn.com/nba/team/roster/_/name/wsh/washington-wizards'}
The function build_team_urls()
returns a dictionary that matches team names with their corresponding roster URL. Given this information, we can systematically loop through all of the rosters and use regular expressions to extract player information for each team.
In order to figure out how to scrape the rosters, let’s take a look at the Golden State Warriors' roster page as an example:
Information for each player is nicely laid out in a table, meaning that the data is likely obtainable using regular expressions. Taking a look at the page source reveals that each player’s name and information are all provided in blocks of what apppear to be json
, highlighted below:
Given the standardized format of the data for each player, this information is indeed extractable using regular expressions. First, we should read in the roster webpage using urllib.request.urlopen
:
url = "https://www.espn.com/nba/team/roster/_/name/gs/golden-state-warriors"
f = urllib.request.urlopen(url)
roster_source = f.read().decode('utf-8')
Then, we construct the regex that will return information for each of the players on the roster webpage.
player_regex = ('\{\"name\"\:\"(\w+\s\w+)\",\"href\"\:\"https?\://www\.espn\.com/nba/player/.*?\",(.*?)\}')
player_regex
player_info = re.findall(player_regex, roster_source)
player_info[0:4]
[('Ky Bowman',
'"uid":"s:40~l:46~a:4065635","guid":"d0ef63e951bb5f842b7357521697dc62","id":"4065635","height":"6\' 1\\"","weight":"187 lbs","age":22,"position":"PG","jersey":"12","salary":"$350,189","birthDate":"06/17/97","headshot":"https://a.espncdn.com/i/headshots/nba/players/full/4065635.png","lastName":"Ky Bowman","experience":0,"college":"Boston College"'),
('Marquese Chriss',
'"uid":"s:40~l:46~a:3907487","guid":"a320ecf1d6481b7518ddc1dc576c27b4","id":"3907487","height":"6\' 9\\"","weight":"240 lbs","age":22,"position":"C","jersey":"32","salary":"$654,469","birthDate":"07/02/97","headshot":"https://a.espncdn.com/i/headshots/nba/players/full/3907487.png","lastName":"Marquese Chriss","experience":3,"college":"Washington","birthPlace":"Sacramento, CA"'),
('Stephen Curry',
'"uid":"s:40~l:46~a:3975","guid":"5dda51f150c966e12026400b73f34fad","id":"3975","height":"6\' 3\\"","weight":"185 lbs","age":32,"position":"PG","jersey":"30","salary":"$40,231,758","birthDate":"03/14/88","headshot":"https://a.espncdn.com/i/headshots/nba/players/full/3975.png","lastName":"Stephen Curry","experience":10,"college":"Davidson","birthPlace":"Akron, OH"'),
('Draymond Green',
'"uid":"s:40~l:46~a:6589","guid":"de360720e41625f28a6bb5ff82616cb1","id":"6589","height":"6\' 6\\"","weight":"230 lbs","age":30,"position":"PF","jersey":"23","salary":"$18,539,130","birthDate":"03/04/90","headshot":"https://a.espncdn.com/i/headshots/nba/players/full/6589.png","lastName":"Draymond Green","experience":7,"college":"Michigan State","birthPlace":"Saginaw, MI"')]
As you can see, player_info
is a list of tuples, in which each player name is paired with a set of information (height, weight, age, etc.) that is organized in json
format. We can use the json
package in Python to convert the information into a Python dictionary:
import json
draymond = json.loads("{"+player_info[3][1]+"}")
draymond
{'age': 30,
'birthDate': '03/04/90',
'birthPlace': 'Saginaw, MI',
'college': 'Michigan State',
'experience': 7,
'guid': 'de360720e41625f28a6bb5ff82616cb1',
'headshot': 'https://a.espncdn.com/i/headshots/nba/players/full/6589.png',
'height': '6\' 6"',
'id': '6589',
'jersey': '23',
'lastName': 'Draymond Green',
'position': 'PF',
'salary': '$18,539,130',
'uid': 's:40~l:46~a:6589',
'weight': '230 lbs'}
In the example above, all of the pertinent information for Draymond Green is now stored into a Python dictionary named draymond
. Let's use the snippets of code above to construct a function which loops through each player in a given roster and scrapes their information:
def get_player_info(roster_url):
f = urllib.request.urlopen(roster_url)
roster_source = f.read().decode('utf-8')
sleep(0.5)
player_regex = ('\{\"name\"\:\"(\w+\s\w+)\",\"href\"\:\"https?\://www\.espn\.com/nba/player/.*?\",(.*?)\}')
player_info = re.findall(player_regex, roster_source)
player_dict = dict()
for player in player_info:
player_dict[player[0]] = json.loads("{"+player[1]+"}")
return(player_dict)
We can now loop through each team in rosters
and run get_player_info()
, storing the output in a dictionary called all_players
:
all_players = dict()
for team in rosters.keys():
print("Gathering player info for team: " + team)
all_players[team] = get_player_info(rosters[team])
Gathering player info for team: boston-celtics
Gathering player info for team: brooklyn-nets
Gathering player info for team: new-york-knicks
Gathering player info for team: philadelphia-76ers
Gathering player info for team: toronto-raptors
Gathering player info for team: chicago-bulls
Gathering player info for team: cleveland-cavaliers
Gathering player info for team: detroit-pistons
Gathering player info for team: indiana-pacers
Gathering player info for team: milwaukee-bucks
Gathering player info for team: atlanta-hawks
Gathering player info for team: charlotte-hornets
Gathering player info for team: miami-heat
Gathering player info for team: orlando-magic
Gathering player info for team: washington-wizards
Gathering player info for team: denver-nuggets
Gathering player info for team: minnesota-timberwolves
Gathering player info for team: oklahoma-city-thunder
Gathering player info for team: portland-trail-blazers
Gathering player info for team: utah-jazz
Gathering player info for team: golden-state-warriors
Gathering player info for team: la-clippers
Gathering player info for team: los-angeles-lakers
Gathering player info for team: phoenix-suns
Gathering player info for team: sacramento-kings
Gathering player info for team: dallas-mavericks
Gathering player info for team: houston-rockets
Gathering player info for team: memphis-grizzlies
Gathering player info for team: new-orleans-pelicans
Gathering player info for team: san-antonio-spurs
After running this code, the all_players
dictionary should be a dictionary of dictionaries of dictionaries. This sounds complicated, but let's walk through what it looks like. The first level of keys should correspond to teams:
all_players.keys()
dict_keys(['boston-celtics', 'brooklyn-nets', 'new-york-knicks', 'philadelphia-76ers', 'toronto-raptors', 'chicago-bulls', 'cleveland-cavaliers', 'detroit-pistons', 'indiana-pacers', 'milwaukee-bucks', 'atlanta-hawks', 'charlotte-hornets', 'miami-heat', 'orlando-magic', 'washington-wizards', 'denver-nuggets', 'minnesota-timberwolves', 'oklahoma-city-thunder', 'portland-trail-blazers', 'utah-jazz', 'golden-state-warriors', 'la-clippers', 'los-angeles-lakers', 'phoenix-suns', 'sacramento-kings', 'dallas-mavericks', 'houston-rockets', 'memphis-grizzlies', 'new-orleans-pelicans', 'san-antonio-spurs'])
Within a team, the keys should correspond to player names. Let's zoom in on the LA Lakers:
all_players["los-angeles-lakers"].keys()
dict_keys(['Kostas Antetokounmpo', 'Avery Bradley', 'Devontae Cacok', 'Alex Caruso', 'Quinn Cook', 'Anthony Davis', 'Jared Dudley', 'Danny Green', 'Dwight Howard', 'LeBron James', 'Kyle Kuzma', 'JaVale McGee', 'Markieff Morris', 'Rajon Rondo', 'Dion Waiters'])
Now we can choose which player to look at. Let's choose LeBron James as an example:
all_players["los-angeles-lakers"]["LeBron James"]
{'age': 35,
'birthDate': '12/30/84',
'birthPlace': 'Akron, OH',
'experience': 16,
'guid': '1f6592b3ff53d3218dc56038d48c1786',
'headshot': 'https://a.espncdn.com/i/headshots/nba/players/full/1966.png',
'height': '6\' 9"',
'id': '1966',
'jersey': '23',
'lastName': 'LeBron James',
'position': 'SF',
'salary': '$37,436,858',
'uid': 's:40~l:46~a:1966',
'weight': '250 lbs'}
A dictionary with information about LeBron James is returned. We can extract information even more precisely by specifying which field we are interested in. Let's get his salary:
all_players["los-angeles-lakers"]["LeBron James"]["salary"]
'$37,436,858'
In order to make data analysis easier, we can re-format this dictionary into a pandas
DataFrame. The function pd.DataFrame.from_dict()
can turn a dictionary of dictionaries into a pandas
DataFrame, as demonstrated below:
import pandas as pd
gsw = pd.DataFrame.from_dict(all_players["golden-state-warriors"], orient = "index")
gsw
uid | guid | id | height | weight | age | position | jersey | salary | birthDate | headshot | lastName | experience | college | birthPlace | hand | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Alen Smailagic | s:40~l:46~a:4401415 | 6ed3f8924bfef2e70329ebd6a104ecae | 4401415 | 6' 10" | 215 lbs | 19 | PF | 6 | $898,310 | 08/18/00 | https://a.espncdn.com/i/headshots/nba/players/... | Alen Smailagic | 0 | NaN | NaN | NaN |
Andrew Wiggins | s:40~l:46~a:3059319 | 064c19d065276a21ca99fdfb296fe05d | 3059319 | 6' 7" | 197 lbs | 25 | SF | 22 | $27,504,630 | 02/23/95 | https://a.espncdn.com/i/headshots/nba/players/... | Andrew Wiggins | 5 | Kansas | Thornhill, ON | NaN |
Chasson Randle | s:40~l:46~a:2580898 | 71b7154a3d81842448b623ee3e65d586 | 2580898 | 6' 2" | 185 lbs | 27 | PG | 25 | NaN | 02/05/93 | https://a.espncdn.com/i/headshots/nba/players/... | Chasson Randle | 2 | Stanford | Rock Island, IL | NaN |
Damion Lee | s:40~l:46~a:2595209 | 41fafb6d47a66d8f79f94161918541a4 | 2595209 | 6' 5" | 210 lbs | 27 | SG | 1 | $842,327 | 10/21/92 | https://a.espncdn.com/i/headshots/nba/players/... | Damion Lee | 2 | Louisville | NaN | L |
Draymond Green | s:40~l:46~a:6589 | de360720e41625f28a6bb5ff82616cb1 | 6589 | 6' 6" | 230 lbs | 30 | PF | 23 | $18,539,130 | 03/04/90 | https://a.espncdn.com/i/headshots/nba/players/... | Draymond Green | 7 | Michigan State | Saginaw, MI | NaN |
Eric Paschall | s:40~l:46~a:3133817 | b67e5e0fa5cb209355845d165a49407e | 3133817 | 6' 6" | 255 lbs | 23 | PF | 7 | $898,310 | 11/04/96 | https://a.espncdn.com/i/headshots/nba/players/... | Eric Paschall | 0 | Villanova | North Tarrytown, NY | NaN |
Jordan Poole | s:40~l:46~a:4277956 | 4b0492b5a52f267fe84098ef6d2e2bdf | 4277956 | 6' 4" | 194 lbs | 20 | SG | 3 | $1,964,760 | 06/19/99 | https://a.espncdn.com/i/headshots/nba/players/... | Jordan Poole | 0 | Michigan | Milwaukee, WI | B |
Kevon Looney | s:40~l:46~a:3155535 | 10a8e77b877324c69966f0c4618caad6 | 3155535 | 6' 9" | 222 lbs | 24 | PF | 5 | $4,464,226 | 02/06/96 | https://a.espncdn.com/i/headshots/nba/players/... | Kevon Looney | 4 | UCLA | Milwaukee, WI | NaN |
Klay Thompson | s:40~l:46~a:6475 | 3411530a7ab7e8dce4f165d59a559520 | 6475 | 6' 6" | 215 lbs | 30 | SG | 11 | $32,742,000 | 02/08/90 | https://a.espncdn.com/i/headshots/nba/players/... | Klay Thompson | 8 | Washington State | Los Angeles, CA | NaN |
Ky Bowman | s:40~l:46~a:4065635 | d0ef63e951bb5f842b7357521697dc62 | 4065635 | 6' 1" | 187 lbs | 22 | PG | 12 | $350,189 | 06/17/97 | https://a.espncdn.com/i/headshots/nba/players/... | Ky Bowman | 0 | Boston College | NaN | NaN |
Marquese Chriss | s:40~l:46~a:3907487 | a320ecf1d6481b7518ddc1dc576c27b4 | 3907487 | 6' 9" | 240 lbs | 22 | C | 32 | $654,469 | 07/02/97 | https://a.espncdn.com/i/headshots/nba/players/... | Marquese Chriss | 3 | Washington | Sacramento, CA | NaN |
Mychal Mulder | s:40~l:46~a:3936298 | f5a46c489e9aee6a1a74f67f9494132f | 3936298 | 6' 4" | 184 lbs | 25 | G | 12 | NaN | 06/12/94 | https://a.espncdn.com/i/headshots/nba/players/... | Mychal Mulder | 0 | Kentucky | Toronto, ON | NaN |
Stephen Curry | s:40~l:46~a:3975 | 5dda51f150c966e12026400b73f34fad | 3975 | 6' 3" | 185 lbs | 32 | PG | 30 | $40,231,758 | 03/14/88 | https://a.espncdn.com/i/headshots/nba/players/... | Stephen Curry | 10 | Davidson | Akron, OH | NaN |
In the DataFrame above, each of the parameters such as 'age', 'salary', etc. are organized in columns and each player is a row. This makes display of the data much easier to read and understand. Furthermore, it also places null values when pieces of data are missing--for example, Chasson Randle's salary information is missing from the website, so 'NaN' is automatically placed in the DataFrame.
DataFrames allow us to quickly make calculations, sort players based on their stats, and compare stats between teams. To make a DataFrame containing data from all the teams, we will loop through each team in all_players
, construct DataFrames, label them with a team
column, and aggregate them into a single DataFrame called all_players_df
.
all_players_df = pd.DataFrame()
# loop through each team, create a pandas DataFrame, and append
for team in all_players.keys():
team_df = pd.DataFrame.from_dict(all_players[team], orient = "index")
team_df['team'] = team
all_players_df = all_players_df.append(team_df)
all_players_df.head(5)
age | birthDate | birthPlace | college | experience | guid | hand | headshot | height | id | jersey | lastName | position | salary | team | uid | weight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Brad Wanamaker | 30 | 07/25/89 | Philadelphia, PA | Pittsburgh | 1 | 5aad35bbbb760e3958107639266768ae | NaN | https://a.espncdn.com/i/headshots/nba/players/... | 6' 3" | 6507 | 9 | Brad Wanamaker | PG | $1,445,697 | boston-celtics | s:40~l:46~a:6507 | 210 lbs |
Carsen Edwards | 22 | 03/12/98 | Houston, TX | Purdue | 0 | 4b8ebdfd01221567925035c1e0d0c337 | NaN | https://a.espncdn.com/i/headshots/nba/players/... | 5' 11" | 4066407 | 4 | Carsen Edwards | PG | $1,228,026 | boston-celtics | s:40~l:46~a:4066407 | 200 lbs |
Daniel Theis | 27 | 04/04/92 | Germany | NaN | 2 | ce75206c087f83ace6f9a8e3efbd9671 | NaN | https://a.espncdn.com/i/headshots/nba/players/... | 6' 8" | 2451037 | 27 | Daniel Theis | C | $5,000,000 | boston-celtics | s:40~l:46~a:2451037 | 245 lbs |
Enes Kanter | 27 | 05/20/92 | Switzerland | Kentucky | 8 | 1e039b407b3daa6eeac69432aa6413fd | NaN | https://a.espncdn.com/i/headshots/nba/players/... | 6' 10" | 6447 | 11 | Enes Kanter | C | $4,767,000 | boston-celtics | s:40~l:46~a:6447 | 250 lbs |
Gordon Hayward | 30 | 03/23/90 | Indianapolis, IN | Butler | 9 | 56f675cb8f40a5aaee5f5747ec9099c5 | NaN | https://a.espncdn.com/i/headshots/nba/players/... | 6' 7" | 4249 | 20 | Gordon Hayward | SF | $32,700,690 | boston-celtics | s:40~l:46~a:4249 | 225 lbs |
Now, all_players_df
is a DataFrame with all the players in the NBA categorized by team. It contains player information such as age, salary, height, weight, etc. I'll export this data to a csv file, in case you readers out there want to read it in and play around with it yourself.
all_players_df.to_csv("NBA_roster_info_all_players_mar2020.csv")
We also want to scrape data coresponding to the performance of each player, in terms of points per game, field goal percentage, rebounds per game, etc. Our goal is to append this information to all_players_df
so that we can compare player performance with traits such as height, salary, etc. We can find performance stats at each player's personal page on ESPN:
We'll want to extract the career stats in the bottom row, which can be found in the highlighted section of the source code below:
In order to extract the information above for each player in our DataFrame, we can construct URLs for player stats pages using the id
column. Fortunately, the URL is standardized and very easy to construct. For example, using the id
value of 3975 for Stephen Curry, the URL to open would be: https://www.espn.com/nba/player/stats/_/id/3975. Below is an example of extracting his career stats using regexes:
url = "https://www.espn.com/nba/player/stats/_/id/3975"
f = urllib.request.urlopen(url)
sleep(0.3)
player_source = f.read().decode('utf-8')
# extract career stats using this regex
stats_regex = ('\[\"Career\",\"\",(.*?)\]\},\{\"ttl\"\:\"Regular Season Totals\"')
career_info = re.findall(stats_regex, player_source)
print(career_info)
['"699","693","34.3","8.1-17.1","47.6","3.6-8.2","43.5","3.7-4.0","90.6","0.7","3.8","4.5","6.6","0.2","1.7","2.5","3.1","23.5"']
We observe that some of the stats are complex and contain non-numerical symbols such as "-". In the example above, the range "3.7-4.0" is for the column "FT", which stands for "Free Throws Made-Attempted Per Game". We should split this up into two categories, "Free Throws Made (FTM)" and "Free Throws Attempted (FTA)", and do the same for field goals and 3 pointers. To do so, we can split the string based on "-" and then un-nest the list. We also need to convert the strings to floating point values.
from itertools import chain
career_info = career_info[0].replace("\"", "").split(",")
career_info = list(chain.from_iterable([i.split("-") for i in career_info]))
career_info = list(map(float,career_info))
print(career_info)
[699.0, 693.0, 34.3, 8.1, 17.1, 47.6, 3.6, 8.2, 43.5, 3.7, 4.0, 90.6, 0.7, 3.8, 4.5, 6.6, 0.2, 1.7, 2.5, 3.1, 23.5]
Now we can loop through each player in all_players_df
, open their stats webpage, extract their career stats, and store the stats in a separate data frame called career_stats_df
using the code below:
career_stats_df = pd.DataFrame(columns = ["GP","GS","MIN","FGM", "FGA","FG%","3PTM","3PTA","3P%","FTM","FTA","FT%","OR","DR","REB","AST","BLK","STL","PF","TO","PTS"])
for player_index in all_players_df.index:
url = "https://www.espn.com/nba/player/stats/_/id/" + str(all_players_df.loc[player_index]['id'])
f = urllib.request.urlopen(url)
sleep(0.3)
player_source = f.read().decode('utf-8')
# extract career stats using this regex
stats_regex = ('\[\"Career\",\"\",(.*?)\]\},\{\"ttl\"\:\"Regular Season Totals\"')
career_info = re.findall(stats_regex, player_source)
try:
# convert the stats to a list of floats, and add the entry to the DataFrame
career_info = career_info[0].replace("\"", "").split(",")
career_info = list(chain.from_iterable([i.split("-") for i in career_info]))
career_info = list(map(float,career_info))
career_stats_df = career_stats_df.append(pd.Series(career_info, index = career_stats_df.columns, name=player_index))
except:
# if no career stats were returned, the player was a rookie with no games played
print(player_index + " has no info, ", end = "")
Some player webpages did not have career stats, which I found corresponded to rookies which had no games played. This threw an error in the loop, so I used a try/except clause to bypass the error and continue stripping content for the remaining players. Their stats are currently stored in the object career_stats_df
:
career_stats_df.head(5)
GP | GS | MIN | FGM | FGA | FG% | 3PTM | 3PTA | 3P% | FTM | ... | FT% | OR | DR | REB | AST | BLK | STL | PF | TO | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Brad Wanamaker | 99.0 | 1.0 | 15.7 | 1.9 | 4.3 | 44.0 | 0.5 | 1.3 | 38.0 | 1.3 | ... | 91.7 | 0.2 | 1.4 | 1.7 | 2.2 | 0.1 | 0.6 | 1.6 | 0.9 | 5.6 |
Carsen Edwards | 35.0 | 0.0 | 9.0 | 1.1 | 3.2 | 32.7 | 0.6 | 1.9 | 30.9 | 0.3 | ... | 84.6 | 0.2 | 1.1 | 1.2 | 0.6 | 0.1 | 0.3 | 0.9 | 0.4 | 3.0 |
Daniel Theis | 187.0 | 62.0 | 17.2 | 2.6 | 4.7 | 55.4 | 0.4 | 1.1 | 34.0 | 1.1 | ... | 75.3 | 1.6 | 3.1 | 4.7 | 1.2 | 0.9 | 0.5 | 2.8 | 0.7 | 6.7 |
Enes Kanter | 634.0 | 222.0 | 21.8 | 4.8 | 8.8 | 54.2 | 0.1 | 0.2 | 28.7 | 2.0 | ... | 77.6 | 2.9 | 4.7 | 7.6 | 0.9 | 0.5 | 0.4 | 2.2 | 1.5 | 11.6 |
Gordon Hayward | 634.0 | 472.0 | 30.8 | 5.2 | 11.6 | 45.1 | 1.3 | 3.6 | 36.6 | 3.5 | ... | 82.2 | 0.7 | 3.6 | 4.4 | 3.5 | 0.4 | 1.0 | 1.7 | 2.0 | 15.3 |
5 rows × 21 columns
The stats for each player are now organized in a neat DataFrame. Here is a legend for what each of the abbreviations mean:
- GP:Games Played
- GS:Games Started
- MIN:Minutes Per Game
- FGM:Field Goals Made Per Game
- FGA:Field Goals Attempted Per Game
- FG%:Field Goal Percentage
- 3PTM:3-Point Field Goals Made Per Game
- 3PTA:3-Point Field Goals Attempted Per Game
- 3P%:3-Point Field Goal Percentage
- FTM:Free Throws Made Per Game
- FTA:Free Throws Attempted
- FT%:Free Throw Percentage
- OR:Offensive Rebounds Per Game
- DR:Defensive Rebounds Per Game
- REB:Rebounds Per Game
- AST:Assists Per Game
- BLK:Blocks Per Game
- STL:Steals Per Game
- PF:Fouls Per Game
- TO:Turnovers Per Game
- PTS:Points Per Game
I'll also export these stats to a csv file:
career_stats_df.to_csv("NBA_player_stats_all_mar2020.csv")
We will now join career_stats_df
with all_players_df
, which will merge the content from both data frames based on rows that have the same index (player name). Players in all_players_df
that are not included in career_stats_df
will have NaN
values for the joined columns.
all_stats_df = all_players_df.join(career_stats_df)
all_stats_df.head(5)
age | birthDate | birthPlace | college | experience | guid | hand | headshot | height | id | ... | FT% | OR | DR | REB | AST | BLK | STL | PF | TO | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Brad Wanamaker | 30 | 07/25/89 | Philadelphia, PA | Pittsburgh | 1 | 5aad35bbbb760e3958107639266768ae | NaN | https://a.espncdn.com/i/headshots/nba/players/... | 6' 3" | 6507 | ... | 91.7 | 0.2 | 1.4 | 1.7 | 2.2 | 0.1 | 0.6 | 1.6 | 0.9 | 5.6 |
Carsen Edwards | 22 | 03/12/98 | Houston, TX | Purdue | 0 | 4b8ebdfd01221567925035c1e0d0c337 | NaN | https://a.espncdn.com/i/headshots/nba/players/... | 5' 11" | 4066407 | ... | 84.6 | 0.2 | 1.1 | 1.2 | 0.6 | 0.1 | 0.3 | 0.9 | 0.4 | 3.0 |
Daniel Theis | 27 | 04/04/92 | Germany | NaN | 2 | ce75206c087f83ace6f9a8e3efbd9671 | NaN | https://a.espncdn.com/i/headshots/nba/players/... | 6' 8" | 2451037 | ... | 75.3 | 1.6 | 3.1 | 4.7 | 1.2 | 0.9 | 0.5 | 2.8 | 0.7 | 6.7 |
Enes Kanter | 27 | 05/20/92 | Switzerland | Kentucky | 8 | 1e039b407b3daa6eeac69432aa6413fd | NaN | https://a.espncdn.com/i/headshots/nba/players/... | 6' 10" | 6447 | ... | 77.6 | 2.9 | 4.7 | 7.6 | 0.9 | 0.5 | 0.4 | 2.2 | 1.5 | 11.6 |
Gordon Hayward | 30 | 03/23/90 | Indianapolis, IN | Butler | 9 | 56f675cb8f40a5aaee5f5747ec9099c5 | NaN | https://a.espncdn.com/i/headshots/nba/players/... | 6' 7" | 4249 | ... | 82.2 | 0.7 | 3.6 | 4.4 | 3.5 | 0.4 | 1.0 | 1.7 | 2.0 | 15.3 |
5 rows × 38 columns
The performance stats have been added as columns on the right side of the DataFrame.
We notice that some of the columns which should contain numerical data such as salary, height, and weight are instead considered strings. This is beacuse they contain non-numerical characters (such as '$' and ',' for salary). To be able to compute statistics on these columns, we need to convert them to numeric values.
We can convert salaries to numeric by removing all non-numerical characters and converting to int
using list comprehension:
# before converting
all_stats_df['salary'].head(3)
Brad Wanamaker $1,445,697
Carsen Edwards $1,228,026
Daniel Theis $5,000,000
Name: salary, dtype: object
all_stats_df['salary']=[int(re.sub(r'[^\d.]+', '', s)) if isinstance(s, str) else s for s in all_stats_df['salary'].values]
# after converting
all_stats_df['salary'].head(3)
Brad Wanamaker 1445697.0
Carsen Edwards 1228026.0
Daniel Theis 5000000.0
Name: salary, dtype: float64
Height is also provided in a non-numeric form, in feet plus inches (e.g. 6' 5"). We should convert this to a numeric form so that statistics can be calculated. To do so, we will write a small function that converts a string of feet plus inches into a numeric value of total inches, convert_height
.
def convert_height(height):
split_height = height.split(" ")
feet = float(split_height[0].replace("\'",""))
inches = float(split_height[1].replace("\"",""))
return (feet*12 + inches)
# before conversion
all_stats_df['height'].head(3)
Brad Wanamaker 6' 3"
Carsen Edwards 5' 11"
Daniel Theis 6' 8"
Name: height, dtype: object
all_stats_df['height'] = [convert_height(x) for x in all_stats_df['height']]
# after conversion
all_stats_df['height'].head(3)
Brad Wanamaker 75.0
Carsen Edwards 71.0
Daniel Theis 80.0
Name: height, dtype: float64
Weight is also a non-numerical field, because of the units listed (e.g. weight': '230 lbs'). We will simply strip off the units for each entry by splitting the string in half with split(" ")
and taking the left side of the split.
# before conversion
all_stats_df['weight'].head(3)
Brad Wanamaker 210 lbs
Carsen Edwards 200 lbs
Daniel Theis 245 lbs
Name: weight, dtype: object
all_stats_df['weight'] = [float(x.split(" ")[0]) for x in all_stats_df['weight']]
# after conversion
all_stats_df['weight'].head(3)
Brad Wanamaker 210.0
Carsen Edwards 200.0
Daniel Theis 245.0
Name: weight, dtype: float64
This should be the last of the values we have to convert to numeric. Now, we have a cleaned-up and joined dataset! I'll save the data to a csv file.
all_stats_df.to_csv("NBA_player_info_and_stats_joined_mar2020.csv")
If you want to read in the data at a later time, you can use read_csv()
like so:
all_stats_df = pd.read_csv("NBA_player_info_and_stats_joined_mar2020.csv", index_col=0)
We can use the data we just compiled to calculate some statistics. Let's start by calculating average stats per team, using groupby()
with mean()
in pandas
.
# calculate means and remove irrelevant columns for id and jersey #
mean_df = all_stats_df.groupby('team').mean().drop(['id','jersey'],1)
mean_df
age | experience | height | salary | weight | GP | GS | MIN | FGM | FGA | ... | FT% | OR | DR | REB | AST | BLK | STL | PF | TO | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
team | |||||||||||||||||||||
atlanta-hawks | 25.538462 | 4.307692 | 79.000000 | 5.608001e+06 | 218.615385 | 314.153846 | 200.076923 | 22.215385 | 3.815385 | 8.146154 | ... | 74.423077 | 1.223077 | 3.184615 | 4.407692 | 2.323077 | 0.569231 | 0.669231 | 2.146154 | 1.430769 | 10.300000 |
boston-celtics | 25.000000 | 2.500000 | 77.687500 | 7.228917e+06 | 224.062500 | 218.000000 | 121.687500 | 17.718750 | 2.787500 | 6.212500 | ... | 76.050000 | 0.712500 | 2.356250 | 3.081250 | 1.562500 | 0.356250 | 0.612500 | 1.631250 | 1.006250 | 7.662500 |
brooklyn-nets | 25.882353 | 4.529412 | 78.529412 | 7.928121e+06 | 217.352941 | 310.823529 | 214.000000 | 20.905882 | 3.482353 | 7.652941 | ... | 71.711765 | 0.864706 | 2.970588 | 3.835294 | 2.047059 | 0.476471 | 0.641176 | 1.688235 | 1.288235 | 9.582353 |
charlotte-hornets | 24.266667 | 2.733333 | 78.733333 | 6.772335e+06 | 216.066667 | 203.400000 | 103.466667 | 19.566667 | 2.626667 | 5.973333 | ... | 75.086667 | 0.933333 | 2.773333 | 3.680000 | 1.506667 | 0.440000 | 0.626667 | 1.746667 | 1.000000 | 7.013333 |
chicago-bulls | 24.666667 | 2.533333 | 79.000000 | 5.392607e+06 | 217.266667 | 199.400000 | 105.666667 | 20.033333 | 3.100000 | 6.866667 | ... | 73.613333 | 0.793333 | 2.580000 | 3.360000 | 1.826667 | 0.326667 | 0.726667 | 1.773333 | 1.020000 | 8.180000 |
cleveland-cavaliers | 24.866667 | 2.933333 | 78.333333 | 8.744085e+06 | 224.333333 | 253.615385 | 165.538462 | 20.376923 | 3.184615 | 6.953846 | ... | 62.484615 | 1.246154 | 3.069231 | 4.300000 | 1.638462 | 0.353846 | 0.546154 | 1.692308 | 1.200000 | 8.407692 |
dallas-mavericks | 26.500000 | 3.250000 | 79.250000 | 7.593353e+06 | 220.333333 | 237.750000 | 105.666667 | 19.083333 | 3.233333 | 7.058333 | ... | 70.833333 | 0.791667 | 2.725000 | 3.533333 | 1.808333 | 0.466667 | 0.575000 | 1.566667 | 1.033333 | 8.933333 |
denver-nuggets | 25.928571 | 4.285714 | 79.285714 | 8.798127e+06 | 224.142857 | 347.153846 | 185.307692 | 20.976923 | 3.423077 | 7.330769 | ... | 75.776923 | 1.061538 | 2.946154 | 3.984615 | 2.030769 | 0.469231 | 0.700000 | 1.846154 | 1.107692 | 9.092308 |
detroit-pistons | 25.000000 | 3.411765 | 78.000000 | 6.505785e+06 | 208.529412 | 235.058824 | 132.294118 | 18.876471 | 2.911765 | 6.652941 | ... | 64.917647 | 0.711765 | 2.335294 | 3.029412 | 1.723529 | 0.341176 | 0.517647 | 1.635294 | 1.000000 | 7.900000 |
golden-state-warriors | 25.076923 | 3.153846 | 77.538462 | 1.173546e+07 | 209.153846 | 244.692308 | 187.692308 | 24.076923 | 4.023077 | 9.100000 | ... | 77.800000 | 0.792308 | 2.853846 | 3.661538 | 2.400000 | 0.384615 | 0.784615 | 2.115385 | 1.430769 | 11.123077 |
houston-rockets | 28.800000 | 7.266667 | 77.666667 | 7.617278e+06 | 212.933333 | 494.400000 | 332.266667 | 21.860000 | 3.353333 | 7.773333 | ... | 71.300000 | 0.820000 | 2.766667 | 3.600000 | 2.013333 | 0.406667 | 0.773333 | 1.953333 | 1.406667 | 9.706667 |
indiana-pacers | 25.250000 | 3.500000 | 78.500000 | 7.942772e+06 | 214.500000 | 251.166667 | 116.250000 | 19.841667 | 3.241667 | 7.150000 | ... | 76.050000 | 0.666667 | 2.725000 | 3.416667 | 1.716667 | 0.491667 | 0.625000 | 1.750000 | 1.066667 | 8.633333 |
la-clippers | 27.250000 | 5.312500 | 78.437500 | 7.520664e+06 | 217.500000 | 362.562500 | 185.875000 | 20.318750 | 3.337500 | 7.175000 | ... | 74.587500 | 1.018750 | 2.862500 | 3.893750 | 1.862500 | 0.406250 | 0.662500 | 1.856250 | 1.087500 | 9.112500 |
los-angeles-lakers | 29.133333 | 7.733333 | 78.666667 | 6.905793e+06 | 222.133333 | 579.928571 | 425.071429 | 25.071429 | 4.528571 | 9.514286 | ... | 70.414286 | 1.121429 | 3.707143 | 4.807143 | 2.507143 | 0.678571 | 0.914286 | 1.957143 | 1.614286 | 11.985714 |
memphis-grizzlies | 24.800000 | 2.866667 | 78.866667 | 6.021349e+06 | 219.066667 | 234.857143 | 100.000000 | 20.107143 | 3.200000 | 6.800000 | ... | 73.071429 | 0.935714 | 2.892857 | 3.821429 | 1.950000 | 0.421429 | 0.635714 | 1.692857 | 1.171429 | 8.328571 |
miami-heat | 27.000000 | 4.937500 | 79.000000 | 8.439172e+06 | 222.187500 | 381.266667 | 210.200000 | 22.126667 | 3.326667 | 7.326667 | ... | 71.233333 | 0.840000 | 2.980000 | 3.833333 | 2.013333 | 0.313333 | 0.713333 | 1.840000 | 1.140000 | 9.180000 |
milwaukee-bucks | 29.466667 | 7.666667 | 78.733333 | 7.836516e+06 | 227.200000 | 565.066667 | 371.266667 | 23.466667 | 3.820000 | 8.233333 | ... | 77.946667 | 0.973333 | 3.133333 | 4.120000 | 1.966667 | 0.513333 | 0.760000 | 1.953333 | 1.266667 | 10.413333 |
minnesota-timberwolves | 23.571429 | 2.285714 | 77.857143 | 4.248132e+06 | 216.214286 | 172.500000 | 56.000000 | 16.978571 | 2.442857 | 5.721429 | ... | 74.914286 | 0.657143 | 2.128571 | 2.764286 | 1.457143 | 0.350000 | 0.585714 | 1.507143 | 0.850000 | 6.535714 |
new-orleans-pelicans | 25.200000 | 3.400000 | 77.933333 | 7.031152e+06 | 221.000000 | 244.200000 | 150.066667 | 22.960000 | 3.766667 | 8.046667 | ... | 64.346667 | 1.040000 | 2.973333 | 4.020000 | 2.193333 | 0.506667 | 0.720000 | 2.020000 | 1.440000 | 9.946667 |
new-york-knicks | 25.066667 | 3.666667 | 78.266667 | 6.224969e+06 | 217.933333 | 290.357143 | 137.785714 | 21.492857 | 3.342857 | 7.371429 | ... | 72.535714 | 1.035714 | 2.857143 | 3.914286 | 1.914286 | 0.464286 | 0.657143 | 1.892857 | 1.214286 | 8.757143 |
oklahoma-city-thunder | 25.066667 | 3.866667 | 78.533333 | 9.111964e+06 | 217.000000 | 281.533333 | 187.466667 | 18.613333 | 2.560000 | 5.713333 | ... | 61.446667 | 0.793333 | 2.320000 | 3.106667 | 1.566667 | 0.400000 | 0.680000 | 1.673333 | 0.920000 | 6.973333 |
orlando-magic | 24.833333 | 3.166667 | 79.750000 | 9.398300e+06 | 219.083333 | 234.416667 | 136.916667 | 19.333333 | 3.016667 | 6.750000 | ... | 71.383333 | 0.966667 | 2.866667 | 3.841667 | 1.458333 | 0.525000 | 0.625000 | 1.575000 | 0.900000 | 7.816667 |
philadelphia-76ers | 25.714286 | 3.714286 | 78.642857 | 8.825191e+06 | 218.500000 | 263.214286 | 156.785714 | 20.685714 | 3.492857 | 7.550000 | ... | 67.207143 | 0.857143 | 3.164286 | 4.021429 | 2.107143 | 0.571429 | 0.671429 | 1.835714 | 1.314286 | 9.392857 |
phoenix-suns | 24.142857 | 2.357143 | 78.857143 | 5.896355e+06 | 213.142857 | 191.071429 | 101.357143 | 19.621429 | 3.092857 | 6.892857 | ... | 82.428571 | 0.907143 | 2.678571 | 3.578571 | 2.007143 | 0.314286 | 0.657143 | 1.764286 | 1.157143 | 8.385714 |
portland-trail-blazers | 25.400000 | 4.666667 | 80.000000 | 8.704394e+06 | 226.533333 | 336.066667 | 247.000000 | 20.186667 | 3.626667 | 7.993333 | ... | 70.940000 | 1.026667 | 3.040000 | 4.080000 | 1.640000 | 0.473333 | 0.620000 | 1.913333 | 1.240000 | 9.786667 |
sacramento-kings | 26.153846 | 3.769231 | 78.000000 | 7.316023e+06 | 214.538462 | 298.153846 | 142.615385 | 19.369231 | 3.238462 | 7.061538 | ... | 63.053846 | 0.784615 | 2.530769 | 3.323077 | 1.584615 | 0.353846 | 0.607692 | 1.653846 | 0.992308 | 8.492308 |
san-antonio-spurs | 26.200000 | 4.933333 | 78.600000 | 7.264785e+06 | 219.733333 | 371.866667 | 218.533333 | 18.746667 | 3.120000 | 7.026667 | ... | 66.606667 | 0.740000 | 2.493333 | 3.240000 | 1.753333 | 0.353333 | 0.553333 | 1.393333 | 0.946667 | 8.213333 |
toronto-raptors | 25.937500 | 3.562500 | 78.437500 | 7.590898e+06 | 214.937500 | 268.562500 | 169.625000 | 17.556250 | 2.606250 | 5.893750 | ... | 76.187500 | 0.712500 | 2.487500 | 3.200000 | 1.618750 | 0.450000 | 0.593750 | 1.668750 | 0.912500 | 7.143750 |
utah-jazz | 25.857143 | 3.785714 | 77.928571 | 8.142802e+06 | 219.928571 | 293.071429 | 170.214286 | 18.207143 | 3.057143 | 6.535714 | ... | 57.978571 | 0.800000 | 2.514286 | 3.328571 | 1.757143 | 0.371429 | 0.542857 | 1.457143 | 1.142857 | 8.185714 |
washington-wizards | 25.133333 | 3.333333 | 78.733333 | 7.772450e+06 | 216.733333 | 231.333333 | 109.000000 | 19.233333 | 3.080000 | 6.620000 | ... | 74.406667 | 0.853333 | 2.493333 | 3.346667 | 1.973333 | 0.366667 | 0.606667 | 1.920000 | 1.113333 | 8.360000 |
30 rows × 26 columns
As you can see, the index of the data frame that is returned corresponds to each individual team now, and the mean values are displayed for each of the columns with numerical values. To find the team with the highest averages for a specific stat, we can use the sort_values()
function. Let's find the top 5 teams with the highest average salary:
mean_df.sort_values('salary', ascending=False).head(5)
age | experience | height | salary | weight | GP | GS | MIN | FGM | FGA | ... | FT% | OR | DR | REB | AST | BLK | STL | PF | TO | PTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
team | |||||||||||||||||||||
golden-state-warriors | 25.076923 | 3.153846 | 77.538462 | 1.173546e+07 | 209.153846 | 244.692308 | 187.692308 | 24.076923 | 4.023077 | 9.100000 | ... | 77.800000 | 0.792308 | 2.853846 | 3.661538 | 2.400000 | 0.384615 | 0.784615 | 2.115385 | 1.430769 | 11.123077 |
orlando-magic | 24.833333 | 3.166667 | 79.750000 | 9.398300e+06 | 219.083333 | 234.416667 | 136.916667 | 19.333333 | 3.016667 | 6.750000 | ... | 71.383333 | 0.966667 | 2.866667 | 3.841667 | 1.458333 | 0.525000 | 0.625000 | 1.575000 | 0.900000 | 7.816667 |
oklahoma-city-thunder | 25.066667 | 3.866667 | 78.533333 | 9.111964e+06 | 217.000000 | 281.533333 | 187.466667 | 18.613333 | 2.560000 | 5.713333 | ... | 61.446667 | 0.793333 | 2.320000 | 3.106667 | 1.566667 | 0.400000 | 0.680000 | 1.673333 | 0.920000 | 6.973333 |
philadelphia-76ers | 25.714286 | 3.714286 | 78.642857 | 8.825191e+06 | 218.500000 | 263.214286 | 156.785714 | 20.685714 | 3.492857 | 7.550000 | ... | 67.207143 | 0.857143 | 3.164286 | 4.021429 | 2.107143 | 0.571429 | 0.671429 | 1.835714 | 1.314286 | 9.392857 |
denver-nuggets | 25.928571 | 4.285714 | 79.285714 | 8.798127e+06 | 224.142857 | 347.153846 | 185.307692 | 20.976923 | 3.423077 | 7.330769 | ... | 75.776923 | 1.061538 | 2.946154 | 3.984615 | 2.030769 | 0.469231 | 0.700000 | 1.846154 | 1.107692 | 9.092308 |
5 rows × 26 columns
Looks like the highest average salary is paid by the Golden State Warriors. Similarly, we can find the top 10 highest paid players by sorting all_stats_df
on salary, then pulling out the top entries for the 'salary' and 'team' columns:
all_stats_df.sort_values('salary', ascending=False)[['salary','team']].head(10)
salary | team | |
---|---|---|
Stephen Curry | 40231758.0 | golden-state-warriors |
Russell Westbrook | 38506482.0 | houston-rockets |
Chris Paul | 38506482.0 | oklahoma-city-thunder |
Kevin Durant | 38199000.0 | brooklyn-nets |
James Harden | 38199000.0 | houston-rockets |
John Wall | 38199000.0 | washington-wizards |
LeBron James | 37436858.0 | los-angeles-lakers |
Kyle Lowry | 34996296.0 | toronto-raptors |
Blake Griffin | 34449964.0 | detroit-pistons |
Kemba Walker | 32742000.0 | boston-celtics |
Stephen Curry is the highest paid player in the NBA with a whopping salary of $40,231,758, followed by Russell Westbrook. We can continue to sift through the data this way for whatever piques our interest. Given how many different variables there are, we can write a small function to make things easier:
def top_n(df, category, n):
return (df.sort_values(category, ascending=False)[[category,'team']].head(n))
This way, we can quickly identify the top n players for any given category in a DataFrame. Let's cycle through some stats of interest:
top_n(all_stats_df, 'PTS', 5)
PTS | team | |
---|---|---|
LeBron James | 27.1 | los-angeles-lakers |
Kevin Durant | 27.0 | brooklyn-nets |
James Harden | 25.1 | houston-rockets |
Luka Doncic | 24.4 | dallas-mavericks |
Joel Embiid | 24.1 | philadelphia-76ers |
top_n(all_stats_df, 'REB', 5)
REB | team | |
---|---|---|
Andre Drummond | 13.8 | cleveland-cavaliers |
Dwight Howard | 12.3 | los-angeles-lakers |
Hassan Whiteside | 11.8 | portland-trail-blazers |
Joel Embiid | 11.5 | philadelphia-76ers |
Kevin Love | 11.1 | cleveland-cavaliers |
top_n(all_stats_df, 'height', 5)
height | team | |
---|---|---|
Tacko Fall | 89.0 | boston-celtics |
Boban Marjanovic | 88.0 | dallas-mavericks |
Kristaps Porzingis | 87.0 | dallas-mavericks |
Moses Brown | 86.0 | portland-trail-blazers |
Bol Bol | 86.0 | denver-nuggets |
top_n(all_stats_df, 'weight', 5)
weight | team | |
---|---|---|
Tacko Fall | 311.0 | boston-celtics |
Jusuf Nurkic | 290.0 | portland-trail-blazers |
Boban Marjanovic | 290.0 | dallas-mavericks |
Nikola Jokic | 284.0 | denver-nuggets |
Zion Williamson | 284.0 | new-orleans-pelicans |
Interestingly, Tacko Fall of the Boston Celtics is both the tallest and the heaviest player in the NBA.
To get a high level overview of how each statistic correlates with one another, we can generate a correlation matrix using corr()
and matplotlib
.
corr_matrix = all_stats_df.drop(['id','jersey'],1).corr()
import matplotlib.pyplot as plt
f = plt.figure(figsize=(19, 15))
plt.matshow(corr_matrix, fignum=f.number)
plt.xticks(range(corr_matrix.shape[1]), corr_matrix.columns, fontsize=14, rotation=45, ha = 'left')
plt.yticks(range(corr_matrix.shape[1]), corr_matrix.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);
We can learn a lot about how different statistics are associated with each other from this matrix, and also identify some interesting trends. For example:
- As expected, we observe that games played, age, and experience are highly correlated with one another. An older player will have played more games and have more years of experience in the NBA.
- As expected, height is most highly correlated with weight (i.e. taller players weigh more). The correlation of either height or weight with other statistics are similar as well.
- Aside from weight, height is highly correlated with overal offensive rebounds (OR) and blocks (BLK), which also makes sense. A taller player should be able to get these more easily.
- Points per game (PTS) is highly correlated with field goal and free throw attempts, which also makes sense since more shots generally mean more points. Interestingly, the correlation with the percentage made is low.
- One of the highest correlates with salary is points per game, which is one of the more important stats when it comes down to performance.
We can narrow in on correlations of interest by sorting the correlation matrix. Let's try sorting by salary and identifying the top correlates:
corr_matrix.sort_values('salary', ascending=False)['salary'].head(10)
salary 1.000000
PTS 0.712635
FTM 0.707054
GS 0.703869
FGM 0.699154
FGA 0.686631
FTA 0.681934
MIN 0.663697
TO 0.648611
STL 0.602140
Name: salary, dtype: float64
As we suspected, points per game (PTS) is most highly correlated with salary, followed by other point-related stats such as free throws made (FTM). Games started (GS) is also highly correlated with salary, which makes sense since highly-paid players are typically better and will be starters.
If we want to model how much more a player costs based on increases in points per game, an easy way is to use linear regression (OLS). To do so, we will use scikit-learn
. The LinearRegression()
function cannot handle null values, so we will remove players that don't have reported salaries or PTS values first:
from sklearn.linear_model import LinearRegression
# remove rows with null values for regression
reg_df = all_stats_df[['salary', 'PTS']].dropna()
Then, we will fit the model with the predictor variable (X) being PTS and the dependent variable (Y) being salary. We will set fit_intercept=False
since players cannot be paid less than $0.00 or score less than 0 PTS:
X = reg_df['PTS'].values.reshape(-1,1)
Y = reg_df['salary'].values.reshape(-1,1)
reg = LinearRegression(fit_intercept=False).fit(X,Y)
y_pred = reg.predict(X)
plt.figure(figsize=(12, 6))
plt.scatter(X, Y)
plt.plot(X, y_pred, color='red')
plt.xlabel("Points per game (Career)")
plt.ylabel("Salary (2020)")
plt.title('Salary vs PTS - simple linear regression', fontsize=16);
Consistent with the positive correlation we calculated previously, a regression line with a positive slope is fitted to the data. We can extract the slope of the line by getting the coefficient using .coef_
:
print(reg.coef_)
[[947619.16030932]]
This was only meant to be a demonstration of what could be done with the data that we scraped. Better models can definitely be generated, especially given the nature of the data. Just by looking at the fit above, we can see that the residuals will be heteroskedastic. There are also a small number of players with high career points per game but low salaries in the bottom right corner of the plot which are skewing the regression line.
Taking into account these caveats, the value of the slope is ~947619.16. This suggests that for every unit increase in points per game, the predicted salary paid to a player increases by $947,619.16! Looks like making that free throw really does count.
Here, I used Python to scrape ESPN for statistics on all the players in the NBA using the urllib
and re
packages. Then, I used pandas
and scikit-learn
to organize the data and calculate some summary statistics.
I hope what you've learned from this project will help you out on your own web scraping quests. The techniques that I've outlined here should be broadly applicable for other websites. In general, webpages that link to subpages within the same site will construct their links in some sort of standardized pattern. If so, you can construct URLs for the subpages and loop through them as we have done here. Next time you find yourself flipping through a website and copy-pasting, consider trying to automate the process using Python!