NBA Analytics Using PySpark. Win ratio for back-to-back games, mean… | by Jin Cui | Apr, 2023


Photo by Emanuel Ekström on Unsplash

Just over a week ago I was watching an NBA game between the Milwaukee Bucks and the Boston Celtics. This was a match-up between the top 2 teams in the league, which many considered to be a prequel to the Eastern Conference finals. Being a big basketball and NBA fan myself, the game turned out rather disappointing as the Milwaukee Bucks lost to the Boston Celtics 140–99, a rare blow-out defeat for Milwaukee which holds the best (regular season) record in the 2022–2023 season.

Although this was out of character for Milwaukee especially given it’s a blow-out loss at home, the commentator of the game alerted me to the fact that they were actually playing a back-to-back game, which is a game played right after playing on the previous day (in this instance, a game away at Indiana on the previous day). In other words, fatigue may have played a role in their loss as playing back-to-back games is physically demanding for athletes, which may have been exacerbated by the travelling between games (from Indiana back to Milwaukee).

Looking at team schedules, out of 80 odd games in a season, NBA teams do play a number of back-to-back games. Do you ever wonder how teams fare in these games, and does this change when teams are playing at away or home courts? This article demonstrates one way of getting these stats, which are typically not available in the public domain, using PySpark — a ready-to-use interface for Apache Spark in Python.

To determine the win ratio for back-to-back games, we’ll need a history of back-to-back games played by each NBA team as well as their results. Although these stats are available on the official NBA website and other community sites, they are not licensed for commercial use and as such, I have simulated a synthetic dataset which contains the following fields.

  • Date when the game was played
  • Team name for the home team
  • Team name for the away team, and
  • Score of the game, and corresponding outcome by home and away team

The table below shows a snippet of the synthetic dataset. You should be able to verify against the official NBA game schedule that that these were not actual games.

Table 1: Synthetic game data. Table by author.

This section provides a step-by-step guide in Python on how to transform the above dataset into one which identifies whether a game played by a team is a back-to-back game and subsequently calculates the win ratio for these games for each team.

Step 1: Load packages and data

#Load required Python packages

import numpy as np
import pandas as pd

!pip install pyspark #Install PySpark
import pyspark
from pyspark.sql.window import Window #For use of Window Function
from pyspark.sql import functions as F #For use of Window Function
from pyspark.sql import SparkSession #For initiating PySpark API in Python

#Read in game.csv

path_games = "/directory/game_synthetic.csv" #Replace with your own directory and data
data_raw_games = pd.read_csv(path_games, encoding = 'ISO-8859-1')

Step 2: Format and create Date columns

#Format the 'game_date' column (if it was defaulted to string at ingestion)
#into Date format

data_raw_games['GAME_DATE'] = pd.to_datetime(data_raw_games['game_date'], \
format='%Y-%m-%d')

#Create a 'GAME_DATE_minus_ONE' column for each row

data_raw_games['GAME_DATE_minus_ONE'] = pd.DatetimeIndex(data_raw_games['GAME_DATE']) \
+ pd.DateOffset(-1)

The ‘GAME_DATE_minus_ONE’ column created above represents the previous calendar date for each game in the dataset. This is discussed in more detail later (in Step 4) and is used for identifying whether a game is a back-to-back game.

Step 3: Split dataset by team

As each row of the dataset is at a game level (i.e. it shows the result of a game between two teams), splitting is required to represent the result at a team level (i.e. splitting each row into two which represent the outcome of a game for each team). This can be achieved using the Python code below.

#Create two dataframes, one for results of home teams and 
#one for results of away teams, and merge at the end

data_games_frame_1 = data_raw_games.sort_values(['game_id'])
data_games_frame_2 = data_raw_games.sort_values(['game_id'])

data_games_frame_1['TEAM_ID'] = data_games_frame_1['team_id_home']
data_games_frame_2['TEAM_ID'] = data_games_frame_2['team_id_away']

data_games_frame_1['WIN_FLAG'] = (data_games_frame_1['win_loss_home'] == 'W')
data_games_frame_2['WIN_FLAG'] = (data_games_frame_1['win_loss_home'] != 'W')

data_games_frame_1['TEAM_NAME'] = data_games_frame_1['team_name_home']
data_games_frame_2['TEAM_NAME'] = data_games_frame_2['team_name_away']

data_games_frame_1['TEAM_NAME_OPP'] = data_games_frame_1['team_name_away']
data_games_frame_2['TEAM_NAME_OPP'] = data_games_frame_2['team_name_home']

data_games_frame_1['HOME_FLAG'] = 'Home'
data_games_frame_2['HOME_FLAG'] = 'Away'

#Merge the two dataframes above
data_games = pd.concat([data_games_frame_1, data_games_frame_2], axis = 0).drop(['team_id_home', 'team_id_away'], axis = 1)\
.sort_values(['game_id']).reset_index(drop = True)

Step 4: Return for each game the date when the team played its previous game

This is when PySpark comes in handy. In particular, we’ll be leveraging the lag function under the Window Functions in PySpark. In practice, as demonstrated in Table 2 below, the lag function provides access to an offset value of a column of choice. In this instance, it returns the date when Atlanta Hawks played its previous game relative to a current game, over a Window which shows a view of all the games played by the Atlanta Hawks.

For example, in the row of index 1, the Atlanta Hawks played the Cleveland Cavaliers on 23/10/2021 (“current game”) as shown in the ‘GAME_DATE’ column, and its last game was against the Dallas Mavericks on 21/10/2021 as shown in the ‘GAME_DATE’ column which is returned via the lag function in the same row as the current game, in the “GAME_DATE_PREV_GAME” column.

Table 2: Lag function demonstration. Table by author

The ‘GAME_DATE_PREV_GAME’ column returned above, when equal to the ‘GAME_DATE_minus_ONE’ column created under Step 2 above, inform that a game is back-to-back (i.e. the date of last game played is equal to the previous calendar day of the current game). This would be the case for row of index 8 (and 14) in Table 1 above as Atlanta Hawks played the Utah Jazz on 4/11/2021 — one day after they played the Brookyln Nets on 3/11/2021.

The Python code for returning the ‘GAME_DATE_PREV_GAME ’ column as well as flagging a back-to-back game for all teams is provided below.

#Select relevant columns from the dataset

col_spark = [

'GAME_DATE'
,'GAME_DATE_minus_ONE'
,'TEAM_ID'
,'TEAM_NAME'
,'TEAM_NAME_OPP'
,'HOME_FLAG'
,'WIN_FLAG'
,'SCORE'
,'season_id'

]

df_spark_feed = data_games[col_spark]

#Initiate PySpark session

spark_1= SparkSession.builder.appName('app_1').getOrCreate()
df_1 = spark_1.createDataFrame(df_spark_feed)

#Create window by each team
Window_Team_by_Date = Window.partitionBy("TEAM_ID").orderBy("GAME_DATE")

#Return date of previous game using the lag function
df_spark = df_1.withColumn("GAME_DATE_PREV_GAME", F.lag("GAME_DATE", 1).over(Window_Team_by_Date)) \
#Flag back-to-back games using a when statement
.withColumn("Back_to_Back_FLAG", F.when(F.col("GAME_DATE_minus_ONE") == F.col("GAME_DATE_PREV_GAME"), 1) \
.otherwise(0))

#Convert Spark dataframe to Pandas dataframe
df = df_spark.toPandas()

Step 5: Calculate win ratio for back-to-back games

#Select relevant columns

col = [
'TEAM_NAME'
,'TEAM_NAME_OPP'
,'GAME_DATE'
,'HOME_FLAG'
,'WIN_FLAG'

]

#Filter for back-to-back games
df_b2b_interim = df[df['Back_to_Back_FLAG'] == 1]

#Show selected columns only
df_b2b = df_b2b_interim[col].sort_values(['TEAM_NAME', 'GAME_DATE']).reset_index(drop = True)

What’s the win ratio for back-to-back games, by team?

Table 3: Win ratio of back-to-back games by team. Table by author

Based on the synthetic dataset, it seems the win ratio for back-to-back games varied by teams. The Houston Rockets had the lowest win ratio in back-to-back games (12.5%), followed by the Orlando Magic (14.8%).

Does it matter if the back-to-back game was played on away or home court?

Table 4: Win ratio of back-to-back games by team and home/away. Table by author

Based on the synthetic dataset, it seems for most teams in Table 4 above, teams were more likely to win back-to-back game playing at home rather than away courts (which was a sensible observation). The Brooklyn Nets, Chicago Bulls and Detroit Pistons were few exceptions to this observation.

Other splits can also be calculated, such as win ratio of non back-to-back games vs. back-to-back games, using the Python code below. A snippet of the output suggests teams were more likely to win non back-to-back games (which again was a sensible observation, with a few exceptions).

Table 5: Win ratio, back-to-back games vs. otherwise. Table by author

The PySpark session and associated Window Functions in Step 4 above can be further customized to return other game stats.

For example, if we would like to query the win ratio (of back-to-back games or not) by season, simply introduce a Window by team and season ID and partition over it like the below.

#Create window by season ID

Window_Team_by_Season = Window.partitionBy("TEAM_ID").orderBy("season_id")

In addition, we all know the score line for an NBA game is volatile, but exactly how volatile? This can be measured by the standard deviation of scores which again may not be available in the public domain. We can easily calibrate this by bringing in the score (which is available in the dataset) and applying the avg and stddev Window Function, which returns the standard deviation over a pre-defined window.

As an example, if the standard deviation of an NBA game is circa. 20 points. then there’s 70% chance that the score line will be within +/- 20 points of the average score line of an NBA game (assuming a Normal distribution).

Example Python code for returning this stat is provided below.


spark_1= SparkSession.builder.appName('app_1').getOrCreate()
df_1 = spark_1.createDataFrame(df_spark_feed)

Window_Team = Window.partitionBy("TEAM_ID").orderBy("HOME_FLAG")
df_spark = df_1.withColumn("SCORE_AVG", F.avg("SCORE").over(Window_Team)) \
.withColumn("SCORE_STD", F.stddev("SCORE").over(Window_Team))

df = df_spark.toPandas()
df.groupby(['TEAM_NAME', 'HOME_FLAG'])["SCORE_AVG", "SCORE_STD"].mean()


Photo by Emanuel Ekström on Unsplash

Just over a week ago I was watching an NBA game between the Milwaukee Bucks and the Boston Celtics. This was a match-up between the top 2 teams in the league, which many considered to be a prequel to the Eastern Conference finals. Being a big basketball and NBA fan myself, the game turned out rather disappointing as the Milwaukee Bucks lost to the Boston Celtics 140–99, a rare blow-out defeat for Milwaukee which holds the best (regular season) record in the 2022–2023 season.

Although this was out of character for Milwaukee especially given it’s a blow-out loss at home, the commentator of the game alerted me to the fact that they were actually playing a back-to-back game, which is a game played right after playing on the previous day (in this instance, a game away at Indiana on the previous day). In other words, fatigue may have played a role in their loss as playing back-to-back games is physically demanding for athletes, which may have been exacerbated by the travelling between games (from Indiana back to Milwaukee).

Looking at team schedules, out of 80 odd games in a season, NBA teams do play a number of back-to-back games. Do you ever wonder how teams fare in these games, and does this change when teams are playing at away or home courts? This article demonstrates one way of getting these stats, which are typically not available in the public domain, using PySpark — a ready-to-use interface for Apache Spark in Python.

To determine the win ratio for back-to-back games, we’ll need a history of back-to-back games played by each NBA team as well as their results. Although these stats are available on the official NBA website and other community sites, they are not licensed for commercial use and as such, I have simulated a synthetic dataset which contains the following fields.

  • Date when the game was played
  • Team name for the home team
  • Team name for the away team, and
  • Score of the game, and corresponding outcome by home and away team

The table below shows a snippet of the synthetic dataset. You should be able to verify against the official NBA game schedule that that these were not actual games.

Table 1: Synthetic game data. Table by author.

This section provides a step-by-step guide in Python on how to transform the above dataset into one which identifies whether a game played by a team is a back-to-back game and subsequently calculates the win ratio for these games for each team.

Step 1: Load packages and data

#Load required Python packages

import numpy as np
import pandas as pd

!pip install pyspark #Install PySpark
import pyspark
from pyspark.sql.window import Window #For use of Window Function
from pyspark.sql import functions as F #For use of Window Function
from pyspark.sql import SparkSession #For initiating PySpark API in Python

#Read in game.csv

path_games = "/directory/game_synthetic.csv" #Replace with your own directory and data
data_raw_games = pd.read_csv(path_games, encoding = 'ISO-8859-1')

Step 2: Format and create Date columns

#Format the 'game_date' column (if it was defaulted to string at ingestion)
#into Date format

data_raw_games['GAME_DATE'] = pd.to_datetime(data_raw_games['game_date'], \
format='%Y-%m-%d')

#Create a 'GAME_DATE_minus_ONE' column for each row

data_raw_games['GAME_DATE_minus_ONE'] = pd.DatetimeIndex(data_raw_games['GAME_DATE']) \
+ pd.DateOffset(-1)

The ‘GAME_DATE_minus_ONE’ column created above represents the previous calendar date for each game in the dataset. This is discussed in more detail later (in Step 4) and is used for identifying whether a game is a back-to-back game.

Step 3: Split dataset by team

As each row of the dataset is at a game level (i.e. it shows the result of a game between two teams), splitting is required to represent the result at a team level (i.e. splitting each row into two which represent the outcome of a game for each team). This can be achieved using the Python code below.

#Create two dataframes, one for results of home teams and 
#one for results of away teams, and merge at the end

data_games_frame_1 = data_raw_games.sort_values(['game_id'])
data_games_frame_2 = data_raw_games.sort_values(['game_id'])

data_games_frame_1['TEAM_ID'] = data_games_frame_1['team_id_home']
data_games_frame_2['TEAM_ID'] = data_games_frame_2['team_id_away']

data_games_frame_1['WIN_FLAG'] = (data_games_frame_1['win_loss_home'] == 'W')
data_games_frame_2['WIN_FLAG'] = (data_games_frame_1['win_loss_home'] != 'W')

data_games_frame_1['TEAM_NAME'] = data_games_frame_1['team_name_home']
data_games_frame_2['TEAM_NAME'] = data_games_frame_2['team_name_away']

data_games_frame_1['TEAM_NAME_OPP'] = data_games_frame_1['team_name_away']
data_games_frame_2['TEAM_NAME_OPP'] = data_games_frame_2['team_name_home']

data_games_frame_1['HOME_FLAG'] = 'Home'
data_games_frame_2['HOME_FLAG'] = 'Away'

#Merge the two dataframes above
data_games = pd.concat([data_games_frame_1, data_games_frame_2], axis = 0).drop(['team_id_home', 'team_id_away'], axis = 1)\
.sort_values(['game_id']).reset_index(drop = True)

Step 4: Return for each game the date when the team played its previous game

This is when PySpark comes in handy. In particular, we’ll be leveraging the lag function under the Window Functions in PySpark. In practice, as demonstrated in Table 2 below, the lag function provides access to an offset value of a column of choice. In this instance, it returns the date when Atlanta Hawks played its previous game relative to a current game, over a Window which shows a view of all the games played by the Atlanta Hawks.

For example, in the row of index 1, the Atlanta Hawks played the Cleveland Cavaliers on 23/10/2021 (“current game”) as shown in the ‘GAME_DATE’ column, and its last game was against the Dallas Mavericks on 21/10/2021 as shown in the ‘GAME_DATE’ column which is returned via the lag function in the same row as the current game, in the “GAME_DATE_PREV_GAME” column.

Table 2: Lag function demonstration. Table by author

The ‘GAME_DATE_PREV_GAME’ column returned above, when equal to the ‘GAME_DATE_minus_ONE’ column created under Step 2 above, inform that a game is back-to-back (i.e. the date of last game played is equal to the previous calendar day of the current game). This would be the case for row of index 8 (and 14) in Table 1 above as Atlanta Hawks played the Utah Jazz on 4/11/2021 — one day after they played the Brookyln Nets on 3/11/2021.

The Python code for returning the ‘GAME_DATE_PREV_GAME ’ column as well as flagging a back-to-back game for all teams is provided below.

#Select relevant columns from the dataset

col_spark = [

'GAME_DATE'
,'GAME_DATE_minus_ONE'
,'TEAM_ID'
,'TEAM_NAME'
,'TEAM_NAME_OPP'
,'HOME_FLAG'
,'WIN_FLAG'
,'SCORE'
,'season_id'

]

df_spark_feed = data_games[col_spark]

#Initiate PySpark session

spark_1= SparkSession.builder.appName('app_1').getOrCreate()
df_1 = spark_1.createDataFrame(df_spark_feed)

#Create window by each team
Window_Team_by_Date = Window.partitionBy("TEAM_ID").orderBy("GAME_DATE")

#Return date of previous game using the lag function
df_spark = df_1.withColumn("GAME_DATE_PREV_GAME", F.lag("GAME_DATE", 1).over(Window_Team_by_Date)) \
#Flag back-to-back games using a when statement
.withColumn("Back_to_Back_FLAG", F.when(F.col("GAME_DATE_minus_ONE") == F.col("GAME_DATE_PREV_GAME"), 1) \
.otherwise(0))

#Convert Spark dataframe to Pandas dataframe
df = df_spark.toPandas()

Step 5: Calculate win ratio for back-to-back games

#Select relevant columns

col = [
'TEAM_NAME'
,'TEAM_NAME_OPP'
,'GAME_DATE'
,'HOME_FLAG'
,'WIN_FLAG'

]

#Filter for back-to-back games
df_b2b_interim = df[df['Back_to_Back_FLAG'] == 1]

#Show selected columns only
df_b2b = df_b2b_interim[col].sort_values(['TEAM_NAME', 'GAME_DATE']).reset_index(drop = True)

What’s the win ratio for back-to-back games, by team?

Table 3: Win ratio of back-to-back games by team. Table by author

Based on the synthetic dataset, it seems the win ratio for back-to-back games varied by teams. The Houston Rockets had the lowest win ratio in back-to-back games (12.5%), followed by the Orlando Magic (14.8%).

Does it matter if the back-to-back game was played on away or home court?

Table 4: Win ratio of back-to-back games by team and home/away. Table by author

Based on the synthetic dataset, it seems for most teams in Table 4 above, teams were more likely to win back-to-back game playing at home rather than away courts (which was a sensible observation). The Brooklyn Nets, Chicago Bulls and Detroit Pistons were few exceptions to this observation.

Other splits can also be calculated, such as win ratio of non back-to-back games vs. back-to-back games, using the Python code below. A snippet of the output suggests teams were more likely to win non back-to-back games (which again was a sensible observation, with a few exceptions).

Table 5: Win ratio, back-to-back games vs. otherwise. Table by author

The PySpark session and associated Window Functions in Step 4 above can be further customized to return other game stats.

For example, if we would like to query the win ratio (of back-to-back games or not) by season, simply introduce a Window by team and season ID and partition over it like the below.

#Create window by season ID

Window_Team_by_Season = Window.partitionBy("TEAM_ID").orderBy("season_id")

In addition, we all know the score line for an NBA game is volatile, but exactly how volatile? This can be measured by the standard deviation of scores which again may not be available in the public domain. We can easily calibrate this by bringing in the score (which is available in the dataset) and applying the avg and stddev Window Function, which returns the standard deviation over a pre-defined window.

As an example, if the standard deviation of an NBA game is circa. 20 points. then there’s 70% chance that the score line will be within +/- 20 points of the average score line of an NBA game (assuming a Normal distribution).

Example Python code for returning this stat is provided below.


spark_1= SparkSession.builder.appName('app_1').getOrCreate()
df_1 = spark_1.createDataFrame(df_spark_feed)

Window_Team = Window.partitionBy("TEAM_ID").orderBy("HOME_FLAG")
df_spark = df_1.withColumn("SCORE_AVG", F.avg("SCORE").over(Window_Team)) \
.withColumn("SCORE_STD", F.stddev("SCORE").over(Window_Team))

df = df_spark.toPandas()
df.groupby(['TEAM_NAME', 'HOME_FLAG'])["SCORE_AVG", "SCORE_STD"].mean()

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsAnalyticsAprbacktobackcuigamesJinmachine learningNBAPySparkratioTech NewsWin
Comments (0)
Add Comment