Friday, June 14, 2013

Homecourt Advantage

Introduction

The regular season is an 82-game marathon for playoff seeds and homecourt advantage, often fighting red in tooth and claw just for the privilege of hosting one more game in your own arena, but how exactly valuable is homecourt advantage? Modern consensus is that owning homecourt is worth roughly +3.2 points per game, which is a margin of victory that separates a 0.500 team with 41 wins and one with nearly 50 wins. Most of the time homecourt advantage is simply calculated by the disparity in road and home winning percentage or point differential, but perhaps there are some nuanced factors we're ignoring. Looking at 14 regular seasons, everything past the lockout season of 1999, a more accurate and comprehensive overview of homecourt advantage and related factors will be presented in the following article.

Methodology

I'm only looking at the regular season right now partly because it's easier to amass a large data-set that way and partly because later I want to investigate if homecourt advantage changes in the playoffs. Nevertheless, we're looking at nearly 17,000 games, and the impetus for the study was an article from Neil Paine about "real" homecourt advantage for every team.

He looked at the difference between a team's point differential at home versus the road, comparing it to the league average to find teams that were significantly above or below the mean. Utah and Denver had homecourt advantages of +6.17 and +5.47, respectively, compared to the average of 3.23. One may suggest that the high altitude of the Rockies makes for a built-in advantage, but, as he correctly pointed out, there's a little more going on here. Teams in the sparely populated west, especially in the Rockies and northwest, host teams who have to fly further on average compared to, for example, a New York where its opponents sometimes only have to travel a couple hundred miles. The next two teams with the highest homecourt advantage rating are west of the Rockies, Golden State and Portland, although they are known for great crowds. Conversely, the teams with the worst homecourt ratings should either be near the cluster of densely-located NBA teams in the northeast or in the "center of mass" of NBA locations in the central timezone. That's, in fact, exactly what happened: the Nets have the worst rating, followed by the Timberwolves, while the rest of the bottom ten largely consist of east coast and central teams.

Distance is calculated from a previously used method with a longitude/latitude arc formula from the arena's coordinates. This was fairly easy in most instances, but there were a handful of games at odd locations -- the Hornets played games in both Oklahoma and Louisiana in two different seasons, and there were a few games in London as well as two cities in Japan. The games in London and Japan, and the games immediately following, were discarded from the dataset. They were definitely outliers: teams have to travel nearly 6000 miles to reach Japan, and the distances were so extreme they were throwing off some of the model variables.

Along with travel distance, another factor is the number of days between games. Obviously, a team is at an advantage without a day of rest between games, but there's also an interaction here between travel distance and days of rest. Distance is muted by a few days off to adjust, while if a team is without rest a thousand mile journey is even more taxing. A previous study incorporated rest, but it did not factor in distance and only used one season of data.

Lastly, homecourt dummy variables are used for each NBA location. For example, the Grizzlies in Vancouver have a different variable than in Memphis. The homecourt location variable shows when a team's homecourt advantage is significantly different than average.

Model selection

The basic model form is below. I used linear regression and point differential. Knowing the point differential strength of the team, SRS, you can have a predicted point differential of the match-up with the ability to solve for the effect of homecourt advantage, as well as other factors like travel distance and travel days. The basic model form is shown below:

SRSHome + HomeTravelDays + HomeDistance + HomeCourt = SRSAway + AwayTravelDays + AwayDistance


PointDiff = HomeCourt + (SRSHome - SRSAway) + HomeDays + AwayDays + HomeDistance + AwayDistance

The variables given above are just the basic forms. You can transform them by squaring them or taking the square root, and you can rearrange them. Travel days, or days between games, were adjusted so the default is one day between games; this is true in over half of every case in a typical season. This improved the model's ability to explain point differential. The key variables are outlined below. (The function StepAIC in R was used to select the predictors for the model.) All variables have two versions, home and away, and most are transformed by squaring or taking the square root.

NoRest: when a team played the day before.

RestDays: the default is one day between games. A "RestDay" is an additional day of rest (e.g. two days between games would be one RestDay and four days would be three RestDays.)

TravelDistance: how far a team traveled from the previous game to the current one.

SixDayTravel: the sum of TravelDistance for games in the past six days.

SixDayTravelWtd: the same as SixDayTravel, but weighted by recency. The distances are summed and divided by days from the present (e.g. a game yesterday has its TravelDistance divided by two.)

CstGames: the number of consecutive days on the home or road.

TravelDistanceOneDay: distance traveled between games with no day of rest; 0 otherwise (when there's at least one day of rest.)

TravelDistanceP: the product of travel distance and consecutive games (e.g. for the fifth straight game, you multiply the travel distance of that game by five.)

Results


A simple way to look at the travel distance question is to separate the games into categories. First a point dispairty is calculated between the actual point margin of the game (home team's score minus the away team's score) and the expected point differential based on team strength (SRS, or simple rating system.) Note that the average value, somewhere around 3.2, is the average homecourt advantage typically calculated for the NBA. Table 1 shows these results broken categorically by the travel distance in miles of the home and away teams before the game. Important note: there were less than 100 games each for the bolded values. For example, when the home team and the away team both had to travel over 2000 miles to the game, there were only 18 such occurrences. The other bolded values have around 50 games each, while the other two categories on the bottom row had around 120 games. Ignoring the cases with small sample sizes, there's not much of a pattern. When the home team didn't travel at all, as the away team's travel distance increased virtually nothing happened. 

 Table 1: Point dispairty matrix with travel distance
Home | Away
0 < d < 500
500 < d < 1000
1000 < d < 2000
2000 < d
No travel
3.51
3.52
3.14
3.36
0 < d < 500
3.54
2.80
2.28
3.99
500 < d < 1000
3.36
2.99
1.63
3.09
1000 < d < 2000
3.16
3.73
3.18
7.28
2000 < d
3.45
1.29
2.70
1.88


And if you're uneasy reading tables and don't believe the numbers, here are two graphs of every game with the point disparities and travel distances. Remember this is travel distance from the previous game to the current one (arena to arena.) A travel distance of zero is normally a home team staying in its home arena, with the exception of a few weird Los Angeles games in the Staples Center where the Clippers, for example, "travel" to the Lakers' arena from their own.



Using the same method, a similar table is created for rest days (or days between games.) Games with no rest have by far the biggest effect, while there's not much of a pattern elsewhere. There is no easy to judge relationship between days of rest and point differential, but there is a suggestion of "rust" being a real factor. But a more refined technique is needed.

Table 2: Point disparity matrix with days between games
Home | Away
No rest
1 day
2 days
3 days
4 or more days
No rest
3.77
1.06
0.83
1.28
2.03
1 day
4.64
2.93
3.26
2.02
-1.27
2 days
4.72
2.47
3.03
4.46
2.20
3 days
4.38
2.75
3.12
0.59
1.49
4 or more days
3.99
2.12
1.91
0.80
3.30


This is where a regression model comes into play. But despite 14 regular seasons worth of data and accurate distances between team arenas, it was difficult to find distance as a significant predictor in the outcome of a game. For distance, you have to use a combination of variables, and this verges on model overfitting. While I started this study, I tried running regression even when I had only one season of data ready and parsed, and I was surprised to see travel distance was not significant, so I kept loading more and more data. Eventually I reached the season after the lockout (I sorta wanted to ignore that crazy season anyhow) the pattern remained the same: distance is an unreliable predictor. Rest days and consecutive games were also unimportant, according to the results, with the exception of "NoRest" -- if you want a simple system to predict who wins and by how much, the team's point differential, homecourt, and whether or not the teams played the day before. Adding more convoluted variables like travel distance squared barely improves the model's ability to explain the results of the games.

Using a k=4 threshold for Step Regression (k controls the gate for when to drop a variable, basically), a whole set of travel distance and rest variables are present. The only way travel distance was pertinent for away teams, by the way, is the combination in the table below. In most models only the home team's travel distance is important -- is this because homecourt advantage is taken away when the team has to travel across the country and gets fatigued? By the way, the intercept basically stands for homecourt advantage, but because of the nature of the polynomial away variables it's not simple to figure out what homecourt advantage actually is.


Table 3: Homecourt model 1 with k=4 (step regression), st. error = 11.25, and adj. R^2 = 0.2359

Coefficient
St. Error
p-value
Intercept
1.55
0.819
0.0579
Predicted point diff.
1.00
0.0142
< 2E-16
HomeNoRest
-1.16
0.253
4.08E-06
AwayNoRest
2.02
0.198
< 2e-16
HomeRestDays^2
-0.0794
0.0351
0.0239
HomeTravelDistance
0.00102
4.30e-4
0.0178
AwayTravelDistance
-0.00523
0.00258
0.0429
AwayTravelDistance^0.5
0.180
0.0883
0.0412
AwayTravelDistance^2
1.12e-6
5.27e-7
0.0340
HomeSixDayTravelWtd
-0.00113
3.80e-4
0.00303

Increasing the k-value, a few variables drop out and we're left with more reliable predictors. Again, travel distance is not much of a problem for away teams. Perhaps this is due to the modern luxuries of air travel, where a thousand miles and a hundred miles are the same. What guys really care about is at least a day of rest and playing on their own floor. But besides NoRest, the Home variables have pretty high standard errors, with the exception of the six day weighted sum of travel distances where there's only a 0.29% chance the variable is insignificant. The coefficients are in terms of point differential, so an intercept of 3.21 means homecourt is giving you 3.21 points over the away team, all other things being equal; and the away team having no rest is a two-point advantage for the home team. Those are powerful results, as even two or three points are worth a handful of wins over the court of a season.

Table 4: Homecourt model 2 with k=4.5 (step regression), st. error = 11.25, and adj. R^2 = 0.2357

Coefficient
St. Error
p-value
Intercept
3.21
0.167
< 2e-16
Predicted point diff.
1.00
0.0142
< 2e-16
HomeNoRest
-1.17
0.253
3.80e-6
AwayNoRest
1.97
0.189
< 2e-16
HomeRestDays^2
-0.0773
0.0351
0.0277
HomeTravelDistance
0.00102
4.30e-4
0.0146
HomeSixDayTravelWtd
-0.00102
3.80e-4
0.00293

Table 3 has the super-simplified version. There's not much of a difference in explaining the results of games, surprisingly; and remember the homecourt intercept is based on a model where the only other variables are no rest days. This will reduce the homecourt advantage, as most of the time it's given as something greater than 3.

Table 5: Homecourt model 3 with k=5 (step regression), st. error = 11.25, and adj. R^2 = 0.2353

Coefficient
St. Error
p-value
Intercept
2.80
0.111
< 2e-16
Predicted point diff.
1.00
0.0142
< 2e-16
HomeNoRest
-1.32
0.243
6.38e-8
AwayNoRest
1.94
0.189
< 2e-16

Team specific homecourt advantage

The leading research question that drove this study was whether or not certain teams had a significantly larger (or smaller) homecourt advantage. Denver is notorious for its mile-high altitude, but is that actually an advantage or is it how far teams have to travel to get there? Thanks to modern software, this question is easy to test and deceptively simple to implement. A dummy variable was given to every team location. If you're in Philadelphia, the PHI variable is 1 and 0 everywhere else, even when the arenas change. Throw in all 33 team location variables (Seattle, Vancouver, and New Jersey are included) with the previous variables, and it's close to 60 variables total -- thanks again to modern software and step regression for making this humanly feasible by selecting the significant variables in the right combination for me.

Surprisingly, some team locations are significant, and yes that includes Denver along with the other high-altitude team in Utah. However, there's another team you would not expect: Sacramento. Yes, the team that almost moved is one of only three with a significantly better homecourt advantage than normal, even though the fans have had little to cheer about in recent years. What's interesting to note is that these cities are significant even though travel distance was a considered variable. A model attempted with k = 3, meaning less variables are rejected, included away travel distances, but the location variables had the same impact. There is definitely a real advantage in Denver with a +1.5 points over the normal homecourt advantage, while  Sacramento and Utah are borderline significant as well.

Table 6: Homecourt model 4 with k=4 (step regression), st. error = 11.24, and adj. R^2 = 0.2365

Coefficient
St. Error
p-value
Intercept
3.09
0.169
< 2e-16
Predicted point diff.
1.00
0.0142
< 2e-16
HomeNoRest
-1.15
0.253
5.55e-6
AwayNoRest
1.97
0.189
< 2e-16
HomeRestDays^2
-0.0786
0.0351
0.0251
HomeTravelDistance
0.00104
4.29e-4
0.0150
HomeSixDayTravelWtd
-0.00116
3.80e-4
0.00220
DEN
1.54
0.485
0.00153
SAC
1.16
0.484
0.0166
UTA
1.10
0.486
0.0242

Increasing the k-penalty factor, everything drops out except days with no rest and location specific homecourt advantage. The team variables appear stable; the coefficients barely changed even when distance variables were dropped, and this was true of the k=3 model with away distance variables. As another sidenote, in the k=3 model Golden State was another location variable at +0.93 but the p-value was 5.7%, meaning it's just barely rejected by the common 5% threshold; basically, it's borderline significant, and it's arguable Golden State has a better than average homecourt advantage, commendable considering how often they've lost over the past 14 seasons. But unfortunately for Philadelphia, they had the only negative coefficient (-0.85) and a p-value of 8.0%.

Table 7: Homecourt model 5 with k=4.5 (step regression), st. error = 11.25, and adj. R^2 = 0.2361


Coefficient
St. Error
p-value
Intercept
2.68
0.115
< 2e-16
Predicted point diff.
1.00
0.0142
< 2e-16
HomeNoRest
-1.30
0.244
8.70e-8
AwayNoRest
1.95
0.189
< 2e-16
DEN
1.50
0.485
0.00198
SAC
1.16
0.484
0.0167
UTA
1.07
0.486
0.0282

One famous unanswered question in the NBA was rust versus rest: is having too many days off before the next game a negative factor? Here's one such attempt of the thorny issue, although it only uses playoff games and looks at win percentage instead of point differential. Point differential is better for a study because it indicates team strength, and in cases with only a few games win percentage is a lot less reliable than point differential. If you read the tables thoroughly above for the models, you would have noticed the HomeRestDays^2 variable in some of the models -- and it's negative. For this variable, a team who has played the day before or the day before that (just one day of rest) factors as a zero. Two days of rest? HomeRestDays is one. At six days of rest, according to some of the models you lose roughly two points, which is a sizable amount for an NBA game. The squared transformation means that rust is only an effect at the extremes. And again, interestingly only the home version of the variable is significant. However, long stretches before the next game are rare during the regular season, and in the next version of this study the playoffs will be loaded into the dataset for a more comprehensive study of rust versus rest.

Conclusion

Using 14 regular seasons for data, trying to predict the point margin of every game with the average adjusted point differential of each team along with a long set of other variables, travel distance does appear to be a significant predictor, but it's only a weak significance. Surprisingly, the travel distance of the home team is more important, and the total amount of travel in the past week, weighted by recency, is a better predictor. For rest before games, having no rest (playing the day before) is a highly significant disadvantage, while there's some weak evidence having too much rest causes a negative effect -- i.e. rust. What homecourt advantage is depends on what other variables you consider. For example, in table 6 homecourt advantage is about +3.1, other factors held equal, but this doesn't include the teams with higher than average advantages at home and the higher likelihood of away teams playing the night before.

There are certain teams, however, with a significantly larger than average homecourt advantage. Denver is roughly +1.5 points above the natural homecourt advantage with very high statistical significance, while both Utah and Sacramento have some significance (but not high) with around +1.1 point for hosting a game in their arena. Teams travel vast distances over North America in a matter of hours, traversing the continent to play a game the next day with modern aircraft and luxuries, causing an apparently complicated relationship between homecourt advantage and expectations -- but it's perhaps simpler than we think. Athletes like having at least one day of rest, while distance is only bothersome when it stunts your ability to enjoy your homecourt advantage, rust is perhaps a real issue but only at the extremes, and home is home: the comforts of playing in your own arena are genuine and potent, some arenas more than others.

No comments:

Post a Comment