Tuesday, February 18, 2014

NBA Three-point Contest Simulation

All-star Weekend is over, but I worked overtime on a method to estimate a player's likelihood of winning the three-point contest. At first, I just wanted to find which factors correlated to contest success, but there's a frustrating lack of patterns within the contest results. The 2014 contest is over, but what can we learn from the past?

What translates into contest success

To set up the model, data is needed, obviously, but it's strange how difficult it is to find comprehensive results on every player and every round. The NBA's own website was missing the seasons 2010 and forward, and it had incorrect results for 2007. Thankfully, cross-checking with this thorough website from Spain and several news sources, every single round from the year 2000 to 2013 was included. Why not before 2000? For one, the shortened line of '95 to '97 means the data from the mid-90's isn't directly comparable, and the league has changed drastically with its three-point habits; comparing 1988 to 2010 is a fool's errand, basically.

After the data was captured, the model form was selected. Because it's count data (i.e. whole, positive numbers) and there's an upper limit on the score, a simple linear model can't be used. Instead a beta regression model is used where the dependent variable (what you're trying to predict) is the proportion of total points scored in a round. Beta regression is preferred when the scale is limited to 0 to 1, useful for rates and proportions. (Specifically, it's the logit link model used in the betareg function from R.) For example, the all-time record of 25 by the old rules would mean a proportion of 25/30 or 0.833. The functional form is: y = exp(beta*x)/( 1 + exp(beta*x) ).

The fun math aside, what variables predict the three-point contest scores? When I was collecting the data, I observed no patterns in the winners and the worst performers -- big name shooters like Bird and Peja have done well, but shooting specialists like Craig Hodges and Voshon Leonard excel too. Kevin Love surprisingly won a contest, so maybe star players have an advantage, but Curry and Durant were disappointing. Undeterred, I tested a variety of stats -- shooting percentages from the past three seasons (being careful to only look at the three-point percentage before the all-star break), three-point attempts per minute, usage rate (aka total shooting volume), height, returning champion, repeating participant, a dummy variable for the final round in case players "warm-up," year, and product combinations by multiplying, say, height by three-point attempts per minute in case there's an interaction effect.




The result? Unfortunately, only two variables were significant -- a weighted average of shooting percentage from the past three seasons(1), and a dummy variable for the final round. There is no correlation between other factors, as much as people like to state big guys have an advantage because they'll somehow get less tired heaving the shots due to their size. Thankfully, there was one interesting result with the shooting percentages -- the previous season is as important as the current one. One would think the more recent season is more relevant, and this is even more surprising because the previous season is full and has more attempts, but there's a selection bias for the contest: the NBA chooses players based on their shooting percentages and number of attempts from the current season. Thus, players who are having an anomalous half-season are more likely to be chosen, and their "true" three-point talent level is generally lower than what their pre-all star break percentage suggests.

(1) (3*3P_year0+3*3P_year1+2*3P_year2)/(3*3PA_year0+3*3PA_year1+2*3PA_year2)

From 2006 to 2013, the average difference in three-point percentage between the pre- and post- all-star break was 3.3% (among players with at least 40 attempts after the break) -- meaning, a player is likely a significantly worse shooter the rest of the season. However, there's a wrench in the selection bias: champions are invited back unless they're injured. This is probably the best comparison test since they are not selected because of their pre-break percentage. And among returning champions, the average difference is ... 0.02. Clearly, there's selection bias when only looking at a half season of stats.

And why am I including all this discussion? Because deciding a field of participants is a important (it's a big televised event and NBA players have to be scheduled and showcased), and it's unwise and apparent that only looking at a half season does not lead to the best set of selections.

Another important observation is the inherent noisiness of the results. The "pseudo" R-squared from the model was 0.1163, which equates to saying only 11.6% of the variable is explained by the variables. For those unfamiliar with regression, that's a low number. Basically, determining the winner of a single round is close to a roll of the die. It's important when watching the contest or looking at past results to realize it's not telling you who the best shooter is, despite the advertisement. Luck and chance are huge factors.

If you're looking for other variables I missed that would explain the results, I'm listening but I considered many. Shooting style is one I've spent time on, but there doesn't appear to be a pattern either. Set shooters win? Beal nearly won and he jumps higher than most do in the contest, and Ray Allen won in 2001. And players with lower verticals like Peja, Pierce, and Bird have excelled as well.

Simulation model

With a formula in hand, the next step is simulating how rounds are won. There is no directly solvable way to estimate a player's chance of winning because there's a great deal of variation in what every player will shoot, you need to simulate the finals based on how well they do the round before, tiebreaker rounds are possible, and which players are in your bracket affect your odds.

Considering those facts, the model was built by varying the coefficient for shooting percentage based on the standard deviation from the regression results. Essentially, this gives a "real world" set of varying results where Kevin Love can shoot 20 one round and 12 the next. With the link logit (beta) form, there's also realistic limits where a score of 25 or 26 is rare, and so are scores of 4 and 5. Out of the 90 first round scores from 2000 to 2013, there was only one score of 23 or higher (Arenas) or 1.1%. Running the simulation with 32,000 games played (4000 simulation seeds with 8 players), there were only 326 such games, which translates to 1.0%. Only one case doesn't prove the model is reflecting real world conditions, so I plotted a histogram below showing how the first round in the real world and virtual compare.


Since the simulation used only the 2014 participants for their shooting percentages, it's not a perfect representation, but it's showing a reasonable spread of results. There are only 90 results in the "real world" first round, but the shape is starting to form a normal curve and one can see that rare events in the simulation are indeed rare in reality.

Based on the simulation results with 4000 random seeds for each of the first rounds and 32,000 for the finals (4000 separate simulations for each possibility of the 4000 first round seeds), the most likely winner was Stephen Curry at 24.6%, followed by Beal who was closely followed by Lillard and Belinelli. Love's odds are at 5 percent, but that's largely due to a terrible 2013. On the other hand, the metric is weighted by attempts, so an injury-plagued season is not as destructive. Afflalo also has low odds due to a low percentage the previous season. Beal's advantage is in an easier field; he's not necessarily better than Belinelli. While Curry disappointed again, his odds were not near 100%, of course, and the finals had two guys in the top four by this simulation.

Simulation odds for winning the 2014 three-point contest
Damian Lillard
Marco Belinelli
Kevin Love
Stephen Curry
Kyrie Irving
Joe Johnson
Bradley Beal
Arron Afflalo
13.7
13.5
5.1
24.6
11.9
10.3
14.6
6.3

Bovada (Vegas), by the way, provided an interesting set of odds. Curry was given 2 to 1 odds, which is significantly more than I estimated and there's a huge gap between him and the next closest competitor. I'm disappointed I didn't try this method sooner because Bovada listed Beal with the second worst odds of winning even though he's shooting 43% right now and shot 39% last season (and he's in an easier bracket.) I would have labeled this the best value along with Belinelli, and since Belinelli ended up winning with Beal second you could have put down 20 dollars on him and won 120.

Alas, the one contest that didn't need an overhaul was changed, and the new maximum score is 34 because there is a moneyball-only rack you can place at your choosing. Since this increases the variability of the score, the odds should be closer together. I'll have to think of a way to emulate this behavior before the next contest.

Under- and over-performing shooters

An regression output that's usually ignored is the list of residuals. This is basically every single observation in the data (every single player's round) and it's usually squared or standardized. It's a way to spot outliers and trends. Though in this context, we can see which players are consistently doing better or worse based on their shooting percentages and what round it is.

The table below includes every player with at least four rounds from 2000 to 2013 and their average model error. In this case, a positive error is good; they're over-performing in the contest based on their percentages. A negative error means they're shooting worse than you'd expect. Based a robust sample of 11 rounds, two contests won, two others coming in second place, and outperforming the model by an average of 3.6 points, Peja may possess a little extra magic we can't quite capture by stats. Arenas is entirely buoyed by scoring a 23 in the first round and never repeating a similar feat, but it was only 4 rounds. Billups and Nash, unfortunately, never lived up to their reputations. Nash is currently 9th all-time in three-point percentage, but his average score was 14.

Residual errors (not squared) for players with at least 4 rounds, 2000 to 2013
Player
Total rounds
Average error
Peja Stojakovic
11
3.6
Gilbert Arenas
4
3.0
Voshon Lenard
4
2.7
Daequan Cook
4
2.4
Jason Kapono
6
1.6
James Jones
4
1.4
Kevin Love
4
1.0
Quentin Richardson
4
0.9
Ray Allen
10
0.6
Kyle Korver
4
0.4
Wesley Person
6
0.4
Dirk Nowitzki
10
0.0
Kevin Durant
4
0.0
Paul Pierce
5
-0.4
Rashard Lewis
5
-1.6
Chauncey Billups
4
-1.8
Steve Nash
4
-2.0

Speaking of all-time great shooters who have not done well, I expect many people will bring up Curry, but he hasn't strictly been disappointing. Before the 2014 contest, his average score was 17.3 with an average error of 0.8; he actually did a little better than expected. However, he was very consistent so he's never had a huge score, but on average he does quite well. Translating his 16 point total from 2014 to the old rules with less moneyballs, that roughly equates to 14 points, which isn't a disaster. I'd say he hasn't underperformed based on his shooting skill. The problem, rather, is that people view number one ranked players and teams way too high compared to the field in most situations, especially in this contest where the results are noisy. 

If the league wants the best shooters possible, they can't ignore previous seasons of data about three-point accuracy. And I'd suggest letting three players play in the finals again because luck is too much of a factor already.

As a final note, I want to comment on the strategy of where to place the moneyball rack. Players are afraid of using it in the last corner because they fear they won't be able to finish the rack and could waste the extra points. This is not an ideal strategy for a few reasons. Besides how much closer the line is and how most players shooter better from that distance, players either finish rounds or have the clock expire as they reach and try to shoot the last ball. It's rare that a player leaves two or more balls unused, Joe Johnson notwithstanding. But no matter where you place the rack, the last ball will always be a moneyball, and if you're afraid of time expiring with two or more balls left you won't have a good chance at advancing anyway. (Although I'd prefer the old rules be reinstated because we have so few era-neutral basketball aspects to judge players. Adding moneyballs only increases luck and variability, and this contest already has that in spades.)

2 comments:

  1. Great writeup. I agree that there was no reason to "fix" the contest, but I do like the idea of a moneyball-only rack. Adds an element of strategy to the contest. Looking forward to any further research you might do on the topic.

    ReplyDelete