Monday, February 11, 2013

Introducing Rookie Projections

Introduction

Every year, basketball fans look forward to a fresh new batch of NBA rookies and scour their statistics to what will amount to the future of the league. Some will post unbelievable stats like a 20-10 year or shoot the ball like only a few elite veterans can, but what does it mean for their future and how much better can they get? Tyreke Evans averaged 20-5-5 after one year in college, but sadly his rookie year has been the highlight of his career. Deron Williams had a rough first season, having problems finding the net and hardly being the distributor in the Utah system that he would become. However, it's unclear how one should interpret rookie stats. If you rebound at an elite level at such an early age, where is there room to improve? Or is it an indication you have lots of potential? This is the first part in a series of articles looking at the subject closely.

Methodology

Unfortunately, wanting to know how good someone will become means you need some objective measure for goodness. I don't think there is one true metric, so I'll try out the popular all-in-one metrics like Win Shares, PER, RAPM and maybe WARP. Looking at a player and thinking instantly, Oh, he's good, doesn't really translate into a statistical model; you need a number.

So the basis of this study will be translating a rookie metric, like a PER of 18, into a future "peak" PER through the use of indicator variables available in the rookie year -- age, for example, or free throws per field goal attempt.

The form of the model will be multivariate exponential as shown below. Basically, it just means you stuff a bunch of variables into an exponential function, which allows for easy interpretation and a (usually) improved type of fit than a straight line linear regression.

Future value = Rookie value*exp(age + β1*x1 + β2*x2 ... βn*xn)

Where value can be PER, Win Shares or any other metric. Age is always an important in projecting a rookie into the future, as competing at a teenager is quite different than competing as a 23 year-old with four years of college under his belt. Variables x1 through xn can anything from height to offensive rebounds per minute. The beta (β) values are coefficients, giving weight to each variable and translating them so the variables can be used togetherSome of the exploratory analysis was done by transforming this equation into linear form -- ln(Future value/Rookie value) = age + x1 + x2 ... x -- but in the refined analysis this will be done as nonlinear regression in a statistical program, minimizing the error on the general model.

Although the exp (also read as e^) looks intimidating, exponential models allow for easy interpretation. Due to a tiny bit of mathematical untangling, you can pull out any variable from the model and calculate it as a percentage change in the metric. For example, with a coefficient of -0.1 for the variable age minus 18, a rookie at 22 years can be calculated as: Rookie value*exp(age + xn) = Rookie value*exp(age)*exp(xn) = Rookie value*exp( -0.1*(22 - 18) )*exp(xn) = Rookie value*0.67*exp(xn). The xn just stands for all the other variables. What the rookie's age of 22 did was reduce his future value by 33 percent, compared to if he had been 18.

The data range

The most difficult part of locating the right data is deciding the endpoints. How far back into the NBA's past do you go? If you choose a wide range of seasons you get more data, but you're then looking at a league played much differently than today; if you choose a small range of seasons, you may not have enough samples to find significance of variables.

What I've done so far is loaded rookies with at least 500 minutes from seasons 2000 to 2005. I can't use seasons more recent because some of those players have yet to hit their prime years. I may include some of the late 90's seasons. I'll also check to see the robustness of using 500 minute seasons for samples; I may increase the cutoff point..

Another point is using peak seasons. What I've decided is using the average metric score of three seasons at ages 26, 27 and 28 (right now I have age 25 in there, but I'll replace that.) Generally, players peak at 27, but this is, obviously, not always true; it's best to think of this projection as, How good will this rookie be at around age 27? Additionally, what bodes well for a point guard may not bode well for a center, so I separated the players into three groups: point guards, wings and frontcourt players. It was somewhat arbitrary in deciding who was a wing or frontcourt player, especially with today's stretch 4's, but it was the best I could do.

Preliminary analysis

One important note is that I'm projecting the change of a player's current metric into the future. If there are certain indicators saying a player won't improve much, that doesn't mean a player isn't very good; it depends where the player is starting from. There are certain types of players who don't improve very much, like energy big men who rely on athleticism and crashing the boards, and there are certain types who do, like long-range shooters. This analysis will find the variables that allow one to predict these changes. A positive variable like shotblocking may be inversely related to improvement, but that doesn't mean shotblocking is negative; it just means the player has less of a chance of increasing his metric's score at a high rate.

Currently, I've spent most of my time adjusting the frontcourt player regression models, checking different variables and looking at patterns, testing for outliers, etc. (Eddie Griffin, sadly, passed away before he even reached his peak age and had a tumultuous career.) The problem so far is finding a larger set of explanatory variables, and the models have a fairly small level of prediction power. For one example, the table below shows the variables for a model with 58 players and an adjusted R^2 of 0.2217 for projecting PER. Age was more significant in other models, and it might be improved when I switch to looking at age 28 instead of 25, especially with the older rookies. Interestingly, TS% is inversely related to future PER, but it is a small effect. Going from a TS% of 55 to 50 means you'll lose 3.1 percent of your future PER. It's hardly earth-shattering, but it's near the typical significant level where the p-value would be 0.05. What this could be saying is that if you start your career with low efficiency you have more to improve.

Defensive rebounds are more correlated to improvement than offensive rebounds. I think this is reasonable when you imagine the player types who dominate offensive boards: typically, they're not part of the offensive set and vulture the boards off errant shots, standing close to the rim instead of, say, stretching the floor with a midrange shot. Going from a respectable 0.15 defensive boards to a commendable 0.20 boosts your future PER by 14 percent. Turnovers are actually a positive sign going forward for a player, and I think this is because of two reasons: it's something that's easier to fix and it indicates that a player is trying on the offensive end. The effect on a rookie is surprisingly strong, as going from 0.04 turnovers a minute to 0.07 improves one's PER by 19 percent. One odd variable I found to be significant was FTA/OReb, where players who generate more free throws versus offensive boards do not improve as much. I'm not quite sure why this is, but it may be highlighting energy big men who generate free throws inside but don't pick up offensive boards at a prodigious rate. Note that offensive rebounds or free throws separately were not significant.

Variable
Coefficient
P-value
Intercept
0.328
0.352
Age
-0.0218
0.0821
TS%
-0.0109
0.0816
DReb/min
2.03
0.0217
TO/min
4.78
0.0115
FTA/OReb
-0.112
0.0416

For fun, I input Andre Drummond's numbers and found that his future projected PER was an astounding 29.5. Keep in mind this is an average projection of his seasons from ages 25 to 27, and a PER of 30 is historic; Hollinger describes it as a runaway MVP year. He has an excellent rookie PER, he's a fantastic rebounder, he's 19 and his TS% is actually modest, due to his terrible foul shooting. Anthony Davis projects to "only" 24.2.

Future posts

In other articles, I'll refine the analysis and look at even more variables, including things like height, wingspan and TS% times Usage%. I may open the analysis to more seasons, but there's less information about seasons in the 90's. As a last note, I want to keep this analysis open to the public and I'll include my final scripts, which I'll probably do in the free program R. Much of the basketball stat community is owned by various entities from teams to large websites, and any information is no longer free and to the people but proprietary and closed. We're not tracking Soviet Russian troop movements here; it's just basketball.

2 comments:

  1. Obvious problem here is that the domain of many (most) player ratings models is the entire real line, and you have chosen to limit it to the positive real line by using an exponential function. Linear models > Non-linear models when signal/noise is small.

    ReplyDelete
    Replies
    1. Well, one reason why I chose exp. is that I personally like the form, which is a stupid reason, but I also assumed most people used a linear model and wanted to see if there were any advantages to exp. I think I'll test out a few different model forms.

      Here's one issue: with linear, you assume an increase in a variable by a certain amount helps players with low PERs (11) and ones with high PERs (25, for instance.) Is this true? Will better players increase at a faster rate or a slower rate? I'll look into the issue.

      Let me know if you have any other recommendations and I'll be sure to cite you. Thanks.

      Delete