Friday, February 20, 2015

NBA Analytics - What's in a Win?



Anyone who has taken an intro statistics course knows correlation does not imply causation, but it does suggest some relationship between variables, even if they both have the same underling cause. In honor of the all star break, I decided to pull some data from Basketball-Reference.com and throw together some correlation calculations and plots to see which team stats correlate to wins. This is not meant to be a scientifically rigorous study and any conclusions or apparent results should be taken with a grain of salt.

If there's one thing that seems like it should correlate to wins, its shooting percentage. After all, if your team tends to make more shots than other teams, you'll probably win a lot of games. Lets find out!

cor(nba$wins, nba$FGP.x)
## [1] 0.6897
plot of chunk unnamed-chunk-3
As expected, field goal percentage shows a strong positive correlation to wins. The main shortcoming of using raw field goal percentage as a measure of offense is that it fails to account for the fact that 3 point shots are worth 50% more than 2 point shots. Effective field goal percentage takes this difference into account:

cor(nba$wins, nba$eFGP)
## [1] 0.798
plot of chunk unnamed-chunk-5
Effective field goal percentage has a correlation of nearly .80 to wins. It's almost surprising the correlation is not stronger, except that free throws are not included in eFG%. True shooting percentage does include free throws, but it does not adjust for shot value so I wouldn't expect a stronger association. Let's find out.

cor(nba$wins, nba$TSP)
## [1] 0.8162
plot of chunk unnamed-chunk-7
True shooting percentage has a slightly stronger correlation to wins than effective field goal percentage. Interestingly, true shooting percentage is an even stronger indicator of wins than points per game, which has a correlation of 0.77 to wins.

It should come as no surprise that Margin of Victory is very tightly tied to wins:

cor(nba$wins, nba$MOV)
## [1] 0.9677
plot of chunk unnamed-chunk-9
Thus far I have investigated stats that directly translate into points. It is obvious that stats that translate directly to points should be strongly related to wins. Now I'll investigate some stats that do not necessary translate as directly to points.
Total rebounding is only weakly correlated with wins:

cor(nba$wins, nba$TRB)
## [1] 0.2461
plot of chunk unnamed-chunk-11
Digging deeper, defense rebounds show a moderate positive correlation at 0.45, while offensive rebounds actually have a negative correlation with wins at -0.21. The talking heads often make a big deal about the importance of offensive rebounding and second chance points, but it seems offensive rebounding actually makes a team less likely to win! What gives?

This appears to be a classic case of what statistics folk call a confounding or lurking variable: a factor we haven't accounted for that has an influence on the variables we are testing. What might the lurking variable be in this case? My intuition tells me teams that miss more shots have more opportunities for offensive rebounds. If poor shooting teams tend to get more offensive rebounds, it would make a lot more sense to see offensive rebounding having this counter-intuitive negative correlation with wins. Lets check the numbers.

cor(nba$FGP.x, nba$ORB)
## [1] -0.4691
plot of chunk unnamed-chunk-13
Sure enough, there is a moderate negative correlation between field goal percentage and offensive rebounding, so while nobody doubts that offensive rebounds are a good thing if you can get them, it is better for the ball go through the hoop in the first place.

To round out common team stats, lets check out assists, steals, blocks, turnovers and personal fouls.

cor(nba$wins, nba$AST)
## [1] 0.6287
cor(nba$wins, nba$STL)
## [1] 0.2824
cor(nba$wins, nba$BLK)
## [1] 0.2925
cor(nba$wins, nba$TOV)
## [1] -0.2592
cor(nba$wins, nba$PF)
## [1] -0.2309

As we would expect, assists show a moderately strong correlation to wins, as each assist translates directly to points. Both steals and blocks show weak positive correlation to wins, while turnovers and fouls show weak negative correlations.

Defense seems to be something that is appears difficult to quantify. While defense is clearly important, steals and blocks are only weakly associated with scoring and wins. Since there's no stat that keeps track of contesting shots, opponent field goal percentage seems like the next best thing. Lets see how it correlates to wins.

cor(nba$wins, nba$O_FGP.x)
## [1] -0.6509
plot of chunk unnamed-chunk-16
The negative correlation between opponent shooting and wins is almost as large in magnitude as the positive correlation between field goal percentage and wins. To go a little deeper, let's investigate shooting percentage differential: the difference between a team's shooting percentage and shooting percentage of its opponents.

cor(nba$wins, nba$FGP.x-nba$O_FGP.x)
## [1] 0.8306
plot of chunk unnamed-chunk-18
Shooting percentage differential shows the strongest correlation to wins of any variable other than margin of victory. If you shoot a better percentage than your opponents more often than not, you are probably going to win a lot of games.

Looking at the correlations between raw stats and wins might be interesting, but it doesn't necessary translate into information teams can use strategically. A team can't just decide to shoot better. It can, however, adjust shot selection, which in turn, affects shooting percentage, points and wins. How does shot selection correlate to wins? Let's check the correlation between wins and the percentage of a team's shot it takes from within 3 feet, 3-10 feet, 10-16 feet and long 2's.

cor(nba$wins, nba$FGA_3)
## [1] 0.04409
cor(nba$wins, nba$FGA_10)
## [1] -0.132
cor(nba$wins, nba$FGA_16)
## [1] -0.1531
cor(nba$wins, nba$FGA_Long2)
## [1] -0.3801
plot of chunk unnamed-chunk-20
The data show that the percentage of shots taken within 3 feet has near zero correlation to wins, shots taken from 3 to 16 feet have a very weak negative correlation and percentage of long 2's a team takes–shots between 16 feet and the 3 point line–has a slightly stronger negative correlation to wins. On the other hand, shooting more three point shots has a moderately positive correlation to wins:

cor(nba$wins, nba$FGA_3point)
## [1] 0.4506
plot of chunk unnamed-chunk-22
The Houston Rockets, a team known for embracing analytics, appears to be making an effort to shoot a lot of 3's and avoid long 2's. They also have a very good record at 36 wins and 17 losses. Does data analysis really provide a strategic edge in basketball or is this simply coincidence? That question is subject to debate, and as with many questions in analytics, we won't really know until we have more data.

7 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Hello, I found your blog through Class-Central. I saw that you had reviewed both 15.071x and Data Analytics and Statistical Inference from Duke University. And given them both great reviews.

    Now I am sort of torn between which one of these I should choose, it will be my first MOOC and I have no prior experience with statistics nor R but a good deal of experience with programming. Which one would you recommend I start with?

    ReplyDelete
    Replies
    1. It depends on what you want to learn. The Duke course is a straight up intro to stats with some optional (recommended) assignments in R. Analytics edge is a predictive modeling course where you mainly make machine learning models with R; it uses some statistical concepts, but doesn't focus on teaching statistics itself. I found the Analytics edge more interesting, but it is more work and it would go smoother if you knew a bit of R and stats ahead of time. Ideally, I would take the Duke course first and Analytics Edge second.

      Delete
    2. Well I want to learn data science. Initially I had intended to pursue the Data Science specialization track from John Hopkins University but after reading your reviews on those courses I decided against that.

      Now from what I understand data science is a discipline of many skills including statistics. So I am trying to learn all of them eventually.

      Analytics Edge does mention that you need high school math as a prereq, in your review you mention that it is not math heavy. I'm currently trying to brush up on my math skills using KhanAcademy but I haven't got that far yet (just mastered the Arithemtics and Early MAth track, almost done with Pre-Algebra). So I'm wondering if it's a better idea for me to stay away from that one until my math is at a higher level.

      Many thanks for the replies. Your reviews have been very helpful to me.

      Delete
    3. For data science you'll want to take statistics eventually, but as I recall the math in Analytics Edge is pretty basic, so I wouldn't avoid it for that reason. If I had to take one or the other, I'd take analytics edge, because there are many other intro stats courses around that are pretty good.

      Delete
    4. Alright, I take it from your answer that some stats knowledge would be a helpful prereq towards the Analytics Edge then? Do you think just going through the stats track and KhanAcademy would be enough?

      Delete
    5. As long as you know basic descriptive statistics like mean, median and variance you should be fine; they teach most of what you need to know.

      Delete

Note: Only a member of this blog may post a comment.