Up until now my posts on Behind the Scoreboard have been almost exclusively about baseball - but I do plan to cover other sports as well. Today's topic is college basketball, where conference tournaments are underway and soon the selection committee will picking the bids for entry into the Big Dance.
An often debated subject is the statistical formula of RPI, which contrary to the title of this post, stands for Ratings Percentage Index. As I'm sure you are aware, the formula is meant to be a non-biased rating of each Division I college basketball team. The formula is as follows:
.25*(WPCT) + .5*(Opponents WPCT) + .25*(Opponents Opponents WPCT).
I won't be the first to criticize the formula and I certainly won't be the last, but I want to approach it from a little different angle than I've heard before. One of the consistent criticisms that you hear is that formula punishes poor schedules too harshly - let's consider the following 3-team example to see if that's true:
For the sake of simplicity, let's assume that the opponent's opponent's WPCT is constant at .500 for each of the 3 teams.
Then take the following:
Team A had an .857 WPCT against a schedule of opponents with a .400 WPCT
Team B had a .700 WPCT against a schedule of opponents with a .500 WPCT
Team C had a .300 WPCT against a schedule of opponents with a .700 WPCT
Assuming that each of the teams is playing to it's ability, which of the teams are the best? Think about it, then continue reading. RPI would have you believe that Teams B and C were tied for the best with an RPI of .55, while Team A was the worst with an RPI of .539. But is there a better, objective way to determine this?
In fact yes, Bill James' Log 5 rule, which is statistically sound and based on the theory of logistic regression (whether Bill knew it at the time or not), can be used to predict the probability of a victory when the two teams' true WPCT's are known. In this case we know the probability of victory (the team's actual WPCT) and the opponents' WPCT and we can solve for the team's "true" (opponent-neutral) WPCT. For instance, for Team B it computes (rather intuitively) that if a team has won 70% of it's games against .500 opponents, then it's true opponent-neutral WPCT is .700. The other results are seen in the table below:
As you can see, in fact RPI gets it all wrong. In order to post an .857 WPCT against .400 opponents, Team A would need to have an opponent-neutral WPCT of .800, by far the highest of the 3 teams. While RPI ranks Team A last, it is easily the best team of the field. For comparison, to accomplish what Team C did, only an opponent-neutral WPCT of .500 is necessary. Meanwhile, while Team C and Team B are tied in RPI, Team B's opponent-neutral WPCT of .700 is far better than Team C's.
RPI is not only off, but WAY off even in this simple example. Its error is the equivalent of ranking the NBA's Philadelphia 76ers above the Los Angeles Lakers. It far over-emphasizes strength of schedule, and discounts the team's actual WPCT. So, yes Virginia, there really is a bias against Mid-Major schools who play softer schedules, but rack up a lot of wins.
Of course both WPCT and Opponent's WPCT (as well as Opponent's Opponent's WPCT) are important, but the formula simply adds the parts together where an interaction between the two are necessary. A tough schedule means nothing if you don't win any games, while a high winning percentage means nothing if they are all against the college basketball equivalent of the Washington Generals. The formula is not merely wrong, but it's construction is not even in the right ballpark.
There are other formula's and algorithms out there, which are a lot more statistically sound than RPI and it's too bad the selection committee isn't fed those numbers instead. You would think that over 200 institutions full of academics could find a better way to rank their basketball teams - but alas unfortunately somehow the basketball world is stuck with RPI.