How predictive is the O’s record on July 1 of their season-ending record?
I thought about this question a lot after reading Nate Silver’s The Signal and the Noise, especially Chapter 8 where he talks about the value of thinking probabilistically and how to use Bayes’s theorem as tool. The theorem seems like such a natural way to think about whether a hypothesis is true, or whether a particular piece of evidence matters.
I am not a statistician but the theorem seemed easy enough to grasp, and after thinking about it for awhile, I came up with a way to use it in baseball. This article is the result of that thinking!
Using the Orioles’ 60 full seasons (since 1954) , I have data to make the following statements:
- If the Orioles are at .500 or above on July 1, I estimate their chances of finishing at .500 or above at 94%.
- Conversely, if the Orioles are below .500 on July 1, I estimate their chances of finishing below .500 at 88%.
I looked at the team’s record on July 1 for a couple reasons. One, it’s early enough in the season that it has some use; after all, if I tell you information about a team’s record on September 1, you may not care because that is late enough in the season that many teams’ records seem like foregone conclusions. Three, baseball-reference.com has a “Standings on any date” feature that makes doing research straightforward.
Showing the Work
Bayes’s theorem is defined by: (xy)/(xy + z(1-x)). In the following two sections I assign values to x, y, and z based on historical fact (the O’s team record after games played on July 1 and at the end of the season, for all years 1954-2013).
If the O’s are at or above .500 on July 1, how likely is it that they will finish the season at or above .500?
For this question, Bayes’s theorem takes the following inputs:
- x: Initial estimate of how likely it is the Orioles will finish at .500 or above. Of the 60 full seasons examined, the O’s achieved this mark in 33 of them. So this is 33/60 or 0.55.
- y: Probability of the Orioles being at or above .500 on July 1 if we know that they will end up finishing that way. Of the 33 winning seasons, they were at or above .500 on July 1 in 30 seasons. 30/33 = 0.91.
An aside: of course, on July 1 we can’t fully know how the O’s will finish the season, so this is a guess, but you can see how it’s an educated one — based on prior O’s teams and not the alignment of the stars, the steadiness of Buck Showalter’s gaze, the “gut feel” of an announcer, or something else subjective.
- z: Probability of the Orioles being at or above .500 on July 1 if we know that they will end up finishing under .500. It turns out this happened twice; in 2005 and 2008, the team was at or above .500 on July 1 but finished with a losing record. 2/27 = 0.07. 27 is the denominator here because we are looking at losing seasons, not winning ones.
This number indicates that, as I suspected, being at or over .500 on July 1 is not perfectly predictive of whether the team will finish that way. If we had simply observed how many times the O’s were at or above .500 and stayed that way, we would be working with incomplete information.
Same caveat as above about how we can’t ever know this for a fact, so we use history to guess. Except this time, we are guessing how likely it is that a winning season is not caused by the team’s record on July 1.
- x = 0.55
- y = 0.91
- z = 0.07
Plugging in the numbers, we get 0.94. So we estimate that the vast, vast majority of the time, if the O’s are above .500 on July 1, they will finish the season that way. Good to know, but what if the O’s are under .500 on July 1? This is a separate, but obviously related, question.
If the O’s are under .500 on July 1, how likely is it they’ll finish that way?
Let’s define our terms again:
- x: Initial estimate of how likely it is the O’s will finish under .500. This is the inverse of the previous prior, so 27/60 or 0.45.
- y: Probability of the O’s being under .500 on July 1 if we know (think, really) that they will finish that way. Of their 27 losing seasons, the O’s were under .500 on July 1 for 22 of them. 22/27 = 0.81.
- z: Probability of the O’s being under .500 on July 1 if we know that they will finish the season above .500. Sadly the data shows that this is extremely rare. Only in three seasons (1957, 1975, and 1976) did the O’s improve from under .500 on July 1 to a winning record. 3/33 = 0.09.
- x = 0.45
- y = 0.81
- z = 0.09
Bayes’s theorem returns 0.88, saying there is an 88% chance that, if the O’s are under .500 on July 1, they will finish under .500 for the season.
Discussing the Results
Intuitively, the results make sense. As each question shows, the O’s have roughly a 50/50 chance of finishing with a winning record. (Their run of success from ’69 to ’85 was mostly wiped out by the era of futility from ’98 to ’11.) Each question also demonstrates strong evidence that whether or not the O’s are above .500 on July 1 is the way they’ll finish the season. Finally, each question demonstrates very weak evidence that the O’s will reverse the position they have on July 1. (I believe this is also why each question’s answer is very close to its component y variable.)
So there you have it. If the O’s are above .500 on July 1, it’s safe to say you can get excited! However if they are below .500 on July 1, don’t expect them to turn it around. I don’t know whether or not this is unusual or whether this is true of other baseball teams. Answering that question is certainly possible but will require a lot of work. Along these lines, I hope to have additional Bayesian posts about the O’s (and maybe, eventually, other teams 😉 in due course.