Elo Ratings v. Machine Learning: Predicting Chess Game Results
Arpad Elo v. LightGBM
Arpad Elo is the inventer of the Elo rating system widely used in the chess world. LightGBM is the machine learning method I am going to use to slam dunk on his 1960s math. This article will be more technical than usual, so buckle up!
Much of this site is dedicated to predicting the results of chess games and tournaments. So, naturally, the first question we should ask is: how do we predict who will win a chess game? In comes Mr. Elo, and his famous rating system. Using Elo’s formulas we can calculate the expected score for each player in a chess game. Simulating chess tournaments also requires us to estimate the probability of a decisive result, and the probability of a draw. The expected score alone from Elo’s formulas is not enough. With some clever adjustments to Elo’s original formulas I found on Wismuth, we can estimate the win, draw, and lose probability for each player.
Using these odds, we can simulate chess games and tournaments quite well. But, we can do better. Is the first move advantage for super GMs different than the rest of us mortals? We can answer this question, and many like it, with machine learning. LightGBM to the rescue.
What the hell is LightGBM?
It’s a really good way to take a bunch of historical data and predict something about the future (lets say two players Elo ratings, and an upcoming chess game!). But how does it work? I’m glad you asked! The idea is actually quite simple:
- Build one simple model that isn’t very good (using decision trees).
- Find out where the first model didn’t predict things very well (by calculating the residual error of each prediction).
- Build another model to predict all the things you didn’t predict well last time (by boosting the importance of the data we made bad predictions on).
- Repeat, until the model is really good.
Now you can call yourself a machine learning expert. Fortunately, we can use neatly curated Python libraries to do the hard work, but it’s still good to have an idea of what’s happening under the hood.
Now that we know a tiny bit about LightGBM, what are we going to do with this super easy to understand algorithm? Using Caissabase (thank you!!), I created a dataset of nearly 5 million over the board chess games. For each game, we will extract three pieces of information: white Elo, black Elo, and the result. We’ll pass this Elo data to LightGBM, ask it to come up with a way to predict the result of each game in the data, and compare it to the Elo formulas mentioned above.
Is the Elo model or the machine learning model better?
Drum roll, please…
The machine learning model is better. Let’s take a look at some visualizations below that can help us understand how the predictions from the two approaches differ. First we’ll take a look at a “Model Disagreement Chart”, also known as a Double Lift Chart in actuarial literature. We start by sorting the data by model disagreement (which is Elo model prediction divided by ML model prediction). Then we group the data into 20 bins, and compare! The better model is the one that is closed to the actual data. You’ll see below that the ML model is better than the Elo model.
Model Disagreement Chart
The one sentence summary of this chart: When the Elo model predicted score for the white player is lower than the ML model prediction, the Elo model prediction is too low.
Another good way to evaluate which model is better is called a lift chart. When the model predicts white score to be low, is it actually low? We can explore this question visually by feeding the model made up data to see what it predicts. For two players with the same Elo, this graph shows us the expected score for the player with the white pieces.
Lift Chart
The Elo model is predicting too low of a score for white when that player is a big underdog, and too high of a score when the white player is a favorite. The LightGBM model is performing well.
These graphs, and some other model statistics I looked at, convince me that the LightGBM model is going to allow me to more accurately predict the results of individual games, and therefore more accurately predict the outcomes of chess tournaments. I’m curious if this has a material impact on the Grand Prix odds I have published - I’ll save that for another day.
It would be a let down if I just told you all about this new fancy model and didn’t put it to use. Back to the question I asked above: Is the first move advantage for white different for players of different Elo ratings? Yes!
First Move Advantage for White in Chess
For games with a higher average Elo rating, the white player has a slightly larger first move advantage. The advantage goes from 0.034 to to 0.048 expected points, a 40% increase in the first move advantage for Super GMs compared to players at about 2000 Elo.
You’re still reading?
Many kudos are still in order for Arpad Elo - his system is still quite useful today, 60+ years later. Were you wondering why I didn’t use machine learning to build a new rating system? The main challenge in doing that would be curating a really high quality dataset. Elo ratings from FIDE and USCF are effectively just that - high quality (aggregated) data for each player. If there was a good way to get a high quality database of all FIDE rated games ever played, we could do some damage with machine learning!
For now, the best I’ve come up with is using Elo ratings and machine learning together. Combining math from the 60s with math from today, we can make good predictions about chess games.