A very non-technical explanation of how AlphaGo Zero can teach itself to play Go so darn well

Published on November 26, 2017

For many years, the ancient Chinese game of Go was seen as the last major board game at which computer programs could not outcompete (or even challenge) top human players. Google subsidiary DeepMind made headlines in March 2016 when their AlphaGo problem defeated Lee Sedol, one of the greatest Go players in the world. It was an impressive achievement, but one that relied on enormous super computers and a massive training dataset of professional Go games.

A few weeks ago, AlphaGo made waves again with a breakthrough that allowed it to surpass the old system “without human knowledge.” Rather than study expert games, the system learned everything about Go simply by playing against itself. The new algorithm is much simpler, much faster, much more efficient, and much more effective than the version that defeated Lee Sedol.

The paper that the team published explaining the new program is a bit puzzling, since it highlights more the simplifications they made than how the simplified version is so much more effective. However, after digging a bit, it became clear that the key change to the algorithm is quite simple — and one that requires no technical background to understand.

How to play Go

Go is simple. Players take turns placing their stones on the board, attempting to surround their opponent’s stones and to encircle territory, thereby claiming it as their own. If a group of any player’s stones are entirely encircled by opposing stones, that group is captured by the opponent. The game has always been very difficult for computers because each turn presents so many choices of where to place the stone, and because it can be tough to tell just who is encircling who.

The original AlphaGo

Imagine that AlphaGo is actually a team of three people, who we’ll call Nora, Valerie and Monty. Nora has studied lots and lots of professional games of Go, and learned to guess where a master Go player would most likely place their stone in any given board situation. She has great intuition about the game, and when shown the board, she can immediately identify the most promising spots to place a stone. Valerie has also studied many professional games of Go, and she’s really good at telling who’s winning, just by looking at the board. Monty doesn’t know much about strategy, but he’s really good at remembering and imagining configurations of Go stones on the board.

On board A, Nora recommends that the spots marked in red and blue are the best options for them to place their black stone. Monty remembers the two middle boards, then asks Nora where she thinks the opponent will play the white stone on hypothetical board B, and he remembers the two hypothetical boards her suggestions would land them on. By evaluating many of these hypotheticals, Monty can come up with his own recommendations for where to play on board A.

Together, Nora, Valerie, and Monty make a good team. When it’s their turn, Monty asks Nora for her top few most promising spots. For each of those, he then shows her the board as if they had played that stone, and asks which moves their opponent is most likely to respond with. For each of those, Monty asks Nora for her probable responses, and so on. Monty is really fast at showing Nora the boards, and really good at remembering her responses, so they can explore a lot of possibilities very quickly.

After Monty and Nora imagine a few steps down the tree of possible moves, Monty asks Valerie to assess how likely they are to win from each possible board that they could reach. He then does some complicated mental math to synthesize Valerie’s assessments of the possible future situations, and comes up with a recommendation of the most promising spots that they could play right now, and plays the stone for their turn in one of them.

The team does this process before each of their turns. Every day, they play many games like this. Every night, Nora and Valerie go home with photos of each move they played that day, labeled with whether or not they ended up winning that game. They study each move from their winning games, remembering it as a good move, and each move from their losing games, remembering it as a bad move. Studying makes Nora less likely to play the losing moves and more likely to play the winning moves.

The new AlphaGo Zero

For AlphaGo Zero, the change that received the most focus is that Nora never studies any professional games of Go. In fact, when she starts, she doesn’t even know the rules of the game, and just places stones at random. What’s more, Valerie isn’t around anymore, and Nora is asked to both predict promising moves and to assess who’s winning. And instead of three months of practice to master the game, Nora and Monty only have three days. Reading just these changes, it’s hard to see how AlphaGo Zero is so much better than the previous iteration.

The new key to success is very subtle. When Nora goes home, she doesn’t study whether the moves they played lead to wins or losses — in fact, she doesn’t study the moves that they played at all. Instead, for each turn, she compares the initial recommendations that she gave to Monty with the recommendations that he came up with after exploring the tree of possibilities with her help.

On board A, Nora recommends that the spots marked in red and blue are the best options for them to place their black stone. Monty remembers the two middle boards, then asks Nora where she thinks the opponent will play the white stone on hypothetical board B, and he remembers the two hypothetical boards her suggestions would land them on. By evaluating many of these hypotheticals, Monty can come up with his own recommendations for where to play on board A.

As fast as Monty is, it takes him a while to imagine so many different possible futures. If Nora could come up with those same recommendations just by glancing at the board, her recommendations would make Monty’s search more useful, and would help him come up with even better recommendations. And if every night she studies better recommendations from Monty, her intuitions will become better, helping Monty make better recommendations, which she can study to build better intuition… a virtuous cycle.

It turns out that engaging with this virtuous cycle between Nora and Monty makes Nora much more effective than when she was simply studying each of their moves as good or bad. Furthermore, because the board is mostly the same after each move, back when she studied good and bad moves, she couldn’t study every move from each game or she’d become overly convinced that anything that looked like that game was good or bad. Instead, she would study just one move played from each game — so she and Valerie and Monty had to play a ton of games to be able to generate enough material for them to study. But now that she’s just studying to predict Monty’s recommendations, she can study the recommendation for every move, and doesn’t need to play nearly as many games to get lots of study material.

Of course, Nora and Monty aren’t real people. In AlphaGo, Nora is a neural network, and Monty is a Monte-Carlo tree search. Both are well known algorithms that people have been working with for years, and both are used in the original AlphaGo. A simple but brilliant change to the implementation of these well known algorithms makes the program vastly more effective. And that sort of brilliant ingenuity, understanding a problem and suggesting a conceptual solution, is exactly what we still have no idea how to make a computer do.