Ideas - Brains

A very non-technical explanation of how AlphaGo Zero can teach itself to play Go so darn well

Published on November 26, 2017

For many years, the ancient Chinese game of Go was seen as the last major board game at which computer programs could not outcompete (or even challenge) top human players. Google subsidiary DeepMind made headlines in March 2016 when their AlphaGo problem defeated Lee Sedol, one of the greatest Go players in the world. It was an impressive achievement, but one that relied on enormous super computers and a massive training dataset of professional Go games.

A few weeks ago, AlphaGo made waves again with a breakthrough that allowed it to surpass the old system “without human knowledge.” Rather than study expert games, the system learned everything about Go simply by playing against itself. The new algorithm is much simpler, much faster, much more efficient, and much more effective than the version that defeated Lee Sedol.

The paper that the team published explaining the new program is a bit puzzling, since it highlights more the simplifications they made than how the simplified version is so much more effective. However, after digging a bit, it became clear that the key change to the algorithm is quite simple — and one that requires no technical background to understand.

How to play Go

Go is simple. Players take turns placing their stones on the board, attempting to surround their opponent’s stones and to encircle territory, thereby claiming it as their own. If a group of any player’s stones are entirely encircled by opposing stones, that group is captured by the opponent. The game has always been very difficult for computers because each turn presents so many choices of where to place the stone, and because it can be tough to tell just who is encircling who.

The original AlphaGo

Imagine that AlphaGo is actually a team of three people, who we’ll call Nora, Valerie and Monty. Nora has studied lots and lots of professional games of Go, and learned to guess where a master Go player would most likely place their stone in any given board situation. She has great intuition about the game, and when shown the board, she can immediately identify the most promising spots to place a stone. Valerie has also studied many professional games of Go, and she’s really good at telling who’s winning, just by looking at the board. Monty doesn’t know much about strategy, but he’s really good at remembering and imagining configurations of Go stones on the board.

On board A, Nora recommends that the spots marked in red and blue are the best options for them to place their black stone. Monty remembers the two middle boards, then asks Nora where she thinks the opponent will play the white stone on hypothetical board B, and he remembers the two hypothetical boards her suggestions would land them on. By evaluating many of these hypotheticals, Monty can come up with his own recommendations for where to play on board A.

Together, Nora, Valerie, and Monty make a good team. When it’s their turn, Monty asks Nora for her top few most promising spots. For each of those, he then shows her the board as if they had played that stone, and asks which moves their opponent is most likely to respond with. For each of those, Monty asks Nora for her probable responses, and so on. Monty is really fast at showing Nora the boards, and really good at remembering her responses, so they can explore a lot of possibilities very quickly.

After Monty and Nora imagine a few steps down the tree of possible moves, Monty asks Valerie to assess how likely they are to win from each possible board that they could reach. He then does some complicated mental math to synthesize Valerie’s assessments of the possible future situations, and comes up with a recommendation of the most promising spots that they could play right now, and plays the stone for their turn in one of them.

The team does this process before each of their turns. Every day, they play many games like this. Every night, Nora and Valerie go home with photos of each move they played that day, labeled with whether or not they ended up winning that game. They study each move from their winning games, remembering it as a good move, and each move from their losing games, remembering it as a bad move. Studying makes Nora less likely to play the losing moves and more likely to play the winning moves.

The new AlphaGo Zero

For AlphaGo Zero, the change that received the most focus is that Nora never studies any professional games of Go. In fact, when she starts, she doesn’t even know the rules of the game, and just places stones at random. What’s more, Valerie isn’t around anymore, and Nora is asked to both predict promising moves and to assess who’s winning. And instead of three months of practice to master the game, Nora and Monty only have three days. Reading just these changes, it’s hard to see how AlphaGo Zero is so much better than the previous iteration.

The new key to success is very subtle. When Nora goes home, she doesn’t study whether the moves they played lead to wins or losses — in fact, she doesn’t study the moves that they played at all. Instead, for each turn, she compares the initial recommendations that she gave to Monty with the recommendations that he came up with after exploring the tree of possibilities with her help.

On board A, Nora recommends that the spots marked in red and blue are the best options for them to place their black stone. Monty remembers the two middle boards, then asks Nora where she thinks the opponent will play the white stone on hypothetical board B, and he remembers the two hypothetical boards her suggestions would land them on. By evaluating many of these hypotheticals, Monty can come up with his own recommendations for where to play on board A.

As fast as Monty is, it takes him a while to imagine so many different possible futures. If Nora could come up with those same recommendations just by glancing at the board, her recommendations would make Monty’s search more useful, and would help him come up with even better recommendations. And if every night she studies better recommendations from Monty, her intuitions will become better, helping Monty make better recommendations, which she can study to build better intuition… a virtuous cycle.

It turns out that engaging with this virtuous cycle between Nora and Monty makes Nora much more effective than when she was simply studying each of their moves as good or bad. Furthermore, because the board is mostly the same after each move, back when she studied good and bad moves, she couldn’t study every move from each game or she’d become overly convinced that anything that looked like that game was good or bad. Instead, she would study just one move played from each game — so she and Valerie and Monty had to play a ton of games to be able to generate enough material for them to study. But now that she’s just studying to predict Monty’s recommendations, she can study the recommendation for every move, and doesn’t need to play nearly as many games to get lots of study material.

Of course, Nora and Monty aren’t real people. In AlphaGo, Nora is a neural network, and Monty is a Monte-Carlo tree search. Both are well known algorithms that people have been working with for years, and both are used in the original AlphaGo. A simple but brilliant change to the implementation of these well known algorithms makes the program vastly more effective. And that sort of brilliant ingenuity, understanding a problem and suggesting a conceptual solution, is exactly what we still have no idea how to make a computer do.

Natural selection of artificial brains is why great AI will predate decent neuroscience

Published on August 24, 2016

There is a common complaint in the machine learning community today that nobody understands how the best algorithms work. Deep learning, the process of training enormous neural networks to recognize layered abstractions, has dominated all sorts of machine learning tasks, from character recognition to game playing and image captioning. While the original idea of neural networks was inspired decades ago by observed phenomena in brains, neural networks have diverged wildly from the structure of biological brains. Researchers searching for the best results on their machine learning tasks have dreamed up not only complex new architectures, but also all kinds of tips, tricks, and techniques to make their networks more effective.

The trouble is that many of these methods are found empirically, rather than theoretically. Whereas traditional machine learning algorithms are grounded in rigorous mathematical proofs, many popular techniques for achieving top tier results with deep neural networks have their theories and explanations given after the fact, attempting to justify why this thing they tried was so effective.

While this seems remarkable and frustrating to those who develop algorithms, it makes intuitive sense when you take a step back and consider that the same process of trial and error is exactly what led to the development of animal brains. Rather than the application of an elegant and well-reasoned mathematical theory, brains are the product of iterative improvements on whatever worked. Real brains are happy accidents layered on happy accidents - and that's exactly what the artifical neural networks are becoming too. While animal brains architectures were produced over millions of years by accidental improvements naturally selected, neural networks architectures are produced over years by thoughtful hypotheses naturally selected by their performance on machine learning tasks.

So what? There are a couple key insights that follow-on from this. First, as long as so many researchers stay focused on the predictive power of their models - which they likely will, since breaking records on learning benchmarks makes for good papers - then neural network practice will always continue to be far in front of what the theory can explain. Tips, tricks, and happy accidents will continue to compound onto each other, producing better and better results without prior statistical proof.

Second, this means that we should expect a very powerful artifical intelligence to be engineered long before a comprehensive theory emerges to explain how it actually works, or - crucially - before a theory emerges to explain a human brain actually works. If we can't invent a general theoretical mathematical framework for even our own creations, then we should expect one for the product of real evolution to be much further down the line.

Reverse engineering processors and brains

Published on March 15, 2015

In October 2014, Joshua Brown published an interesting opinion piece in Frontiers of Neuroscience. He told a parable about a group of neuroscientists who come upon a computer, and having no idea how it works, they set about applying the techniques of cognitive neuroscience, neurophysiology, and neuropsychology to try to understand the computer's magic. They use EEG, MRI, diffusion tensor imaging, and lesion studies to learn the different functions of the hard drive, the RAM, the processor, and the video card. They learn about its interdependencies and even how to make crude repairs, but they never truly understand how the computer works.

The moral of the story is that neuroscience needs a strong mechanistic theory that can be used to understand how observed effects arise from the systems at work. The author advocates that computational modeling should be as foundational to a neuroscience education as neuroanatomy. But while it was only an analogy, the story about trying to reverse engineer a computer made me really wonder - could we even reverse engineer one of our own processors?

To get an idea of how hard reverse engineering would be, I turned to my friend Warren. He was the TA for my semiconductors class and is about to graduate with his doctorate in electrical engineering. When I asked him about the feasibility, he wrote:

For simple devices and circuits on silicon, it’s trivial- we have the equipment, the SEMs, the TEMs, the AFMs; you can get surface images of components, cross-sectional images no problem. People are also starting to use X-rays to look inside chips... Once you understand what the 20 or 30 different layers of the chip are, you can start to hack down through the layers (destructively) to try and identify layout. Scaling up though it becomes infinitely more complicated.

The latest chips have thousands of pins and not knowing what any of them do, there’s an infinite number of tests you could do. Now, if you have some previous knowledge of, say, how a simpler chip is built, you might start to be able to identify functional regions e.g. a region on a chip with a highly repetitive, dense, pattern will invariably be memory. A region that is repeated, a few times on a chip will, in a modern processor, likely be cores. But I think for the complex chip it’s simply the scale of the number of devices- millions and millions of transistors, and isolating and disentangling them non-destructively is simply unviable, let alone without having any idea of the function that the circuits/devices do!

In a nutshell, you would need many, many chips to test destructively in order to even come close to understanding the basic functionality. If I were to hazard a guess, I’d say deconstructing a chip of the complexity found in a modern computer, without some sort of map would be close to impossible.

Building this map of the physical architecture is akin to the work undertaken by connectomics researchers to digitally reconstruct circuits of neurons. But the physical layout is only a small component of a computer's magic - the hardware, without any of the software running on it. Many neuroscientists that employ a mechanistic, model-focused approach actually record the spiking patterns of individual neurons, either by measuring cell voltages with miniscule metal probes, or by inducing mutations that cause the neurons to glow when their voltage spikes. They add a stimulus, measure the behavior of the network, and try to model it. Imagine doing that with a processor. Measuring would be a nightmare - while brain currents are carried by micron scale neurons, processor currents run on wires and gates only nanometers thick, scarcely probe-able. Perhaps infrared imaging could reveal areas with high current density, but you'd need a system that could read the values in less than a nanosecond, as a two gigahertz processor changes them two billion times per second.

If you could somehow read the binary numbers in the processor registers, you would have the foundation to reverse engineer the processor's binary instruction set. But the layered abstraction so key to complex ideas like programming would make it almost impossible to understand functionality from byte code. Even a simple program written in an interpreted language like Python involves a complex multi-layered program (the interpreter) building data structures and shuffling values. A model that attempts to explain the behavior of bytecode based on the inputs and outputs of a Python application would need to account for so many layers of abstraction that the probe readings could very well appear chaotic.

Reverse engineering a modern processor would clearly be a herculean task. In fact, I would venture it would take more years of academic study and research to reverse engineer a processor than to engineer your own new architecture. Rather than trying to puzzle out the nuances of arbitrary decisions made along the way (endianness, anybody?), engineers would be free to make their own arbitrary decisions and move on. If the goal were to re-create something with the same abilities as a modern computer within a few decades of research, it seems far easier to simply take the big picture ideas revealed by the curious specimen (transistors can be made from silicon, transistors an be used to make hierarchical logic systems, many classes of problems can be solved with the same generalized architecture) and uses those as the foundation for creating a new theory and system.

So where does that leave neuroscience? The processor is the product of generations of engineers trying to find the most efficient, elegant, and consistent solutions. The brain is the product of orders of magnitude more generations of trial and error that went with whatever happened to work first. No doubt reverse engineering the brain would be dramatically more difficult than a computer, so the same heuristic above applies - rather than puzzle out the intricate machinations of the curious specimen, take the big picture ideas it reveals (networks with learned weights can encode high level ideas, many classes of problems can be solved with the same generalized architecture) and use those as the foundation for creating a new theory and system.

Parallel engineering can yield useful understandings of the system under study, but recreating the abilities of the brain is a goal in itself of artifical intelligence research. Neural networks that apply some of the big picture ideas have been around for decades, but the recent explosion of success with convolutional neural networks has been driven by newly engineered architectures and mathematical formulations, not by neuroscience breakthroughs. Other neural network paradigms, like spike-driven and recursive networks, are still in their relative infancy because they're more difficult to get a handle on. However, they are founded on big picture properties of the brain that are known to be effective, so I'm confident that as they receive more engineering attention, they will yield spectacular new results.

Playing sports with dedicated circuits

Published on January 11, 2015

I was recently playing ping-pong when I was struck by an interesting analogy between computational models and the learned reflexes of sports. When you play any physical game or sport involving speed, you rely on both reflexes and higher level thoughts. However, playing by thought and playing by reflex are very different in both their methods and their results. It's most clear in individual sports in which you must react quickly to your opponent's newest challenge while devising your own next strike, such as tennis. As the ball comes toward you, you have a fraction of a second to identify its position and trajectory, position yourself, coordinate the trajectory of your arm with that of the ball, and execute a strike at just the right angles and force.

Any athlete will tell you that when you're competing, you don't think about what you're doing. You just FEEL it. You watch the ball and your arms and legs and hands move and WHAM you hit the ball with just the spin you want. It's like there's a short circuit from your eyes to your arms.

But what if you don't have those kind of reflexes yet? What if you're new to the sport? You play a lot and slowly you develop them, but there can be real value in getting tips on what to focus on. Lessons and workshops from more experienced players are popular because they work. They tell you what to think about, what angle you want your racket to collide with the ball, the motion of arm that can achieve that angle of collision reliably. By thinking about these tips as you play, you can get better fast.

But I think it's just as relatable an experience that in the short term, tips like those will disrupt your play. You may hit some of those shots better, but just as likely your overall play will be a little off, the ball won't seem to go your way, and you give a worse account of yourself than you would have expected if you just focused up and played your game with the flawed technique you're comfortable with.

On the other side of the analogy, consider the contrast between hardware and software solutions to computing problems. The most basic tool of computation is the transistor, in logical terms just a pipe with an on-off switch. From it, you can build the fundamental logic gates - AND, OR, NOT, XOR, and the rest. These can be combined in clever ways to do things like add binary numbers, store the value of a bit, or control which signal goes through an intersection. Circuits like those can then be combined to achieve all sorts of specific goals, like controlling the timer and display of a microwave as you press its buttons.

But the most important logical circuit you can build is a general purpose processor. Equipped with tools to manipulate and store values, the processor acts according to instructions which are themselves data to be manipulated and stored. Rather than specify the operation in the wiring of the circuit, the processor can be programmed to do anything. This is the distinction between solutions in hardware, which use custom physical circuits, and software, which simply run programs on general purpose circuits. Software is obviously incredibly more flexible, allowing people to write simpler and simpler code to specify more and more complexity. It's the go-to solution for most computing problems.

But the flexibility comes at a price of speed. Though they're both unthinkably fast, the time required to execute the same computation with a program on a processor can be orders of magnitude longer than that of a purpose built digital circuit. That's why people go to the expense of engineering application specific integrated circuits for time critical applications like audio processing. Internet routers use dedicated circuits to forward packets out the right wire instead of waiting on a CPU. It's also why most computers have a dedicated graphics processing unit (GPU) to use specific circuits to draw computer graphics in real time.

To draw the analogy, the brain seems to have capacity for both general purpose, CPU style circuits, and specialized, GPU style pipelines from input observations to output actions. The general purpose thinking engine seems the more miraculous with its ability to perform symbolic reasoning, while the special use pathways are indespensible for quick twitch skills like riding a bike. For an activity like playing tennis, playing at a high level requires the competition to be executed almost entirely with dedicated pathways. But when you're trying to improve and you're thinking about applying the tips expressed as verbal symbolic concepts, your observations of the ball's motion has to pass through your general purpose processor before it can tell your arm how to move, slowing you down and throwing off your game.

Luckily, the miracle of human learning is that if similar thoughts passes through the general networks enough, they begin to develop their own dedicated pipelines. With enough training time, you internalize the concepts and can play by reflex better than ever before.

Skepticism of skepticism - examining why I reject the supernatural

Published on November 19, 2014

Several weeks ago, I had a really thought-provoking conversation with a friend about beliefs in the supernatural. He was describing to us how he had come to start believing quite seriously in the existence of angels after a series of compelling conversations with both friends in Virginia and new acquaintances in Nicaragua. There were several stories - people stopping on the side of the road because of lights in the sky and facing an angel descended, children cured of chronic illness by holy water from Lourdes, bright figures that appeared in a shadowy room and convinced an alcoholic to change his life.

My initial response was skepticism, but his conviction gave me pause. I don't believe that the people whose testimonials he cited were lying, or had less than complete convictions about the reality of their experience. On the other hand my skepticism was rooted in a sort of generic faith in the scientific method and established corpus of theory. I've frequently heard people counter that on the basis of "I just believe that there are a lot of things that science still doesn't understand." I certainly can't refute that - our scientific consensus has had major errors identified within it time and time again. So why was I so confident that these people had to be wrong?

It made me start thinking harder about the things that "science still doesn't understand." I realized that there’s actually a huge discrepancy between the confidence of different fields of science. Lots of foundational fields, like chemistry, nuclear physics, magnetism and electricity, are consistently jaw-dropping in their ability to understand, and manipulate the physical world with the aid of mathematical models. Their work has been powerfully vindicated by the revolutionary real world power to leverage these models to create things as unbelievably complex as a modern processor or nuclear reactor. Other fields like geology and astronomy can predict with high confidence the creation history and timeline of our star and our planet, and use that to understand and access marvels like fossil fuels. Studying these fields, it’s consistently amazing to learn just how much science does understand.

That bears a sharp contrast to my experience beginning to study neuroscience in my final semesters at Princeton. What consistently astounded me was not how much, but how little science understands about the mechanisms of the brain. Consider the problem of understanding vision - how do people identify and track objects in their field of view? We have a picture from the big perspective - the light hits the retina, and the data is passed upstream to the primary visual cortex before heading to a host of further areas. But how it goes from a collection of firing photoreceptors to a high-level representation of objects is not even within the scope of current research. The cutting edge research, such as this Nature article from 2014 or this Nature Neuroscience article from 2013 is simply to try and determine which types of retina cells or which parts of their dendrites are responsible for direction selectivity, for determining which direction the field of view is moving. There is so much more that is not understood than is.

Now consider, where are the gaps in our body of scientific understanding into which you can fit angels and other life-changing supernatural experiences. Is it easier to imagine that our understanding of aerodynamics is flawed and it is possible for a human figure to fly by its own power, or that our understanding of subjective human experience and memory is flawed? Seemingly 'concrete' evidence like the healing effect of holy water too is well explained by the baffling complexity of the mind. Not only have placebos have been well demonstrated to be extremely effective medicines, but it's also clear that the more involved a placebo treatment is, the more effective it is likely to be. (Magazine article, example study). It should come as no surprise that holy water to a true believer would be a highly effective treatment.

In sum, I realized my skepticism for the supernatural is rooted in recognizing that a person’s experience is informed by the hypothetical objective world transpiring around them, but the experience is ultimately dictated by the unfathomable spaghetti of connections and double crossings of perception, memory, and expectation from whose biases, shortcuts, and fallacies from which we can, by nature, never be free.