Introduction to AlphaGo Zero
If you are interested in technologies like artificial intelligence (AI) and their implementation in sports and games, you may have heard of AlphaGo. In a 2015 match against Fan Hui, it became the first computer programme to defeat a professional human Go player.
In 2017, DeepMind, the creators of the programme, published a paper titled “Mastering the Game of Go without Human Knowledge” in the Nature journal. In it, the authors introduced AlphaGo Zero as “an algorithm based solely on reinforcement learning, without human data, guidance, or domain knowledge beyond game rules.”
With AlphaGo Zero, AlphaGo became its own teacher and the neural network trained to predict move selections and the winner of games also improved the strength of tree search. This resulted in higher quality move selection and stronger self-play in the next iteration.
Before taking a closer look at the programme, it is important to look at the differences between AlphaGo and AlphaGo Zero. DeepMind refers to the program that defeated Fan Hui as AlphaGo Fan. According to them, AlphaGo Fan utilised two deep neutral networks. One was a policy network that outputs move probabilities. The other was a value network that outputs a position evaluation.
The trained networks were combined with a Monte-Carlo Tree Search to provide a lookahead search. Next came AlphaGo Lee, which used a similar approach and defeated Lee Sedol in March 2016.
Key Differences
DeepMind states that there are several key differences that sets AlphaGo Zero apart from AlphaGo Fan and AlphaGo Lee. The first difference is that AlphaGo Zero is trained solely by self-play reinforcement learning. The process starts with random play without any supervision or use of human data.
AlphaGo Zero only uses the black and white stones as input features and also only uses a single neutral network as opposed to separate policy and value networks.
The final difference between AlphaGo Zero and the previous two programmes is that AlphaGo Zero uses a simpler tree search that relies on the single neutral network to evaluate positions and sample moves. No MonteCarlo rollouts are performed.
The programme does use a new reinforcement learning algorithm that incorporates lookahead search inside the training loop. This results in rapid improvement and precise and stable learning.
System
Any machine learning consultancy would find the system of AlphaGo Zero and how it achieves the results it does. According to DeepMind, on day zero, the programme has no prior knowledge of the game and only has the basic rules as input.
On the third day, AlphaGo Zero surpassed the abilities of AlphaGo Lee, which won four out of five games against Lee Sedol in 2016.
After 21 days, the programme reaches the level of AlphaGo Master. This is the version that defeated 60 top professional players online and also won three out of three games against world champion Ke Jie in 2017.
By the fortieth day, the programme surpassed all other versions of AlphaGo.
Any AI app development company specialising in board games would also show interest in how this level of performance translates to the actual game. According to DeepMind, after three hours, the programme played like a human beginner by greedily capturing as many stones as possible instead of implementing a long-term strategy.
After nineteen hours, the programme had learnt the fundamentals of advanced strategies and by seventy hours, AlphaGo Zero played at a superhuman level.
Methods
Taking a closer look at the infrastructure and system that permits the programme to play a challenging board game like Go, it is important to consider the different methods implemented.
Reinforcement learning is one of the key concepts that will be of interest to AI service providers. Reinforcement learning or RL looks at how intelligent agents should act in an environment in order to maximise the notion of the cumulative reward. It is one the three basic paradigms in machine learning, with supervised learning and unsupervised learning being the other two.
Self-play reinforcement learning has been used in games prior to this. This applies to games like chess and checkers as well as Go. DeepMind, with their AlphaGo Zero programme, uses this approach as well and follows the formalism of alternating Markov games.