10/22/17Reinforcement Learning, On-Policy, Monte Carlo Tree Search, Go
DeepMind's AlphaGo is back in the news but not for beating mere humans.
Introducing AlphaGo Zero, Silver et. al demonstrate that no domain knowledge is needed to train superhuman level performance. Rather, AlphaGo Zero shatters performance ceilings using only self-play. Beating earlier AlphaGo versions by 100 games to 0, Zero challenges our assumptions on learning. Here, I want to consider what distinguishes this advance over previous work.
Sutton and Barto's "Reinforcement Learning: An Introduction" provides an excellent reference and I offer my own takeaways: RL in a Nutshell Notes/Code Sutton & Barto
Some notable differences between Zero and other AlphaGo versions:
Let's break these down in turn.
First, Go is interesting for its complexity. Each of the 19x19 board positions can be characterized has having a white/black/no stones. Game rules alone do not sufficiently constrain the action space to rely on unstructured search. In earlier versions, this challenge was mitigated using various hand-crafted heuristics to guide AlphaGo action selection. In Zero, not even the most self-evident heuristics like "Don't fill in your own eyes" have been built into the model.
Similarly, earlier versions benefit from observation of Go master game play. Zero reaches a new high Elo rating from first-principles of game play, autodidactically!
Second, RL algorithms can be broadly characterized by whether learning is on/off-policy. This distinction separates learning methods that maintain separate models for evaluating states and selecting actions. Zero breaks from earlier work in that one model is used for both.
Finally, AlphaGo searches ahead for the best game play trajectories. Zero uses these search methods in training to refine the neural network policy models but ultimately, MCTS rollouts are approximated by this network.
If you don't know Go, building up an environment for play can be daunting. However, openai gym offers a 19x19 board game environment to provide a convenient api for benchmarking performance against an agent. Looking through the source code, we see that this agent can assume a random policy or play by the pachi Go engine.
We want to hack this a bit to allow for self play and so we insert a flag called solo:
def _step(self, action): solo = True if not solo: assert self.state.color == self.player_color ... if not solo: if not self.state.board.is_terminal: ... else: if not self.state.board.is_terminal: pass
Here I've linked a patch that implements the changes characterized above. The flag cuts out the step where an opponent plays, permitting us to choose both moves.
From here, we instantiate an environment and perform a few randomly selected actions as follows:
from collections import deque import gym env = gym.make('Go19x19-v0') board = env.reset() def go_game(): env.reset() episode_reward = 0 game_hist = deque(maxlen=16) for step in range(100): plyr = step % 2 rand_actn = env.action_space.sample() board, reward, done, viz = env.step( rand_actn) game_hist.appendleft(board) game_states = [pers[plyr] for pers in game_hist] game_states.append(plyr) episode_reward += reward if done: break yield episode_reward, game_states
The step method above eats an integer from 0 to 362. Action 0 puts a stone on A19, 1 on B19, ..., 361 on T1, 362 is a pass.
board[0:3] indicates occupied positions by players 0, 1 as well as unoccupied positions.
Here, I have a snippet using ray to parallelize the computation of many trajectories.
Next time, we will consider how Zero "sees" the game of Go.