SmellsLikeML

AlphaGo Zero: The Environment

10/22/17

Reinforcement Learning, On-Policy, Monte Carlo Tree Search, Go

DeepMind's AlphaGo is back in the news but not for beating mere humans.

Introducing AlphaGo Zero, Silver et. al demonstrate that no domain knowledge is needed to train superhuman level performance. Rather, AlphaGo Zero shatters performance ceilings using only self-play. Beating earlier AlphaGo versions by 100 games to 0, Zero challenges our assumptions on learning. Here, I want to consider what distinguishes this advance over previous work.

Sutton and Barto's "Reinforcement Learning: An Introduction" provides an excellent reference and I offer my own takeaways: RL in a Nutshell Notes/Code Sutton & Barto

Some notable differences between Zero and other AlphaGo versions:

Let's break these down in turn.

First, Go is interesting for its complexity. Each of the 19x19 board positions can be characterized has having a white/black/no stones. Game rules alone do not sufficiently constrain the action space to rely on unstructured search. In earlier versions, this challenge was mitigated using various hand-crafted heuristics to guide AlphaGo action selection. In Zero, not even the most self-evident heuristics like "Don't fill in your own eyes" have been built into the model.

Similarly, earlier versions benefit from observation of Go master game play. Zero reaches a new high Elo rating from first-principles of game play, autodidactically!

Second, RL algorithms can be broadly characterized by whether learning is on/off-policy. This distinction separates learning methods that maintain separate models for evaluating states and selecting actions. Zero breaks from earlier work in that one model is used for both.

Finally, AlphaGo searches ahead for the best game play trajectories. Zero uses these search methods in training to refine the neural network policy models but ultimately, MCTS rollouts are approximated by this network.

Self-Play

If you don't know Go, building up an environment for play can be daunting. However, openai gym offers a 19x19 board game environment to provide a convenient api for benchmarking performance against an agent. Looking through the source code, we see that this agent can assume a random policy or play by the pachi Go engine.

We want to hack this a bit to allow for self play and so we insert a flag called solo:


def _step(self, action):
    solo = True
    if not solo:
        assert self.state.color 
                 == self.player_color
    ...
    if not solo:
        if not self.state.board.is_terminal:
            ...
        else:
            if not 
                self.state.board.is_terminal:
                pass
            

Here I've linked a patch that implements the changes characterized above. The flag cuts out the step where an opponent plays, permitting us to choose both moves.

From here, we instantiate an environment and perform a few randomly selected actions as follows:

from collections import deque
import gym

env = gym.make('Go19x19-v0')
board = env.reset()

def go_game():
    env.reset()
    episode_reward = 0
    game_hist = deque(maxlen=16)
    for step in range(100):
        plyr = step % 2
        rand_actn = env.action_space.sample()
        board, reward, done, viz = env.step(
                                   rand_actn)
        game_hist.appendleft(board)
        game_states = [pers[plyr] for 
                        pers in game_hist]
        game_states.append(plyr)
        episode_reward += reward
        if done:
            break
        yield episode_reward, game_states 
    
            

The step method above eats an integer from 0 to 362. Action 0 puts a stone on A19, 1 on B19, ..., 361 on T1, 362 is a pass.

board[0:3] indicates occupied positions by players 0, 1 as well as unoccupied positions.

Here, I have a snippet using ray to parallelize the computation of many trajectories.

Next time, we will consider how Zero "sees" the game of Go.