SmellsLikeML

AlphaGo Zero: Learn with Search

11/2/17

Having considered the Go environment and how AlphaGo Zero is able to perceive game play through deep convnets, we turn to consider how decisions are made.

AlphaGo Zero was designed with neural net 'heads' for policy and value calculations. Essentially, each head takes the convnet output and applies additional layers to learn functions which learn policy/value through experiencing gameplay.

The policy head applies 2-filter 1x1 convolution with 1-stride, batch normalization, relu activation, and a 362-dense layer outputting logit probabilities for actions corresponding to 361 positions and a pass for something like:


policy_head = tf.layers.conv2d(net, 2, 
                         strides=1, 
                         kernel_size=1,
                         padding='same'
                         )
policy_head = tf.layers.batch_normalization(
                  policy_head, 
                  is_training=True)
policy_head = tf.nn.relu(policy_head)
policy_head = tf.contrib.layers.flatten(
                                   policy_head)
policy_head = tf.layers.dense(policy_head, 362)
policy_head = tf.nn.softmax(policy_head)
            

The value head uses 1-filter 1x1 convolution with 1-stride, batch normalization, relu activation, flattened and connected to 256 dense, relu, to scalar, tanh activation.


value_head = tf.layers.conv2d(net, 1, 
                         strides=1, 
                         kernel_size=1,
                         padding='same'
                         )
value_head = tf.layers.batch_normalization(
                  value_head, 
                  is_training=True)
value_head = tf.nn.relu(value_head)
value_head = tf.contrib.layers.flatten(
                                  value_head)
value_head = tf.layers.dense(value_head, 256)
value_head = tf.nn.relu(value_head)
value_head = tf.layers.dense(value_head, 1)
value_head = tf.nn.tanh(value_head)
            

The MSE and cross entropy losses from these heads are given equal weight in the overall loss. The policy head learns to approximate the result of MCTS while the value head learns to estimate likelihood of winning from board configurations.

In this way, the authors frame MCTS as a policy improvement operator while evaluating on the results of game play using search. Neural Networks guide the search while learning its approximation. To encourage exploration, MCTS uses upper-confidence bound action selection. This selection mechanism accounts for the uncertainty in our policy estimates, tracking empirical frequency and biasing actions toward competitive actions with higher variance in estimated value.