SmellsLikeML

AlphaGo Zero: Seeing Go

10/22/17

Experts in games of strategy like chess, are thought to maintain an internal representation of board configuration that summarizes geometric and functional dependence between pieces.

AlphaGo Zero leverages this insight, using the state-of-the-art in computer vision to build its own representations. Binary arrays describe the game state from a player's perspective in time, indicating occupied intersections. But Go game play forbids repetition, imposing a temporal constraint. Zero considers the sequence of game states from both player perspectives over the last 8 moves for each, along with a binary variable C encoding the color to play. Symbolically, consider the sequence of 17 values:

$$ s_t = \left[ \mathcal{X}_t, \mathcal{Y}_t, ..., \mathcal{X}_{t-7}, \mathcal{Y}_{t-7}, C\right]$$

Here, from the perspectives of players X, Y, each state is charactrized by a binary 19x19 array where coordinates indicate whether the player's stone occupies this space at time t.

Input features are fed to towers with convolution blocks characterized by 3x3 kernels, batch normalization, and relu activations, as well as 19 or 39 residual blocks which also include skip connections. The features extracted from these towers are used for policy/value iteration.

In Tensorflow, we might build these towers with something like:


def conv_block(input, idx, act=tf.nn.relu):
    with tf.name_scope('conv_block-
                        {}'.format(idx)):
        inpts = tf.layers.conv2d(input, 256, 
                                 strides=1, 
                                 kernel_size=3,
                                 padding='same'
                                 )
        inpts = tf.layers.batch_normalization(
                          inpts, 
                          is_training=True)
        return act(inpts)
...

def residual_block(input, idx):
    with tf.name_scope('res_block-
                        {}'.format(idx)):
        inpts = conv_block(inpts, 
                          idx + '-1')
        inpts = conv_block(inpts, 
                           idx + '-2', 
                           act=tf.identity)
        inpts = inpts + input  #skip 
        inpts = tf.nn.relu(inpts)
...

for _ in range(19):   # or 39
    inpts = residual_block(inpts, idx)
            

This architecture continues in the trend toward removing pooling layers, small kernels with many feature maps, and skip connections.

There are around 20K configurations for a 3x3 patch and ZERO tries learning the 256 best ways to look at this. Stacking layers, the effective receptive field increases and higher-level features can be learned.

Residual connections help to backpropagate through the many layers.

On 19x19 input with sharp local differences, the decimation of pooling would likely prove catastrophic to learning. Pooling is nice for smoother, larger input that we want to downsample. Another way to think about it, Go is characteristically smaller and "higher frequency" than natural images prevalent in ImageNet.

Next time, we will consider the modules applied to learn game play policies.