10/08/17

We've found that, even after aggressively downsampling an image, enough high level patterns remain to perform classification/regression. Generalizable models must learn the most salient underlying structures in the data to remain robust to noise/perturbations. I want to consider the autoencoder as a data exploration tool, aiming to inform the design of supervised learning experiments.

Autoencoders is a broad concept that includes classical ML algorithms like PCA as a special case. We train to approximate the identity function under some constraints. Common patterns include:

*Sparse | *Contractive | *Denoising | *Undercomplete |
---|---|---|---|

Augment loss with l1/l2 penalty on model weights | Augment loss with l2 penalty on model Jacobian | Corrupt input data, reconstruct w/o noise | Set dimension of hidden layer smaller than input |

Smaller/more sparse model weights | Smoother model | Improved robustness to noise | Dimensionality Reduction |

Each variant exploits a different prior assumption, offering the practitioner characteristically biased model flavors.

Below, we flatten each input image and feed the input to an undercomplete, fully connected neural network to encode the 1200 raw pixel input into a 32 dimensional space. During training, loss will be the MSE in image reconstruction. This means that the intermediary code at the bottleneck layer must effectively encode a raw image to retain the most important patterns from the MSE perspective. Because the Speed Challenge training data features a single perspective, associating input neurons with pixel values is reasonable.

The existence of these encodings is a testament to the inherent simplicity of an image corpus and the efficiency of distributed representations. In other words, resolving an image to pixel values may be wasteful when there is a smaller, more dense representation available.

However, the decoder is optimized to reduce reconstruction error under the MSE, which tends to smooth out hard image gradients. The largest errors relate to MSE's bias toward a blurred version of the world:

For comparison, the best reconstructions feature smooth images, often well-represented in the training data.

Projecting the encoded images with t-SNE reveals how the saturation/obstruction of the horizon offer local discriminative power in the embedded space.

This offers some hint of the intrinsic dimensionality of the data as well as characteristics of training under MSE.

Suppose, from the beginning, we were interested in constraining the distribution of encoded points. Variational Autoencoders operate under the assumption that these codes should be Gaussian distributed. With this assumption, we trade MSE loss for one which can be decomposed into the sum of:

- a term driving the model toward high fidelity reconstructions
- a term encouraging the encoding to be more 'normal'

This model learns a mapping to a Gaussian distributed latent space. Then to generate new images with distribution similar to the training set, we simply sample the Gaussian and decode it with the VAE. Here, I have decoded 5 points sampled from a 80-dim Gaussian to generate realistic images.

We can even decode several points along the line segment spanning the encoding of two images to visualize the transition from one to another taking a shortcut through latent space.

Here is code to perform the dimensionality reduction and generate plots you can adapt to your investigation.

For example, checkout the VAE decodings trained on a corpus of favicon images.