Need for Speed II


Convnets, Computer Vision, Deep Learning, Regression

Using TF-Slim to fine-tune a checkpointed Inception-V3 model to classify raw images as one of the 2 apparent driving modes, we can achieve over 94% accuracy in a randomly sampled 30% validation set. Though an optimistic estimate, since the validation and training set frames are highly correlated in this set up, high accuracy seems reasonable considering typical images from the early segment:

and the latter half:

These examples portray the later video segment with more narrow streets, and generally more complex background scene. A fine-tuned image classifier might hone in on visual cues like these to distinguish top speed vs. stop-and-go driving.

This suggests the plausibility of learning more complex patterns from the video frames to refine a baseline which simply predicts the mode average.

Starting from the cnn classifier built before, I tweaked the model to perform regression under the MSE loss. I also modified the file to load all the training images into memory which consumes around 10 GB of RAM.

I considered the complexity of the images, their size after cropping and resizing down and chose to apply 3 layers of 5x5 kernels with a stride of 2 and no pooling. Flattening and feeding into 3 fully connected layers with dropout, I used the Adam optimizer, with initial learning rate of 1e-3 under exponential decay.

Optimistic validation estimates notwithstanding, loss of the model (repo here) quickly snaps well below the baseline of a crude classifier predicting mode average.

This effort puts us solidly in range of the performance target. At this point, we have not yet reconciled modeling approaches which alternatively focus attention on reference points determined by Dense Optical Flow versus performing high level scene interpretation as when we fine-tuned an Inception V3 classifier. Motivated by this work, I want to explore models which incorporate more of the global information we discarded in our zeal for crops and transforms.

The authors describe a CNN architecture over two video streams of differing levels of resolution. Features specialized to either high level context like color or those based on the high frequency content of the center cropped image. Factoring input this way, allowed the authors to exploit the camera center bias in framing the subject over their training videos to reduce train time without sacrificing accuracy.

High level context of a scene remains even after aggressively downsampling to 30x40 images.

I used opencv's cv2.resize(img, (0,0), fx=0.5**4, fy=0.5**4) to generate a whole folder of low resolution images.

Then I passed the command line arguments path/to/low/res/images set conv_layers at [(3,32,1),(3,32,1),(3,32,1)] after tweaking a line in data utils to drop resizing/cropping.

This model also reaches validation error close to 3.

Some error analysis:

Small gains to accuracy in post processing speeds predicted to be negative since no examples occurred in the training data. Observe the strong model accuracy near 5 m/s.

Plotting the predicted minus actual versus actual speed shows that high speeds are even more difficult but the biggest error contributions follow a tendency to underestimate higher speeds.

The gif above shows the validation images corresponding to the top 10 errors in increasing order. It is interesting to note that several images feature highway structures which obscure the horizon in a similar way to trees and buildings in characteristically lower speed residential sections. This coincides with large underestimates of speed indicating that the model may have been fooled by the visual similarities.

For comparison, the model using optical flow produces errors like:

Despite the strong concentration of the scatter plot until 15 m/s, this model also has difficulties with near-0 m/s.

In contrast to the raw image model, this model shows a greater tendency to overestimate high speeds.

It is often a good idea to ensemble models which make different kinds of errors. Perhaps averaging predictions will reduce overall MSE.

Still another approach is to benefit from both perspectives using CNNs for feature extraction, concatenating before feeding to the fully connected layers.

With the benefit of both optical flow feature extraction and the visual context of raw images, I choose wider (2048, 1024, 256, 1) fully connected layers, training until early stoppage with a validation loss reaching 1.6.

This model still tends to predict negative velocities. The residuals for this model don't fan out as much for higher speeds as the previous models.

For our qualitative review, I plot the optical flow images corresponding to the largest errors.

This collection shows the tendency to underestimate speeds with overpasses in the background, though with smaller residual error.

While the combined model is an improvement, we have done little outside of Optical Flow to exploit the temporal regularities...