Need for Speed IV


Lessons in Overfitting

After adding a couple additional vision problems to my portfolio and taking part in some Kaggle contests, I got up the courage to submit my speed challenge predictions to the folks at After some delays around the holidays, I heard back on the final MSE...

Turns out I scored over 90!! Clearly the LSTM model was very overfit.

Even as I assembled the submission for scoring, I "eyeballed" the results as I watched the test video and generally observed speed increases coinciding with the predictions. However, despite applying a rolling mean for time-smoothing, the predictions oscillated more widely than would be physically plausible. With a quick parameter sweep, I found a degree of variance between predictions that did not inspire much confidence. Nonetheless, after making contact with comma, there is a time constraint in effect if you want the submission to be scored by one of the engineers.

Besides the regularity of the predictions in time, another obvious problem for the model was failing to predict 0 m/s when the automobile is stopped at an intersection. It seems that the motion of cars moving through the field of view confounds the model, which picks up on features derived from optical flow vectors.

One natural remedy might be to investigate different architectures. Indeed, NVIDIA reports success in visual odometry using a simple convnet architecture. For the amount of available training data, adding LSTM layers as I have done can contribute to overparametrization and overfitting. Recall that I began with a more simple convnet architecture, I suspect that by further reducing model capacity the LSTM layers may still be used to greater effect.

I should also make greater use of the continuity assumption. At each instance, I should expect the speed to be marginally higher or lower than the average over the preceding time window. By explicitly concatenating something like the average speed over a time window preceding the frame to predict, I should expect the task has been reduced to predicting relative changes in speed.

However, I believe the primary culprit here relates to a poor validation scheme. In fact, this was a chief concern from an early point as evidenced by variation in training curves. Specifically, I experimented with simply reserving the last 20% of the video for validation versus selecting 20% frames at random. Probably both are a poor man's measure for the test error.

Selecting 20% of frames at random ignores the high degree of correlation between subsequent frames leading to an overly optimistic validation loss. In other words, a consecutive frames vary only slightly, thus information leakage renders the validation loss a poor approximation to the generalization error.

On the other hand, chopping off the last N seconds implicitly assumes this segment represents the full variation of what we might encounter in general. Needless to say, this is a weak assumption.

I believe a better approach than either of the quick-and-dirty ones discussed above is to methodically build a validation set which represents the different driving modes while while selecting frames insulated from the training data. A reasonable scheme could work as follows:

There are also a number of results to work through in understanding the state-of-the-art in computer vision for autonomous vehicles. Though model performance was very disappointing, I will take this as a signal that there is much more work to be done here. One shouldn't step up to the challenge with limited experience in vision problems and expect to blow away the competition. I will return to this problem after the next set of challenges I have planned for myself as I work on new skills in deep learning.