SmellsLikeML

Need for Speed

09/10/17

Visual Odometry, Dense Optical Flow, Autonomous Vehicles, Computer Vision

Progress in autonomous vehicles has been exciting. One SF group working on this problem is comma.ai. Their approach has been to hack newer model cars augmented with dashcams rather than to wait for auto manufacturers to get up to speed.

They've also posted a programming challenge. After extracting the file, you'll find a README along with two .mp4 videos of a dashcam view driving around the SF bay and a text file of 20400 speeds, each corresponding to a frame in the training video. The objective is visual odometry: to regress on the image outputting speed for each frame, generalizing to the unlabeled test video.

From the README: "We will evaluate your test.txt using mean squared error. <10 is good. <5 is better. <3 is heart." Here is a little sample from the training video:

For each snippet, the bottom quarter of the frame, featuring the car dash, provides no information for predicting the speed of the car. Often, the top of the image offers no reference points to infer the car's speed. Generally, the eye is drawn to focusing in the center, especially the lane markers. Even the neighboring cars are moving alongside at a relative speed producing perplexing perspectives.

Above, I have aggressively cropped a frame to focus on these markers. Additionally, I have converted from a 3-channel color image to black & white since only a small part of the color spectrum is represented when focusing attention to the asphalt.

Here, we have taken great liberty discarding global context to focus attention to a patch in the immediate foreground framing lane markers. We will treat that global context separately. Since speed is a rate of distance traveled over time, information from additional frames may help. Naively subtracting consecutive images produces unintelligible results.

This noisy image does still retain the structure of lane markers. I need something like displacement vectors for consecutive images... after googling for a bit, I stumbled into the notion of "optical flow".

Optical Flow, using geometric techniques to track reference points, produces a vector field from two images. Rendering an image colored by magnitude of these vectors emphasizes those points which are moving across frames. This image shows high resolution for such simple structure and so I downsample by averaging over 3x3 blocks.

Applied to the training video, I have something like this:

Dense optical flow seems to have succeeded in emphasizing those aspects of the frames where the greatest change occurs. However, the continuity in time is rather weak and so we could attenuate some of the apparently spurious changes over time.

The gif above appears smoother because I averaged the transformed images over 3 consecutive frames in time.

To see why averaging frames across time might be reasonable, consider the following plot of speed over the frame number for the train.mp4

An inspection reveals that the speed changes only marginally over several time frames. In fact, by rounding each speed to the nearest of 0, 5, 10, ..., 25 would result in an approximation that differs from the original by MSE of 2.3. So one approach to achieving target performance could be to build an excellent classifier. Here is what the distribution of speeds looks like:

From these last two plots, it seems reasonable to conclude that the training video samples different modes of driving. Namely, the segment begins with a freeway portion of smoother, higher speed while the second half features slower, stop-and-go driving through traffic and residential areas. Speeds near 15 m/s appear as a barrier to these modes. Consider a model which predicts the average speed for one of these two modes:

This coarse approximation coincides with the modes of the histogram above while introducing MSE over 15. This suggests a performance baseline for an accurate classifier paired with something more fine-tuned than predicting the class average or mode speed.

To get a sense of what we'll encounter in the 16 min video, below is a collection of time-stacked frames:

Cruising the Highway.
At a Stop.
Behind a Car.

Intuitively, I'd expect darker images with lane markers to indicate higher speeds with lighter images relating to the fact that dense optical flow detects more reference points at lower speeds. You might expect to associate a car ahead with low speeds.

And so we have cropped, transformed, downsampled, and time smoothed video frames and labeled each with the average speed over the time window. However, this approach has been heavy handed in discarding information. When the perspective over the training data does not vary too much, it can be desirable to learn feature detectors which can specialize to different parts of the image.

Often, Dense Optical Flow is visualized so that the direction of the flow vectors maps to a color wheel while the magnitude scales with intensity. And so, by reintroducing the color spectrum, we maintain the advantage of greater localization. Here is a snippet without cropping, spatial smoothing, or greyscaling.

Observe, it is generally true that line markers are framed in the parts of the image that will be green/orange. Now it would be possible for feature detectors to specialize to detecting lane markers.

It may be desirable to retain more of the high frequency details from the original image. Taking the saturation from the input image rather than maxing it out under DenseOpticalFlow we have.

Thus we have framed a regression problem with a number of preprocessing techniques and considered performance bounds with simpler models. In the sequel, we will apply deep learning to the visual odometry problem.