10/11/17LSTMs for Video, Deep Learning, Computer Vision
Upon introducing multiple streams for our convnet regressor, the architectual possibilities really explode. Suppose we use a third stream feeding each by one of three consecutive frames. We'd like to trade our fully connected layers for a couple RNN layers, hopefully incorporating more temporal context into the predictions.
The necessary changes can be summarized as follows:
Trying 2 hidden RNN layers with 8 neurons, I quickly reach very low MSE.
Here is a plot of the predicted vs. actual in validation.
And the residuals for comparison