Also read our pitch!
For our next challenge, we return to a 5 year old Kaggle contest to develop a model to help predict seizure events. The training data consists of 106 GB of .mat files, each representing 10 minutes of raw intracranial EEG readings from 16/24 sensors, for 5 dogs and 2 humans respectively.
Unconstrained by the real-world limitations of computational feasibility on embedded hardware, the winning models involved complex ensembles of linear models, SVMs, Multilayer Perceptrons, and Random Forests. These models generally followed preprocessing techniques like PCA or band passing for dimensionality reduction. The contest finished in 2014, nearly a year before deep learning frameworks like Tensorflow were available to help challengers experiment with deep learning and none of the winning approaches were going to make it onto a wearable device.
Speech recognition is another discipline where deep learning models can perform quite well. Essentially, practitioners will regularize the audio signal through Fourier Transform, producing a spectrogram where computer vision ideas using convolutional neural networks (CNNs) apply to identify visual structures called formants.
Going back to seizure prediction, the EEG readings oscillate rapidly in time and it is reasonable to apply FFT to this signal to learn about what discernable patterns may emerge. Besides the smoothing effect of the Fourier Transform, the FFT algorithm is very fast and can easily be performed by a microcontroller. Combining FFT with CNNs, we want to develop a model which can actually be run on a wearable for the purposes outlined by the contest sponsors.
Things get complex in frequency space so we simplify our visualization by applying the absolute value to the spectrogram for a single 10 minute EEG reading. Additionally, the relative magnitude between frequencies often necessitates applying a log transform so that patterns are not overwelmed by scale. Finally, we will apply min-max normalization for a plot corresponding to a single channel like:
Let's assume we cannot fit 10 minutes of 16/24 channel, 400/5000 Hz EEG readings into memory. Rather than applying the FFT to our training samples, we'll choose a variant called the Short time Fourier Transform (STFT) to resolve the signal spectra over smaller time windows, say 3 seconds.
Then, restricting to smaller time segments, we have the following plot using the STFT:
As before, we min-max normalize after performing a log transform on the absolute value of the STFT window for a single channel.
It may be a dramatic simplification to ignore covariance between channels but for the purposes of a quick test, we will create a bunch of these spectrograms, resized to 128 x 128 images and dumped to file. It takes another big assumption to regard any 3 second clip from the 10 minute segment as equally representative for the purposes of differentiating signatures in the EEG.
A Simple Baseline
To get things started, I create a very simple architecture to explore the thesis that CNNs can perform comparably to methods from the contest, but on resource limited hardware. Then by fitting a few thousand samples into memory, I start exploring different architectual choices for the convnet. I'll downsample the very prevalent interictal class to create a more balanced dataset.
Initially, I am looking to reign in the model complexity, the Keras Sequential API supports a handy 'summary' method to get a count of model parameters.
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d_1 (Conv2D) (None, 124, 124, 32) 832 _________________________________________________________________ activation_1 (Activation) (None, 124, 124, 32) 0 _________________________________________________________________ max_pooling2d_1 (MaxPooling2 (None, 62, 62, 32) 0 _________________________________________________________________ conv2d_2 (Conv2D) (None, 60, 60, 32) 9248 _________________________________________________________________ activation_2 (Activation) (None, 60, 60, 32) 0 _________________________________________________________________ max_pooling2d_2 (MaxPooling2 (None, 30, 30, 32) 0 _________________________________________________________________ conv2d_3 (Conv2D) (None, 28, 28, 32) 9248 _________________________________________________________________ activation_3 (Activation) (None, 28, 28, 32) 0 _________________________________________________________________ max_pooling2d_3 (MaxPooling2 (None, 14, 14, 32) 0 _________________________________________________________________ conv2d_4 (Conv2D) (None, 12, 12, 32) 9248 _________________________________________________________________ activation_4 (Activation) (None, 12, 12, 32) 0 _________________________________________________________________ max_pooling2d_4 (MaxPooling2 (None, 6, 6, 32) 0 _________________________________________________________________ flatten_1 (Flatten) (None, 1152) 0 _________________________________________________________________ dense_1 (Dense) (None, 512) 590336 _________________________________________________________________ activation_5 (Activation) (None, 512) 0 _________________________________________________________________ dropout_1 (Dropout) (None, 512) 0 _________________________________________________________________ dense_2 (Dense) (None, 32) 16416 _________________________________________________________________ activation_6 (Activation) (None, 32) 0 _________________________________________________________________ dropout_2 (Dropout) (None, 32) 0 _________________________________________________________________ dense_3 (Dense) (None, 1) 33 _________________________________________________________________ activation_7 (Activation) (None, 1) 0 ================================================================= Total params: 635,361 Trainable params: 635,361 Non-trainable params: 0 _________________________________________________________________ None Train on 9291 samples, validate on 2323 samples
Next, I look to quickly demonstrate that the model is able to learn, without too much regard for performance or overfitting.
Reviewing the training progress, there is good reason to believe training on more data and experimenting with fine-tuned models, unsupervised pretraining, hyperparameter optimization may all help lead to a usable model.
Epoch 99/100 128/128 [==============================] - 502s 4s/step - loss: 0.6242 - acc: 0.6316 - val_loss: 0.6317 - val_acc: 0.6271 Epoch 100/100 128/128 [==============================] - 521s 4s/step - loss: 0.6253 - acc: 0.6318 - val_loss: 0.6266 - val_acc: 0.6321
Now that we have this, we can begin to pin down some of the details around our simplifying assumptions and scale up to the full dataset. If this goes well, we might also try unsupervised pretraining since half of the data consists of testing samples.