Also read our pitch!

For our next challenge, we return to a 5 year old Kaggle contest to develop a model to help predict seizure events. The training data consists of 106 GB of .mat files, each representing 10 minutes of raw intracranial EEG readings from 16/24 sensors, for 5 dogs and 2 humans respectively.

Most contestants engineered lab-tested features based on FFT or time-correlation between the channels with the results of a similar contemporaneous contest in mind.

Unconstrained by the real-world limitations of computational feasibility on embedded hardware, the winning models involved complex ensembles of linear models, SVMs, Multilayer Perceptrons, and Random Forests. These models generally followed preprocessing techniques like PCA or band passing for dimensionality reduction. The contest finished in 2014, nearly a year before deep learning frameworks like Tensorflow were available to help challengers experiment with deep learning and none of the winning approaches were going to make it onto a wearable device.

More recently, the Tensorflow Speech Recognition Challenge did introduce a special prize for those able to run models on a Raspberry Pi 3.

Speech recognition is another discipline where deep learning models can perform quite well. Essentially, practitioners will regularize the audio signal through Fourier Transform, producing a spectrogram where computer vision ideas using convolutional neural networks (CNNs) apply to identify visual structures called formants.

signal processing

Reading Minds

Going back to seizure prediction, the EEG readings oscillate rapidly in time and it is reasonable to apply FFT to this signal to learn about what discernable patterns may emerge. Besides the smoothing effect of the Fourier Transform, the FFT algorithm is very fast and can easily be performed by a microcontroller. Combining FFT with CNNs, we want to develop a model which can actually be run on a wearable for the purposes outlined by the contest sponsors.

Things get complex in frequency space so we simplify our visualization by applying the absolute value to the spectrogram for a single 10 minute EEG reading. Additionally, the relative magnitude between frequencies often necessitates applying a log transform so that patterns are not overwelmed by scale. Finally, we will apply min-max normalization for a plot corresponding to a single channel like:

signal processing

Let's assume we cannot fit 10 minutes of 16/24 channel, 400/5000 Hz EEG readings into memory. Rather than applying the FFT to our training samples, we'll choose a variant called the Short time Fourier Transform (STFT) to resolve the signal spectra over smaller time windows, say 3 seconds.

Then, restricting to smaller time segments, we have the following plot using the STFT:

stft

As before, we min-max normalize after performing a log transform on the absolute value of the STFT window for a single channel.

It may be a dramatic simplification to ignore covariance between channels but for the purposes of a quick test, we will create a bunch of these spectrograms, resized to 128 x 128 images and dumped to file. It takes another big assumption to regard any 3 second clip from the 10 minute segment as equally representative for the purposes of differentiating signatures in the EEG.

A Simple Baseline

To get things started, I create a very simple architecture to explore the thesis that CNNs can perform comparably to methods from the contest, but on resource limited hardware. Then by fitting a few thousand samples into memory, I start exploring different architectual choices for the convnet. I'll downsample the very prevalent interictal class to create a more balanced dataset.

Initially, I am looking to reign in the model complexity, the Keras Sequential API supports a handy 'summary' method to get a count of model parameters.

  
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d_1 (Conv2D)            (None, 124, 124, 32)      832
_________________________________________________________________
activation_1 (Activation)    (None, 124, 124, 32)      0
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 62, 62, 32)        0
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 60, 60, 32)        9248
_________________________________________________________________
activation_2 (Activation)    (None, 60, 60, 32)        0
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 30, 30, 32)        0
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 28, 28, 32)        9248
_________________________________________________________________
activation_3 (Activation)    (None, 28, 28, 32)        0
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 14, 14, 32)        0
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 12, 12, 32)        9248
_________________________________________________________________
activation_4 (Activation)    (None, 12, 12, 32)        0
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 6, 6, 32)          0
_________________________________________________________________
flatten_1 (Flatten)          (None, 1152)              0
_________________________________________________________________
dense_1 (Dense)              (None, 512)               590336
_________________________________________________________________
activation_5 (Activation)    (None, 512)               0
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0
_________________________________________________________________
dense_2 (Dense)              (None, 32)                16416
_________________________________________________________________
activation_6 (Activation)    (None, 32)                0
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33
_________________________________________________________________
activation_7 (Activation)    (None, 1)                 0
=================================================================
Total params: 635,361
Trainable params: 635,361
Non-trainable params: 0
_________________________________________________________________
None
  Train on 9291 samples, validate on 2323 samples
  

Next, I look to quickly demonstrate that the model is able to learn, without too much regard for performance or overfitting.

Reviewing the training progress, there is good reason to believe training on more data and experimenting with fine-tuned models, unsupervised pretraining, hyperparameter optimization may all help lead to a usable model.

  
Epoch 99/100
128/128 [==============================] - 502s 4s/step - loss: 0.6242 - acc: 0.6316 - val_loss: 0.6317 - val_acc:
0.6271
Epoch 100/100
128/128 [==============================] - 521s 4s/step - loss: 0.6253 - acc: 0.6318 - val_loss: 0.6266 - val_acc:
0.6321

  

Now that we have this, we can begin to pin down some of the details around our simplifying assumptions and scale up to the full dataset. If this goes well, we might also try unsupervised pretraining since half of the data consists of testing samples.

Stay tuned...