SmellsLikeML

Deep IDS

1/31/18

Security, Intrusion Detection System, Machine Learning

After learning more about hacking and penetration testing, it is surprising how simple and formulaic the methods can be. While we imagine a crafty and persistent attacker, this approach just doesn't scale. Attackers are more productive when they play the numbers and automate their attack methods. That was essentially the theme of a book like Violent Python: low-level understanding to build scalable, custom attack pipelines.

At the end of Violent Python, we saw anti-virus packages fooled without recognizing a signature. As new tricks surface in the cat-and-mouse of cybersecurity it's natural to consider robust machine learning models to detect malicious traffic. This idea is at least as old as the 1999 KDD Cup challenge which is framed around this problem.

The KDDCup99 dataset serves as a popular benchmark for intrusion detection systems. Many benchmark on the NSL-KDD dataset to simplify/standardize the preprocessing of the KDD cup data.

One of the key challenges with this dataset is the binary classification task of 'Normal' vs. 'Anomalous' traffic. Unsupervised learning methods like K-Means clustering are often used for the robustness to new attack methods. The book 'Advanced Analytics with Spark' demonstrates Spark mllib KMeans on the KDD Cup '99 dataset and here is a pyspark script.

Others use tree-based models for the network intrusion detection task.

One strength for research in this area should be the cheapness of simulating attacks to build a large training dataset. This suggests consideration of deep learning methods which tend to achieve higher performance as a function of data scale. An engineering research group considered this and published their results in applying deep learning comparing to unsupervised pretraining.

While the research group does not provide details on the deep learning model's network architecture, it is not difficult to reproduce their results with sensible choices. I built a multilayer perceptron taking the NSL-KDD Train+ dataset with one-hot encoded categorical variables and applied simple transformations like log scaling and Min-Max standardization. The linked notebook shows that with little effort, we can reach similar performance in the 5-class classification task.

The result of their work was that by training in two stages, using unsupervised learning with autoencoders over train and test, then combining pretrained layers with fine-tuned higher layers for a second phase training a classifier, you achieve higher performance. This should come as no surprise, the test dataset is roughly 1/6th the number of samples of the training dataset and includes attack methods not covered in the training data. It should also come as little surprise that upsampling some of the underrepresented classes would lead to higher performance.

Because these are known training methods to boost model performance, I wanted to investigate algorithmic and architectural choices that might lead to an improved model. I ran into a paper on applying RNNs to the NSL-KDD dataset, however, the data is not sequential and the scant details in the paper suggest the results may be dubious.

This paper offers a broad overview of the dataset and modern methods around the task.

One approach that I was interested in considering are Entity Embeddings. These architectures have demonstrated value when there are high cardinality categorical features that can be efficiently represented with embedding layers. The categorical feature values are encoded with integers and these features are fed into separate embedding layers. Here, I explored deep-shallow architectures with additional layers for the real-valued feature matrix. Ultimately, I found the results of this entity embedding to perform similar to the MLP, both of which perform comparable to the results of the Toledo researchers above.

Those researchers also spelled out directions of future work including the implementation of a real-time network intrusion detection system (NIDS), as well as learning directly on raw network traffic headers. Both of these would be fantastic results. I'd like to have a simple NIDS running on something like a raspberry pi to protect all my home devices.

The KDD cup dataset itself, is highly engineered with feature blocks corresponding to 2 second time windows, or knowledge-based features, or other statistics and metadata.

Some have advocated for moving on from the KDD Cup benchmarks, here is a modern survey on the openly available datasets and methods for producing ML models for NIDS. The UNSW-NB15 dataset provides a modern alternative to the KDD 99 dataset. Though publicly available pcap files are often unlabeled, some groups have released attack samples which we can read on the command line with tcpdump. The ADFA datasets catalogue malicious system calls for Host Intrusion Detection Systems (HIDS).

Using system calls for HIDS aims to reduce the attack code to its underlying actions for latent context. Often, rich domain knowledge is baked into intrusion detection systems.

Here is a small set of labeled server log data. Inspecting the logs, we find some suspicious resource requests and can frame the malicious request classification task. Reducing the data to the character level, we can implement a convnet classifier like Zhang,LeCun. Here is a keras implementation which performs quite well and uses < 10K parameters.

Still, the holy grail is a robust, end-to-end, real-time intrusion detection system. This likely requires deep learning on raw packets. Here the payload itself or the temporal characteristic of network interactions may lend well to sequential representation for a RNN. Other researchers apply dilated convolutional layers and unsupervised pretraining to claim state-of-the-art performance using the CTU dataset.

Though this area of research hasn't attracted as much attention as the computer vision problem, I am excited to see the results as the cybersecurity threat has gained greater prominence.