While we typically imagine the crafty and persistent hacker, attackers can be more effective by playing the numbers with automated methods. Books like Violent Python show how well suited Python scripting can be to develop custom attack pipelines.
At the end of this text, the reader learns to fool anti-virus packages which look for specific signatures to identify malware, unable to generalize more broadly to identify malicious behavior. Therefore, it is only natural to consider machine learning methods to keep up in the cat-and-mouse game of cybersecurity and this idea is at least as old as the 1999 KDD Cup which framed exactly this sort of challenge.
One challenge associated with this dataset is the binary classification task of 'Normal' vs. 'Anomalous' traffic. Unsupervised learning methods like K-Means clustering are often used for the robustness to new attack methods. Here is a pyspark mllib implementation of KMeans from the book 'Advanced Analytics with Spark' on the KDD Cup '99 dataset.
Others use tree-based models for the network intrusion detection task.
An engineering research group published their results in applying deep learning and considered unsupervised pretraining.
While the research group does not provide details on the deep learning model's network architecture, it is not difficult to reproduce their results with sensible choices. I built a multilayer perceptron taking the NSL-KDD Train+ dataset with one-hot encoded categorical variables and apply a log transformation and Min-Max standardization. The linked notebook shows that with little effort, we can reach similar performance in the 5-class classification task.
Methods like upsampling underrepresented classes and unsupervised pretraining are known techniques to boost model performance, I wanted to investigate algorithmic and architectural choices that might lead to an improved model.
I wanted to test Entity Embeddings since these architectures are designed for high cardinality categorical features that can be efficiently represented with embedding layers. The categorical feature values are encoded with integers and these features are fed into separate embedding layers. Here, I explored deep-shallow architectures with additional layers for the real-valued feature matrix. Ultimately, I found the results of this entity embedding to perform similar to the MLP, both of which perform comparable to the results of the Toledo researchers above.
The KDD cup dataset itself, is highly engineered with feature blocks corresponding to 2 second time windows, or knowledge-based features, or other statistics and metadata. This paper offers a broad overview of the dataset and modern methods around the task.
Some have advocated for moving on from the KDD Cup benchmarks, here is a modern survey on the openly available datasets and methods for producing ML models for NIDS. The UNSW-NB15 dataset provides a modern alternative to the KDD 99 dataset. Though publicly available pcap files are often unlabeled, some groups have released attack samples which we can read on the command line with tcpdump. The ADFA datasets catalogue malicious system calls for Host Intrusion Detection Systems (HIDS).
Using system calls for HIDS aims to reduce the attack code to its underlying actions for latent context. Often, rich domain knowledge is baked into intrusion detection systems.
One strength for applying deep learning in this area should be the cheapness of simulating attacks to build a large training dataset.
Here is a small set of spanish language labeled server log data generated in this way. Reducing the data to the character level, we can implement a convnet classifier like Zhang,LeCun. Here is a keras implementation which performs quite well and uses < 10K parameters.
I wanted to gain a deeper understanding of these attack methods so I compiled some resources from reading a few standard references in penetration testing and hacking. It is not hard to imagine a microcontroller on the home network inspecting network packets and performing intrusion prevention.
Though this area of research hasn't attracted as much attention as the computer vision problem, I am excited to see the results as the cybersecurity threat has gained greater prominence. To be continued...