Abstract:
We proposed a scalable outlier detection method to identify outliers in large datasets with a goal to create unsupervised intrusion detection. In our work, the strength of Kolmogorov-Smirnov and Efron Outlier Detection algorithm (KSE-test) and K-means clustering algorithm, both with linear time complexity, are combined to create fast outlier detection. While still maintaining high detection rate and low false alarm rate, our method can easily be paralleled for processing a large data set. The result is then applied with a predefined threshold in order to create efficient intrusion detection. We validate our method using the KDD99 dataset. With the appropriate values of threshold and value of K in our proposed method, the results yield higher detection rate and lower false alarm rate. While scaling linearly, the accuracy of our method is also improved from those of pure KSE-test-based methods. Moreover, we propose a proof-of-concept parallel version of our proposed method that works on Apache Spark platform, which greatly reduces execution time and easily scales up by adding more machines to the cluster.