October 21, 2016

Unsupervised Machine Learning with One-class Support Vector Machines

At ThisData we've been working hard to use and improve on machine learning approaches to information security problems. Finding security issues (like access anomalies) in millions of event records by hand is impossible, and naive approaches using deterministic, rule-based logic do not scale well beyond basic scenarios.

Machine learning gives us some tools that can make this task easier by automating analysis, combining data from different sources in ways that are very hard to do in typical programming, and modelling complex patterns in data.

We're going to be looking at several machine learning techniques over the next few weeks. Today we're starting with unsupervised learning with one-class support vector machines (SVMs).

We'll look at what SVMs are and how they work, and train a one-class SVM model to predict whether network accesses are attacks or not. On our way there we'll need to select relevant features and normalize the data. Finally we'll look at how to export the trained model to use it in a production system.

Unsupervised what?

Unsupervised machine learning is machine learning without labelled data (where data hasn't been labelled beforehand to say what it is -- in our case, whether a network access is an attack or not). Most programmers are familiar, at least in some way, with supervised ML. This is where a model is trained (learns) from labelled data and then uses that model to make predictions or give some other output. In unsupervised learning the model is trained without labels, and a trained model picks novel or anomalous observations from a dataset based on one or more measures of similarity to "normal" data.

Unsupervised machine learning can be useful in information security problems because a) we don't always have accurately labelled data, and b) we want to be able to identify novel (or anomalous) observations in data without having necessarily seen an example of that behaviour in the past.

Support vector machines

Support vector machines (SVMs) are a type of learning model used for classification and regression analysis.

In an SVM data points are represented as points in space in such a way that points from different categories are separated by a plane. You can think of this like a line through data points that separates data of different classes.

Data separated by a plane Image courtesy https://commons.wikimedia.org/wiki/User:Zirguezi

New data are mapped into the same space and their location relative to the plane is used to predict which categories each point belongs, with the plane being referred to as the decision boundary (i.e. determining to which class the data belongs). In the case where the decision boundary needs to be non-linear (i.e. where classes cannot be separated by a straight line), SVMs also have the ability to project space through a non-linear function, lifting the data to a space with a higher dimension where a linear decision boundary does separate classes.

Simulation of projection through non-linear function SVM projection through polynomial function, from https://www.youtube.com/watch?v=3liCbRZPrZA

We won't delve into any more detail about the inner workings of SVMs, but you can find out more in this great introduction by Roemer Vlasveld.

One-class SVMs are a special case of support vector machine. First, data is modelled and the algorithm is trained. Then when new data are encountered their position relative to the "normal" data (or inliers) from training can be used to determine whether it is "out of class" or not - in other words, whether it is unusual or not. Because they can be trained with unlabelled data they are an example of unsupervised machine learning.

A practical example

In a previous article, "An Exploratory Data Science Workspace on the Mac" we installed some Python packages and the Jupyter Notebook program. If you missed that and you want to follow along you might want to do that first.

In this example we're going to work with HTTP access data from an open dataset from the KDD Cup '99, which consists of millions of network accesses containing multiple different types of attacks (alongside normal accesses) from a simulated military network. You can get the dataset here. We're using the 10 percent set containing a little under half a million datapoints and 3 (of 41) features (columns) that are relevant for HTTP requests.

If you want to follow along, create a new notebook in Jupyter now so you can begin coding!

Below is annotated source code, and I'll explain anything interesting as we go.

Visualising the data

Before we get started training a SVM model, we need to know what our data is like. In the first cell, we'll put:

%matplotlib inline

import numpy as np  
import pandas as pd  
from sklearn import utils  
import matplotlib

# import the CSV from http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
# this will return a pandas dataframe.
data = pd.read_csv('~/Downloads/kddcup.data_10_percent.csv', low_memory=False)

# extract just the logged-in HTTP accesses from the data
data = data[data['service'] == "http"]  
data = data[data["logged_in"] == 1]

# let's take a look at the types of attack labels are present in the data.

When we run the first cell with Shift + Return or by clicking Cell > Run Cells, we should get something like this:

Jupyter screenshot with histogram of labels

Extracting and normalizing features

The full dataset contains 41 features with data relating to TCP packets, SMTP access, etc. but only 3 of those are relevant for HTTP. We need to extract those 3 features so we're not training with irrelevant ones that will muddy our model. Also, SVM implementations work better with normalised data. This yields both better accuracy and reduces numerical instability that is inherent in their implementation.

Enter the following in a new cell:

# the full dataset contains features for SMTP, NDP, ICMP, etc.
# here we'll grab just the relevant features for HTTP.
relevant_features = [  

# replace the data with a subset containing only the relevant features
data = data[relevant_features]

# normalise the data - this leads to better accuracy and reduces numerical instability in
# the SVM implementation
data["duration"] = np.log((data["duration"] + 0.1).astype(float))  
data["src_bytes"] = np.log((data["src_bytes"] + 0.1).astype(float))  
data["dst_bytes"] = np.log((data["dst_bytes"] + 0.1).astype(float))  

Making our data one-class

Later we're going to use scikit-learn's OneClassSVM predict function to generate output. This returns +1 or -1 to indicate whether the data is an "inlier" or "outlier" respectively. To make comparison easier later we'll replace our data's label with a matching +1 or -1 value. This also transforms our data from multi-class (multiple different labels) to one-class (boolean label), which is a prerequisite for using a one-class SVM.

In a new cell:

# we're using a one-class SVM, so we need.. a single class. the dataset 'label'
# column contains multiple different categories of attacks, so to make use of 
# this data in a one-class system we need to convert the attacks into
# class 1 (normal) and class -1 (attack)
data.loc[data['label'] == "normal.", "attack"] = 1  
data.loc[data['label'] != "normal.", "attack"] = -1

# grab out the attack value as the target for training and testing. since we're
# only selecting a single column from the `data` dataframe, we'll just get a
# series, not a new dataframe
target = data['attack']

# find the proportion of outliers we expect (aka where `attack == -1`). because 
# target is a series, we just compare against itself rather than a column.
outliers = target[target == -1]  
print("outliers.shape", outliers.shape)  
print("outlier fraction", outliers.shape[0]/target.shape[0])

# drop label columns from the dataframe. we're doing this so we can do 
# unsupervised training with unlabelled data. we've already copied the label
# out into the target series so we can compare against it later.
data.drop(["label", "attack"], axis=1, inplace=True)

# check the shape for sanity checking.

The output here is:

outliers.shape (2209,)  
outlier fraction 0.03761600681140911  
(58725, 3)

We can see here that the outliers (attacks) represent around 4% of the data, and we end up with ~59k rows and 3 features now that we've removed the label.

Splitting data into training and test sets

Next we're going to split our dataset into a training and a testing segment using a ratio of 4:1. By doing this we're setting aside a subset of the data for testing our trained model, to ensure we're getting the correct results. Testing with the same data we used for training can lead to invalid results as a trained model will typically do well at classifying the examples it was trained with! (Note that in unsupervised machine learning this isn't necessarily the case.)

Add a new cell and enter:

from sklearn.model_selection import train_test_split  
train_data, test_data, train_target, test_target = train_test_split(data, target, train_size = 0.8)  

The output from this cell should be something like:

(46980, 3)

Our train_data shows 80% of the records from the resampled dataset, and we have the right number of features.

Training the model

Now we're ready to train our model. We do this by calling the fit function from scikit-learn's svm.OneClassSVM. It accepts a few parameters but the most important are nu, kernel, and for the RBF kernel we'll be using, gamma.

  • nu is "An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors" and must be between 0 and 1. Basically this means the proportion of outliers we expect in our data. This is an important factor to consider when assessing algorithms. Many unsupervised ML algorithms require you to know (or hint at) the number of outliers or class members you expect.
  • kernel is the kernel type to be used. Earlier we discussed SVM's ability to use a non-linear function to project the hyperspace to higher dimension. Setting kernel to something other than linear here will achieve that. The default is rbf (RBF - radial basis function).
  • gamma is a parameter of the RBF kernel type and controls the influence of individual training samples - this effects the "smoothness" of the model. A low value improves the smoothness and "generalizability" of the model, while a high value reduces it but makes the model "tighter-fitted" to the training data. Some experimentation is often required to find the best value.

We already know that the proportion of attacks in our data is about 4%. We'll get the precise fraction and use that for nu below. Through experimentation I found an effective gamma to be 0.00005.

In our next cell we'll instantiate a model and fit (train) it with our training data.

from sklearn import svm

# set nu (which should be the proportion of outliers in our dataset)
nu = outliers.shape[0] / target.shape[0]  
print("nu", nu)

model = svm.OneClassSVM(nu=nu, kernel='rbf', gamma=0.00005)  

Training can take a while (at least several seconds).

The output from this cell is:

nu 0.03761600681140911

OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma=5e-05, kernel='rbf',  
      max_iter=-1, nu=0.03761600681140911, random_state=None,
      shrinking=True, tol=0.001, verbose=False)

We now have two ways to make use of the trained model:

  • Find the distance from the hyperplane of some samples using decision_function. The distance is on an arbitrary scale so to make use of it we need to either define a threshold to use or scale it to something useful
  • Perform a regression on some samples using predict. This will return values +1 or -1 indicating whether or not the sample(s) are "in class" or "out of class" (are normal or abnormal relative to the trained model)

Checking accuracy of the model

But before we use the model we would like to know its accuracy - how good it is at predicting the right class for data. To do that we'll use the predict function on our data and then use sklearn's built-in analysis functions to compare the labels between the predict output and our target, which we set up earlier on.

In a new cell:

from sklearn import metrics  
preds = model.predict(train_data)  
targs = train_target

print("accuracy: ", metrics.accuracy_score(targs, preds))  
print("precision: ", metrics.precision_score(targs, preds))  
print("recall: ", metrics.recall_score(targs, preds))  
print("f1: ", metrics.f1_score(targs, preds))  
print("area under curve (auc): ", metrics.roc_auc_score(targs, preds))  

The output:

accuracy:  0.977203065134  
precision:  0.998757399123  
recall:  0.977533555934  
f1:  0.988031513662  
area under curve (auc):  0.973115098969  

This shows at our model predicts with ~98% accuracy the class of the data from the training set. Not bad. Precision, recall, F1, and AUC are all measures of the model's effectiveness at predicting classes. How exactly these work will be the subject of another post, but the closer to 1.0 they are, the better.

Now let's check the test set, which was not used for training:

preds = model.predict(test_data)  
targs = test_target

print("accuracy: ", metrics.accuracy_score(targs, preds))  
print("precision: ", metrics.precision_score(targs, preds))  
print("recall: ", metrics.recall_score(targs, preds))  
print("f1: ", metrics.f1_score(targs, preds))  
print("area under curve (auc): ", metrics.roc_auc_score(targs, preds))  


accuracy:  0.975819497659  
precision:  0.998912353848  
recall:  0.975914283184  
f1:  0.987279405178  
area under curve (auc):  0.974682805309  

Again, around 98% accurate.

Depending on our application this may or may not be an acceptable level of accuracy - we'll look more at alternative machine learning algorithms and their accuracy in upcoming posts.

Further, because we're training with a random subset of the original data, the accuracy of our trained model is going to vary depending on which data is present (or absent) the training set. There are several approaches to deal with this, which we will cover in a separate post.

Making use of the model

To use the model on new data (e.g. in JSON format) we could do something like this:

data = pd.read_json(some_json)  

If our output is -1 the model has predicted the data to be an outlier (which means an attack in our case), a +1 means an inlier (not an attack).

To use the model outside of our development environment we need to save it to disk. Fortunately this is quite straight forward:

outputfile = 'oneclass_v1.model'  
from sklearn.externals import joblib  
joblib.dump(model, outputfile, compress=9)  

Then in our deployed code we and load the model back in with:

from sklearn.externals import joblib  
model = joblib.load('oneclass_v1.model')

# then predict with

What we've learnt

Today we introduced unsupervised learning with one-class support vector machines. We have:

  • Imported data from CSV and visualised it
  • Selected relevant features and normalised data
  • Re-labelled the data to make use of a one-class system
  • Trained a one-class SVM model
  • Checked the accuracy of our model, and
  • Seen how to export and import the model for use in live systems

As you can see there is quite a bit more to do when training a one-class SVM model than data-in, data-out.

In upcoming posts we'll be looking at the nuts-and-bolts of getting a machine learning solution to production, improving on accuracy checks, and how to use other useful machine learning systems.

We'd love to know if you give this example a try, and how you get on. Please leave us a note in the comments!

Image courtesy https://commons.wikimedia.org/wiki/User:Zirguezi


The future of authentication

Today I’m excited to announce a deal that we have been working on for the past few months and how that will impact the future of contextual ...

Introducing custom security rules

For the past few years we’ve been working hard to create a plug and play adaptive risk engine. We designed our core service using a mix of b ...