At ThisData we've been working hard to use and improve on machine learning approaches to information security problems. Finding security issues (like access anomalies) in millions of event records by hand is impossible, and naive approaches using deterministic, rule-based logic do not scale well beyond basic scenarios.
Machine learning gives us some tools that can make this task easier by automating analysis, combining data from different sources in ways that are very hard to do in typical programming, and modelling complex patterns in data.
We're going to be looking at several machine learning techniques over the next few weeks. Today we're starting with unsupervised learning with one-class support vector machines (SVMs).
We'll look at what SVMs are and how they work, and train a one-class SVM model to predict whether network accesses are attacks or not. On our way there we'll need to select relevant features and normalize the data. Finally we'll look at how to export the trained model to use it in a production system.
Unsupervised machine learning is machine learning without labelled data (where data hasn't been labelled beforehand to say what it is -- in our case, whether a network access is an attack or not). Most programmers are familiar, at least in some way, with supervised ML. This is where a model is trained (learns) from labelled data and then uses that model to make predictions or give some other output. In unsupervised learning the model is trained without labels, and a trained model picks novel or anomalous observations from a dataset based on one or more measures of similarity to "normal" data.
Unsupervised machine learning can be useful in information security problems because a) we don't always have accurately labelled data, and b) we want to be able to identify novel (or anomalous) observations in data without having necessarily seen an example of that behaviour in the past.
Support vector machines
Support vector machines (SVMs) are a type of learning model used for classification and regression analysis.
In an SVM data points are represented as points in space in such a way that points from different categories are separated by a plane. You can think of this like a line through data points that separates data of different classes.
Image courtesy https://commons.wikimedia.org/wiki/User:Zirguezi
New data are mapped into the same space and their location relative to the plane is used to predict which categories each point belongs, with the plane being referred to as the decision boundary (i.e. determining to which class the data belongs). In the case where the decision boundary needs to be non-linear (i.e. where classes cannot be separated by a straight line), SVMs also have the ability to project space through a non-linear function, lifting the data to a space with a higher dimension where a linear decision boundary does separate classes.
SVM projection through polynomial function, from https://www.youtube.com/watch?v=3liCbRZPrZA
We won't delve into any more detail about the inner workings of SVMs, but you can find out more in this great introduction by Roemer Vlasveld.
One-class SVMs are a special case of support vector machine. First, data is modelled and the algorithm is trained. Then when new data are encountered their position relative to the "normal" data (or inliers) from training can be used to determine whether it is "out of class" or not - in other words, whether it is unusual or not. Because they can be trained with unlabelled data they are an example of unsupervised machine learning.
A practical example
In a previous article, "An Exploratory Data Science Workspace on the Mac" we installed some Python packages and the Jupyter Notebook program. If you missed that and you want to follow along you might want to do that first.
In this example we're going to work with HTTP access data from an open dataset from the KDD Cup '99, which consists of millions of network accesses containing multiple different types of attacks (alongside normal accesses) from a simulated military network. You can get the dataset here. We're using the 10 percent set containing a little under half a million datapoints and 3 (of 41) features (columns) that are relevant for HTTP requests.
If you want to follow along, create a new notebook in Jupyter now so you can begin coding!
Below is annotated source code, and I'll explain anything interesting as we go.
Visualising the data
Before we get started training a SVM model, we need to know what our data is like. In the first cell, we'll put:
%matplotlib inline import numpy as np import pandas as pd from sklearn import utils import matplotlib # import the CSV from http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html # this will return a pandas dataframe. data = pd.read_csv('~/Downloads/kddcup.data_10_percent.csv', low_memory=False) # extract just the logged-in HTTP accesses from the data data = data[data['service'] == "http"] data = data[data["logged_in"] == 1] # let's take a look at the types of attack labels are present in the data. data.label.value_counts().plot(kind='bar')
When we run the first cell with Shift + Return or by clicking Cell > Run Cells, we should get something like this:
Extracting and normalizing features
The full dataset contains 41 features with data relating to TCP packets, SMTP access, etc. but only 3 of those are relevant for HTTP. We need to extract those 3 features so we're not training with irrelevant ones that will muddy our model. Also, SVM implementations work better with normalised data. This yields both better accuracy and reduces numerical instability that is inherent in their implementation.
Enter the following in a new cell:
# the full dataset contains features for SMTP, NDP, ICMP, etc. # here we'll grab just the relevant features for HTTP. relevant_features = [ "duration", "src_bytes", "dst_bytes", "label" ] # replace the data with a subset containing only the relevant features data = data[relevant_features] # normalise the data - this leads to better accuracy and reduces numerical instability in # the SVM implementation data["duration"] = np.log((data["duration"] + 0.1).astype(float)) data["src_bytes"] = np.log((data["src_bytes"] + 0.1).astype(float)) data["dst_bytes"] = np.log((data["dst_bytes"] + 0.1).astype(float))
Making our data one-class
Later we're going to use scikit-learn's
predict function to generate output. This returns
-1 to indicate whether the data is an "inlier" or "outlier" respectively. To make comparison easier later we'll replace our data's label with a matching
-1 value. This also transforms our data from multi-class (multiple different labels) to one-class (boolean label), which is a prerequisite for using a one-class SVM.
In a new cell:
# we're using a one-class SVM, so we need.. a single class. the dataset 'label' # column contains multiple different categories of attacks, so to make use of # this data in a one-class system we need to convert the attacks into # class 1 (normal) and class -1 (attack) data.loc[data['label'] == "normal.", "attack"] = 1 data.loc[data['label'] != "normal.", "attack"] = -1 # grab out the attack value as the target for training and testing. since we're # only selecting a single column from the `data` dataframe, we'll just get a # series, not a new dataframe target = data['attack'] # find the proportion of outliers we expect (aka where `attack == -1`). because # target is a series, we just compare against itself rather than a column. outliers = target[target == -1] print("outliers.shape", outliers.shape) print("outlier fraction", outliers.shape/target.shape) # drop label columns from the dataframe. we're doing this so we can do # unsupervised training with unlabelled data. we've already copied the label # out into the target series so we can compare against it later. data.drop(["label", "attack"], axis=1, inplace=True) # check the shape for sanity checking. data.shape
The output here is:
outliers.shape (2209,) outlier fraction 0.03761600681140911 (58725, 3)
We can see here that the outliers (attacks) represent around 4% of the data, and we end up with ~59k rows and 3 features now that we've removed the label.
Splitting data into training and test sets
Next we're going to split our dataset into a training and a testing segment using a ratio of 4:1. By doing this we're setting aside a subset of the data for testing our trained model, to ensure we're getting the correct results. Testing with the same data we used for training can lead to invalid results as a trained model will typically do well at classifying the examples it was trained with! (Note that in unsupervised machine learning this isn't necessarily the case.)
Add a new cell and enter:
from sklearn.model_selection import train_test_split train_data, test_data, train_target, test_target = train_test_split(data, target, train_size = 0.8) train_data.shape
The output from this cell should be something like:
train_data shows 80% of the records from the resampled dataset, and we have the right number of features.
Training the model
Now we're ready to train our model. We do this by calling the
fit function from scikit-learn's
svm.OneClassSVM. It accepts a few parameters but the most important are
kernel, and for the RBF kernel we'll be using,
nuis "An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors" and must be between
1. Basically this means the proportion of outliers we expect in our data. This is an important factor to consider when assessing algorithms. Many unsupervised ML algorithms require you to know (or hint at) the number of outliers or class members you expect.
kernelis the kernel type to be used. Earlier we discussed SVM's ability to use a non-linear function to project the hyperspace to higher dimension. Setting
kernelto something other than
linearhere will achieve that. The default is
rbf(RBF - radial basis function).
gammais a parameter of the RBF kernel type and controls the influence of individual training samples - this effects the "smoothness" of the model. A low value improves the smoothness and "generalizability" of the model, while a high value reduces it but makes the model "tighter-fitted" to the training data. Some experimentation is often required to find the best value.
We already know that the proportion of attacks in our data is about 4%. We'll get the precise fraction and use that for
nu below. Through experimentation I found an effective
gamma to be
In our next cell we'll instantiate a model and fit (train) it with our training data.
from sklearn import svm # set nu (which should be the proportion of outliers in our dataset) nu = outliers.shape / target.shape print("nu", nu) model = svm.OneClassSVM(nu=nu, kernel='rbf', gamma=0.00005) model.fit(train_data)
Training can take a while (at least several seconds).
The output from this cell is:
nu 0.03761600681140911 OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma=5e-05, kernel='rbf', max_iter=-1, nu=0.03761600681140911, random_state=None, shrinking=True, tol=0.001, verbose=False)
We now have two ways to make use of the trained model:
- Find the distance from the hyperplane of some samples using
decision_function. The distance is on an arbitrary scale so to make use of it we need to either define a threshold to use or scale it to something useful
- Perform a regression on some samples using
predict. This will return values
-1indicating whether or not the sample(s) are "in class" or "out of class" (are normal or abnormal relative to the trained model)
Checking accuracy of the model
But before we use the model we would like to know its accuracy - how good it is at predicting the right class for data. To do that we'll use the
predict function on our data and then use sklearn's built-in analysis functions to compare the labels between the
predict output and our
target, which we set up earlier on.
In a new cell:
from sklearn import metrics preds = model.predict(train_data) targs = train_target print("accuracy: ", metrics.accuracy_score(targs, preds)) print("precision: ", metrics.precision_score(targs, preds)) print("recall: ", metrics.recall_score(targs, preds)) print("f1: ", metrics.f1_score(targs, preds)) print("area under curve (auc): ", metrics.roc_auc_score(targs, preds))
accuracy: 0.977203065134 precision: 0.998757399123 recall: 0.977533555934 f1: 0.988031513662 area under curve (auc): 0.973115098969
This shows at our model predicts with ~98% accuracy the class of the data from the training set. Not bad. Precision, recall, F1, and AUC are all measures of the model's effectiveness at predicting classes. How exactly these work will be the subject of another post, but the closer to
1.0 they are, the better.
Now let's check the test set, which was not used for training:
preds = model.predict(test_data) targs = test_target print("accuracy: ", metrics.accuracy_score(targs, preds)) print("precision: ", metrics.precision_score(targs, preds)) print("recall: ", metrics.recall_score(targs, preds)) print("f1: ", metrics.f1_score(targs, preds)) print("area under curve (auc): ", metrics.roc_auc_score(targs, preds))
accuracy: 0.975819497659 precision: 0.998912353848 recall: 0.975914283184 f1: 0.987279405178 area under curve (auc): 0.974682805309
Again, around 98% accurate.
Depending on our application this may or may not be an acceptable level of accuracy - we'll look more at alternative machine learning algorithms and their accuracy in upcoming posts.
Further, because we're training with a random subset of the original data, the accuracy of our trained model is going to vary depending on which data is present (or absent) the training set. There are several approaches to deal with this, which we will cover in a separate post.
Making use of the model
To use the model on new data (e.g. in JSON format) we could do something like this:
data = pd.read_json(some_json) model.predict(data)
If our output is
-1 the model has predicted the data to be an outlier (which means an attack in our case), a
+1 means an inlier (not an attack).
To use the model outside of our development environment we need to save it to disk. Fortunately this is quite straight forward:
outputfile = 'oneclass_v1.model' from sklearn.externals import joblib joblib.dump(model, outputfile, compress=9)
Then in our deployed code we and load the model back in with:
from sklearn.externals import joblib model = joblib.load('oneclass_v1.model') # then predict with model.predict(..)
What we've learnt
Today we introduced unsupervised learning with one-class support vector machines. We have:
- Imported data from CSV and visualised it
- Selected relevant features and normalised data
- Re-labelled the data to make use of a one-class system
- Trained a one-class SVM model
- Checked the accuracy of our model, and
- Seen how to export and import the model for use in live systems
As you can see there is quite a bit more to do when training a one-class SVM model than data-in, data-out.
In upcoming posts we'll be looking at the nuts-and-bolts of getting a machine learning solution to production, improving on accuracy checks, and how to use other useful machine learning systems.
We'd love to know if you give this example a try, and how you get on. Please leave us a note in the comments!
Image courtesy https://commons.wikimedia.org/wiki/User:Zirguezi