-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

10 Machine Learning Blueprints You Should Know for Cybersecurity
By :

In this section, we will look at anomaly detection, which forms the foundation for detecting intrusions and suspicious activity.
The word anomaly means something that deviates from what is standard, normal, or expected. Anomalies are events or data points that do not fit in with the rest of the data. They represent deviations from the expected trend in data. Anomalies are rare occurrences and, therefore, few in number.
For example, consider a bot or fraud detection model used in a social media website such as Twitter. If we examine the number of follow requests sent to a user per day, we can get a general sense of the trend and plot this data. Let’s say that we plotted this data for a month, and ended up with the following trend:
Figure 2.1 – Trend for the number of follow requests over a month
What do you notice? The user seems to have roughly 30-40 follow requests per day. On the 8th and 18th days, however, we see a spike that clearly stands out from the daily trend. These two days are anomalies.
Anomalies can also be visually observed in a two-dimensional space. If we plot all the points in the dataset, the anomalies should stand out as being different from the others. For instance, continuing with the same example, let us say we have a number of features such as the number of messages sent, likes, retweets, and so on by a user. Using all of the features together, we can construct an n-dimensional feature vector for a user. By applying a dimensionality reduction algorithm such as principal component analysis (PCA) (at a high level, this algorithm can convert data to lower dimensions and still retain the properties), we can reduce it to two dimensions and plot the data. Say we get a plot as follows, where each point represents a user, and the dimensions represent principal components of the original data. The points colored in red clearly stand out from the rest of the data—these are outliers:
Figure 2.2 – A 2D representation of data with anomalies in red
Note that anomalies do not necessarily represent a malicious event—they simply indicate that the trend deviates from what was normally expected. For example, a user suddenly receiving increased amounts of friend requests is anomalous, but this may have been because they posted some very engaging content. Anomalies, when flagged, must be investigated to determine whether they are malicious or benign.
Anomaly detection is considered an important problem in the field of cybersecurity. Unusual or abnormal events can often indicate security breaches or attacks. Furthermore, anomaly detection does not need labeled data, which is hard to come by in security problems.
Now that we have introduced what anomaly detection is in sufficient detail, we will look at a real-world dataset that will help us observe and detect anomalies in action.
Before we jump into any algorithms for anomaly detection, let us talk about the dataset we will be using in this chapter. The dataset that is popularly used for anomaly and intrusion detection tasks is the Network Security Laboratory-Knowledge Discovery in Databases (NSL-KDD) dataset. This was originally created in 1999 for use in a competition at the 5th International Conference on Knowledge Discovery and Data Mining (KDD). The task in the competition was to develop a network intrusion detector, which is a predictive model that can distinguish between bad connections, called intrusions or attacks, and benign normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.
This activity consists of a few steps, which we will look at in the next subsections.
The actual NSL-KDD dataset is fairly large (nearly 4 million records). We will be using a smaller version of the data that is a 10% subset randomly sampled from the whole data. This will make our analysis feasible. You can, of course, experiment by downloading the full data and rerunning our experiments.
First, we import the necessary Python libraries:
import pandas as pd import numpy as np import os from requests import get
Then, we set the paths for the locations of training and test data, as well as the paths to a label file that holds a header (names of features) for the data:
train_data_page = "http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz" test_data_page = "http://kdd.ics.uci.edu/databases/kddcup99/kddcup.testdata.unlabeled_10_percent.gz" labels ="http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names" datadir = "data"
Next, we download the data and column names using the wget
command through Python. As these files are zipped (compressed), we have to first extract the contents using the gunzip
command. The following Python code snippet does that for us:
# Download training data print("Downloading Training Data") os.system("wget " + train_data_page) training_file_name = train_data_page.split("/")[-1].replace(".gz","") os.system("gunzip " + training_file_name ) with open(training_file_name, "r+") as ff: lines = [i.strip().split(",") for i in ff.readlines()] ff.close() # Download training column labels print("Downloading Training Labels") response = get(labels) labels = response.text labels = [i.split(",")[0].split(":") for i in labels.split("\n")] labels = [i for i in labels if i[0]!=''] final_labels = labels[1::]
Finally, we construct a DataFrame from the downloaded streams:
data = pd.DataFrame(lines) labels = final_labels data.columns = [i[0] for i in labels]+['target'] for i in range(len(labels)): if labels[i][1] == ' continuous.': data.iloc[:,i] = data.iloc[:,i].astype(float)
This completes our step of downloading the data and creating a DataFrame from it. A DataFrame is a tabular data structure that will allow us to manipulate, slice and dice, and filter the data as needed.
Once the data is downloaded, you can have a look at the DataFrame simply by printing the top five rows:
data.head()
This should give you an output just like this:
Figure 2.3 – Top five rows from the NSL-KDD dataset
As you can see, the top five rows of the data are displayed. The dataset has 42 columns. The last column, named target
, identifies the kind of network attack for every row in the data. To examine the distribution of network attacks (that is, how many examples of each kind of attack are present), we can run the following statement:
data['target'].value_counts()
This will list all network attacks and the count (number of rows) for each attack, as follows:
Figure 2.4 – Distribution of data by label (attack type)
We can see that there are a variety of attack types present in the data, with the smurf
and neptune
types accounting for the largest part. Next, we will look at how to model this data using statistical algorithms.
Change the font size
Change margin width
Change background colour