sameer's blog: Technical paper presented at a National conference in NITTTR,Chandigarh

IP Traffic Classification Using Neural Networks

Sunil Agrawal¹, Sameer Sharma², Vivek Gupta, Vivek Sharma

UIET, Panjab University, Chandigarh.

ABSTRACT

The early detection of applications associated with TCP flows is an essential step for network security and traffic engineering. The classic way to identify flows, i.e. looking at port numbers, is not effective anymore. In this paper, we propose neural network techniques for Internet traffic identification. Two supervised neural networks are compared on the basis of precision, recall and model build time. We opened different websites in order to create internet traffic in our lab, and captured traffic flows with the help of ‘Ethereal’. Then most significant features of flows are extracted and used for the training of proposed neural network algorithms. We find out that MLP network outperforms the RBF network when the precision of classification is compared.

Keywords: Traffic classification, Machine Learning, Ethereal.

1. INTRODUCTION

Here is brief discussion of few classical approaches that were being used for traffic classification.

A. Port Number Analysis

Historically, traffic classification techniques used well known port numbers to identify Internet traffic. This was successful because many traditional applications use fixed port numbers assigned by IANA [6]. For example, email applications commonly use port 25. This technique has been shown to be ineffective by Karagiannis et al. in [7] for some applications such as the current generation of P2P applications which intentionally tries to disguise their traffic by using dynamic port numbers or masquerade as well-known applications. In addition, only those applications whose port numbers are known in advance can be identified.

B. Payload-based Analysis

Another well researched approach is analysis of packet payloads [7]–[10]. In this approach, the packet payloads are analyzed to see whether or not they contain characteristics signatures of known applications. These approaches have been shown to work very well for Internet traffic including P2P traffic. However, these techniques also have drawbacks. First, payload analysis poses privacy and security concerns. Second, these techniques typically require increased processing and storage capacity. Third, these approaches are unable to cope with encrypted transmissions. Finally, these techniques only identify traffic for which signatures are available and are unable to classify previously unknown traffic.

C. Transport-layer heuristics

Transport-layer heuristic information has been used to address the drawbacks of payload-based analysis and the diminishing effectiveness of port-based identification. Karagiannis

Corresponding Author’s Email id

1. s.agrawal@hotmail.com 2. smrshrm20@gmail.com

propose a novel approach that uses the unique behaviors of P2P applications when they are

transferring data or making connections to identify this traffic [7]. This approach is shown to perform better than port-based classification and equivalent to payload-based analysis. In addition, Karagiannis created another method that uses the social, functional, and application behaviors to identify all types of traffic [11].

2. PROBLEM STATEMENT

The classical approaches mentioned above can’t single handedly be used to classify IP traffic data.

Our proposal:

We shall be using supervised ‘neural network’ algorithm to solve this problem. The neural network algorithms are RBF and MLP.

Machine Learning Approaches

Machine learning techniques generally consists of two parts: model building and then classification. A model is first built using training data. This model is then inputted into a classifier that then classifies a data set. Machine learning techniques can be divided into the categories of unsupervised and supervised. McGregor et al. hypothesize the ability of using an unsupervised approach to group flows based on connection-level (i.e., transport layer) statistics to classify traffic [1]. In this method, an EM algorithm [5] is used and McGregor et al. draw the conclusion that this approach is promising. In [3] and [4], Zander et al. extend this work by using an EM algorithm called AutoClass [12] and find the optimal set of attributes to use for building the classification model. Some supervised machine learning techniques, such as [13] and [2], also use connection-level statistics to classify traffic. In [13], Roughan et al. use nearest neighbor and linear discriminate analysis. This approach is limited because it does not classify HTTP traffic and uses a limited number of connection level statistics. In [2], Moore et al. suggests using Na¨ıve Bayes as a classifier and shows that the Na¨ıve Bayes approach has a high accuracy for classifying IP traffic.

Performance metrics

Now we have to define certain parameters on which the two algorithms will be compared.

MODEL BUILDING TIME

Model Building Time means the time taken by the network to build a model using the training data available to it, so that any further data presented to it can be accurately classified into suitable classes.

PRECISION

Precision means the percentage of members of class X correctly classified as belonging to class X. It is clear from the bar-graph that the precision of MLP is better than that of RBF for the given traffic data.

FALSE POSITIVE RATE

False Positive (FP) Rate specifies the percentage of members of other classes incorrectly classified as belonging to class X

RECALL RATE

Recall Rate means the percentage of members of class X correctly classified as belonging to class X.

ACCURACY

Accuracy means the ratio of number of correctly classified instances to the total number of instances

3. EXPERIMENTS PERFORMED

· Collect Internet traffic data using a Networking Software, and generate a data file in a suitable format, ready to be used by software simulation tool.

· Classification of the collected traffic data into desired classes using Neural Networks (MLP & RBF), and compare the performance of these algorithms on various parameters for the accurate classification of Internet Traffic.

4. DATASET

Data capturing is done in order to obtain data from different websites (having different variety of data).this is to be done be done by a software called Ethereal.

Ethereal [4] is a free packet analyzer computer application. It is used for network troubleshooting, analysis, software and communications protocol development, and education. Ethereal, in May 2006 the project was renamed Wireshark due to trademark issues.

We selected some 9 popular websites on the basis of the type of packets we can get from them. The websites used are:

1. ICQ.COM

It is a chatting website including features of “strange chatting” with instant messaging. It works on `Instant Messaging and Presence Protocol` under TCP.ICQ has a specific protocol by the name OSCAR (Open System for Communication in Real-time) provided by AOL (America On Line).

2. YOUTUBE.COM

It is a multimedia website providing free video playing and downloading facility using flash player. It was basically selected to inculcate big packets of multimedia download. It works on `Real Time Protocol (RTP).

3. ZAPAK.COM

It is an online gaming website providing onsite gaming facility and download. It utilizes `Online Gaming Protocol (ONGP) ` for data transfer.

4. INDIANEXPRESS.COM

It is a simple HTTP (Hypertext Text Transfer Protocol) based website providing information and news.

5. GMAIL.COM

It is basically a mailing website supporting mail, online chatting and File transfer. It is at front is using HTTP. But, it is also using `Simple Message Transfer Protocol (SMTP)’ and Post Office Protocol (POP).

6. ESPN.COM

A simple sports related information and news website. It utilizes the HTTP protocol.

7. GEETGANGA.ORG

It is an info website supporting poem text and its download. It also utilizes the HTTP protocol.

8. SONGS.PK

A multimedia website supporting songs play and download. It utilizes the RTP protocol.

9. WIKIPEDIA.COM

A web based encyclopedia site. Again an HTTP website.

Feature selection

Features are to be selected from the packets captured which will help us in making .arff file format before being fed into WEKA. We have selected our attributes as:

1. Packet size 2.protocol 3.source port 4.destination port 5.window size

Flag bits: 6. CWR 7.ecn-echo 8.urgent 9.acknowledgement 10.push 11.reset

12. syn 13.fin

5. Neural Networks

Neural Networks (NN), which is simplified models of the biological neuron system, is a massively parallel distributed processing system made up of highly interconnected neural computing elements that have the ability to learn and thereby acquire knowledge and make it available for use.

Various learning mechanisms exist to enable the NN acquire knowledge. NN architectures have been classified into various types based on their learning mechanisms. Ability of NN to learn is called ‘training’ and the ability of NN to solve a problem using the acquired knowledge is called ’inference’.

A human brain develops with time and it is generally known as ‘experience’. Technically, this involves the ‘development’ of neurons to adapt themselves to their surrounding environment, thus rendering the brain ‘plastic’ in its information processing capability. On similar lines, the property of plasticity is available with NN architectures. Further, the ‘stability’ of NN is also desired, i.e., the adaptive capability of the NN in the face of changing environment. This is so since NN systems essentially being learning systems need to preserve the information learnt, but at the same time, need to be receptive to leaning new information. The NN needs to remain ‘plastic’ to significant or useful information, but remain ‘stable’ when presented with irrelevant information.

5.1 THE MULTILAYER PERCEPTRON (MLP)

The Multilayer Perceptron (MLP) is the most common neural network mode used currently. A multilayer perceptron is a feed-forward artificial neural network model that maps sets of input data onto a set of appropriate output. Feed forward refers to giving a pre-feedback to a person. This type of neural network is a supervised network because it requires a desired output in order to learn. The training data consist of pairs of input objects (typically vectors), and desired outputs. The output of the function can be a continuous value, or can predict a class label of the input object (called classification). The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this, the learner has to generalize from the presented data to unseen situations in a reasonable way. Hence, the goal of this type of network is to create a model that correctly maps the input to the output using historical data so that the model can then be used to produce the output when the desired output is unknown.

FIG 1 A representation of an MLP network

It is made of neurons characterized by a bias and weighted links between them. The neurons receive the inputs and normalize them before forwarding them. It has input and an output layer with one or more hidden layers of nonlinearly-activating nodes. Each node in one layer connects with a certain weight w_ij to every node in the following layer.The inputs are fed into the neurons of input layer and get multiplied by interconnection weights as they are passed from the input layer to the first hidden layer. Each neuron in any subsequent layer first computes a linear combination of the outputs of the previous layer. The output of the neuron is then function of that combination with f being linear for output neurons or a sigmoid for hidden layers.

5.2 THE RADIAL BASIS FUNCTION (RBF)

RBF networks have three layers:

FIG 2: A block diagram of an RBF network

Input layer – There is one neuron in the input layer for each predictor variable. In the case of categorical variables, N-1 neurons are used where N is the number of categories. The input neurons then feed the values to each of the neurons in the hidden layer.

Hidden layer – This layer has a variable number of neurons (the optimal number is determined by the training process). Each neuron consists of a radial basis function centered on a point with as many dimensions as there are predictor variables. The spread (radius) of the RBF function may be different for each dimension. The centers and spreads are determined by the training process. When presented with the x vector of input values from the input layer, a hidden neuron computes the Euclidean distance of the test case from the neuron’s center point and then applies the RBF kernel function to this distance using the spread values. The resulting value is passed to the summation layer.

Summation layer – The value coming out of a neuron in the hidden layer is multiplied by a weight associated with the neuron (W1, W2, ...,Wn in this figure) and passed to the summation which adds up the weighted values and presents this sum as the output of the network.

7. RESULTS AND DICUSSION

A comparison of MLP and RBF on the basis of precision rate, false positive, and recall rate is given. It is clear from the bar-graph that the precision, false positive rate and recall rate of MLP is better than that of RBF for the given traffic data. So one may infer that MLP is out rightly better but we still have build time.

Model Building Time means the time taken by the network to build a model using the training data available to it, so that any further data presented to it can be accurately classified into suitable classes.

It is clear from the table in Fig 3, that the model building time of RBF is better than that of MLP for the given traffic data. It means that RBF network is faster than MLP .Important thing is that real time Fig 3: Comparison of MLP & RBF

applications require better build time.

8. CONCLUSION

We have collected Real Time IP Traffic Data from the Internet using Ethereal Software. We collected 259 instances with 13 attributes, and the following conclusions were reached at:-

· When the two algorithms (MLP&RBF) were run on this dataset, we observed that MLP algorithm gives maximum accuracy in classification of the dataset compared to RBF. The maximum accuracy achieved is 75.28%. Also, the Precision of Classification, Recall Rate, and False Positive Rate of MLP is better than the RBF network, but the Model Building Time of RBF is six times better than MLP network (taking only 1.24 seconds in contrast with 7.08 seconds taken by MLP network).

· Even though, MLP fares better in four parameters (discussed above), but it lags behind in Model Building Time, which is quite an important parameter, in fact a driving force in the selection of the algorithm in the real time applications. So here, RBF scores over the MLP network.

· The accuracy can be improved by increasing the number of instances, as greater the number of instances, the more proper the network will be trained and give us accurate results.

So, the gist of all the above discussion concludes that MLP, which is more accurate, is suitable for the type of applications which demand more of accuracy than model building time. On the other hand, in real time applications, where time is a constraint, RBF network is more suitable.

9. FUTURE SCOPE

Internet Traffic Classification is a very wide area, in which a lot of research is going on at different organizations of this world. We have just embarked upon the classification of a known Internet Traffic into some pre-defined classes. Further, we can make use of other approaches to classify the Internet Traffic. For example, use of unsupervised learning to classify the traffic using some clustering techniques can be a good alternative. A new algorithm can be implemented in the WEKA software to classify the Internet traffic. A dataset with larger number of attributes can be collected by exploring some other networking software to collect the Internet Traffic, because more the features available about the traffic, more accurately it can be classified. But with increase in the number of attributes, the computational complexity of the algorithm will increase, leading to increase in the computation time, thereby, jeopardizing the real time applications. So we need to have a trade-off between the two parameters. In future we expect to get better algorithm which will bolster this method of IP classification in accurately predicting malicious attacks or warding off unwanted packets. All in all, there are a plenty of options available for extending the scope of this vast field of Internet Traffic Classification.

REFERENCES

[1] A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow Clustering Using Machine Learning Techniques,” in PAM 2004, Antibes Juan-les- Pins, France, April 19-20, 2004.

[2] A. Moore and D. Zuev, “Internet Traffic Classification Using Bayesian Analysis Techniques,” in SIGMETRICS’05, Banff, Canada, June 6-10, 2005.

[3] S. Zander, T. Nguyen, and G. Armitage, “Self-Learning IP Traffic Classification Based on Statistical Flow Characteristics,” in PAM 2005, Boston, USA, March 31-April 1, 2005.

[4] “Automated Traffic Classification and Application Identification using Machine Learning,” in LCN’05, Sydney, Australia, November 15- 17, 2005.

[5] A. Dempster, N. Paird, and D. Rubin, “Maximum likelihood from incomplete data via EM algorithm,” Journal of the Royal Statistical Society, vol. 39, no.1, pp. 1–38, 1977.

[6] IANA. Internet Assigned Numbers Authority (IANA), “http://www.iana.org/as signments/port-numbers.”

[7] T. Karagiannis, A. Broido, M. Faloutsos, and K. claffy, “Transport Layer Identification of P2P Traffic,” in IMC’04, Taormina, Italy, October 25- 27, 2004.

[8] P. Haffner, S. Sen, O. Spatscheck, and D. Wang, “ACAS: Automated Construction of Application Signatures,” in SIGCOMM’05 Workshops, Philadelphia, USA, August 22-26, 2005.

[9] A. Moore and K. Papagiannaki, “Toward the Accurate Identification of Network Applications,” in PAM 2005, Boston, USA, March 31-April 1, 2005.

[10] S. Sen, O. Spatscheck, and D. Wang, “Accurate, Scalable In- Network Identification of P2P Traffic Using Application Signatures,” in WWW2005, New York, USA, May 17-22, 2004.

[11] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “BLINK: Multilevel Traffic Classification in the Dark,” in SIGCOMM’05, Philadelphia, USA, August 21-26, 2005.

[12] P. Cheeseman and J. Strutz, “Bayesian Classification (AutoClass): Theory and Results.” In Advances in Knowledge Discovery and Data Mining, AAI/MIT Press, USA, 1996.

[13] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of-Service Mapping for QoS: A Statistical Signature-based Approach to IP Traffic Classification,” in IMC’04, Taormina, Italy, October 25-27, 2004.

[14] Witten and E. Frank, “Data Mining: Practical Machine Learning Tools and Techniques”,

second ed. San Francisco: Morgan Kaufmann, 2005

[15] Ethereal software for data capturing, http://www.ethereal.com