sameer's blog: Technical paper presented at a National conference in Udaipur, Rajasthan

A Preliminary Performance Comparison of Two Clustering Algorithms for Practical IP Traffic Classification

Sunil Agrawal, Sameer Sharma and B.S. Sohi

UIET, Panjab University, Chandigarh, India.

Email: s.agrawal@hotmail.com, smrshrm20@gmail.com

ABSTRACT

The early detection of applications associated with TCP flows is an essential step for network security and traffic engineering. The classic way to identify flows, i.e. looking at port numbers, is not effective anymore. In this paper, we propose a technique that uses an unsupervised machine learning approach for Internet traffic identification.

Our unsupervised approach uses Simple K-Means Clustering Algorithm, and we compare the results with efficient version of this clustering algorithm- Fast K-Means on the grounds of Model Building Time, i.e., the speed with which the data is clustered by the algorithms. We find that Fast K-Means takes less time when the number of clusters range from 2 to 22, and afterwards, it gets slower than Simple K-Means. We also find that the unsupervised technique can be used to discover traffic from previously unknown applications and has the potential to become an excellent tool for exploring Internet traffic.

Keywords

Traffic classification, Machine Learning, K-means.

1. INTRODUCTION

Enterprise or campus networks usually impose a set of rules for users to access the network in order to protect network resources and enforce institutional policies (for instance, no sharing of music files or gaming). This leaves network administrators with the daunting task of (1) identifying the application associated with a traffic flow and (2) controlling user’s traffic when needed. Therefore, accurate classification of traffic flows is an essential step for administrators to detect intrusion or malicious attacks, forbidden applications, or simply new applications (which may impact the future provisioning of network resources).

Previous works have proposed a number of methods to identify the application associated with a traffic flow. The simplest approach consists in examining TCP port numbers. Port-based methods are simple because many well-known applications have specific port

numbers (for instance, HTTP traffic uses port 80 and FTP port 21). However, the research community now recognizes that port-based classification is inadequate, mainly because many applications use dynamic port-negotiation mechanisms to hide from firewalls and network security tools. An alternative approach is to inspect the payload of every packet. This technique can be extremely accurate when the payload is not encrypted, but it is an unrealistic alternative. First, there are privacy concerns with examining user data. Second, there is a high storage and computational cost to study every packet that traverses a link (in particular at very high-speed links).

2. BACKGROUND AND RELATED WORK

There has been much recent work in the field of traffic classification. This section will survey the different techniques presented in the literature.

A. Port Number Analysis

Historically, traffic classification techniques used well known port numbers to identify Internet traffic. This was successful because many traditional applications use fixed

port numbers assigned by IANA [6]. For example, email applications commonly use port 25. This technique has been shown to be ineffective by Karagiannis et al. in [7] for

some applications such as the current generation of P2P applications which intentionally tries to disguise their traffic by using dynamic port numbers or masquerade as well-known applications. In addition, only those applications whose port numbers are known in advance can be identified.

B. Payload-based Analysis

Another well researched approach is analysis of packet payloads [7]–[10]. In this approach, the packet payloads are analyzed to see whether or not they contain characteristics signatures of known applications. These approaches have been shown to work very well for Internet traffic including P2P traffic. However, these techniques also have drawbacks. First, payload analysis poses privacy and security concerns. Second,

these techniques typically require increased processing and storage capacity. Third, these approaches are unable to cope with encrypted transmissions. Finally, these techniques only identify traffic for which signatures are available and are unable to classify previously unknown traffic.

C. Transport-layer heuristics

Transport-layer heuristic information has been used to address the drawbacks of payload-based analysis and the diminishing effectiveness of port-based identification. Karagiannis

et al. propose a novel approach that uses the unique behaviors of P2P applications when they are transferring data or making connections to identify this traffic [7]. This approach is shown to perform better than port-based classification and equivalent to payload-based analysis. In addition, Karagiannis et al. created another method that uses the social, functional, and application behaviors to identify all types of traffic [11].

D. Machine Learning Approaches

Machine learning techniques generally consists of two parts: model building and then classification. A model is first built using training data. This model is then inputted into a classifier that then classifies a data set. Machine learning techniques can be divided into the categories of unsupervised and supervised. McGregor et al. hypothesize the ability of using an unsupervised approach to group flows based on connection-level (i.e., transport layer) statistics to classify traffic [1]. In this method, an EM algorithm [5] is used and McGregor et al. draw the conclusion that this approach is promising. In [3] and [4], Zander et al. extend this work by using an EM algorithm called AutoClass [12] and find the optimal set of attributes to use for building the classification model. Some supervised machine learning techniques, such as [13] and [2], also use connection-level statistics to classify traffic. In [13], Roughan et al. use nearest neighbor and linear discriminate

analysis. This approach is limited because it does not classify HTTP traffic and uses a limited number of connection level statistics. In [2], Moore et al. suggests using Naïve Bayes as a classifier and shows that the Naïve Bayes approach has a high accuracy classifying traffic.

3. UNSUPERVISED CLUSTERING:-

To extract groups of flows that share a common communication behavior, we borrow techniques from machine learning. We use unsupervised clustering as it relies on unlabeled data samples to find natural groups (or clusters) in a dataset, whereas supervised clustering uses a pre-labeled set of samples to construct a model for each cluster. Although the traffic classification mechanism may also use Naive Bayes Classifiers, an example of supervised clustering, unsupervised learning is more appropriate for traffic classification because it does not rely on pre-defined classes. A single application can have multiple behaviors which should be modeled separately.

We use two versions of K-Means Clustering algorithm for this application, and compare these two algorithms on the ground of Speed i.e., model building time.

Data Set used for comparison of Algorithms:-

The algorithms have been run on IP traffic data, having 248 attributes and 24863 instances. This is the IP Traffic data of University of Auckland and has been referenced from the link:-

http://www.dcs.qmul.ac.uk/research/nrl http://www.cl.cam.ac.uk/Research/SRG/netos/nprobe/data/papers/sigmetrics/index.html

System Configuration and Software Platform used for comparison:-

The software on which the algorithms have been compared is WEKA

The computer system on which the algorithms have been run consists of Intel Core 2 duo processor with speed 3 GHz, and has a 3 GB DDR2 RAM.

2. BRIEF THEORY OF K-MEANS CLUSTERING ALGORITHM

It is an algorithm for partitioning (or clustering) data points into disjoint subsets (or clusters) containing data points so as to minimize the sum-of-squares criterion

where xn is a vector representing the nth data point and µj is the geometric centroid of the data points in subset (cluster) Sj. In general, the algorithm does not achieve a global minimum of over the assignments. In fact, since the algorithm uses discrete assignment rather than a set of continuous parameters, the "minimum" it reaches cannot even be properly called a local minimum. Despite these limitations, the algorithm is used fairly frequently as a result of its ease of implementation.

The algorithm consists of a simple re-estimation procedure as follows. Initially, the data points are assigned at random to the sets. For step 1, the centroid is computed for each set. In step 2, every point is assigned to the cluster whose centroid is closest to that point. These two steps are alternated until a stopping criterion is met, i.e., when there is no further change in the assignment of the data points.

The K-Means has an input parameter of K. It represents the number of disjoint partitions used by K-Means. In our data set, we would expect there would be at least one cluster for each traffic class. In addition, due to diversity of traffic in some classes such as HTTP (e.g. browsing, bulk download, streaming), we expect even more clusters to be formed.

So, initially, K has been kept 2, and then incremented at an interval of 2, upto a maximum of 28. The model building time of both algorithms has been recorded. The number of instances in the respective clusters in each simulation of both the algorithms is exactly the same. It means that the Fast K-Means is just algorithmically superior than Simple K-Means algorithm.

4. EXPERIMENTAL RESULTS

Text Box: No. of Clusters(K) Fast K-Means (seconds) Simple K-Means (seconds) 2 50 53 4 65 65 6 146 162 8 100 98 10 142 143 12 129 130 14 123 127 16 138 138 18 158 174 20 244 240 22 180 193 24 166 160 26 220 195 28 440 438

When these two algorithms are run on WEKA, one by one, with the input parameter, i.e., Number of clusters, being varied from 2 to 28, on the data set specified above, the Model Building time is noted for each simulation. The output is presented in the tabular form in the adjoining table. Graphical representation of both algorithms is given in the following pages, and comparison of both algorithms is also shown.

l Fast K-Means : represented by white column

l Simple K-Means : represented by black column

Relative Comparison of Fast K-Means and Simple K-Means

5. CONCLUSION

This paper presented an unsupervised machine learning approach ( Simple K-Means and Fast K-Means) for Internet traffic classification. We used qualitative and quantitative results to compare the significance of the these two algorithms on the grounds of time. We show that the time required to cluster using Fast K-Means is less if the number of clusters (or classes) is within the range of 2-22. After that, Simple K-Means performs better. The Fast K-Means algorithm is similar to Simple K-Means Algorithm, with some algorithmic optimizations. It takes advantage of the geometric properties of K-Means clustering algorithm to reduce runtime. The improvement in Fast K-Means is also a function of the dimensionality of the data presented to it. Also, the reduction in model building time for Fast K-Means can be much helpful while working with IP traffic classification in real time. When the number of instances in the data is large enough, Fast K-Means clusters the data within less time. Further, we can assign classes to the clusters evaluated, and hence, whenever the new instance arrives, it immediately can get classified accordingly, hence providing the network administrator with an excellent tool for accurate and timely classification of the real time IP traffic.

6. FUTURE WORK

At this point, we have just compared the two unsupervised clustering algorithms on the ground of model building time, i.e., speed of clustering. Here each instance of the data has 248 flow features(full feature set). But classification could have been done with reduced feature set, with acceptable accuracy. Our future work involve the reduction in feature set with the help of some correlation-based search algorithms, as to reduce number of features drastically (about 5-10% of the original number of attributes). And then, we will try to compare these algorithms, not only on the basis of model build time, but also on the basis of classification accuracy.

REFERENCES

[1] A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow Clustering Using Ma chine Learning Techniques,” in PAM 2004, Antibes Juan-les- Pins, France, April 19-20, 2004.

[2] A. Moore and D. Zuev, “Internet Traffic Classification Using Bayesian Analysis Techniques,” in SIGMETRICS’05, Banff, Canada, June 6-10, 2005.

[3] S. Zander, T. Nguyen, and G. Armitage, “Self-Learning IP Traffic Classification Based on Statistical Flow Characteristics,” in PAM 2005, Boston, USA, March 31-April 1, 2005.

[4] ——, “Automated Traffic Classification and Application Identification using Ma chine Learning,” in LCN’05, Sydney, Australia, November 15- 17, 2005.

[5] A. Dempster, N. Paird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, vol. 39, no.1, pp. 1–38, 1977.

[6] IANA. Internet Assigned Numbers Authority (IANA), “http://www.iana.org/as signments/port-numbers.”

[7] T. Karagiannis, A. Broido, M. Faloutsos, and K. claffy, “Transport Layer Identifi cation of P2P Traffic,” in IMC’04, Taormina, Italy, October 25- 27, 2004.

[8] P. Haffner, S. Sen, O. Spatscheck, and D. Wang, “ACAS: Automated Construc tion of Application Signatures,” in SIGCOMM’05 Workshops, Philadelphia, USA, August 22-26, 2005.

[9] A. Moore and K. Papagiannaki, “Toward the Accurate Identification of Network Applications,” in PAM 2005, Boston, USA, March 31-April 1, 2005.

[10] S. Sen, O. Spatscheck, and D. Wang, “Accurate, Scalable In- Network Identifica tion of P2P Traffic Using Application Signatures,” in WWW2005, New York, USA, May 17-22, 2004.

[11] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “BLINK: Multilevel Traffic Classification in the Dark,” in SIGCOMM’05, Philadelphia, USA, August 21-26, 2005.

[12] P. Cheeseman and J. Strutz, “Bayesian Classification (AutoClass): Theory and Results.” In Advances in Knowledge Discovery and Data Mining, AAI/MIT Press, USA, 1996.

[13] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of-Service Mapping for QoS: A Statistical Signature-based Approach to IP Traffic Classification,” in IMC’04, Taormina, Italy, October 25-27, 2004.

[14] I. Witten and E. Frank, (2005) Data Mining: Pratical Machine Learning Tools and Techniques, 2nd ed. San Francisco: Morgan Kaufmann, 2005.

[15] A. Banerjee and J. Langford, “An Objective Evaluation of Criterion for Cluster ing,” in KDD’04, Seattle, USA, August 22-25, 2004.

[16] Auckland Data Sets, http://www.wand.net.nz/wand/wits/auck/.

[17] V. Paxson, “Empirically-Derived Analytic Models of Wide-Area TCP Connec tions,” IEEE/ACM Transactions on Networking, vol. 2, no. 4, pp. 316–336, Au gust 1998.

[18] C. Colman, “What to do about P2P?” Network Computing Magazine, vol. 12, no. 6, 2003.

[19] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, USA: Prentice Hall, 1988.

[20] J. Erman, M. Arlitt, and A. Mahanti, “Traffic Classification using Clustering Al gorithms,” in SIGCOMM’06 MineNet Workshop, Pisa, Italy, September 2006.

sameer's blog

Tuesday, February 23, 2010

Technical paper presented at a National conference in Udaipur, Rajasthan

No comments:

Post a Comment

Search This Blog

About Me

Blog Archive