Saturday, July 31, 2010
What is Pixel?
Wednesday, July 28, 2010
Quotations
Tuesday, July 27, 2010
DTH-SATELLITE TV
Satellite TV Signal
After the video is compressed, the provider encrypts it to keep people from accessing it for free. Encryption scrambles the digital data in such a way that it can only be decrypted (converted back into usable data) if the receiver has the correct decryption algorithm and security keys.
Monday, July 26, 2010
Neural Networks
Sunday, July 25, 2010
RIBHU ON WINDOWS 7
WEB TECHNOLOGIES
- JSP:- Java Server Pages. It is based on Java language. It is the most secure language, but simultaneously, very costly to implement. There are three stages in Java Technologies
- ASP:-Active Server Pages. It is the Microsoft’s answer to Sun Micro System’s Java language. Microsoft provides an integrated development environment “Visual Studio” for development in ASPX (an extension to ASP), with the .NET framework. Visual Studio is very-much developer- friendly, very easy to learn.
- PHP:-HyperText Pre-Processor. To a beginner in web-development, it is the most famous and easy-to-learn web-technology. The PHP Servers are least expensive in case you want to deploy you website, and the availability of XAMPP tools.
Thursday, April 29, 2010
A letter to GOD
Saturday, April 17, 2010
Movie Review: PaathShaala
Friday, February 26, 2010
Movie Review: Kartik calling Kartik
Movie Review: Teen Patti
Tuesday, February 23, 2010
Technical paper presented at a National conference in NITTTR,Chandigarh
IP Traffic Classification Using Neural Networks
Sunil Agrawal1, Sameer Sharma2, Vivek Gupta, Vivek Sharma
UIET,
ABSTRACT
The early detection of applications associated with TCP flows is an essential step for network security and traffic engineering. The classic way to identify flows, i.e. looking at port numbers, is not effective anymore. In this paper, we propose neural network techniques for Internet traffic identification. Two supervised neural networks are compared on the basis of precision, recall and model build time. We opened different websites in order to create internet traffic in our lab, and captured traffic flows with the help of ‘Ethereal’. Then most significant features of flows are extracted and used for the training of proposed neural network algorithms. We find out that MLP network outperforms the RBF network when the precision of classification is compared.
Keywords: Traffic classification, Machine Learning, Ethereal.
1. INTRODUCTION
Here is brief discussion of few classical approaches that were being used for traffic classification.
A. Port Number Analysis
Historically, traffic classification techniques used well known port numbers to identify Internet traffic. This was successful because many traditional applications use fixed port numbers assigned by IANA [6]. For example, email applications commonly use port 25. This technique has been shown to be ineffective by Karagiannis et al. in [7] for some applications such as the current generation of P2P applications which intentionally tries to disguise their traffic by using dynamic port numbers or masquerade as well-known applications. In addition, only those applications whose port numbers are known in advance can be identified.
B. Payload-based Analysis
Another well researched approach is analysis of packet payloads [7]–[10]. In this approach, the packet payloads are analyzed to see whether or not they contain characteristics signatures of known applications. These approaches have been shown to work very well for Internet traffic including P2P traffic. However, these techniques also have drawbacks. First, payload analysis poses privacy and security concerns. Second, these techniques typically require increased processing and storage capacity. Third, these approaches are unable to cope with encrypted transmissions. Finally, these techniques only identify traffic for which signatures are available and are unable to classify previously unknown traffic.
C. Transport-layer heuristics
Transport-layer heuristic information has been used to address the drawbacks of payload-based analysis and the diminishing effectiveness of port-based identification. Karagiannis
Corresponding Author’s Email id
1. s.agrawal@hotmail.com 2. smrshrm20@gmail.com
propose a novel approach that uses the unique behaviors of P2P applications when they are
transferring data or making connections to identify this traffic [7]. This approach is shown to perform better than port-based classification and equivalent to payload-based analysis. In addition, Karagiannis created another method that uses the social, functional, and application behaviors to identify all types of traffic [11].
2. PROBLEM STATEMENT
The classical approaches mentioned above can’t single handedly be used to classify IP traffic data.
Our proposal:
We shall be using supervised ‘neural network’ algorithm to solve this problem. The neural network algorithms are RBF and MLP.
Machine Learning Approaches
Machine learning techniques generally consists of two parts: model building and then classification. A model is first built using training data. This model is then inputted into a classifier that then classifies a data set. Machine learning techniques can be divided into the categories of unsupervised and supervised. McGregor et al. hypothesize the ability of using an unsupervised approach to group flows based on connection-level (i.e., transport layer) statistics to classify traffic [1]. In this method, an EM algorithm [5] is used and McGregor et al. draw the conclusion that this approach is promising. In [3] and [4], Zander et al. extend this work by using an EM algorithm called AutoClass [12] and find the optimal set of attributes to use for building the classification model. Some supervised machine learning techniques, such as [13] and [2], also use connection-level statistics to classify traffic. In [13], Roughan et al. use nearest neighbor and linear discriminate analysis. This approach is limited because it does not classify HTTP traffic and uses a limited number of connection level statistics. In [2], Moore et al. suggests using Na¨ıve Bayes as a classifier and shows that the Na¨ıve Bayes approach has a high accuracy for classifying IP traffic.
Performance metrics
Now we have to define certain parameters on which the two algorithms will be compared.
- MODEL BUILDING TIME
- PRECISION
Precision means the percentage of members of class X correctly classified as belonging to class X. It is clear from the bar-graph that the precision of MLP is better than that of RBF for the given traffic data.
- FALSE POSITIVE RATE
False Positive (FP) Rate specifies the percentage of members of other classes incorrectly classified as belonging to class X
- RECALL RATE
Recall Rate means the percentage of members of class X correctly classified as belonging to class X.
- ACCURACY
Accuracy means the ratio of number of correctly classified instances to the total number of instances
3. EXPERIMENTS PERFORMED
· Collect Internet traffic data using a Networking Software, and generate a data file in a suitable format, ready to be used by software simulation tool.
· Classification of the collected traffic data into desired classes using Neural Networks (MLP & RBF), and compare the performance of these algorithms on various parameters for the accurate classification of Internet Traffic.
4. DATASET
Data capturing is done in order to obtain data from different websites (having different variety of data).this is to be done be done by a software called Ethereal.
Ethereal [4] is a free packet analyzer computer application. It is used for network troubleshooting, analysis, software and communications protocol development, and education. Ethereal, in May 2006 the project was renamed Wireshark due to trademark issues.
We selected some 9 popular websites on the basis of the type of packets we can get from them. The websites used are:
1. ICQ.COM
It is a chatting website including features of “strange chatting” with instant messaging. It works on `Instant Messaging and Presence Protocol` under TCP.ICQ has a specific protocol by the name OSCAR (Open System for Communication in Real-time) provided by AOL (America On Line).
2. YOUTUBE.COM
It is a multimedia website providing free video playing and downloading facility using flash player. It was basically selected to inculcate big packets of multimedia download. It works on `Real Time Protocol (RTP).
3. ZAPAK.COM
It is an online gaming website providing onsite gaming facility and download. It utilizes `Online Gaming Protocol (ONGP) ` for data transfer.
4. INDIANEXPRESS.COM
It is a simple HTTP (Hypertext Text Transfer Protocol) based website providing information and news.
5. GMAIL.COM
It is basically a mailing website supporting mail, online chatting and File transfer. It is at front is using HTTP. But, it is also using `Simple Message Transfer Protocol (SMTP)’ and Post Office Protocol (POP).
6. ESPN.COM
A simple sports related information and news website. It utilizes the HTTP protocol.
7. GEETGANGA.ORG
It is an info website supporting poem text and its download. It also utilizes the HTTP protocol.
8. SONGS.PK
A multimedia website supporting songs play and download. It utilizes the RTP protocol.
9. WIKIPEDIA.COM
A web based encyclopedia site. Again an HTTP website.
Feature selection
Features are to be selected from the packets captured which will help us in making .arff file format before being fed into WEKA. We have selected our attributes as:
1. Packet size 2.protocol 3.source port 4.destination port 5.window size
Flag bits: 6. CWR 7.ecn-echo 8.urgent 9.acknowledgement 10.push 11.reset
12. syn 13.fin
5. Neural Networks
Neural Networks (NN), which is simplified models of the biological neuron system, is a massively parallel distributed processing system made up of highly interconnected neural computing elements that have the ability to learn and thereby acquire knowledge and make it available for use.
Various learning mechanisms exist to enable the NN acquire knowledge. NN architectures have been classified into various types based on their learning mechanisms. Ability of NN to learn is called ‘training’ and the ability of NN to solve a problem using the acquired knowledge is called ’inference’.
A human brain develops with time and it is generally known as ‘experience’. Technically, this involves the ‘development’ of neurons to adapt themselves to their surrounding environment, thus rendering the brain ‘plastic’ in its information processing capability. On similar lines, the property of plasticity is available with NN architectures. Further, the ‘stability’ of NN is also desired, i.e., the adaptive capability of the NN in the face of changing environment. This is so since NN systems essentially being learning systems need to preserve the information learnt, but at the same time, need to be receptive to leaning new information. The NN needs to remain ‘plastic’ to significant or useful information, but remain ‘stable’ when presented with irrelevant information.
5.1 THE MULTILAYER PERCEPTRON (MLP)
The Multilayer Perceptron (MLP) is the most common neural network mode used currently. A multilayer perceptron is a feed-forward artificial neural network model that maps sets of input data onto a set of appropriate output. Feed forward refers to giving a pre-feedback to a person. This type of neural network is a supervised network because it requires a desired output in order to learn. The training data consist of pairs of input objects (typically vectors), and desired outputs. The output of the function can be a continuous value, or can predict a class label of the input object (called classification). The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this, the learner has to generalize from the presented data to unseen situations in a reasonable way. Hence, the goal of this type of network is to create a model that correctly maps the input to the output using historical data so that the model can then be used to produce the output when the desired output is unknown.
FIG 1 A representation of an MLP network
It is made of neurons characterized by a bias and weighted links between them. The neurons receive the inputs and normalize them before forwarding them. It has input and an output layer with one or more hidden layers of nonlinearly-activating nodes. Each node in one layer connects with a certain weight wij to every node in the following layer.The inputs are fed into the neurons of input layer and get multiplied by interconnection weights as they are passed from the input layer to the first hidden layer. Each neuron in any subsequent layer first computes a linear combination of the outputs of the previous layer. The output of the neuron is then function of that combination with f being linear for output neurons or a sigmoid for hidden layers.
5.2 THE RADIAL BASIS FUNCTION (RBF)
RBF networks have three layers:

FIG 2: A block diagram of an RBF network
Input layer – There is one neuron in the input layer for each predictor variable. In the case of categorical variables, N-1 neurons are used where N is the number of categories. The input neurons then feed the values to each of the neurons in the hidden layer.
Hidden layer – This layer has a variable number of neurons (the optimal number is determined by the training process). Each neuron consists of a radial basis function centered on a point with as many dimensions as there are predictor variables. The spread (radius) of the RBF function may be different for each dimension. The centers and spreads are determined by the training process. When presented with the x vector of input values from the input layer, a hidden neuron computes the Euclidean distance of the test case from the neuron’s center point and then applies the RBF kernel function to this distance using the spread values. The resulting value is passed to the summation layer.
Summation layer – The value coming out of a neuron in the hidden layer is multiplied by a weight associated with the neuron (W1, W2, ...,Wn in this figure) and passed to the summation which adds up the weighted values and presents this sum as the output of the network.
7. RESULTS AND DICUSSION
A comparison of MLP and RBF on the basis of precision rate, false positive, and recall rate is given. It is clear from the bar-graph that the precision, false positive rate and recall rate of MLP is better than that of RBF for the given traffic data. So one may infer that MLP is out rightly better but we still have build time.
Model Building Time means the time taken by the network to build a model using the training data available to it, so that any further data presented to it can be accurately classified into suitable classes.
It is clear from the table in Fig 3, that the model building time of RBF is better than that of MLP for the given traffic data. It means that RBF network is faster than MLP .Important thing is that real time Fig 3: Comparison of MLP & RBF
applications require better build time.
8. CONCLUSION
We have collected Real Time IP Traffic Data from the Internet using Ethereal Software. We collected 259 instances with 13 attributes, and the following conclusions were reached at:-
· When the two algorithms (MLP&RBF) were run on this dataset, we observed that MLP algorithm gives maximum accuracy in classification of the dataset compared to RBF. The maximum accuracy achieved is 75.28%. Also, the Precision of Classification, Recall Rate, and False Positive Rate of MLP is better than the RBF network, but the Model Building Time of RBF is six times better than MLP network (taking only 1.24 seconds in contrast with 7.08 seconds taken by MLP network).
· Even though, MLP fares better in four parameters (discussed above), but it lags behind in Model Building Time, which is quite an important parameter, in fact a driving force in the selection of the algorithm in the real time applications. So here, RBF scores over the MLP network.
· The accuracy can be improved by increasing the number of instances, as greater the number of instances, the more proper the network will be trained and give us accurate results.
So, the gist of all the above discussion concludes that MLP, which is more accurate, is suitable for the type of applications which demand more of accuracy than model building time. On the other hand, in real time applications, where time is a constraint, RBF network is more suitable.
9. FUTURE SCOPE
Internet Traffic Classification is a very wide area, in which a lot of research is going on at different organizations of this world. We have just embarked upon the classification of a known Internet Traffic into some pre-defined classes. Further, we can make use of other approaches to classify the Internet Traffic. For example, use of unsupervised learning to classify the traffic using some clustering techniques can be a good alternative. A new algorithm can be implemented in the WEKA software to classify the Internet traffic. A dataset with larger number of attributes can be collected by exploring some other networking software to collect the Internet Traffic, because more the features available about the traffic, more accurately it can be classified. But with increase in the number of attributes, the computational complexity of the algorithm will increase, leading to increase in the computation time, thereby, jeopardizing the real time applications. So we need to have a trade-off between the two parameters. In future we expect to get better algorithm which will bolster this method of IP classification in accurately predicting malicious attacks or warding off unwanted packets. All in all, there are a plenty of options available for extending the scope of this vast field of Internet Traffic Classification.
REFERENCES
[1] A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow Clustering Using Machine Learning Techniques,” in PAM 2004, Antibes Juan-les- Pins, France, April 19-20, 2004.
[2] A. Moore and D. Zuev, “Internet Traffic Classification Using Bayesian Analysis Techniques,” in SIGMETRICS’05, Banff, Canada, June 6-10, 2005.
[3] S. Zander, T. Nguyen, and G. Armitage, “Self-Learning IP Traffic Classification Based on Statistical Flow Characteristics,” in PAM 2005, Boston, USA, March 31-April 1, 2005.
[4] “Automated Traffic Classification and Application Identification using Machine Learning,” in LCN’05,
[5] A. Dempster,
[6] IANA. Internet Assigned Numbers Authority (IANA), “http://www.iana.org/as signments/port-numbers.”
[7] T. Karagiannis, A. Broido, M. Faloutsos, and K. claffy, “Transport Layer Identification of P2P Traffic,” in IMC’04, Taormina, Italy, October 25- 27, 2004.
[8] P. Haffner, S. Sen, O. Spatscheck, and D. Wang, “ACAS: Automated Construction of Application Signatures,” in SIGCOMM’05 Workshops,
[9] A. Moore and K. Papagiannaki, “Toward the Accurate Identification of Network Applications,” in PAM 2005,
[10] S. Sen, O. Spatscheck, and D. Wang, “Accurate, Scalable In- Network Identification of P2P Traffic Using Application Signatures,” in WWW2005,
[11] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “BLINK: Multilevel Traffic Classification in the Dark,” in SIGCOMM’05,
[12] P. Cheeseman and J. Strutz, “Bayesian Classification (AutoClass): Theory and Results.” In Advances in Knowledge Discovery and Data Mining, AAI/MIT
[13] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of-Service Mapping for QoS: A Statistical Signature-based Approach to IP Traffic Classification,” in IMC’04,
[14]
second ed.
[15] Ethereal software for data capturing, http://www.ethereal.com
Technical paper presented at a National conference in Udaipur, Rajasthan
A Preliminary Performance Comparison of Two Clustering Algorithms for Practical IP Traffic Classification
Sunil Agrawal, Sameer Sharma and B.S. Sohi
UIET,
Email: s.agrawal@hotmail.com, smrshrm20@gmail.com
ABSTRACT
The early detection of applications associated with TCP flows is an essential step for network security and traffic engineering. The classic way to identify flows, i.e. looking at port numbers, is not effective anymore. In this paper, we propose a technique that uses an unsupervised machine learning approach for Internet traffic identification.
Our unsupervised approach uses Simple K-Means Clustering Algorithm, and we compare the results with efficient version of this clustering algorithm- Fast K-Means on the grounds of Model Building Time, i.e., the speed with which the data is clustered by the algorithms. We find that Fast K-Means takes less time when the number of clusters range from 2 to 22, and afterwards, it gets slower than Simple K-Means. We also find that the unsupervised technique can be used to discover traffic from previously unknown applications and has the potential to become an excellent tool for exploring Internet traffic.
Keywords
Traffic classification, Machine Learning, K-means.
1. INTRODUCTION
Previous works have proposed a number of methods to identify the application associated with a traffic flow. The simplest approach consists in examining TCP port numbers. Port-based methods are simple because many well-known applications have specific port
numbers (for instance, HTTP traffic uses port 80 and FTP port 21). However, the research community now recognizes that port-based classification is inadequate, mainly because many applications use dynamic port-negotiation mechanisms to hide from firewalls and network security tools. An alternative approach is to inspect the payload of every packet. This technique can be extremely accurate when the payload is not encrypted, but it is an unrealistic alternative. First, there are privacy concerns with examining user data. Second, there is a high storage and computational cost to study every packet that traverses a link (in particular at very high-speed links).
2. BACKGROUND AND RELATED WORK
There has been much recent work in the field of traffic classification. This section will survey the different techniques presented in the literature.
A. Port Number Analysis
Historically, traffic classification techniques used well known port numbers to identify Internet traffic. This was successful because many traditional applications use fixed
port numbers assigned by IANA [6]. For example, email applications commonly use port 25. This technique has been shown to be ineffective by Karagiannis et al. in [7] for
some applications such as the current generation of P2P applications which intentionally tries to disguise their traffic by using dynamic port numbers or masquerade as well-known applications. In addition, only those applications whose port numbers are known in advance can be identified.
B. Payload-based Analysis
Another well researched approach is analysis of packet payloads [7]–[10]. In this approach, the packet payloads are analyzed to see whether or not they contain characteristics signatures of known applications. These approaches have been shown to work very well for Internet traffic including P2P traffic. However, these techniques also have drawbacks. First, payload analysis poses privacy and security concerns. Second,
these techniques typically require increased processing and storage capacity. Third, these approaches are unable to cope with encrypted transmissions. Finally, these techniques only identify traffic for which signatures are available and are unable to classify previously unknown traffic.
C. Transport-layer heuristics
Transport-layer heuristic information has been used to address the drawbacks of payload-based analysis and the diminishing effectiveness of port-based identification. Karagiannis
et al. propose a novel approach that uses the unique behaviors of P2P applications when they are transferring data or making connections to identify this traffic [7]. This approach is shown to perform better than port-based classification and equivalent to payload-based analysis. In addition, Karagiannis et al. created another method that uses the social, functional, and application behaviors to identify all types of traffic [11].
D. Machine Learning Approaches
Machine learning techniques generally consists of two parts: model building and then classification. A model is first built using training data. This model is then inputted into a classifier that then classifies a data set. Machine learning techniques can be divided into the categories of unsupervised and supervised. McGregor et al. hypothesize the ability of using an unsupervised approach to group flows based on connection-level (i.e., transport layer) statistics to classify traffic [1]. In this method, an EM algorithm [5] is used and McGregor et al. draw the conclusion that this approach is promising. In [3] and [4], Zander et al. extend this work by using an EM algorithm called AutoClass [12] and find the optimal set of attributes to use for building the classification model. Some supervised machine learning techniques, such as [13] and [2], also use connection-level statistics to classify traffic. In [13], Roughan et al. use nearest neighbor and linear discriminate
analysis. This approach is limited because it does not classify HTTP traffic and uses a limited number of connection level statistics. In [2], Moore et al. suggests using Naïve Bayes as a classifier and shows that the Naïve Bayes approach has a high accuracy classifying traffic.
3. UNSUPERVISED CLUSTERING:-
To extract groups of flows that share a common communication behavior, we borrow techniques from machine learning. We use unsupervised clustering as it relies on unlabeled data samples to find natural groups (or clusters) in a dataset, whereas supervised clustering uses a pre-labeled set of samples to construct a model for each cluster. Although the traffic classification mechanism may also use Naive Bayes Classifiers, an example of supervised clustering, unsupervised learning is more appropriate for traffic classification because it does not rely on pre-defined classes. A single application can have multiple behaviors which should be modeled separately.
We use two versions of K-Means Clustering algorithm for this application, and compare these two algorithms on the ground of Speed i.e., model building time.
Data Set used for comparison of Algorithms:-
The algorithms have been run on IP traffic data, having 248 attributes and 24863 instances. This is the IP Traffic data of
http://www.dcs.qmul.ac.uk/research/nrl http://www.cl.cam.ac.uk/Research/SRG/netos/nprobe/data/papers/sigmetrics/index.html
System Configuration and Software Platform used for comparison:-
The software on which the algorithms have been compared is WEKA
The computer system on which the algorithms have been run consists of Intel Core 2 duo processor with speed 3 GHz, and has a 3 GB DDR2 RAM.
2. BRIEF THEORY OF K-MEANS CLUSTERING ALGORITHM
It is an algorithm for partitioning (or clustering) data points into disjoint subsets (or clusters) containing data points so as to minimize the sum-of-squares criterion
![]()
where xn is a vector representing the nth data point and µj is the geometric centroid of the data points in subset (cluster) Sj. In general, the algorithm does not achieve a global minimum of over the assignments. In fact, since the algorithm uses discrete assignment rather than a set of continuous parameters, the "minimum" it reaches cannot even be properly called a local minimum. Despite these limitations, the algorithm is used fairly frequently as a result of its ease of implementation.
The algorithm consists of a simple re-estimation procedure as follows. Initially, the data points are assigned at random to the sets. For step 1, the centroid is computed for each set. In step 2, every point is assigned to the cluster whose centroid is closest to that point. These two steps are alternated until a stopping criterion is met, i.e., when there is no further change in the assignment of the data points.
The K-Means has an input parameter of K. It represents the number of disjoint partitions used by K-Means. In our data set, we would expect there would be at least one cluster for each traffic class. In addition, due to diversity of traffic in some classes such as HTTP (e.g. browsing, bulk download, streaming), we expect even more clusters to be formed.
So, initially, K has been kept 2, and then incremented at an interval of 2, upto a maximum of 28. The model building time of both algorithms has been recorded. The number of instances in the respective clusters in each simulation of both the algorithms is exactly the same. It means that the Fast K-Means is just algorithmically superior than Simple K-Means algorithm.
4. EXPERIMENTAL RESULTS

When these two algorithms are run on WEKA, one by one, with the input parameter, i.e., Number of clusters, being varied from 2 to 28, on the data set specified above, the

l Fast K-Means : represented by white column
l Simple K-Means : represented by black column
Relative Comparison of Fast K-Means and Simple K-Means
5. CONCLUSION
This paper presented an unsupervised machine learning approach ( Simple K-Means and Fast K-Means) for Internet traffic classification. We used qualitative and quantitative results to compare the significance of the these two algorithms on the grounds of time. We show that the time required to cluster using Fast K-Means is less if the number of clusters (or classes) is within the range of 2-22. After that, Simple K-Means performs better. The Fast K-Means algorithm is similar to Simple K-Means Algorithm, with some algorithmic optimizations. It takes advantage of the geometric properties of K-Means clustering algorithm to reduce runtime. The improvement in Fast K-Means is also a function of the dimensionality of the data presented to it. Also, the reduction in model building time for Fast K-Means can be much helpful while working with IP traffic classification in real time. When the number of instances in the data is large enough, Fast K-Means clusters the data within less time. Further, we can assign classes to the clusters evaluated, and hence, whenever the new instance arrives, it immediately can get classified accordingly, hence providing the network administrator with an excellent tool for accurate and timely classification of the real time IP traffic.
6. FUTURE WORK
At this point, we have just compared the two unsupervised clustering algorithms on the ground of model building time, i.e., speed of clustering. Here each instance of the data has 248 flow features(full feature set). But classification could have been done with reduced feature set, with acceptable accuracy. Our future work involve the reduction in feature set with the help of some correlation-based search algorithms, as to reduce number of features drastically (about 5-10% of the original number of attributes). And then, we will try to compare these algorithms, not only on the basis of model build time, but also on the basis of classification accuracy.
REFERENCES
[1] A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow Clustering Using Ma chine Learning Techniques,” in PAM 2004,
[2] A. Moore and D. Zuev, “Internet Traffic Classification Using Bayesian Analysis Techniques,” in SIGMETRICS’05, Banff, Canada, June 6-10, 2005.
[3] S. Zander, T. Nguyen, and G. Armitage, “Self-Learning IP Traffic Classification Based on Statistical Flow Characteristics,” in PAM 2005, Boston, USA, March 31-April 1, 2005.
[4] ——, “Automated Traffic Classification and Application Identification using Ma chine Learning,” in LCN’05,
[5] A. Dempster,
[6] IANA. Internet Assigned Numbers Authority (IANA), “http://www.iana.org/as signments/port-numbers.”
[7] T. Karagiannis, A. Broido, M. Faloutsos, and K. claffy, “Transport Layer Identifi cation of P2P Traffic,” in IMC’04, Taormina, Italy, October 25- 27, 2004.
[8] P. Haffner, S. Sen, O. Spatscheck, and D. Wang, “ACAS: Automated Construc tion of Application Signatures,” in SIGCOMM’05 Workshops,
[9] A. Moore and K. Papagiannaki, “Toward the Accurate Identification of Network Applications,” in PAM 2005,
[10] S. Sen, O. Spatscheck, and D. Wang, “Accurate, Scalable In- Network Identifica tion of P2P Traffic Using Application Signatures,” in WWW2005,
[11] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “BLINK: Multilevel Traffic Classification in the Dark,” in SIGCOMM’05,
[12] P. Cheeseman and J. Strutz, “Bayesian Classification (AutoClass): Theory and Results.” In Advances in Knowledge Discovery and Data Mining, AAI/MIT
[13] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of-Service Mapping for QoS: A Statistical Signature-based Approach to IP Traffic Classification,” in IMC’04,
[14] I.
[15] A. Banerjee and J. Langford, “An Objective Evaluation of Criterion for Cluster ing,” in KDD’04,
[16] Auckland Data Sets, http://www.wand.net.nz/wand/wits/auck/.
[17] V. Paxson, “Empirically-Derived Analytic Models of Wide-Area TCP Connec tions,” IEEE/ACM Transactions on Networking, vol. 2, no. 4, pp. 316–336, Au gust 1998.
[18] C. Colman, “What to do about P2P?” Network Computing Magazine, vol. 12, no. 6, 2003.
[19] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data.
[20] J. Erman, M. Arlitt, and A. Mahanti, “Traffic Classification using Clustering Al gorithms,” in SIGCOMM’06 MineNet Workshop, Pisa, Italy, September 2006.
