Saturday, July 31, 2010

What is Pixel?

Pixel is the acronym of Picture Element.
The number of colors in a picture depends on the bits per pixel
It need not be a square.
The standard aspect ratios are 4:3 and 3:2.
Number of pixels is not the only factor for resolution.

Wednesday, July 28, 2010

Quotations

·       It is possible to overcome the waves of the ocean and then drown in the glass of water.
·       Monsoons are the creation of Mother Nature, with whose showers, today, only we have the green earth in this black universe.
·       I was born intelligent, but education ruined me.
·       O’ GOD, GIVE ME THE COURAGE TO CHANGE THE THINGS WHICH I CAN CHANGE AND THE COURAGE TO ACCEPT THINGS WHICH I CANNOT CHANGE AND THE WISDOM TO KNOW THE DIFFERENCE.
·       LET YOUR WORK BE ONCE BEGUN,
DO NOT LEAVE TILL IT IS DONE.
            LET YOUR WORK BE GREAT OR SMALL,
            DO IT WHOLE OR NOT AT ALL..

comparison of two internet technologies


Tuesday, July 27, 2010

DTH-SATELLITE TV

DTH (Direct to Home)

DTH stands for Direct-to-Home. It is a wireless system for delivering television programming directly to a viewer's house.

Components of a Satellite TV System/ DTH System

There are five major components involved in a direct to home (DTH) or direct broadcasting (DBS) satellite system: the programming source, the broadcast center, the satellite, the satellite dish and the receiver.
·         Programming sources are simply the channels that provide programming for broadcast. The provider doesn't create original programming itself; it pays other companies (HBO, for example, or ESPN) for the right to broadcast their content via satellite. In this way, the provider is kind of like a broker between you and the actual programming sources. (Cable TV companies work on the same principle.)
·         The broadcast center is the central hub of the system. At the broadcast center, the TV provider receives signals from various programming sources and beams a broadcast signal to satellites in geosynchronous orbit.
·         The satellites receive the signals from the broadcast station and rebroadcast them to Earth.
·         The viewer's dish picks up the signal from the satellite (or multiple satellites in the same part of the sky) and passes it on to the receiver in the viewer's house.
·         The receiver processes the signal and passes it on to a standard TV.


Satellite TV Signal

Satellite signals have a very long path to follow before they appear on TV screen. Because satellite signals contain such high-quality digital data, it would be impossible to transmit them without compression. Compression simply means that unnecessary or repetitive information is removed from the signal before it is transmitted. The signal is reconstructed after transmission.

Standards of Compression

Satellite TV uses a special type of video file compression standardized by the Moving Picture Experts Group (MPEG). With MPEG compression, the provider is able to transmit significantly more channels. There are currently five of these MPEG standards, each serving a different purpose. DirecTV and DISH Network, the two major satellite TV providers in the United States, once used MPEG-2, which is still used to store movies on DVDs and for digital cable television (DTV). With MPEG-2, the TV provider can reduce the 270-Mbps stream to about 5 or 10 Mbps (depending on the type of programming).

Encryption and Transmission 

After the video is compressed, the provider encrypts it to keep people from accessing it for free. Encryption scrambles the digital data in such a way that it can only be decrypted (converted back into usable data) if the receiver has the correct decryption algorithm and security keys.
Once the signal is compressed and encrypted, the broadcast center beams it directly to one of its satellites. The satellite picks up the signal with an onboard dish, amplifies the signal and uses another dish to beam the signal back to Earth, where viewers can pick it up.

India’s DTH Provider- Tata Sky

It uses MPEG-2 digital compression technology and transmits using INSAT 4A at 83.0 E. It is a Joint Venture between the TATA Group, that owns 80% and STAR Group that owns a 20% stake. It was launched in 2006 and offers close to 173 channels including some interactive channels

Monday, July 26, 2010

Beautiful locations of north india




Neural Networks

Neural Networks (NN), which is simplified models of the biological neuron system, is a massively parallel distributed processing system made up of highly interconnected neural computing elements that have the ability to learn and thereby acquire knowledge and make it available for use.
Various learning mechanisms exist to enable the NN acquire knowledge. NN architectures have been classified into various types based on their learning mechanisms. Ability of NN to learn is called ‘training’ and the ability of NN to solve a problem using the acquired knowledge is called ’inference’.
A human brain develops with time and it is generally known as ‘experience’. Technically, this involves the ‘development’ of neurons to adapt themselves to their surrounding environment, thus rendering the brain ‘plastic’ in its information processing capability. On similar lines, the property of plasticity is available with NN architectures. Further, the ‘stability’ of NN is also desired, i.e., the adaptive capability of the NN in the face of changing environment. This is so since NN systems essentially being learning systems need to preserve the information learnt, but at the same time, need to be receptive to leaning new information. The NN needs to remain ‘plastic’ to significant or useful information, but remain ‘stable’ when presented with irrelevant information. Neural networks, with their remarkable ability to derive meaning from complicated or imprecise data, can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques.

Sunday, July 25, 2010

RIBHU ON WINDOWS 7

A soft ware is said to be in its beta stage when it is released for testing purposes only and the developer provides no technical support with it. This is the somewhat formal definition of beta software as known to all of us. Betas as we all know are for testing only and the developers strongly recommend that they should not be used for mainstream applications. Still there is a kind of excitement in using these programs. It’s like exploring some uncharted territory. Where you don’t know what to expect next and you may be welcome with a delightful new feature and any moment. Now talking of software the piece of software used by most of us most commonly is the operating system. As depicted in Microsoft ads a few years back a computer without an operating system is just like a garbage can.  Also we know that the most commonly used desktop operating system on planet earth is Microsoft Windows. And those familiar to windows must be aware of the fact that another version of windows called Windows 7 is on the anvil and is expected to appear on the shelves sometime later this year. So as a part of the testing cycle for this software Microsoft corporation has ( had until you read this) released a beta version of this version of windows. Actually Microsoft released some 2.5 million copies for download and testing purposes. To clear some doubts here only Microsoft is NOT giving out free windows; it is just a pre release evaluation version of windows that will expire as soon as the clock strikes the beginning of August 1,2009. Now coming back to the download, as said earlier Microsoft released 2.5 million copies of windows 7 beta. A 2.44 GB download and needles to mention one of these was downloaded by none other than yours sincerely.
So here I will be reviewing the windows 7 Beta for you and will start from the installation.
The installation process is similar to windows vista the only difference being that a somewhat blue background replaces the blue-green aura of vista. The install took about 22 minutes on my AMD Athlon 4400+ desktop with 2 GB RAM.
The user interface: All I can say here is have a look for yourself. The UI is much more sleek than ever before. The taskbar and the start menu have been redesigned to give a better working environment. The desktop peek makes all the windows transparent to give a view of the desktop. You can pin applications to the taskbar like the quick-launch bar just that now a pinned application returns to its icon when minimized. Taskbar buttons provide full screen previews when moused over. Jump lists provide your recently used documents. The best thing about all this is that it is not at all hard to get used to and is intuitive as Microsoft describes it.
Compatibility: When Windows Vista was released some two years back; one major problem was device and software compatibility. Nothing seemed to work with Vista, so no doubt no one liked it. In the past 2 years a lot of vista compatible hardware and software has been released and the best thing about windows 7 is that almost everything that works with vista works with it(almost because my antivirus software refused to do so except that I had no problems with anything). Also my hard drive had some compatibility issues with windows vista; those have disappeared now. I now have what they call a squeaky clean device manager. Also AVG8 free is a good alternative for antivirus.
Libraries: Another thing I liked about this version of windows. Instead of using  a single folder as your my documents or my pictures you can use a collection of multiple folders known as libraries. This is actually helpful in finding documents and other similar things.
Touch: I don’t have any firsthand experience on this but still Microsoft describes it as the best feature of windows 7.It requires a touch screen display and am not buying one just for the sake of it.
Themes: Also an addition to the UI these are an instant way to change the look and feel of windows. Colour schemes and wallpapers all clubbed into one. A total of five themes have been provided with windows. However I do miss the windows classic theme which is somehow absent.
Applications: Almost all the standard applications of windows are bundled including the games. Also we have a new calculator and a new user interface for paint and word pad. However certain applications have been unbundled and have to be downloaded spartanly as windows live essentials. This additional 173 MB download and includes applications such as mail, messenger and photo gallery.
As I have just had only seven hours with this mind blowing operating system so these are the only features I have discovered. But there lies a lot more hidden inside and I seriously am looking forward to it. So in the end all I would like to say is three cheers for Microsoft and for Windows 7.

WEB TECHNOLOGIES

Today, the world wide web(WWW), alias Internet is ubiquitous. Everyone is connected to this information gateway through one way or the other. Even a kid of second standard knows how to play online games on the internet. Now, how all this stuff works….
Let’s try to unfold the secrets layer-by-layer.
The most basic web page is an html page, which can be coded in Notepad, and run on a web browser (IE, Mozilla, Opera,Safari,Mosaic, Google Chrome)…
HTML means hypertext markup language, a language that the browser understands. We have three basic entities, a client, a server and a host. Taking an example, you have to open your Gmail inbox. As a client, you send a request to the server, where the Google is hosted, by specifying the hyperlink (complete address) of the web page, along with your user name and password. A server is a software that acts as an intermediary between a client and host. It takes the requests from client, processes them, connects to the database (host) where your data (all the messages of your inbox) are stored, and returns to the client in html format. There are various server-side scripting languages, which run on the server, process user requests, connect to the database and output the html to the client. They are as follows:-
  1. JSP:- Java Server Pages. It is based on Java language. It is the most secure language, but simultaneously, very costly to implement. There are three stages in Java Technologies
a)      Core Java, which deals with all the intricacies of the programming tools and object-oriented paradigm, which serves as the basis of this language.
b)      Advance Java, which deals with the applet programming. An applet is a small application running within a large program. You can embed various applets within your web pages to enhance GUI, as applets follow event-driven programming. Example applets are- calculator, analog clock, etc.
c)      Enterprise Java, includes JSP and servlets, and various advance features like EJB,AJAX,etc.

  1. ASP:-Active Server Pages. It is the Microsoft’s answer to Sun Micro System’s                  Java language. Microsoft provides an integrated development environment “Visual Studio” for development in ASPX (an extension to ASP), with the .NET framework. Visual Studio is very-much developer- friendly, very easy to learn.

The most famous web-server to be installed on your system to run JSP programs is “Apache Tomcat “ , which you have to explicitly install. Also, you have to install an SQL database (use MYSQL, its nice) to store all you application data. The advantage here is freedom to choose various products, whereas with Visual Studio 2005/2008, you have an IIS Server, SQL Server Database integrated with the .Net Framework.
Also, JAVA is open-source, and with Microsoft’s ASPX, you have to use Microsoft’s products only. The choice is purely yours.

  1. PHP:-HyperText Pre-Processor. To a beginner in web-development, it is the most famous and easy-to-learn web-technology. The PHP Servers are least expensive in case you want to deploy you website, and the availability of XAMPP tools.
In industry, LAMP architecture is most famous, L-> Linux OS,  A->  Apache Web Server, M-> MySql Database, P-> PHP language. And X in XAMPP-> any operating system, like Windows, Linux, MacOS, etc.
So with all the different web technologies, all depends on the developer to choose the technology which satisfies nearly all his requirements.
With the availability of a large variety of Rapid Application Development (RAD)  tools, developing a website of your own is just a matter of 2-3 hours. There are services available, which give you a complete template of your website, all database connections, etc. and you just have to feed in the content. But this type of development is good for fun and experience. For serious and flawless web-development like the Google website, the developers have to be more mature, experienced and well-versed with the concepts of software engineering, which defines all the stages of development of a software product.

Thursday, April 29, 2010

A letter to GOD

Dear GOD,

First of all, my deep regards to you, and thanks for being the Generator, Operator & Destroyer of this Universe. I am so thankful to you for sending me on this earth in the times when Greats like Sachin Tendulkar, Shahrukh Khan & Aamir Khan are in the top notch form, and i am so lucky to share this earth with such wonders.

People go to temples to pray in front of you, when they are in some problem. But you have programmed me in such a way that i rarely go to temples. But yes, i am in a very big problem. So i thought of writing a letter to you, and hope that my problem gets solved.

People say YOU are one, its true, i admit. But to help every creature on this earth, it is a formidable task for YOU to be present everywhere. So you have sent your agents in the form of FRIENDS to help people. I am also a very lucky person to have such wonderful friends.

But among all these friends of mine, i had a friend whom i considered to be YOU only. Every decision in my life, i used to take by consulting this friend(further referred to as X). I used to rever X, and blindly follow the decisions X took for me. I considered myself very lucky to have YOU in my life.

But as they say, tere khel tu hi jaanta hai bhagwan...

Recently, It was my day, the most wonderful day in my life. The best of my friends sitting with me, and i am enjoying with them. Suddenly, i say such CRUEL words, and to whom, to X, and the result is that, X has departed from my life.

The thing is that, if i had said these words, if i meant even 1% of what i said, toh koi dukh ki baat nahi thi. But i don't know why and from where these words came out of my mouth, but isse pehle ki i could control the situation, paani sar ke upar ja chuka tha..

Now when i talk to or text to X, sometimes, i do get responses, but i know, that in X's heart, i am nowhere. All the respect i had gained since all these years, is gone with just a single stroke.

Now, what do i want from YOU?
GOD, i just want you to please convey my message to my friend that what i said, i didnt mean any of those words. Kehte hain na ki bure waqt me bande ka dimaag kharab ho jata hai, maybe the same thing happened to me that day.
I apologise to my friend for doing what i have done. I hope YOU will make my friend understand that i am not a wrong person at heart, or a wrong guy to be friends with.
Now only YOU can help me GOD. Please make my relationship with X, the same pristine way, as it was earlier. Because undoubtedly, those were the best days of my life, when i had such wonderful friends.

Regards
Y.
X's Friend.













Saturday, April 17, 2010

Movie Review: PaathShaala

PaathShaala is a path-breaking film, there is no doubt about it.. I wish my path would have broken before entering the multiplex to watch this awesome movie..
A question to Shahid...WHY MAN WHY?? why did u do this to yourself, and to ur followers and the so-called bechari audience..Did u read the script properly.. I mean, you are good actor man..
If you want to enjoy your weekend, i strongly suggest you NOT to watch this movie, else your weekend will be devastated and you will regret paying the money for purchasing the ticket of this movie..
Here is what happens on the screen and with you in the hall..
By the way, first time, i entered the hall 15 mins late, and now i thank God that i was saved from the torture for 15 mins.
Shahid joins SVM school as English Teacher, but also starts teaching music to children out of passion, and on request of Ayesha Takia (purely wasted in the movie).. The school is under financial pressure from the management, so it starts charging exhorbitantly from the kids.. The principal (Nana Patekar, the only saviour in an otherwise poor show of acting by everyone else, except shahid) is bound to follow the management decision, even though it turns out in the end that he was against it from the beginning only.
In the end, as usual, the hero encourages the students to do a strike and wake up the managment, and succeeds.
Now the problem with this movie is that none of the subplot is handled properly. The direction could not be poorer than this. The screenplay is pathetic. Music is so-so. All the dialogues are very predictable.. There is no story in the movie. No chemistry between the hero-heroine.

The recess (interval) occurs in barely 40 mins from the start of the movie, and you feel like cheated..

All in all, worst movie of the current decade, if not century.
If i had a choice, i would take the stars from the movie, but as a reviewer, if i have to give it stars, i can't afford more than a half star..
Better luck next time..
And a suggestion to shahid..dude, read the script properly before doing a project..

Friday, February 26, 2010

Movie Review: Kartik calling Kartik

Check out the review on the following link:-

http://www.bollywoodhungama.com/movies/review/14050/index.html

My rating:- 3.5 out of 5.... But yes, Farhan is a fantastic Actor, and deepika is looking damn good in the movie, and the music is amazing..

Movie Review: Teen Patti

Check out the movie review of Teen patti on the following link:-


My rating...4 out of 5. MUST watch for Mathematics lovers..

Tuesday, February 23, 2010

Technical paper presented at a National conference in NITTTR,Chandigarh

IP Traffic Classification Using Neural Networks

Sunil Agrawal1, Sameer Sharma2, Vivek Gupta, Vivek Sharma

UIET, Panjab University, Chandigarh.

ABSTRACT

The early detection of applications associated with TCP flows is an essential step for network security and traffic engineering. The classic way to identify flows, i.e. looking at port numbers, is not effective anymore. In this paper, we propose neural network techniques for Internet traffic identification. Two supervised neural networks are compared on the basis of precision, recall and model build time. We opened different websites in order to create internet traffic in our lab, and captured traffic flows with the help of ‘Ethereal’. Then most significant features of flows are extracted and used for the training of proposed neural network algorithms. We find out that MLP network outperforms the RBF network when the precision of classification is compared.

Keywords: Traffic classification, Machine Learning, Ethereal.

1. INTRODUCTION

Here is brief discussion of few classical approaches that were being used for traffic classification.

A. Port Number Analysis

Historically, traffic classification techniques used well known port numbers to identify Internet traffic. This was successful because many traditional applications use fixed port numbers assigned by IANA [6]. For example, email applications commonly use port 25. This technique has been shown to be ineffective by Karagiannis et al. in [7] for some applications such as the current generation of P2P applications which intentionally tries to disguise their traffic by using dynamic port numbers or masquerade as well-known applications. In addition, only those applications whose port numbers are known in advance can be identified.

B. Payload-based Analysis

Another well researched approach is analysis of packet payloads [7]–[10]. In this approach, the packet payloads are analyzed to see whether or not they contain characteristics signatures of known applications. These approaches have been shown to work very well for Internet traffic including P2P traffic. However, these techniques also have drawbacks. First, payload analysis poses privacy and security concerns. Second, these techniques typically require increased processing and storage capacity. Third, these approaches are unable to cope with encrypted transmissions. Finally, these techniques only identify traffic for which signatures are available and are unable to classify previously unknown traffic.

C. Transport-layer heuristics

Transport-layer heuristic information has been used to address the drawbacks of payload-based analysis and the diminishing effectiveness of port-based identification. Karagiannis

Corresponding Author’s Email id

1. s.agrawal@hotmail.com 2. smrshrm20@gmail.com

propose a novel approach that uses the unique behaviors of P2P applications when they are

transferring data or making connections to identify this traffic [7]. This approach is shown to perform better than port-based classification and equivalent to payload-based analysis. In addition, Karagiannis created another method that uses the social, functional, and application behaviors to identify all types of traffic [11].

2. PROBLEM STATEMENT

The classical approaches mentioned above can’t single handedly be used to classify IP traffic data.

Our proposal:

We shall be using supervised ‘neural network’ algorithm to solve this problem. The neural network algorithms are RBF and MLP.

Machine Learning Approaches

Machine learning techniques generally consists of two parts: model building and then classification. A model is first built using training data. This model is then inputted into a classifier that then classifies a data set. Machine learning techniques can be divided into the categories of unsupervised and supervised. McGregor et al. hypothesize the ability of using an unsupervised approach to group flows based on connection-level (i.e., transport layer) statistics to classify traffic [1]. In this method, an EM algorithm [5] is used and McGregor et al. draw the conclusion that this approach is promising. In [3] and [4], Zander et al. extend this work by using an EM algorithm called AutoClass [12] and find the optimal set of attributes to use for building the classification model. Some supervised machine learning techniques, such as [13] and [2], also use connection-level statistics to classify traffic. In [13], Roughan et al. use nearest neighbor and linear discriminate analysis. This approach is limited because it does not classify HTTP traffic and uses a limited number of connection level statistics. In [2], Moore et al. suggests using Na¨ıve Bayes as a classifier and shows that the Na¨ıve Bayes approach has a high accuracy for classifying IP traffic.

Performance metrics

Now we have to define certain parameters on which the two algorithms will be compared.

    1. MODEL BUILDING TIME

Model Building Time means the time taken by the network to build a model using the training data available to it, so that any further data presented to it can be accurately classified into suitable classes.

    1. PRECISION

Precision means the percentage of members of class X correctly classified as belonging to class X. It is clear from the bar-graph that the precision of MLP is better than that of RBF for the given traffic data.

    1. FALSE POSITIVE RATE

False Positive (FP) Rate specifies the percentage of members of other classes incorrectly classified as belonging to class X

    1. RECALL RATE

Recall Rate means the percentage of members of class X correctly classified as belonging to class X.

    1. ACCURACY

Accuracy means the ratio of number of correctly classified instances to the total number of instances

3. EXPERIMENTS PERFORMED

· Collect Internet traffic data using a Networking Software, and generate a data file in a suitable format, ready to be used by software simulation tool.

· Classification of the collected traffic data into desired classes using Neural Networks (MLP & RBF), and compare the performance of these algorithms on various parameters for the accurate classification of Internet Traffic.

4. DATASET

Data capturing is done in order to obtain data from different websites (having different variety of data).this is to be done be done by a software called Ethereal.

Ethereal [4] is a free packet analyzer computer application. It is used for network troubleshooting, analysis, software and communications protocol development, and education. Ethereal, in May 2006 the project was renamed Wireshark due to trademark issues.

We selected some 9 popular websites on the basis of the type of packets we can get from them. The websites used are:

1. ICQ.COM

It is a chatting website including features of “strange chatting” with instant messaging. It works on `Instant Messaging and Presence Protocol` under TCP.ICQ has a specific protocol by the name OSCAR (Open System for Communication in Real-time) provided by AOL (America On Line).

2. YOUTUBE.COM

It is a multimedia website providing free video playing and downloading facility using flash player. It was basically selected to inculcate big packets of multimedia download. It works on `Real Time Protocol (RTP).

3. ZAPAK.COM

It is an online gaming website providing onsite gaming facility and download. It utilizes `Online Gaming Protocol (ONGP) ` for data transfer.

4. INDIANEXPRESS.COM

It is a simple HTTP (Hypertext Text Transfer Protocol) based website providing information and news.

5. GMAIL.COM

It is basically a mailing website supporting mail, online chatting and File transfer. It is at front is using HTTP. But, it is also using `Simple Message Transfer Protocol (SMTP)’ and Post Office Protocol (POP).

6. ESPN.COM

A simple sports related information and news website. It utilizes the HTTP protocol.

7. GEETGANGA.ORG

It is an info website supporting poem text and its download. It also utilizes the HTTP protocol.

8. SONGS.PK

A multimedia website supporting songs play and download. It utilizes the RTP protocol.

9. WIKIPEDIA.COM

A web based encyclopedia site. Again an HTTP website.

Feature selection

Features are to be selected from the packets captured which will help us in making .arff file format before being fed into WEKA. We have selected our attributes as:

1. Packet size 2.protocol 3.source port 4.destination port 5.window size

Flag bits: 6. CWR 7.ecn-echo 8.urgent 9.acknowledgement 10.push 11.reset

12. syn 13.fin

5. Neural Networks

Neural Networks (NN), which is simplified models of the biological neuron system, is a massively parallel distributed processing system made up of highly interconnected neural computing elements that have the ability to learn and thereby acquire knowledge and make it available for use.

Various learning mechanisms exist to enable the NN acquire knowledge. NN architectures have been classified into various types based on their learning mechanisms. Ability of NN to learn is called ‘training’ and the ability of NN to solve a problem using the acquired knowledge is called ’inference’.

A human brain develops with time and it is generally known as ‘experience’. Technically, this involves the ‘development’ of neurons to adapt themselves to their surrounding environment, thus rendering the brain ‘plastic’ in its information processing capability. On similar lines, the property of plasticity is available with NN architectures. Further, the ‘stability’ of NN is also desired, i.e., the adaptive capability of the NN in the face of changing environment. This is so since NN systems essentially being learning systems need to preserve the information learnt, but at the same time, need to be receptive to leaning new information. The NN needs to remain ‘plastic’ to significant or useful information, but remain ‘stable’ when presented with irrelevant information.

5.1 THE MULTILAYER PERCEPTRON (MLP)

The Multilayer Perceptron (MLP) is the most common neural network mode used currently. A multilayer perceptron is a feed-forward artificial neural network model that maps sets of input data onto a set of appropriate output. Feed forward refers to giving a pre-feedback to a person. This type of neural network is a supervised network because it requires a desired output in order to learn. The training data consist of pairs of input objects (typically vectors), and desired outputs. The output of the function can be a continuous value, or can predict a class label of the input object (called classification). The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this, the learner has to generalize from the presented data to unseen situations in a reasonable way. Hence, the goal of this type of network is to create a model that correctly maps the input to the output using historical data so that the model can then be used to produce the output when the desired output is unknown.

FIG 1 A representation of an MLP network

It is made of neurons characterized by a bias and weighted links between them. The neurons receive the inputs and normalize them before forwarding them. It has input and an output layer with one or more hidden layers of nonlinearly-activating nodes. Each node in one layer connects with a certain weight wij to every node in the following layer.The inputs are fed into the neurons of input layer and get multiplied by interconnection weights as they are passed from the input layer to the first hidden layer. Each neuron in any subsequent layer first computes a linear combination of the outputs of the previous layer. The output of the neuron is then function of that combination with f being linear for output neurons or a sigmoid for hidden layers.

5.2 THE RADIAL BASIS FUNCTION (RBF)

RBF networks have three layers:

FIG 2: A block diagram of an RBF network

Input layer – There is one neuron in the input layer for each predictor variable. In the case of categorical variables, N-1 neurons are used where N is the number of categories. The input neurons then feed the values to each of the neurons in the hidden layer.

Hidden layer – This layer has a variable number of neurons (the optimal number is determined by the training process). Each neuron consists of a radial basis function centered on a point with as many dimensions as there are predictor variables. The spread (radius) of the RBF function may be different for each dimension. The centers and spreads are determined by the training process. When presented with the x vector of input values from the input layer, a hidden neuron computes the Euclidean distance of the test case from the neuron’s center point and then applies the RBF kernel function to this distance using the spread values. The resulting value is passed to the summation layer.

Summation layer – The value coming out of a neuron in the hidden layer is multiplied by a weight associated with the neuron (W1, W2, ...,Wn in this figure) and passed to the summation which adds up the weighted values and presents this sum as the output of the network.

7. RESULTS AND DICUSSION

A comparison of MLP and RBF on the basis of precision rate, false positive, and recall rate is given. It is clear from the bar-graph that the precision, false positive rate and recall rate of MLP is better than that of RBF for the given traffic data. So one may infer that MLP is out rightly better but we still have build time.

Model Building Time means the time taken by the network to build a model using the training data available to it, so that any further data presented to it can be accurately classified into suitable classes.

It is clear from the table in Fig 3, that the model building time of RBF is better than that of MLP for the given traffic data. It means that RBF network is faster than MLP .Important thing is that real time Fig 3: Comparison of MLP & RBF

applications require better build time.

8. CONCLUSION

We have collected Real Time IP Traffic Data from the Internet using Ethereal Software. We collected 259 instances with 13 attributes, and the following conclusions were reached at:-

· When the two algorithms (MLP&RBF) were run on this dataset, we observed that MLP algorithm gives maximum accuracy in classification of the dataset compared to RBF. The maximum accuracy achieved is 75.28%. Also, the Precision of Classification, Recall Rate, and False Positive Rate of MLP is better than the RBF network, but the Model Building Time of RBF is six times better than MLP network (taking only 1.24 seconds in contrast with 7.08 seconds taken by MLP network).

· Even though, MLP fares better in four parameters (discussed above), but it lags behind in Model Building Time, which is quite an important parameter, in fact a driving force in the selection of the algorithm in the real time applications. So here, RBF scores over the MLP network.

· The accuracy can be improved by increasing the number of instances, as greater the number of instances, the more proper the network will be trained and give us accurate results.

So, the gist of all the above discussion concludes that MLP, which is more accurate, is suitable for the type of applications which demand more of accuracy than model building time. On the other hand, in real time applications, where time is a constraint, RBF network is more suitable.

9. FUTURE SCOPE

Internet Traffic Classification is a very wide area, in which a lot of research is going on at different organizations of this world. We have just embarked upon the classification of a known Internet Traffic into some pre-defined classes. Further, we can make use of other approaches to classify the Internet Traffic. For example, use of unsupervised learning to classify the traffic using some clustering techniques can be a good alternative. A new algorithm can be implemented in the WEKA software to classify the Internet traffic. A dataset with larger number of attributes can be collected by exploring some other networking software to collect the Internet Traffic, because more the features available about the traffic, more accurately it can be classified. But with increase in the number of attributes, the computational complexity of the algorithm will increase, leading to increase in the computation time, thereby, jeopardizing the real time applications. So we need to have a trade-off between the two parameters. In future we expect to get better algorithm which will bolster this method of IP classification in accurately predicting malicious attacks or warding off unwanted packets. All in all, there are a plenty of options available for extending the scope of this vast field of Internet Traffic Classification.

REFERENCES

[1] A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow Clustering Using Machine Learning Techniques,” in PAM 2004, Antibes Juan-les- Pins, France, April 19-20, 2004.

[2] A. Moore and D. Zuev, “Internet Traffic Classification Using Bayesian Analysis Techniques,” in SIGMETRICS’05, Banff, Canada, June 6-10, 2005.

[3] S. Zander, T. Nguyen, and G. Armitage, “Self-Learning IP Traffic Classification Based on Statistical Flow Characteristics,” in PAM 2005, Boston, USA, March 31-April 1, 2005.

[4] “Automated Traffic Classification and Application Identification using Machine Learning,” in LCN’05, Sydney, Australia, November 15- 17, 2005.

[5] A. Dempster, N. Paird, and D. Rubin, “Maximum likelihood from incomplete data via EM algorithm,” Journal of the Royal Statistical Society, vol. 39, no.1, pp. 1–38, 1977.

[6] IANA. Internet Assigned Numbers Authority (IANA), “http://www.iana.org/as signments/port-numbers.”

[7] T. Karagiannis, A. Broido, M. Faloutsos, and K. claffy, “Transport Layer Identification of P2P Traffic,” in IMC’04, Taormina, Italy, October 25- 27, 2004.

[8] P. Haffner, S. Sen, O. Spatscheck, and D. Wang, “ACAS: Automated Construction of Application Signatures,” in SIGCOMM’05 Workshops, Philadelphia, USA, August 22-26, 2005.

[9] A. Moore and K. Papagiannaki, “Toward the Accurate Identification of Network Applications,” in PAM 2005, Boston, USA, March 31-April 1, 2005.

[10] S. Sen, O. Spatscheck, and D. Wang, “Accurate, Scalable In- Network Identification of P2P Traffic Using Application Signatures,” in WWW2005, New York, USA, May 17-22, 2004.

[11] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “BLINK: Multilevel Traffic Classification in the Dark,” in SIGCOMM’05, Philadelphia, USA, August 21-26, 2005.

[12] P. Cheeseman and J. Strutz, “Bayesian Classification (AutoClass): Theory and Results.” In Advances in Knowledge Discovery and Data Mining, AAI/MIT Press, USA, 1996.

[13] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of-Service Mapping for QoS: A Statistical Signature-based Approach to IP Traffic Classification,” in IMC’04, Taormina, Italy, October 25-27, 2004.

[14] Witten and E. Frank, “Data Mining: Practical Machine Learning Tools and Techniques”,

second ed. San Francisco: Morgan Kaufmann, 2005

[15] Ethereal software for data capturing, http://www.ethereal.com

Technical paper presented at a National conference in Udaipur, Rajasthan

A Preliminary Performance Comparison of Two Clustering Algorithms for Practical IP Traffic Classification

Sunil Agrawal, Sameer Sharma and B.S. Sohi

UIET, Panjab University, Chandigarh, India.

Email: s.agrawal@hotmail.com, smrshrm20@gmail.com

ABSTRACT

The early detection of applications associated with TCP flows is an essential step for network security and traffic engineering. The classic way to identify flows, i.e. looking at port numbers, is not effective anymore. In this paper, we propose a technique that uses an unsupervised machine learning approach for Internet traffic identification.

Our unsupervised approach uses Simple K-Means Clustering Algorithm, and we compare the results with efficient version of this clustering algorithm- Fast K-Means on the grounds of Model Building Time, i.e., the speed with which the data is clustered by the algorithms. We find that Fast K-Means takes less time when the number of clusters range from 2 to 22, and afterwards, it gets slower than Simple K-Means. We also find that the unsupervised technique can be used to discover traffic from previously unknown applications and has the potential to become an excellent tool for exploring Internet traffic.

Keywords

Traffic classification, Machine Learning, K-means.

1. INTRODUCTION

Enterprise or campus networks usually impose a set of rules for users to access the network in order to protect network resources and enforce institutional policies (for instance, no sharing of music files or gaming). This leaves network administrators with the daunting task of (1) identifying the application associated with a traffic flow and (2) controlling user’s traffic when needed. Therefore, accurate classification of traffic flows is an essential step for administrators to detect intrusion or malicious attacks, forbidden applications, or simply new applications (which may impact the future provisioning of network resources).

Previous works have proposed a number of methods to identify the application associated with a traffic flow. The simplest approach consists in examining TCP port numbers. Port-based methods are simple because many well-known applications have specific port

numbers (for instance, HTTP traffic uses port 80 and FTP port 21). However, the research community now recognizes that port-based classification is inadequate, mainly because many applications use dynamic port-negotiation mechanisms to hide from firewalls and network security tools. An alternative approach is to inspect the payload of every packet. This technique can be extremely accurate when the payload is not encrypted, but it is an unrealistic alternative. First, there are privacy concerns with examining user data. Second, there is a high storage and computational cost to study every packet that traverses a link (in particular at very high-speed links).

2. BACKGROUND AND RELATED WORK

There has been much recent work in the field of traffic classification. This section will survey the different techniques presented in the literature.

A. Port Number Analysis

Historically, traffic classification techniques used well known port numbers to identify Internet traffic. This was successful because many traditional applications use fixed

port numbers assigned by IANA [6]. For example, email applications commonly use port 25. This technique has been shown to be ineffective by Karagiannis et al. in [7] for

some applications such as the current generation of P2P applications which intentionally tries to disguise their traffic by using dynamic port numbers or masquerade as well-known applications. In addition, only those applications whose port numbers are known in advance can be identified.

B. Payload-based Analysis

Another well researched approach is analysis of packet payloads [7]–[10]. In this approach, the packet payloads are analyzed to see whether or not they contain characteristics signatures of known applications. These approaches have been shown to work very well for Internet traffic including P2P traffic. However, these techniques also have drawbacks. First, payload analysis poses privacy and security concerns. Second,

these techniques typically require increased processing and storage capacity. Third, these approaches are unable to cope with encrypted transmissions. Finally, these techniques only identify traffic for which signatures are available and are unable to classify previously unknown traffic.

C. Transport-layer heuristics

Transport-layer heuristic information has been used to address the drawbacks of payload-based analysis and the diminishing effectiveness of port-based identification. Karagiannis

et al. propose a novel approach that uses the unique behaviors of P2P applications when they are transferring data or making connections to identify this traffic [7]. This approach is shown to perform better than port-based classification and equivalent to payload-based analysis. In addition, Karagiannis et al. created another method that uses the social, functional, and application behaviors to identify all types of traffic [11].

D. Machine Learning Approaches

Machine learning techniques generally consists of two parts: model building and then classification. A model is first built using training data. This model is then inputted into a classifier that then classifies a data set. Machine learning techniques can be divided into the categories of unsupervised and supervised. McGregor et al. hypothesize the ability of using an unsupervised approach to group flows based on connection-level (i.e., transport layer) statistics to classify traffic [1]. In this method, an EM algorithm [5] is used and McGregor et al. draw the conclusion that this approach is promising. In [3] and [4], Zander et al. extend this work by using an EM algorithm called AutoClass [12] and find the optimal set of attributes to use for building the classification model. Some supervised machine learning techniques, such as [13] and [2], also use connection-level statistics to classify traffic. In [13], Roughan et al. use nearest neighbor and linear discriminate

analysis. This approach is limited because it does not classify HTTP traffic and uses a limited number of connection level statistics. In [2], Moore et al. suggests using Naïve Bayes as a classifier and shows that the Naïve Bayes approach has a high accuracy classifying traffic.

3. UNSUPERVISED CLUSTERING:-

To extract groups of flows that share a common communication behavior, we borrow techniques from machine learning. We use unsupervised clustering as it relies on unlabeled data samples to find natural groups (or clusters) in a dataset, whereas supervised clustering uses a pre-labeled set of samples to construct a model for each cluster. Although the traffic classification mechanism may also use Naive Bayes Classifiers, an example of supervised clustering, unsupervised learning is more appropriate for traffic classification because it does not rely on pre-defined classes. A single application can have multiple behaviors which should be modeled separately.

We use two versions of K-Means Clustering algorithm for this application, and compare these two algorithms on the ground of Speed i.e., model building time.

Data Set used for comparison of Algorithms:-

The algorithms have been run on IP traffic data, having 248 attributes and 24863 instances. This is the IP Traffic data of University of Auckland and has been referenced from the link:-

http://www.dcs.qmul.ac.uk/research/nrl http://www.cl.cam.ac.uk/Research/SRG/netos/nprobe/data/papers/sigmetrics/index.html

System Configuration and Software Platform used for comparison:-

The software on which the algorithms have been compared is WEKA

The computer system on which the algorithms have been run consists of Intel Core 2 duo processor with speed 3 GHz, and has a 3 GB DDR2 RAM.

2. BRIEF THEORY OF K-MEANS CLUSTERING ALGORITHM

It is an algorithm for partitioning (or clustering) data points into disjoint subsets (or clusters) containing data points so as to minimize the sum-of-squares criterion

where xn is a vector representing the nth data point and µj is the geometric centroid of the data points in subset (cluster) Sj. In general, the algorithm does not achieve a global minimum of over the assignments. In fact, since the algorithm uses discrete assignment rather than a set of continuous parameters, the "minimum" it reaches cannot even be properly called a local minimum. Despite these limitations, the algorithm is used fairly frequently as a result of its ease of implementation.

The algorithm consists of a simple re-estimation procedure as follows. Initially, the data points are assigned at random to the sets. For step 1, the centroid is computed for each set. In step 2, every point is assigned to the cluster whose centroid is closest to that point. These two steps are alternated until a stopping criterion is met, i.e., when there is no further change in the assignment of the data points.

The K-Means has an input parameter of K. It represents the number of disjoint partitions used by K-Means. In our data set, we would expect there would be at least one cluster for each traffic class. In addition, due to diversity of traffic in some classes such as HTTP (e.g. browsing, bulk download, streaming), we expect even more clusters to be formed.

So, initially, K has been kept 2, and then incremented at an interval of 2, upto a maximum of 28. The model building time of both algorithms has been recorded. The number of instances in the respective clusters in each simulation of both the algorithms is exactly the same. It means that the Fast K-Means is just algorithmically superior than Simple K-Means algorithm.

4. EXPERIMENTAL RESULTS

Text Box: No. of Clusters(K) Fast K-Means (seconds) Simple K-Means (seconds) 2 50 53 4 65 65 6 146 162 8 100 98 10 142 143 12 129 130 14 123 127 16 138 138 18 158 174 20 244 240 22 180 193 24 166 160 26 220 195 28 440 438

When these two algorithms are run on WEKA, one by one, with the input parameter, i.e., Number of clusters, being varied from 2 to 28, on the data set specified above, the Model Building time is noted for each simulation. The output is presented in the tabular form in the adjoining table. Graphical representation of both algorithms is given in the following pages, and comparison of both algorithms is also shown.

l Fast K-Means : represented by white column

l Simple K-Means : represented by black column

Relative Comparison of Fast K-Means and Simple K-Means

5. CONCLUSION

This paper presented an unsupervised machine learning approach ( Simple K-Means and Fast K-Means) for Internet traffic classification. We used qualitative and quantitative results to compare the significance of the these two algorithms on the grounds of time. We show that the time required to cluster using Fast K-Means is less if the number of clusters (or classes) is within the range of 2-22. After that, Simple K-Means performs better. The Fast K-Means algorithm is similar to Simple K-Means Algorithm, with some algorithmic optimizations. It takes advantage of the geometric properties of K-Means clustering algorithm to reduce runtime. The improvement in Fast K-Means is also a function of the dimensionality of the data presented to it. Also, the reduction in model building time for Fast K-Means can be much helpful while working with IP traffic classification in real time. When the number of instances in the data is large enough, Fast K-Means clusters the data within less time. Further, we can assign classes to the clusters evaluated, and hence, whenever the new instance arrives, it immediately can get classified accordingly, hence providing the network administrator with an excellent tool for accurate and timely classification of the real time IP traffic.

6. FUTURE WORK

At this point, we have just compared the two unsupervised clustering algorithms on the ground of model building time, i.e., speed of clustering. Here each instance of the data has 248 flow features(full feature set). But classification could have been done with reduced feature set, with acceptable accuracy. Our future work involve the reduction in feature set with the help of some correlation-based search algorithms, as to reduce number of features drastically (about 5-10% of the original number of attributes). And then, we will try to compare these algorithms, not only on the basis of model build time, but also on the basis of classification accuracy.

REFERENCES

[1] A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow Clustering Using Ma chine Learning Techniques,” in PAM 2004, Antibes Juan-les- Pins, France, April 19-20, 2004.

[2] A. Moore and D. Zuev, “Internet Traffic Classification Using Bayesian Analysis Techniques,” in SIGMETRICS’05, Banff, Canada, June 6-10, 2005.

[3] S. Zander, T. Nguyen, and G. Armitage, “Self-Learning IP Traffic Classification Based on Statistical Flow Characteristics,” in PAM 2005, Boston, USA, March 31-April 1, 2005.

[4] ——, “Automated Traffic Classification and Application Identification using Ma chine Learning,” in LCN’05, Sydney, Australia, November 15- 17, 2005.

[5] A. Dempster, N. Paird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, vol. 39, no.1, pp. 1–38, 1977.

[6] IANA. Internet Assigned Numbers Authority (IANA), “http://www.iana.org/as signments/port-numbers.”

[7] T. Karagiannis, A. Broido, M. Faloutsos, and K. claffy, “Transport Layer Identifi cation of P2P Traffic,” in IMC’04, Taormina, Italy, October 25- 27, 2004.

[8] P. Haffner, S. Sen, O. Spatscheck, and D. Wang, “ACAS: Automated Construc tion of Application Signatures,” in SIGCOMM’05 Workshops, Philadelphia, USA, August 22-26, 2005.

[9] A. Moore and K. Papagiannaki, “Toward the Accurate Identification of Network Applications,” in PAM 2005, Boston, USA, March 31-April 1, 2005.

[10] S. Sen, O. Spatscheck, and D. Wang, “Accurate, Scalable In- Network Identifica tion of P2P Traffic Using Application Signatures,” in WWW2005, New York, USA, May 17-22, 2004.

[11] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “BLINK: Multilevel Traffic Classification in the Dark,” in SIGCOMM’05, Philadelphia, USA, August 21-26, 2005.

[12] P. Cheeseman and J. Strutz, “Bayesian Classification (AutoClass): Theory and Results.” In Advances in Knowledge Discovery and Data Mining, AAI/MIT Press, USA, 1996.

[13] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of-Service Mapping for QoS: A Statistical Signature-based Approach to IP Traffic Classification,” in IMC’04, Taormina, Italy, October 25-27, 2004.

[14] I. Witten and E. Frank, (2005) Data Mining: Pratical Machine Learning Tools and Techniques, 2nd ed. San Francisco: Morgan Kaufmann, 2005.

[15] A. Banerjee and J. Langford, “An Objective Evaluation of Criterion for Cluster ing,” in KDD’04, Seattle, USA, August 22-25, 2004.

[16] Auckland Data Sets, http://www.wand.net.nz/wand/wits/auck/.

[17] V. Paxson, “Empirically-Derived Analytic Models of Wide-Area TCP Connec tions,” IEEE/ACM Transactions on Networking, vol. 2, no. 4, pp. 316–336, Au gust 1998.

[18] C. Colman, “What to do about P2P?” Network Computing Magazine, vol. 12, no. 6, 2003.

[19] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, USA: Prentice Hall, 1988.

[20] J. Erman, M. Arlitt, and A. Mahanti, “Traffic Classification using Clustering Al gorithms,” in SIGCOMM’06 MineNet Workshop, Pisa, Italy, September 2006.