SlideShare une entreprise Scribd logo
1  sur  7
Télécharger pour lire hors ligne
NETWORK TRAFFIC ANALYSIS: HADOOP
PIG VS TYPICAL MAPREDUCE
Anjali P P1 and Binu A2
1

Department of Information Technology, Rajagiri School of Engineering and
Technology, Kochi. M G University, Kerala

2

Department of Information Technology, Rajagiri School of Engineering and
Technology, Kochi. M G University, Kerala

anjalinambiarpp@gmail.com

binu_a@rajagiritech.ac.in

ABSTRACT
Big data analysis has become much popular in the present day scenario and the manipulation of
big data has gained the keen attention of researchers in the field of data analytics. Analysis of
big data is currently considered as an integral part of many computational and statistical
departments. As a result, novel approaches in data analysis are evolving on a daily basis.
Thousands of transaction requests are handled and processed everyday by different websites
associated with e-commerce, e-banking, e-shopping carts etc. The network traffic and weblog
analysis comes to play a crucial role in such situations where Hadoop can be suggested as an
efficient solution for processing the Netflow data collected from switches as well as website
access-logs during fixed intervals.

KEYWORDS
Big Data, MapReduce, Hadoop, Pig, netflow, network traffic

1. INTRODUCTION
The network protocol designed by Cisco Systems for assembling network traffic information
named as NetFlow has become an industry standard for monitoring traffic and is supported by
various platforms. Cisco standard NetFlow version 5 defines network flow as a unidirectional
sequence of packets which has got seven parameters:
•
•
•
•
•
•
•

Ingress interface (SNMP ifIndex)
Source IP address
Destination IP address
IP protocol
Source port for UDP or TCP, 0 for other protocols
Destination port for UDP or TCP, type and code for ICMP, or 0 for other protocols
IP Type of Service

For limiting the load on the router components and the amount of flow records exported, Netflow
is normally enabled on a per-interface basis. If IP filters are not used to decide if a packet can be
captured by NetFlow, it will capture all packets received by an ingress interface by default.
Natarajan Meghanathan et al. (Eds) : ITCSE, ICDIP, ICAIT - 2013
pp. 15–21, 2013. © CS & IT-CSCP 2013

DOI : 10.5121/csit.2013.3902
16

Computer Science & Information Technology (CS & IT)

Further customizations can also be done on NetFlow implementation as per the requirement in
different scenarios. It can even allow the observation and recording of IP packets on the egress
interface if accurately configured.
NetFlow data from dedicated probes can be efficiently utilized for the observation of critical
links, whereas the data from routers can provide a network-wide view of the traffic that which
will be of great use for capacity planning, accounting, performance monitoring, and security.
This paper depicts a survey regarding the network flow data analysis done using Apache Hadoop
MapReduce framework and a comparative study of typical MapReduce against Apache Pig, a
platform for analyzing large data sets and provides ease of programming, optimizing
opportunities and enhanced extensibility.

2. NETFLOW ANALYSIS USING TYPICAL HADOOP MAPREDUCE
FRAMEWORK
Hadoop can be considered as an efficient mechanism for big data manipulation in distributed
environment. Hadoop was developed by Apache foundation and is well-known for its scalability
and efficiency. Hadoop works on a computational paradigm called MapReduce which is
characterized by a Map and a Reduce phase. In Map phase, the computation is divided into many
fragments and is distributed among the cluster nodes. The individual results evolved at the end of
the Map phase is combined and reduced to a single output in the Reduce phase, thereby producing
the result in a short span of time.
Hadoop will store the data in its distributed file system popularly known as HDFS and will make
it easily available to the users when requested. HDFS is greatly advantageous in its scalability and
portability aspects. It achieves reliability by the replication of data across multiple nodes in the
cluster, thereby eliminating the need for RAID storage on the cluster nodes. High availability is
also incorporated in Hadoop cluster systems by means of a backup provision for the master
node[name node] known as secondary namenode. The major limitations of HDFS are the inability
to be mounted directly by an existing operating system and the inconvenience of moving data in
and out of the HDFS before and after the execution of a job.
Once Hadoop is deployed on a cluster, instances of these processes will be started – Namenode,
Datanode, Jobtracker, Tasktracker. Hadoop can process large amount of data such as website
logs, network traffic logs obtained from switches etc. The namenode, otherwise known as master
node is the main component of the HDFS which stores the directory tree of the file system and
keeps track of the data during execution. The datanode stores the data in HDFS as suggested by
its name. The jobtracker delegates the MapReduce tasks to different cluster nodes and the
tasktracker spawns and executes the job on the member nodes.
The Netflow analysis requires sufficient network flow data collected at a particular time or at
regular intervals for processing. Upon analyzing the incoming and outgoing network traffic
corresponding to various hosts, required precautions can be taken regarding server security,
firewall setup, packet filtering etc and here, the significance of network flow analysis cannot be
kept aside. Flow data can be collected from switches or routers using router-console commands
and are filtered on the basis of the required parameters such as IP address, Port numbers etc.
Similarly, every website will generate corresponding logs and such files can be collected from the
web-server for further analysis. Website access logs will contain the details of URLs accessed, the
time of access, number of hits etc which are unavoidable for a website manager for the proactive
detection and encounter of problems like Denial of Service attacks and IP address spoofing.
Computer Science & Information Technology (CS & IT)

17

The log files generated by a router or web-server every minute will accumulate to form a largesized file which needs a great deal of time to get processed. In such a situation, the advantage of
Hadoop MapReduce paradigm can be made use of. The big data is fed as input for a MapReduce
algorithm which provides an efficiently processed output file in return. Every map function
transforms a data into certain number of key/value pairs which later sorts them according to the
key values whereas a reduce function merges the values with similar key values into a single
result.
One of the main feature of Hadoop MapReduce is its coding complexity which is possible only by
programmers with highly developed skills. This programming model is highly restrictive and
Hadoop cluster management is also not much easy while considering aspects like debugging, log
collection, software distribution etc. In short, the effort to be put in MapReduce programming
restricts its acceptability to a remarkable extent. Here comes the need for a platform which can
provide ease of programming as well as enhanced extensibility and Apache Pig reveals itself as a
novel approach in data analytics. The key features and applications of Apache Pig deployed over
Hadoop framework is described in Section 3.

3. USING APACHE PIG TO PROCESS NETWORK FLOW DATA
Apache Pig is a platform that supports substantial parallelization which enables the processing of
huge data sets. The structure of Pig can be described in two layers – An infrastructure layer with
an in-built compiler that can produce MapReduce programs and a language layer with a textprocessing language called “Pig Latin”. Pig makes data analysis and programming easier and
understandable to novices in Hadoop framework using Pig Latin which can be considered as a
parallel data flow language. The pig setup has also got provisions for further optimizations and
user defined functionalities. Pig Latin queries are compiled into MapReduce jobs and are
executed in distributed Hadoop cluster environments. Pig Latin looks similar to SQL[Structured
Query Language] but is designed for Hadoop data processing environment just like SQL for
RDBMS environment. Another member of Hadoop ecosystem known as Hive is a query language
like SQL and needs data to be loaded into tables before processing. Pig Latin can work on
schema-less or inconsistent environments and can operate on the available data as soon as it is
loaded into the HDFS. Pig closely resembles scripting languages like Perl and Python in certain
aspects such as the flexibility in syntax and dynamic variables definitions. Pig Latin is efficient
enough to be considered as the native parallel processing language for distributed systems such as
Hadoop.

Figure 1. Data processing model in PIG
18

Computer Science & Information Technology (CS & IT)

Pig is gaining popularity in different zones of computational and statistical departments. It is
widely used in processing weblogs and wiping off corrupted data from records. Pig can build
behavior prediction models based on the user interactions with a website and this feature can be
applied in displaying ads and news stories that keeps the user entertained. It also emphasizes on
an iterative processing model which can keep track of every new updates by combining the
behavioral model with the user data. This feature is largely used in social networking sites like
Facebook and micro-blogging sites like Twitter. When it comes to small data or scanning
multiple records in random order, pig cannot be considered as effective as MapReduce. The
aforementioned data processing model used by Pig is depicted in Figure 1.
In this paper, we are exploring the horizons of Pig Latin for processing the netflow obtained from
routers. The proposed method for implementing Pig for processing netflow data is described in
Section 4.

4. IMPLEMENTATION ASPECTS
Here we present the NetFlow analysis using Hadoop, which can manage large volume of data,
employ parallel processing and come up with required output in no time. A map-reduce algorithm
need to be deployed using Hadoop to get the result and writing such map-reduce programs for
analyzing huge flow data is a time consuming task. Here the Apache-Pig will help us to save a
good deal of time.
We need to analyze traffic at a particular time and Pig plays an important role in this regard. For
data analysis, we create buckets with pairs of (SrcIF, SrcIPaddress), (DstIF,DstIPaddress),
(Protocal,FlowperSec). Later, they are joined and categorized as SrcIF and FlowperSec;
SrcIPaddress and FlowperSec. Netflow data is collected from a busy Cisco switch using nfdump.
The output contains three basic sections-IP Packet section, Protocol Flow section and Source
Destination packets section.

Figure 2. Netflow Analysis using Hadoop Pig

Pig needs the netflow datasets in arranged format which is then converted to SQL-like schema.
Hence the three sections are created as three different schemas. A sample Pig Latin code is given
below:
Protocol = LOAD ‘NetFlow-Data1’ AS (protocol:chararray, flow:int, …)
Source = LOAD ‘NetFlow-Data2’ AS (SrcIF:chararray, SrcIPAddress:chararray, …)
Destination = LOAD ‘NetFlow-Data3’ AS (DstIF:chararray, DstIPaddress, …)
As shown in Figure 2, the network flow is deduced into schemas by Pig. Then these schemas are
combined for analysis and the traffic flow per node is the resultant output.
Computer Science & Information Technology (CS & IT)

19

5. ADVANTAGES OF USING HADOOP PIG OVER MAPREDUCE
The member of Hadoop ecosystem popularly known as Pig can be considered as a procedural
version of SQL. The most attractive feature and advantage of Pig is its simplicity. It doesn't
demand highly skilled Java professionals for implementation. The flexibility in syntax and
dynamic variable allocation makes it much more user-friendly and provisions for user-defined
functions add extensibility for the framework. The ability of Pig to work with un-structured data
greatly accounts to parallel data processing. The great difference comes when the computational
complexity is considered. Typical map-reduce programs will have large number of java code lines
whereas the task can be accomplished using Pig Latin in very fewer lines of code. Huge network
flow data input files can be processed with writing simple programs in Pig Latin which closely
resembles scripting languages like Perl/Python. In other words, the computational complexity of
Pig Latin is lesser than Hadoop MapReduce programs written in Java which made Pig suitable for
the implementation of this project. The procedure followed for arriving at such a conclusion is
described in Section 6.

6. EXPERIMENTS AND RESULTS
Hadoop Pig was selected as the programming language for Netflow analysis after conducting a
short test on a sample input file.

6.1 Experimental Setup
Apache Hadoop was initially installed and configured on a stand-alone node and Pig was
deployed on top of this system. Key-based authentication was setup on the machine as Hadoop
prompts for passwords while performing operations. Hadoop instances such as namenode,
datanode, jobtracker and tasktracker are started before running Pig and it shows a “grunt” console
upon starting. The grunt console is the command prompt specific to Pig and is used for data and
file system manipulation. Pig can work both in local mode and MapReduce mode which can be
selected as per the user requirement.
A sample Hadoop MapReduce word-count program written in Java was used for the comparison.
The program used to process a sample input file of enough size had one hundred thirty lines of
code and the time taken for execution was noted. The same operation was coded using Pig Latin
which contained only five lines of code and corresponding running time was also noted for
framing the results.

6.2. Experiment Results
At a particular instant, the size of input file is kept constant for MapReduce and Pig whereas the
lines of code are more for the former and less for the latter. For a good comparison, the input file
size was taken in the range of 13 kb to 208 kb and the corresponding time is recorded as in Figure
3.
20

Computer Science & Information Technology (CS & IT)

Figure 3. Hadoop Pig vs. MapReduce

From this we can formulate an expression as follows: As the input file size increases in the
multiples of x, the execution time for typical MapReduce also increases proportionally whereas
Hadoop Pig can maintain a constant time at least for x upto 5 times. Graphs are plotted between
both and are given as Figure 4 and Figure 5.

Figure 4. Input file-size vs. execution time for MapReduce

Figure 5. Input file-size vs. execution time for Pig

6. CONCLUSION
In this paper, a detailed survey regarding the wide scope of Apache Pig platform is conducted.
The key features of the network flows are extracted for data analysis using Hadoop Pig which was
tested and proved to be advantageous in this aspect with a very low computational complexity.
Network flow capture is to be done with the help of easily available open-source tools. Once the
Pig-based flow analyzer is implemented, traffic flow analysis can be done with increased
convenience and easiness even without the help of highly developed programming skills. It
greatly reduces the time consumed for researching on the complex programming constructs and
Computer Science & Information Technology (CS & IT)

21

control loops for implementing a typical MapReduce paradigm using Java, but still enjoying the
advantages of Map/Reduce task division method with the help of the layered architecture and inbuilt compilers of Apache Pig.

ACKNOWLEDGEMENT
We are greatly indebted to the college management and the faculty members for providing
necessary facilities and hardware along with timely guidance and suggestions for implementing
this work.

REFERENCES
http://pig.apache.org/
http://orzota.com/blog/pig-for-beginners/
http://jugnu-life.blogspot.com/2012/03/installing-pig-apache-hadoop-pig.html
http://www.hadoop.apache.org
http://www.ibm.com/developerworks/library/l-hadoop-1/
http://www.ibm.com/developerworks/linux/library/l-apachepigdataquery/
http://ankitasblogger.blogspot.in/2011/01/hadoop-cluster-setup.html
http://icanhadoop.blogspot.in/2012/09/configuring-hadoop-is-very-if-you-just.html
http://developer.yahoo.com/hadoop/tutorial/module7.html
https://cyberciti.biz
http://www.quuxlabs.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
http://www.cisco.com/en/US/docs/ios/ios_xe/netflow/configuration/guide/cfg_nflow_data_expt_xe.ht
ml#wp1056663
[13] http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/netflow.h
tml
[14] http://ofps.oreilly.com/titles/9781449302641/intro.html
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]

AUTHORS
Anjali P P is a MTech student of Information Technology Department in Rajagiri
college of engineering, Kochi, under MG University. She has received BTech degree in
Computer Science and Engineering from Cochin University of Science and
Technology. Her research interest includes Hadoop and Big data analysis.

Binu A is an Assistant Professor in Information Technology Department of Rajagiri
college of engineering, Kochi, under MG University.

Contenu connexe

Tendances

Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportAhmad El Tawil
 
Interactive Analytics in Human Time
Interactive Analytics in Human TimeInteractive Analytics in Human Time
Interactive Analytics in Human TimeDataWorks Summit
 
Srikanth hadoop 3.6yrs_hyd
Srikanth hadoop 3.6yrs_hydSrikanth hadoop 3.6yrs_hyd
Srikanth hadoop 3.6yrs_hydsrikanth K
 
ManMachine&Mathematics_Arup_Ray_Ext
ManMachine&Mathematics_Arup_Ray_ExtManMachine&Mathematics_Arup_Ray_Ext
ManMachine&Mathematics_Arup_Ray_ExtArup Ray
 
Srikanth hadoop hyderabad_3.4yeras - copy
Srikanth hadoop hyderabad_3.4yeras - copySrikanth hadoop hyderabad_3.4yeras - copy
Srikanth hadoop hyderabad_3.4yeras - copysrikanth K
 
Bigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpBigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpbigdata sunil
 
Phoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBasePhoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBaseSalesforce Developers
 
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed REnd-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed RJorge Martinez de Salinas
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Hadoop add-on API for Advanced Content Based Search & Retrieval
Hadoop add-on API for Advanced Content Based Search & RetrievalHadoop add-on API for Advanced Content Based Search & Retrieval
Hadoop add-on API for Advanced Content Based Search & Retrievaliosrjce
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmodwaqasm86
 
Nagarjuna_Damarla_Resume
Nagarjuna_Damarla_ResumeNagarjuna_Damarla_Resume
Nagarjuna_Damarla_ResumeNag Arjun
 

Tendances (19)

Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Atul Mithe
Atul MitheAtul Mithe
Atul Mithe
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases report
 
Interactive Analytics in Human Time
Interactive Analytics in Human TimeInteractive Analytics in Human Time
Interactive Analytics in Human Time
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
HimaBindu
HimaBinduHimaBindu
HimaBindu
 
Srikanth hadoop 3.6yrs_hyd
Srikanth hadoop 3.6yrs_hydSrikanth hadoop 3.6yrs_hyd
Srikanth hadoop 3.6yrs_hyd
 
ManMachine&Mathematics_Arup_Ray_Ext
ManMachine&Mathematics_Arup_Ray_ExtManMachine&Mathematics_Arup_Ray_Ext
ManMachine&Mathematics_Arup_Ray_Ext
 
Srikanth hadoop hyderabad_3.4yeras - copy
Srikanth hadoop hyderabad_3.4yeras - copySrikanth hadoop hyderabad_3.4yeras - copy
Srikanth hadoop hyderabad_3.4yeras - copy
 
Bigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpBigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExp
 
Phoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBasePhoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBase
 
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed REnd-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Hadoop add-on API for Advanced Content Based Search & Retrieval
Hadoop add-on API for Advanced Content Based Search & RetrievalHadoop add-on API for Advanced Content Based Search & Retrieval
Hadoop add-on API for Advanced Content Based Search & Retrieval
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
 
Nagarjuna_Damarla_Resume
Nagarjuna_Damarla_ResumeNagarjuna_Damarla_Resume
Nagarjuna_Damarla_Resume
 

En vedette

Computer network and networking peripherals (ITM)
Computer network and networking peripherals (ITM)Computer network and networking peripherals (ITM)
Computer network and networking peripherals (ITM)Kapil Rode
 
Computer peripherals
Computer peripheralsComputer peripherals
Computer peripheralsgemanice01
 
BASIC COMPUTER PERIPHERALS/DEVICES/SYSTEMS
BASIC COMPUTER PERIPHERALS/DEVICES/SYSTEMSBASIC COMPUTER PERIPHERALS/DEVICES/SYSTEMS
BASIC COMPUTER PERIPHERALS/DEVICES/SYSTEMSNeve Deschanel
 
The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...Brian Solis
 
Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)maditabalnco
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsBarry Feldman
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome EconomyHelge Tennø
 

En vedette (7)

Computer network and networking peripherals (ITM)
Computer network and networking peripherals (ITM)Computer network and networking peripherals (ITM)
Computer network and networking peripherals (ITM)
 
Computer peripherals
Computer peripheralsComputer peripherals
Computer peripherals
 
BASIC COMPUTER PERIPHERALS/DEVICES/SYSTEMS
BASIC COMPUTER PERIPHERALS/DEVICES/SYSTEMSBASIC COMPUTER PERIPHERALS/DEVICES/SYSTEMS
BASIC COMPUTER PERIPHERALS/DEVICES/SYSTEMS
 
The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...
 
Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post Formats
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome Economy
 

Similaire à NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcscpconf
 
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...IJCSES Journal
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopIRJET Journal
 
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce ijujournal
 
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEHMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEijujournal
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...IJECEIAES
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
 
Internet of Things and Hadoop
Internet of Things and HadoopInternet of Things and Hadoop
Internet of Things and Hadoopaziksa
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET Journal
 

Similaire à NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE (20)

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Cs6703 grid and cloud computing unit 4
Cs6703 grid and cloud computing unit 4Cs6703 grid and cloud computing unit 4
Cs6703 grid and cloud computing unit 4
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
Sdn in big data
Sdn in big dataSdn in big data
Sdn in big data
 
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
 
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCEHMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
Internet of Things and Hadoop
Internet of Things and HadoopInternet of Things and Hadoop
Internet of Things and Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache Pig
 

Dernier

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

  • 1. NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE Anjali P P1 and Binu A2 1 Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala 2 Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala anjalinambiarpp@gmail.com binu_a@rajagiritech.ac.in ABSTRACT Big data analysis has become much popular in the present day scenario and the manipulation of big data has gained the keen attention of researchers in the field of data analytics. Analysis of big data is currently considered as an integral part of many computational and statistical departments. As a result, novel approaches in data analysis are evolving on a daily basis. Thousands of transaction requests are handled and processed everyday by different websites associated with e-commerce, e-banking, e-shopping carts etc. The network traffic and weblog analysis comes to play a crucial role in such situations where Hadoop can be suggested as an efficient solution for processing the Netflow data collected from switches as well as website access-logs during fixed intervals. KEYWORDS Big Data, MapReduce, Hadoop, Pig, netflow, network traffic 1. INTRODUCTION The network protocol designed by Cisco Systems for assembling network traffic information named as NetFlow has become an industry standard for monitoring traffic and is supported by various platforms. Cisco standard NetFlow version 5 defines network flow as a unidirectional sequence of packets which has got seven parameters: • • • • • • • Ingress interface (SNMP ifIndex) Source IP address Destination IP address IP protocol Source port for UDP or TCP, 0 for other protocols Destination port for UDP or TCP, type and code for ICMP, or 0 for other protocols IP Type of Service For limiting the load on the router components and the amount of flow records exported, Netflow is normally enabled on a per-interface basis. If IP filters are not used to decide if a packet can be captured by NetFlow, it will capture all packets received by an ingress interface by default. Natarajan Meghanathan et al. (Eds) : ITCSE, ICDIP, ICAIT - 2013 pp. 15–21, 2013. © CS & IT-CSCP 2013 DOI : 10.5121/csit.2013.3902
  • 2. 16 Computer Science & Information Technology (CS & IT) Further customizations can also be done on NetFlow implementation as per the requirement in different scenarios. It can even allow the observation and recording of IP packets on the egress interface if accurately configured. NetFlow data from dedicated probes can be efficiently utilized for the observation of critical links, whereas the data from routers can provide a network-wide view of the traffic that which will be of great use for capacity planning, accounting, performance monitoring, and security. This paper depicts a survey regarding the network flow data analysis done using Apache Hadoop MapReduce framework and a comparative study of typical MapReduce against Apache Pig, a platform for analyzing large data sets and provides ease of programming, optimizing opportunities and enhanced extensibility. 2. NETFLOW ANALYSIS USING TYPICAL HADOOP MAPREDUCE FRAMEWORK Hadoop can be considered as an efficient mechanism for big data manipulation in distributed environment. Hadoop was developed by Apache foundation and is well-known for its scalability and efficiency. Hadoop works on a computational paradigm called MapReduce which is characterized by a Map and a Reduce phase. In Map phase, the computation is divided into many fragments and is distributed among the cluster nodes. The individual results evolved at the end of the Map phase is combined and reduced to a single output in the Reduce phase, thereby producing the result in a short span of time. Hadoop will store the data in its distributed file system popularly known as HDFS and will make it easily available to the users when requested. HDFS is greatly advantageous in its scalability and portability aspects. It achieves reliability by the replication of data across multiple nodes in the cluster, thereby eliminating the need for RAID storage on the cluster nodes. High availability is also incorporated in Hadoop cluster systems by means of a backup provision for the master node[name node] known as secondary namenode. The major limitations of HDFS are the inability to be mounted directly by an existing operating system and the inconvenience of moving data in and out of the HDFS before and after the execution of a job. Once Hadoop is deployed on a cluster, instances of these processes will be started – Namenode, Datanode, Jobtracker, Tasktracker. Hadoop can process large amount of data such as website logs, network traffic logs obtained from switches etc. The namenode, otherwise known as master node is the main component of the HDFS which stores the directory tree of the file system and keeps track of the data during execution. The datanode stores the data in HDFS as suggested by its name. The jobtracker delegates the MapReduce tasks to different cluster nodes and the tasktracker spawns and executes the job on the member nodes. The Netflow analysis requires sufficient network flow data collected at a particular time or at regular intervals for processing. Upon analyzing the incoming and outgoing network traffic corresponding to various hosts, required precautions can be taken regarding server security, firewall setup, packet filtering etc and here, the significance of network flow analysis cannot be kept aside. Flow data can be collected from switches or routers using router-console commands and are filtered on the basis of the required parameters such as IP address, Port numbers etc. Similarly, every website will generate corresponding logs and such files can be collected from the web-server for further analysis. Website access logs will contain the details of URLs accessed, the time of access, number of hits etc which are unavoidable for a website manager for the proactive detection and encounter of problems like Denial of Service attacks and IP address spoofing.
  • 3. Computer Science & Information Technology (CS & IT) 17 The log files generated by a router or web-server every minute will accumulate to form a largesized file which needs a great deal of time to get processed. In such a situation, the advantage of Hadoop MapReduce paradigm can be made use of. The big data is fed as input for a MapReduce algorithm which provides an efficiently processed output file in return. Every map function transforms a data into certain number of key/value pairs which later sorts them according to the key values whereas a reduce function merges the values with similar key values into a single result. One of the main feature of Hadoop MapReduce is its coding complexity which is possible only by programmers with highly developed skills. This programming model is highly restrictive and Hadoop cluster management is also not much easy while considering aspects like debugging, log collection, software distribution etc. In short, the effort to be put in MapReduce programming restricts its acceptability to a remarkable extent. Here comes the need for a platform which can provide ease of programming as well as enhanced extensibility and Apache Pig reveals itself as a novel approach in data analytics. The key features and applications of Apache Pig deployed over Hadoop framework is described in Section 3. 3. USING APACHE PIG TO PROCESS NETWORK FLOW DATA Apache Pig is a platform that supports substantial parallelization which enables the processing of huge data sets. The structure of Pig can be described in two layers – An infrastructure layer with an in-built compiler that can produce MapReduce programs and a language layer with a textprocessing language called “Pig Latin”. Pig makes data analysis and programming easier and understandable to novices in Hadoop framework using Pig Latin which can be considered as a parallel data flow language. The pig setup has also got provisions for further optimizations and user defined functionalities. Pig Latin queries are compiled into MapReduce jobs and are executed in distributed Hadoop cluster environments. Pig Latin looks similar to SQL[Structured Query Language] but is designed for Hadoop data processing environment just like SQL for RDBMS environment. Another member of Hadoop ecosystem known as Hive is a query language like SQL and needs data to be loaded into tables before processing. Pig Latin can work on schema-less or inconsistent environments and can operate on the available data as soon as it is loaded into the HDFS. Pig closely resembles scripting languages like Perl and Python in certain aspects such as the flexibility in syntax and dynamic variables definitions. Pig Latin is efficient enough to be considered as the native parallel processing language for distributed systems such as Hadoop. Figure 1. Data processing model in PIG
  • 4. 18 Computer Science & Information Technology (CS & IT) Pig is gaining popularity in different zones of computational and statistical departments. It is widely used in processing weblogs and wiping off corrupted data from records. Pig can build behavior prediction models based on the user interactions with a website and this feature can be applied in displaying ads and news stories that keeps the user entertained. It also emphasizes on an iterative processing model which can keep track of every new updates by combining the behavioral model with the user data. This feature is largely used in social networking sites like Facebook and micro-blogging sites like Twitter. When it comes to small data or scanning multiple records in random order, pig cannot be considered as effective as MapReduce. The aforementioned data processing model used by Pig is depicted in Figure 1. In this paper, we are exploring the horizons of Pig Latin for processing the netflow obtained from routers. The proposed method for implementing Pig for processing netflow data is described in Section 4. 4. IMPLEMENTATION ASPECTS Here we present the NetFlow analysis using Hadoop, which can manage large volume of data, employ parallel processing and come up with required output in no time. A map-reduce algorithm need to be deployed using Hadoop to get the result and writing such map-reduce programs for analyzing huge flow data is a time consuming task. Here the Apache-Pig will help us to save a good deal of time. We need to analyze traffic at a particular time and Pig plays an important role in this regard. For data analysis, we create buckets with pairs of (SrcIF, SrcIPaddress), (DstIF,DstIPaddress), (Protocal,FlowperSec). Later, they are joined and categorized as SrcIF and FlowperSec; SrcIPaddress and FlowperSec. Netflow data is collected from a busy Cisco switch using nfdump. The output contains three basic sections-IP Packet section, Protocol Flow section and Source Destination packets section. Figure 2. Netflow Analysis using Hadoop Pig Pig needs the netflow datasets in arranged format which is then converted to SQL-like schema. Hence the three sections are created as three different schemas. A sample Pig Latin code is given below: Protocol = LOAD ‘NetFlow-Data1’ AS (protocol:chararray, flow:int, …) Source = LOAD ‘NetFlow-Data2’ AS (SrcIF:chararray, SrcIPAddress:chararray, …) Destination = LOAD ‘NetFlow-Data3’ AS (DstIF:chararray, DstIPaddress, …) As shown in Figure 2, the network flow is deduced into schemas by Pig. Then these schemas are combined for analysis and the traffic flow per node is the resultant output.
  • 5. Computer Science & Information Technology (CS & IT) 19 5. ADVANTAGES OF USING HADOOP PIG OVER MAPREDUCE The member of Hadoop ecosystem popularly known as Pig can be considered as a procedural version of SQL. The most attractive feature and advantage of Pig is its simplicity. It doesn't demand highly skilled Java professionals for implementation. The flexibility in syntax and dynamic variable allocation makes it much more user-friendly and provisions for user-defined functions add extensibility for the framework. The ability of Pig to work with un-structured data greatly accounts to parallel data processing. The great difference comes when the computational complexity is considered. Typical map-reduce programs will have large number of java code lines whereas the task can be accomplished using Pig Latin in very fewer lines of code. Huge network flow data input files can be processed with writing simple programs in Pig Latin which closely resembles scripting languages like Perl/Python. In other words, the computational complexity of Pig Latin is lesser than Hadoop MapReduce programs written in Java which made Pig suitable for the implementation of this project. The procedure followed for arriving at such a conclusion is described in Section 6. 6. EXPERIMENTS AND RESULTS Hadoop Pig was selected as the programming language for Netflow analysis after conducting a short test on a sample input file. 6.1 Experimental Setup Apache Hadoop was initially installed and configured on a stand-alone node and Pig was deployed on top of this system. Key-based authentication was setup on the machine as Hadoop prompts for passwords while performing operations. Hadoop instances such as namenode, datanode, jobtracker and tasktracker are started before running Pig and it shows a “grunt” console upon starting. The grunt console is the command prompt specific to Pig and is used for data and file system manipulation. Pig can work both in local mode and MapReduce mode which can be selected as per the user requirement. A sample Hadoop MapReduce word-count program written in Java was used for the comparison. The program used to process a sample input file of enough size had one hundred thirty lines of code and the time taken for execution was noted. The same operation was coded using Pig Latin which contained only five lines of code and corresponding running time was also noted for framing the results. 6.2. Experiment Results At a particular instant, the size of input file is kept constant for MapReduce and Pig whereas the lines of code are more for the former and less for the latter. For a good comparison, the input file size was taken in the range of 13 kb to 208 kb and the corresponding time is recorded as in Figure 3.
  • 6. 20 Computer Science & Information Technology (CS & IT) Figure 3. Hadoop Pig vs. MapReduce From this we can formulate an expression as follows: As the input file size increases in the multiples of x, the execution time for typical MapReduce also increases proportionally whereas Hadoop Pig can maintain a constant time at least for x upto 5 times. Graphs are plotted between both and are given as Figure 4 and Figure 5. Figure 4. Input file-size vs. execution time for MapReduce Figure 5. Input file-size vs. execution time for Pig 6. CONCLUSION In this paper, a detailed survey regarding the wide scope of Apache Pig platform is conducted. The key features of the network flows are extracted for data analysis using Hadoop Pig which was tested and proved to be advantageous in this aspect with a very low computational complexity. Network flow capture is to be done with the help of easily available open-source tools. Once the Pig-based flow analyzer is implemented, traffic flow analysis can be done with increased convenience and easiness even without the help of highly developed programming skills. It greatly reduces the time consumed for researching on the complex programming constructs and
  • 7. Computer Science & Information Technology (CS & IT) 21 control loops for implementing a typical MapReduce paradigm using Java, but still enjoying the advantages of Map/Reduce task division method with the help of the layered architecture and inbuilt compilers of Apache Pig. ACKNOWLEDGEMENT We are greatly indebted to the college management and the faculty members for providing necessary facilities and hardware along with timely guidance and suggestions for implementing this work. REFERENCES http://pig.apache.org/ http://orzota.com/blog/pig-for-beginners/ http://jugnu-life.blogspot.com/2012/03/installing-pig-apache-hadoop-pig.html http://www.hadoop.apache.org http://www.ibm.com/developerworks/library/l-hadoop-1/ http://www.ibm.com/developerworks/linux/library/l-apachepigdataquery/ http://ankitasblogger.blogspot.in/2011/01/hadoop-cluster-setup.html http://icanhadoop.blogspot.in/2012/09/configuring-hadoop-is-very-if-you-just.html http://developer.yahoo.com/hadoop/tutorial/module7.html https://cyberciti.biz http://www.quuxlabs.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ http://www.cisco.com/en/US/docs/ios/ios_xe/netflow/configuration/guide/cfg_nflow_data_expt_xe.ht ml#wp1056663 [13] http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/netflow.h tml [14] http://ofps.oreilly.com/titles/9781449302641/intro.html [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] AUTHORS Anjali P P is a MTech student of Information Technology Department in Rajagiri college of engineering, Kochi, under MG University. She has received BTech degree in Computer Science and Engineering from Cochin University of Science and Technology. Her research interest includes Hadoop and Big data analysis. Binu A is an Assistant Professor in Information Technology Department of Rajagiri college of engineering, Kochi, under MG University.