This document summarizes a research paper on scalable NetFlow analysis using Hadoop. It discusses:
1) The challenges of analyzing large volumes of Internet traffic data, including scalability, fault tolerance, and extensibility.
2) How Hadoop can help address these challenges by providing distributed computing and storage capabilities to process petabytes of data across thousands of nodes.
3) The design of a Hadoop-based traffic processing tool for collecting, storing, and analyzing NetFlow and packet data at scale through MapReduce jobs.
1. Scalable NetFlow Analysis
with Hadoop
Yeonhee Lee and Youngseok Lee
{yhlee06, lee}@cnu.ac.kr
http://networks.cnu.ac.kr/~yhlee
Chungnam National University, Korea
January 8, 2013
FloCon 2013
4. Internet Measurement
• Challenges
• Scalability
• Fault-tolerant system
• Extensibility
• CAIDA data
• Capture, Curation, Storage, Search, Sharing, Analysis,
and Visualization
• Ark topology: 1.8 TB
• Telescope: 102 TB
• Packet headers: 18.8 TB
Josh Polterock, “CAIDA: A Data Sharing Case Study,”
Security at the Cyber Border: Exploring Cybersecurity for International Research Network
Connections workshop, 2012 4
5. Harness Distributed Computing
and Storage ?
Google MapReduce, 2004 Apache Hadoop project
• 1 PB sorting by Google
• 2008: 6 hours and 2
minutes on 4,000
computers
• 2011: 33 minutes on 8000
computers
• 2011: 10PB, 8000
computers, 6 hours and 27
minutes
5
6. Our Proposal
Hadoop-based Traffic Measurement Administrator
and Analysis Platform
NetFlow v5
Web Visualizer / Hive
Packet
Master
Traffic Analyzer
Traffic Analysis
Mapper & Reducer
Traffic
Collector
Slave Pcap Bin NetFlow
I/O I/O I/O
HDFS Hadoop
1. Yeonhee Lee and Youngseok Lee, "Toward Scalable Internet Traffic Measurement and Analysis with
Hadoop," ACM SIGCOMM Computer Communication Review (CCR), Jan. 2013
2. Yeonhee Lee and Youngseok Lee “A Hadoop-based Packet Trace Processing Tool” , TMA, April 2011
3. Yeonhee Lee and Youngseok Lee, "Detecting DDoS Attacks with Hadoop", ACM CoNEXT Student
Workshop, Dec, 2011 6
7. Related Work
• Traffic analysis of DNS root server (RIPE, 2011.11)
• PacketPig (2012.03) - Big Data Security Analytics platform
• Sherpasurfing – Open Source Cyber Security Solution, Hadoop World
2011
• Firewall/IDS logs, netflow/packet
• Performing Network and Security Analytics with Hadoop, (Travis
Dawson, Narus), Hadoop Summit 2012
• Distributed Bro (IDS)
7
15. Block-level IO vs. File-level IO
140 4.5
4.3
120 4.0
3.9
Completion Time (min)
3.5 3.5
100
3.0
SpeedUp
80 2.5 IP Analysis_blockIO
60 2.0 IP Analysis_fileIO
1.5
40 SpeedUp vs fileIO
1.0
20 0.5
0 0.0
# of nodes
15
16. Challenges
1. Data handing issue in Hadoop
2. Distributed traffic analysis MapReduce algorithms
3. Performance tuning in a large-scale Hadoop
testbed
16
17. DistributedCache
Aggregation Filtering Rule
cnu;srcip=168.188.0.0-168.188.255.255
Aggregation Rule
as;ip;subnet;port;protocol;srcas;dstas;srcip;dstip;sr
csubnet;dstsubnet;srcport;dstport;
identification Port
IP/UDP packet
aggregation
# of octets
generation
group-key
K: time|AS counts
decoding
NetFlow
filtering
# of packets
packet
v5 header V: count per AS
# of Flows
v5 record Protocol
# of octets
…
# of packets
… # of Flows
AS
identification
v5 record
aggregation
# of octets
generation
group-key K: time|AS
decoding
counts
filtering
# of packets
packet
IP/UDP packet V: count per AS
# of Flows
NetFlow
v5 header Subnet
# of octets
v5 record
# of packets
# of Flows
Block Map Reduce
HDFS IO Phase Phase HDFS
17
31. Summary
• NetFlow analysis with Hadoop
• NetFlow v5 processing module
• MapReduce algorithms: statistics
• Distributed computing and storage with Hadoop
• Fits Internet measurement application
• Scalability
• Source codes are available at
• Packet, NetFlow
• https://sites.google.com/a/networks.cnu.ac.kr/dnlab/researc
h/hadoop
• https://github.com/ssallys/pcap-on-Hadoop
31
32. Ongoing Work
• Distributed real-time • Scalable collection
monitoring • E.g.) 10GE 10 X 1 GE
• Rule matching for HDFS
Streamed NetFlow
• Developing rule for
MapReduce
• Rule classification for
dedicated rule matching Productivity
RHive RHadoop
Rhipe
• Integration Pig Hive
Maho
ut
• Streaming packages MapReduce
• Enhanced analytics HDFS
• Data mining: Mahout
Performance
• Machine learning
32
33. Reference
• Papers
1. Y. Lee and Y. Lee, "Toward Scalable Internet Traffic
Measurement and Analysis with Hadoop," ACM SIGCOMM
Computer Communication Review (CCR), Jan. 2013
2. Y. Lee, W. Kang, and Y. Lee, "A Hadoop-based Packet Trace
Processing Tool," The Third TMA, April 2011
3. Y. Lee and Y. Lee, "Detecting DDoS Attacks with Hadoop",
ACM CoNEXT Student Workshop, Dec, 2011
• Software
1. http://networks.cnu.ac.kr/~yhlee
2. https://sites.google.com/a/networks.cnu.ac.kr/dnlab/research/hadoop
3. https://github.com/ssallys/pcap-on-Hadoop
33