SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Scalable NetFlow Analysis
      with Hadoop
    Yeonhee Lee and Youngseok Lee
         {yhlee06, lee}@cnu.ac.kr
     http://networks.cnu.ac.kr/~yhlee
    Chungnam National University, Korea

              January 8, 2013
                FloCon 2013
Contents
• Introduction
• Overview
• Hadoop-based traffic processing tool
• Evaluation
• Summary
INTRODUCTION
Internet Measurement
• Challenges
   • Scalability
   • Fault-tolerant system
   • Extensibility
• CAIDA data
   • Capture, Curation, Storage, Search, Sharing, Analysis,
     and Visualization
       • Ark topology: 1.8 TB
       • Telescope: 102 TB
       • Packet headers: 18.8 TB
    Josh Polterock, “CAIDA: A Data Sharing Case Study,”
    Security at the Cyber Border: Exploring Cybersecurity for International Research Network
    Connections workshop, 2012                                                           4
Harness Distributed Computing
and Storage ?
Google MapReduce, 2004          Apache Hadoop project
• 1 PB sorting by Google
  • 2008: 6 hours and 2
    minutes on 4,000
    computers
  • 2011: 33 minutes on 8000
    computers
  • 2011: 10PB, 8000
    computers, 6 hours and 27
    minutes



                                                        5
Our Proposal
            Hadoop-based Traffic Measurement                                 Administrator
                 and Analysis Platform

                        NetFlow v5

                                                                              Web Visualizer / Hive
               Packet
                                             Master
                                                                                Traffic Analyzer
                                                                                 Traffic Analysis
                                                                               Mapper & Reducer
                                                               Traffic
                                                              Collector
                                     Slave                                  Pcap      Bin      NetFlow
                                                                             I/O      I/O        I/O



                                                                     HDFS           Hadoop




 1. Yeonhee Lee and Youngseok Lee, "Toward Scalable Internet Traffic Measurement and Analysis with
    Hadoop," ACM SIGCOMM Computer Communication Review (CCR), Jan. 2013
 2. Yeonhee Lee and Youngseok Lee “A Hadoop-based Packet Trace Processing Tool” , TMA, April 2011
 3. Yeonhee Lee and Youngseok Lee, "Detecting DDoS Attacks with Hadoop", ACM CoNEXT Student
    Workshop, Dec, 2011                                                                            6
Related Work
• Traffic analysis of DNS root server (RIPE, 2011.11)
• PacketPig (2012.03) - Big Data Security Analytics platform
• Sherpasurfing – Open Source Cyber Security Solution, Hadoop World
  2011
    •   Firewall/IDS logs, netflow/packet
• Performing Network and Security Analytics with Hadoop, (Travis
  Dawson, Narus), Hadoop Summit 2012
• Distributed Bro (IDS)




                                                                      7
OVERVIEW
Hadoop-based NetFlow Analysis
                                     Collect & Anlaysis




   NetFlow                 NetFlow



             Distributer




                                                          9
Traffic Analyzer
                                            Hive QL Query for Traffic Analysis                                User
                                                                                                            Interface
                           Scan IP      Spoofed IP         Heavy User                     User-defined
                            query         query              query                           query


               Traffic                       MapReduce for Traffic Analysis                                   Web
                                                                                                              UI
              Collector        IP           TCP              HTTP              DDoS            NetFlow
                  &       analysis MR   analysis MR       analysis MR       analysis MR       analysis MR
               Loader
NetFlow
 Packet




                                                                                                                             monitor
                                                           IO formats
                                  Pcap                      Binary                      Text                                  query
                              InputFormat             Input/OutputFormat         Input/OutputFormat
                                                                                                               CLI



                                     HDFS                          Hadoop

           Data Source                                    Data Processing                                   User Interface
          (Jpcap, HDFS)                               (HDFS, MapReduce, Hive)                                (Hive, Web)




                                        Distributer




                                                                                                                                       10
HADOOP-BASED TRAFFIC
ANALYSIS
Challenges
1. Data handing issue in HDFS
2. Distributed traffic analysis MapReduce algorithms
3. Performance tuning in a large-scale Hadoop




         Scalability     Fault      Distributed
         (~TB/PB)      tolerance   computation




                                                       12
Challenges
1. Data handing issue in Hadoop

2. Distributed traffic analysis MapReduce algorithms

3. Performance tuning in a large-scale Hadoop
testbed




                                                       13
Block-level Parallelism
                     block-level                                                        file-level
                     processing                                                        processing




                                      HDFS                                                            HDFS
                                  Block2 (64 MB)                                                  Block3 (64 MB)

                                                                                                                              14
00 00 68 2B AD 4C 38 A4 04 00 5C 00 00 00 5C 00 00 00 FF FF ‥‥ 00 21 B5 01 68 2B AD 4C 2B 1C 07 00 3C 00 00 00 3C 00 00 00 01 80 ‥
Block-level IO vs. File-level IO

                        140                                  4.5
                                                       4.3
                        120                                  4.0
                                                 3.9
Completion Time (min)




                                           3.5               3.5
                        100
                                                             3.0




                                                                   SpeedUp
                         80                                  2.5             IP Analysis_blockIO

                         60                                  2.0             IP Analysis_fileIO
                                                             1.5
                         40                                                  SpeedUp vs fileIO
                                                             1.0
                         20                                  0.5
                          0                                  0.0



                              # of nodes
                                                                                             15
Challenges

1. Data handing issue in Hadoop

2. Distributed traffic analysis MapReduce algorithms

3. Performance tuning in a large-scale Hadoop

testbed



                                                       16
DistributedCache
Aggregation                                                 Filtering Rule
                                                            cnu;srcip=168.188.0.0-168.188.255.255

                                                            Aggregation Rule
                                                            as;ip;subnet;port;protocol;srcas;dstas;srcip;dstip;sr
                                                            csubnet;dstsubnet;srcport;dstport;



                identification                                                                              Port
IP/UDP packet




                                                                                     aggregation
                                                                                                            # of octets




                                                        generation
                                                         group-key
                                                                     K: time|AS                    counts



                                 decoding
 NetFlow




                                            filtering
                                                                                                            # of packets
                   packet


 v5 header                                                           V: count                      per AS
                                                                                                            # of Flows
 v5 record                                                                                                  Protocol
                                                                                                            # of octets
 …
                                                                                                            # of packets
 …                                                                                                          # of Flows
                                                                                                            AS
                identification




 v5 record




                                                                                     aggregation
                                                                                                            # of octets
                                                        generation
                                                         group-key   K: time|AS
                                 decoding



                                                                                                   counts
                                            filtering

                                                                                                            # of packets
                   packet




IP/UDP packet                                                        V: count                      per AS
                                                                                                            # of Flows
 NetFlow
 v5 header                                                                                                  Subnet
                                                                                                            # of octets
 v5 record
                                                                                                            # of packets
                                                                                                            # of Flows

                Block                        Map                                   Reduce
     HDFS        IO                         Phase                                  Phase                        HDFS
                                                                                                                           17
Anomaly Detection                                          Detection Rules
                                                                                     DistributedCache
                                                           port_scan;ip,proto=6;srcip,dstport;srcip;pkts=20-
                                                           syn_flood;ip,proto=6,syn-fin=1-
                                                           ;srcip,dstip;srcip,dstip;syn-fin=6-




IP/UDP packet
                identification




                                                                                           aggregation
                                                                      &group sort
                                                                      partitioning
                                                       generation
                                                        group-key




                                                                                                          detection
                                 decoding


                                            matching
 NetFlow


                                             pattern
                   packet



 v5 header
                                                                                                                      syn_flood
 v5 record
                                                                                                                      3.3.3.33.3.3.4
                                                                                                                       …
 …

 …

 v5 record
                identification




                                                                                           aggregation
                                                                      &group sort
                                                                      partitioning
                                                       generation
                                                        group-key

                                                                                                                      PortScan




                                                                                                         detection
                                 decoding


                                            matching



                                                                                                                      1.1.1.11.1.1.2
                                             pattern
                   packet




IP/UDP packet
                                                                                                                       …
 NetFlow
 v5 header

 v5 record



                Block                        Map                     Shuffle                   Reduce
     HDFS
                 IO                         Phase                     &Sort                    Phase                      HDFS
                                                                                                                               18
Challenges

1. Data handing issue in Hadoop

2. Distributed traffic analysis MapReduce algorithms

3. Performance tuning in a large-scale Hadoop




                                                       19
Performance Tuning
• Configuration
   • Hadoop IO Buffer (128K  1 MB)
   • Java heap space (300 MB 1 024 MB)
   • # of MapReduce Slots (# of cores)
• MapReduce Algorithm
  • normal combiner vs inMapper combiner
• Job scheduling




                                           20
Job Scheduling
       • Different job types
             • Periodic jobs (for monitoring)
                • guaranteed service within time
                • e.g Aggregated Statistics for monitoring, Flow Parse job for
                   analytics
             • Small ad-hoc query job (for analytics)
                • fast response time
                  5 munites             5 munites                        5 munites                          5 munites

Collector         Collect                   Collect                           Collect                                     …

FIFO Scheduling      …        Basic Statistics
                                                 Flow
                                                 Parse
                                                         ad-hoc
                                                         query
                                                                Basic Statistics
                                                                                   Flow
                                                                                   Parse
                                                                                           ad-hoc
                                                                                           query
                                                                                                  Basic Statistics
                                                                                                                         Flow
                                                                                                                         Parse
                                                                                                                                  ad-hoc
                                                                                                                                  query



Fair Scheduler
                     …          Basic Statistics   Flow Parse       Basic Statistics   Flow Parse     Basic Statistics        Flow Parse
                                        ad-hoc query                        ad-hoc query                      ad-hoc query

                                                                                                                          periodic job 21
                                                                                                                          ad-hoc job
PERFORMANCE
EVALUATION
Experiments
• Testbed
  Type      Nodes      Cores             CPU             Memory         HardDisk       Rack
  Small       3          24          3.4 GHz 8 core        16 GB           2 TB       1 Rack

  Medium      30        240      2.93 GHz 8 core           16 GB           4 TB       1 Rack

  Large      200        400      2.66 GHz 2 core            2 GB         500 GB       4 Racks




• Data and MapReduce jobs
  Type                Dataset                         MapReduce Job                 Testbed
 NetFlow           1 TB from KOREN             flowStats, flowDetect, flowPrint       Small
                                                      IP, TCP, Web
  Packet   1 ~ 5 TB from CNU campus N/W                                           Medium, Large
                                              (webpop, User Behavior, DDoS)



                                                                                              23
NetFlow: SpeedUp (vs. Flowtools)
                                                 flowStats vs. Flowtools
                                                 flowPrint vs. FlowTools

          6.0
          5.0                                   5.0
SpeedUp




          4.0
          3.0
          2.0                                   2.3

          1.0
          0.0
                1        2                  3
                    # of nodes   > FlowPrint
                                      flow-cat -p flowfile |flow-print –f14
                                 > FlowStats
                                      flow-cat -p flowfile|flow-stat -f12
                                                                        24
                                      flow-cat -p flowfile|flow-stat –f5
NetFlow: Scalability
Job Completion Time (min)




                            120                                                                         7
                                  103
                            100                                                                         6                                   6.1




                                                                                 SpeedUp (vs 5 nodes)
                             80                                                                         5
                                                                                                        4
                             60                                     flowStats                                                               flowStats
                                                                                                        3
                             40                                     flowDetect                                                              flowDetect
                                                                                                        2
                                                               17
                             20                                                                         1
                              0                                                                         0
                                   5    10    15   20     25   30                                           5   10   15    20     25   30
                                             # of nodes                                                              # of nodes




                                                                                                                                                  25
NetFlow: Pattern Matching Result
                                                                                     10
                                                                                              NetFlows Record Distribution
                                                                                     8




                                                                  # of records (M)
                                                                                     6

                                                                                     4

                                                                                     2

                                                                                     0




   100000
                                                                                                                        worm.sasser,w32.sasser

                                                                                                                        remote_administrator
        10000                                                                                                           vnc

                                                                                                                        w32.witty.worm

        1000                                                                                                            worm.opasoft,w32.opaserv.worm
count




                                                                                                                        code_red_worm

         100                                                                                                            netfairy

                                                                                                                        kamun

          10                                                                                                            emule

                                                                                                                        shockwave_killer

            1                                                                                                           worm.killmsblast,w32.nachi.worm,w32.welchia.w
          2012-08-29   2012-09-08   2012-09-18   2012-09-28   2012-10-08                  2012-10-18     2012-10-28     orm

                                                       date                                                                                             26
Packet: ScaleOut
                         140
Completion Time (min)




                         120       121
                         100                                                   IP Analysis (min)
                         80        77                                          TCP Analysis (min)
                         60                                                    WebPop (min)
                         40                                                    UserBehavior (min)
                         20                                          15
                                                                      13       DDoS (min)
                          0
                               5         10   15        20      25   30

                         14
                                                                          13
                         12
                                                                               IP Analysis (Gbps)
   Throughput (Gbps)




                         10
                                                                          9    TCP Analysis (Gbps)
                          8
                          6                                                    WebPop (Gbps)
                          4                                                    UserBehavior (Gbps)
                          2
                                                                               DDos (Gbps)
                          0
                               5         10   15        20      25   30
                                                   # of nodes                                  27
Packet: SizeUp (30 nodes)

                        8000                                      16                       IP Analysis
                                                           7007                            IP Analysis_ripe
                        7000                                      14
Completion Time (sec)




                                                                                           TCPStats
                        6000                                      12




                                                                       Throughput (Gbps)
                                                                                           Webpop
                        5000                                      10
                                                                                           UserBehavior
                        4000                                      8
                                                                                           DDoS
                                                         2934
                        3000                                      6                        IP Analysis
                        2000                                      4                        TCPStats
                        1000                                      2                        Webpop

                           0                                      0                        UserBehavior
                               1TB   2TB     3TB   4TB      5TB                            DDoS

                                       Data Size
                                                                                                         28
Packet: SizeUp (200 nodes)

                        5000                                             18
                        4500                                             16
                        4000                                        15                            IP Analysis
Completion Time (sec)




                                                                         14
                        3500                                                                      TCP Analysis
                                                                         12




                                                                              Throughput (Gbps)
                                                                                                  Webpop
                        3000                                 2744
                                                                         10                       UserBehavior
                        2500
                                                                    8    8                        DDoS
                        2000                                                                      IP Analysis
                                                                         6
                        1500                                                                      TCP Analysis
                        1000                                             4                        Webpop
                         500                                             2                        UserBehavior
                           0                                             0                        DDoS
                               1TB   2TB    3TB        4TB      5TB
                                           Data Size

                                                                                                        29
SUMMARY
Summary
• NetFlow analysis with Hadoop
   • NetFlow v5 processing module
   • MapReduce algorithms: statistics

•   Distributed computing and storage with Hadoop
    • Fits Internet measurement application
    • Scalability

• Source codes are available at
   • Packet, NetFlow
   • https://sites.google.com/a/networks.cnu.ac.kr/dnlab/researc
      h/hadoop
    • https://github.com/ssallys/pcap-on-Hadoop


                                                                   31
Ongoing Work
• Distributed real-time              • Scalable collection
  monitoring                           • E.g.) 10GE  10 X 1 GE
  • Rule matching for                    HDFS
    Streamed NetFlow
  • Developing rule for
    MapReduce
  • Rule classification for
    dedicated rule matching                                                          Productivity

                                                RHive     RHadoop
                                                                      Rhipe

• Integration                             Pig           Hive
                                                                              Maho
                                                                               ut

  • Streaming packages                             MapReduce

  • Enhanced analytics                                         HDFS
      •   Data mining: Mahout
                                Performance
      •   Machine learning

                                                                                          32
Reference
• Papers
  1.   Y. Lee and Y. Lee, "Toward Scalable Internet Traffic
       Measurement and Analysis with Hadoop," ACM SIGCOMM
       Computer Communication Review (CCR), Jan. 2013
  2.   Y. Lee, W. Kang, and Y. Lee, "A Hadoop-based Packet Trace
       Processing Tool," The Third TMA, April 2011
  3.   Y. Lee and Y. Lee, "Detecting DDoS Attacks with Hadoop",
       ACM CoNEXT Student Workshop, Dec, 2011

• Software
  1.   http://networks.cnu.ac.kr/~yhlee
  2.   https://sites.google.com/a/networks.cnu.ac.kr/dnlab/research/hadoop
  3.   https://github.com/ssallys/pcap-on-Hadoop


                                                                             33
THANK YOU !

Contenu connexe

Tendances

Expand data analysis tool at scale with Zeppelin
Expand data analysis tool at scale with ZeppelinExpand data analysis tool at scale with Zeppelin
Expand data analysis tool at scale with ZeppelinDataWorks Summit
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index StructuresHBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index StructuresCloudera, Inc.
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopHyunsik Choi
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillMapR Technologies
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache DruidImply
 
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11Hyunsik Choi
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSTreasure Data, Inc.
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 

Tendances (20)

Expand data analysis tool at scale with Zeppelin
Expand data analysis tool at scale with ZeppelinExpand data analysis tool at scale with Zeppelin
Expand data analysis tool at scale with Zeppelin
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
 
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index StructuresHBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
 
Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for Hadoop
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache Druid
 
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWS
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 

Similaire à Net flowhadoop flocon2013_yhlee_final

Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONUJerome Boulon
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
 
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...IJCSES Journal
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...ijcses
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1Adam Muise
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopGeorge Ang
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopYahoo Developer Network
 
Multi-thematic spatial databases
Multi-thematic spatial databasesMulti-thematic spatial databases
Multi-thematic spatial databasesConor Mc Elhinney
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 

Similaire à Net flowhadoop flocon2013_yhlee_final (20)

Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONU
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
 
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
 
Multi-thematic spatial databases
Multi-thematic spatial databasesMulti-thematic spatial databases
Multi-thematic spatial databases
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 

Net flowhadoop flocon2013_yhlee_final

  • 1. Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee {yhlee06, lee}@cnu.ac.kr http://networks.cnu.ac.kr/~yhlee Chungnam National University, Korea January 8, 2013 FloCon 2013
  • 2. Contents • Introduction • Overview • Hadoop-based traffic processing tool • Evaluation • Summary
  • 4. Internet Measurement • Challenges • Scalability • Fault-tolerant system • Extensibility • CAIDA data • Capture, Curation, Storage, Search, Sharing, Analysis, and Visualization • Ark topology: 1.8 TB • Telescope: 102 TB • Packet headers: 18.8 TB Josh Polterock, “CAIDA: A Data Sharing Case Study,” Security at the Cyber Border: Exploring Cybersecurity for International Research Network Connections workshop, 2012 4
  • 5. Harness Distributed Computing and Storage ? Google MapReduce, 2004 Apache Hadoop project • 1 PB sorting by Google • 2008: 6 hours and 2 minutes on 4,000 computers • 2011: 33 minutes on 8000 computers • 2011: 10PB, 8000 computers, 6 hours and 27 minutes 5
  • 6. Our Proposal Hadoop-based Traffic Measurement Administrator and Analysis Platform NetFlow v5 Web Visualizer / Hive Packet Master Traffic Analyzer Traffic Analysis Mapper & Reducer Traffic Collector Slave Pcap Bin NetFlow I/O I/O I/O HDFS Hadoop 1. Yeonhee Lee and Youngseok Lee, "Toward Scalable Internet Traffic Measurement and Analysis with Hadoop," ACM SIGCOMM Computer Communication Review (CCR), Jan. 2013 2. Yeonhee Lee and Youngseok Lee “A Hadoop-based Packet Trace Processing Tool” , TMA, April 2011 3. Yeonhee Lee and Youngseok Lee, "Detecting DDoS Attacks with Hadoop", ACM CoNEXT Student Workshop, Dec, 2011 6
  • 7. Related Work • Traffic analysis of DNS root server (RIPE, 2011.11) • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World 2011 • Firewall/IDS logs, netflow/packet • Performing Network and Security Analytics with Hadoop, (Travis Dawson, Narus), Hadoop Summit 2012 • Distributed Bro (IDS) 7
  • 9. Hadoop-based NetFlow Analysis Collect & Anlaysis NetFlow NetFlow Distributer 9
  • 10. Traffic Analyzer Hive QL Query for Traffic Analysis User Interface Scan IP Spoofed IP Heavy User User-defined query query query query Traffic MapReduce for Traffic Analysis Web UI Collector IP TCP HTTP DDoS NetFlow & analysis MR analysis MR analysis MR analysis MR analysis MR Loader NetFlow Packet monitor IO formats Pcap Binary Text query InputFormat Input/OutputFormat Input/OutputFormat CLI HDFS Hadoop Data Source Data Processing User Interface (Jpcap, HDFS) (HDFS, MapReduce, Hive) (Hive, Web) Distributer 10
  • 12. Challenges 1. Data handing issue in HDFS 2. Distributed traffic analysis MapReduce algorithms 3. Performance tuning in a large-scale Hadoop Scalability Fault Distributed (~TB/PB) tolerance computation 12
  • 13. Challenges 1. Data handing issue in Hadoop 2. Distributed traffic analysis MapReduce algorithms 3. Performance tuning in a large-scale Hadoop testbed 13
  • 14. Block-level Parallelism block-level file-level processing processing HDFS HDFS Block2 (64 MB) Block3 (64 MB) 14 00 00 68 2B AD 4C 38 A4 04 00 5C 00 00 00 5C 00 00 00 FF FF ‥‥ 00 21 B5 01 68 2B AD 4C 2B 1C 07 00 3C 00 00 00 3C 00 00 00 01 80 ‥
  • 15. Block-level IO vs. File-level IO 140 4.5 4.3 120 4.0 3.9 Completion Time (min) 3.5 3.5 100 3.0 SpeedUp 80 2.5 IP Analysis_blockIO 60 2.0 IP Analysis_fileIO 1.5 40 SpeedUp vs fileIO 1.0 20 0.5 0 0.0 # of nodes 15
  • 16. Challenges 1. Data handing issue in Hadoop 2. Distributed traffic analysis MapReduce algorithms 3. Performance tuning in a large-scale Hadoop testbed 16
  • 17. DistributedCache Aggregation Filtering Rule cnu;srcip=168.188.0.0-168.188.255.255 Aggregation Rule as;ip;subnet;port;protocol;srcas;dstas;srcip;dstip;sr csubnet;dstsubnet;srcport;dstport; identification Port IP/UDP packet aggregation # of octets generation group-key K: time|AS counts decoding NetFlow filtering # of packets packet v5 header V: count per AS # of Flows v5 record Protocol # of octets … # of packets … # of Flows AS identification v5 record aggregation # of octets generation group-key K: time|AS decoding counts filtering # of packets packet IP/UDP packet V: count per AS # of Flows NetFlow v5 header Subnet # of octets v5 record # of packets # of Flows Block Map Reduce HDFS IO Phase Phase HDFS 17
  • 18. Anomaly Detection Detection Rules DistributedCache port_scan;ip,proto=6;srcip,dstport;srcip;pkts=20- syn_flood;ip,proto=6,syn-fin=1- ;srcip,dstip;srcip,dstip;syn-fin=6- IP/UDP packet identification aggregation &group sort partitioning generation group-key detection decoding matching NetFlow pattern packet v5 header syn_flood v5 record 3.3.3.33.3.3.4 … … … v5 record identification aggregation &group sort partitioning generation group-key PortScan detection decoding matching 1.1.1.11.1.1.2 pattern packet IP/UDP packet … NetFlow v5 header v5 record Block Map Shuffle Reduce HDFS IO Phase &Sort Phase HDFS 18
  • 19. Challenges 1. Data handing issue in Hadoop 2. Distributed traffic analysis MapReduce algorithms 3. Performance tuning in a large-scale Hadoop 19
  • 20. Performance Tuning • Configuration • Hadoop IO Buffer (128K  1 MB) • Java heap space (300 MB 1 024 MB) • # of MapReduce Slots (# of cores) • MapReduce Algorithm • normal combiner vs inMapper combiner • Job scheduling 20
  • 21. Job Scheduling • Different job types • Periodic jobs (for monitoring) • guaranteed service within time • e.g Aggregated Statistics for monitoring, Flow Parse job for analytics • Small ad-hoc query job (for analytics) • fast response time 5 munites 5 munites 5 munites 5 munites Collector Collect Collect Collect … FIFO Scheduling … Basic Statistics Flow Parse ad-hoc query Basic Statistics Flow Parse ad-hoc query Basic Statistics Flow Parse ad-hoc query Fair Scheduler … Basic Statistics Flow Parse Basic Statistics Flow Parse Basic Statistics Flow Parse ad-hoc query ad-hoc query ad-hoc query periodic job 21 ad-hoc job
  • 23. Experiments • Testbed Type Nodes Cores CPU Memory HardDisk Rack Small 3 24 3.4 GHz 8 core 16 GB 2 TB 1 Rack Medium 30 240 2.93 GHz 8 core 16 GB 4 TB 1 Rack Large 200 400 2.66 GHz 2 core 2 GB 500 GB 4 Racks • Data and MapReduce jobs Type Dataset MapReduce Job Testbed NetFlow 1 TB from KOREN flowStats, flowDetect, flowPrint Small IP, TCP, Web Packet 1 ~ 5 TB from CNU campus N/W Medium, Large (webpop, User Behavior, DDoS) 23
  • 24. NetFlow: SpeedUp (vs. Flowtools) flowStats vs. Flowtools flowPrint vs. FlowTools 6.0 5.0 5.0 SpeedUp 4.0 3.0 2.0 2.3 1.0 0.0 1 2 3 # of nodes > FlowPrint flow-cat -p flowfile |flow-print –f14 > FlowStats flow-cat -p flowfile|flow-stat -f12 24 flow-cat -p flowfile|flow-stat –f5
  • 25. NetFlow: Scalability Job Completion Time (min) 120 7 103 100 6 6.1 SpeedUp (vs 5 nodes) 80 5 4 60 flowStats flowStats 3 40 flowDetect flowDetect 2 17 20 1 0 0 5 10 15 20 25 30 5 10 15 20 25 30 # of nodes # of nodes 25
  • 26. NetFlow: Pattern Matching Result 10 NetFlows Record Distribution 8 # of records (M) 6 4 2 0 100000 worm.sasser,w32.sasser remote_administrator 10000 vnc w32.witty.worm 1000 worm.opasoft,w32.opaserv.worm count code_red_worm 100 netfairy kamun 10 emule shockwave_killer 1 worm.killmsblast,w32.nachi.worm,w32.welchia.w 2012-08-29 2012-09-08 2012-09-18 2012-09-28 2012-10-08 2012-10-18 2012-10-28 orm date 26
  • 27. Packet: ScaleOut 140 Completion Time (min) 120 121 100 IP Analysis (min) 80 77 TCP Analysis (min) 60 WebPop (min) 40 UserBehavior (min) 20 15 13 DDoS (min) 0 5 10 15 20 25 30 14 13 12 IP Analysis (Gbps) Throughput (Gbps) 10 9 TCP Analysis (Gbps) 8 6 WebPop (Gbps) 4 UserBehavior (Gbps) 2 DDos (Gbps) 0 5 10 15 20 25 30 # of nodes 27
  • 28. Packet: SizeUp (30 nodes) 8000 16 IP Analysis 7007 IP Analysis_ripe 7000 14 Completion Time (sec) TCPStats 6000 12 Throughput (Gbps) Webpop 5000 10 UserBehavior 4000 8 DDoS 2934 3000 6 IP Analysis 2000 4 TCPStats 1000 2 Webpop 0 0 UserBehavior 1TB 2TB 3TB 4TB 5TB DDoS Data Size 28
  • 29. Packet: SizeUp (200 nodes) 5000 18 4500 16 4000 15 IP Analysis Completion Time (sec) 14 3500 TCP Analysis 12 Throughput (Gbps) Webpop 3000 2744 10 UserBehavior 2500 8 8 DDoS 2000 IP Analysis 6 1500 TCP Analysis 1000 4 Webpop 500 2 UserBehavior 0 0 DDoS 1TB 2TB 3TB 4TB 5TB Data Size 29
  • 31. Summary • NetFlow analysis with Hadoop • NetFlow v5 processing module • MapReduce algorithms: statistics • Distributed computing and storage with Hadoop • Fits Internet measurement application • Scalability • Source codes are available at • Packet, NetFlow • https://sites.google.com/a/networks.cnu.ac.kr/dnlab/researc h/hadoop • https://github.com/ssallys/pcap-on-Hadoop 31
  • 32. Ongoing Work • Distributed real-time • Scalable collection monitoring • E.g.) 10GE  10 X 1 GE • Rule matching for HDFS Streamed NetFlow • Developing rule for MapReduce • Rule classification for dedicated rule matching Productivity RHive RHadoop Rhipe • Integration Pig Hive Maho ut • Streaming packages MapReduce • Enhanced analytics HDFS • Data mining: Mahout Performance • Machine learning 32
  • 33. Reference • Papers 1. Y. Lee and Y. Lee, "Toward Scalable Internet Traffic Measurement and Analysis with Hadoop," ACM SIGCOMM Computer Communication Review (CCR), Jan. 2013 2. Y. Lee, W. Kang, and Y. Lee, "A Hadoop-based Packet Trace Processing Tool," The Third TMA, April 2011 3. Y. Lee and Y. Lee, "Detecting DDoS Attacks with Hadoop", ACM CoNEXT Student Workshop, Dec, 2011 • Software 1. http://networks.cnu.ac.kr/~yhlee 2. https://sites.google.com/a/networks.cnu.ac.kr/dnlab/research/hadoop 3. https://github.com/ssallys/pcap-on-Hadoop 33