SlideShare une entreprise Scribd logo
1  sur  31
Flume
Reliable Distributed
Streaming Log Collection

Ian Wrigley
Educational Services, Cloudera, Inc
ian@cloudera.com
Scenario	
  
•  Situa,on:	
  
      –  You	
  have	
  hundreds	
  of	
  services	
  producing	
  logs	
  
      –  You’re	
  running	
  a	
  daily	
  cron	
  job	
  on	
  the	
  logs	
  
            •  Rota,ng	
  the	
  logs	
  
            •  Maybe	
  compressing	
  or	
  otherwise	
  processing	
  them	
  
            •  Transferring	
  them	
  to	
  HDFS	
  (the	
  Hadoop	
  Distributed	
  File	
  System)	
  


•  Problem:	
  	
  
      –  As	
  the	
  amount	
  of	
  data	
  increases,	
  it	
  takes	
  longer	
  and	
  longer	
  to	
  run	
  the	
  
         cron	
  job	
  


7/15/2010                                                                                                                    2
You	
  need	
  a	
  “Flume”	
  
•  Flume	
  is	
  a	
  distributed	
  system	
  that	
  gets	
  
   your	
  logs	
  from	
  their	
  source	
  and	
  
   aggregates	
  them	
  to	
  where	
  you	
  want	
  to	
  
   process	
  them	
  
•  Open	
  source,	
  	
  Apache	
  v2.0	
  License	
  
•  Goals:	
  
      –  Reliability	
  
      –  Scalability	
  
      –  Extensibility	
  
      –  Manageability	
  
                                                                   Columbia Gorge, Broughton Log Flume
7/15/2010                                                                                                3
Use	
  cases	
  
•  Collec,ng	
  logs	
  from	
  nodes	
  in	
  your	
  
   Hadoop	
  cluster	
  
•  Collec,ng	
  logs	
  from	
  services	
  such	
  
   as	
  hUpd,	
  mail,	
  etc.	
  
•  Collec,ng	
  impressions	
  from	
  
   custom	
  apps	
  for	
  an	
  ad	
  network	
  

•  But	
  wait,	
  there’s	
  more!	
  
                                                                 It’s log, log ... Everyone wants a log!
      –  Basic	
  online	
  in-­‐stream	
  analysis	
  
      –  Online	
  in-­‐stream	
  file	
  processing	
  and	
  
         manipula,on	
  
7/15/2010                                                                                                  4
Key	
  abstrac,ons	
  
•  Data	
  path	
  and	
  control	
  path	
                                                         Agent
•  Nodes	
  are	
  in	
  the	
  data	
  path	
  	
  
      –  Nodes	
  have	
  a	
  source	
  and	
  a	
  sink	
  
                                                                                                    Collector
      –  They	
  can	
  take	
  different	
  roles	
  
            •  A	
  typical	
  topology	
  has	
  agent	
  nodes	
  and	
  collector	
  nodes	
  
            •  Op,onally	
  it	
  has	
  processor	
  nodes	
  
•  Masters	
  are	
  in	
  the	
  control	
  path	
                                                  Master	
  
      –  Centralized	
  point	
  of	
  configura,on	
  
      –  Specify	
  sources	
  and	
  sinks	
  	
  
      –  Can	
  control	
  flows	
  of	
  data	
  between	
  nodes	
  
      –  Use	
  one	
  master	
  or	
  use	
  many	
  with	
  a	
  ZooKeeper-­‐backed	
  quorum	
  

7/15/2010                                                                                                         5
A	
  sample	
  topology	
  
            Agent tier   Collector tier       Master	
  
            Agent
             Agent         Collector
              Agent
               Agent

            Agent
             Agent         Collector         HDFS
              Agent
               Agent
                                          /logs/web/2010/0715/1200
                                          /logs/web/2010/0715/1300
                                          /logs/web/2010/0715/1400
            Agent
             Agent         Collector
              Agent
               Agent

7/15/2010                                                      6
Masters	
  control	
  node	
  configura,on	
  
            Agent tier   Collector tier       Master	
  
            Agent
             Agent         Collector
              Agent
               Agent                        Storage tier

            Agent
             Agent         Collector         HDFS
              Agent
               Agent
                                          /logs/web/2010/0715/1200
                                          /logs/web/2010/0715/1300
                                          /logs/web/2010/0715/1400
            Agent
             Agent         Collector
              Agent
               Agent

7/15/2010                                                      7
Outline	
  
•  What	
  is	
  Flume?	
  
      –  Goals	
  and	
  architecture	
  
•  Reliability	
  
      –  Fault-­‐tolerance	
  and	
  High	
  availability	
  	
  
•  Scalability	
  
      –  Horizontal	
  scalability	
  of	
  all	
  nodes	
  and	
  masters	
  
•  Extensibility	
  
      –  Unix	
  principle,	
  all	
  kinds	
  of	
  data,	
  all	
  kinds	
  of	
  sources,	
  all	
  kinds	
  of	
  sinks	
  
•  Manageability	
  
      –  Centralized	
  management	
  suppor,ng	
  dynamic	
  reconfigura,on	
  	
  

7/15/2010                                                                                                                         8
RELIABILITY	
  


                       The logs will still get there…
7/15/2010                                               9
Tunable	
  data	
  reliability	
  levels	
  
•  Best	
  effort	
  
      –  Fire	
  and	
  forget	
                                   Agent   Collector   HDFS
•  Store	
  on	
  failure	
  +	
  retry	
  
      –  Local	
  acks,	
  local	
  errors	
  detectable	
  	
     Agent   Collector   HDFS
      –  Failover	
  when	
  faults	
  detected	
  	
  

•  End-­‐to-­‐end	
  reliability	
  
      –  End	
  to	
  end	
  acks	
                                Agent   Collector   HDFS
      –  Data	
  survives	
  compound	
  failures,	
  
         and	
  may	
  be	
  retried	
  mul,ple	
  
         ,mes	
  


7/15/2010                                                                                 10
SCALABILITY	
  



7/15/2010
                       Logs jamming the Kemi River   11
A	
  sample	
  topology	
  
            Agent tier   Collector tier       Master	
  
            Agent
             Agent         Collector
              Agent
               Agent

            Agent
             Agent         Collector         HDFS
              Agent
               Agent
                                          /logs/web/2010/0715/1200
                                          /logs/web/2010/0715/1300
                                          /logs/web/2010/0715/1400
            Agent
             Agent         Collector
              Agent
               Agent

7/15/2010                                                     12
Data	
  path	
  is	
  horizontally	
  scalable	
  
             Agent
              Agent                               Collector                                 HDFS
               Agent
                Agent

•  Add	
  collectors	
  to	
  increase	
  availability	
  and	
  to	
  handle	
  more	
  data	
  
      –  Assumes	
  a	
  single	
  agent	
  will	
  not	
  dominate	
  a	
  collector	
  
      –  Fewer	
  connec,ons	
  to	
  HDFS	
  
      –  Larger,	
  more	
  efficient	
  writes	
  to	
  HDFS	
  
•  Agents	
  have	
  mechanisms	
  for	
  machine	
  resource	
  tradeoffs	
  
     •  Write	
  log	
  locally	
  to	
  avoid	
  collector	
  disk	
  IO	
  boUleneck	
  and	
  catastrophic	
  failures	
  
     •  Compression	
  and	
  batching	
  	
  (trade	
  cpu	
  for	
  network)	
  
     •  Push	
  computa,on	
  into	
  the	
  event	
  collec,on	
  pipeline	
  (balance	
  IO,	
  Mem,	
  and	
  CPU	
  
        resource	
  boUlenecks)	
  


7/15/2010                                                                                                                   13
Load	
  balancing	
  
                                 Agent
                                  Agent                                 Collector
                                 Agent
                                  Agent                                 Collector

                                  Agent                                 Collector
                                  Agent

 •  Agents	
  are	
  logically	
  par,,oned	
  and	
  can	
  send	
  to	
  different	
  collectors	
  
 •  Use	
  randomiza,on	
  to	
  pre-­‐specify	
  failovers	
  when	
  many	
  collectors	
  
    exist	
  	
  
       •  Spread	
  load	
  if	
  a	
  collector	
  goes	
  down	
  
       •  Spread	
  load	
  if	
  new	
  collectors	
  are	
  added	
  to	
  the	
  system	
  


7/15/2010                                                                                          14
Load	
  balancing	
  
                                 Agent
                                  Agent                                 Collector
                                 Agent
                                  Agent                                 Collector

                                  Agent                                 Collector
                                  Agent

 •  Agents	
  are	
  logically	
  par,,oned	
  and	
  can	
  send	
  to	
  different	
  collectors	
  
 •  Use	
  randomiza,on	
  to	
  pre-­‐specify	
  failovers	
  when	
  many	
  collectors	
  
    exist	
  	
  
       •  Spread	
  load	
  if	
  a	
  collector	
  goes	
  down	
  
       •  Spread	
  load	
  if	
  new	
  collectors	
  are	
  added	
  to	
  the	
  system	
  


7/15/2010                                                                                          15
Control	
  plane	
  is	
  horizontally	
  scalable	
  
                Node                              Master	
                   ZK1	
  
                Node                              Master	
                             ZK2	
  
                Node                              Master	
                   ZK3	
  

•  A	
  master	
  controls	
  dynamic	
  configura,ons	
  of	
  nodes	
  
      –  Uses	
  consensus	
  protocol	
  to	
  keep	
  state	
  consistent	
  
      –  Scales	
  well	
  for	
  configura,on	
  reads	
  
      –  Allows	
  for	
  adap,ve	
  repar,,oning	
  in	
  the	
  future	
  
•  Nodes	
  can	
  talk	
  to	
  any	
  master	
  
•  Masters	
  can	
  talk	
  to	
  any	
  ZooKeeper	
  member	
  
7/15/2010                                                                                        16
Control	
  plane	
  is	
  horizontally	
  scalable	
  
                Node                              Master	
                   ZK1	
  
                Node                              Master	
                             ZK2	
  
                Node                              Master	
                   ZK3	
  

•  A	
  master	
  controls	
  dynamic	
  configura,ons	
  of	
  nodes	
  
      –  Uses	
  consensus	
  protocol	
  to	
  keep	
  state	
  consistent	
  
      –  Scales	
  well	
  for	
  configura,on	
  reads	
  
      –  Allows	
  for	
  adap,ve	
  repar,,oning	
  in	
  the	
  future	
  
•  Nodes	
  can	
  talk	
  to	
  any	
  master	
  
•  Masters	
  can	
  talk	
  to	
  any	
  ZooKeeper	
  member	
  
7/15/2010                                                                                        17
Control	
  plane	
  is	
  horizontally	
  scalable	
  
                Node                              Master	
                   ZK1	
  
                Node                              Master	
                             ZK2	
  
                Node                              Master	
                   ZK3	
  

•  A	
  master	
  controls	
  dynamic	
  configura,ons	
  of	
  nodes	
  
      –  Uses	
  consensus	
  protocol	
  to	
  keep	
  state	
  consistent	
  
      –  Scales	
  well	
  for	
  configura,on	
  reads	
  
      –  Allows	
  for	
  adap,ve	
  repar,,oning	
  in	
  the	
  future	
  
•  Nodes	
  can	
  talk	
  to	
  any	
  master	
  
•  Masters	
  can	
  talk	
  to	
  any	
  ZooKeeper	
  member	
  
7/15/2010                                                                                        18
EXTENSIBILITY	
  


                         Turn raw logs into something useful…
7/15/2010                                                       19
Flume	
  is	
  easy	
  to	
  extend	
  
•  Simple	
  source	
  and	
  sink	
  APIs	
  
      –  Event	
  granularity	
  streaming	
  design	
  
      –  Have	
  many	
  simple	
  opera,ons	
  and	
  compose	
  for	
  complex	
  behavior	
  
•  End-­‐to-­‐end	
  principle	
  
      –  Put	
  smarts	
  and	
  state	
  at	
  the	
  end	
  points.	
  	
  Keep	
  the	
  middle	
  simple	
  
•  Flume	
  deals	
  with	
  reliability	
  	
  
      –  Just	
  add	
  a	
  new	
  source	
  or	
  add	
  a	
  new	
  sink	
  and	
  Flume	
  has	
  primi,ves	
  to	
  deal	
  
         with	
  reliability	
  




7/15/2010                                                                                                                           20
Variety	
  of	
  Data	
  sources	
  
•  Can	
  deal	
  with	
  push	
  and	
  pull	
  sources	
                               push

                                                                               App	
            Agent
•  Supports	
  many	
  legacy	
  event	
  sources	
  
      –  Tailing	
  a	
  file	
                                                           poll
                                                                               App	
            Agent
      –  Output	
  from	
  periodically	
  Exec’ed	
  program	
  
      –  Syslog,	
  Syslog-­‐ng	
  
      –  Experimental:	
  IRC	
  /	
  TwiUer	
  /	
  Scribe	
  /	
  AMQP	
      embed
                                                                                          App	
  
                                                                                               Agent


7/15/2010                                                                                               21
Variety	
  of	
  Data	
  output	
  
•  Send	
  data	
  to	
  many	
  sinks	
  
      –  HDFS,	
  Files,	
  Console,	
  RPC	
  
      –  Experimental:	
  HBase,	
  Voldemort,	
  S3,	
  etc…	
  
•  Supports	
  an	
  extensible	
  variety	
  of	
  outputs	
  formats	
  and	
  des,na,ons	
  
      –  Output	
  to	
  language-­‐neutral	
  and	
  open	
  data	
  formats	
  (JSON,	
  Avro,	
  text)	
  
      –  Compressed	
  output	
  files	
  in	
  development	
  
•  Uses	
  decorators	
  to	
  process	
  event	
  data	
  in-­‐flight	
  
      –  Sampling,	
  aUribute	
  extrac,on,	
  filtering,	
  projec,on,	
  checksumming,	
  
         batching,	
  wire	
  compression,	
  etc…	
  


7/15/2010                                                                                                       22
MANAGEABILITY	
  



7/15/2010
                         Wheeeeee!   23
Centralized	
  data	
  flow	
  management	
  
•  Master	
  specifies	
  node	
  sources,	
  sinks	
  and	
  data	
  flows	
  
      –  Simply	
  specify	
  the	
  role	
  of	
  the	
  node:	
  collector,	
  agent	
  
      –  Or	
  specify	
  a	
  custom	
  configura,on	
  for	
  a	
  node	
  


•  Control	
  Interfaces:	
  
      –  Flume	
  Shell	
  	
  
      –  Basic	
  Web	
  
      –  HUE	
  +	
  Flume	
  Manager	
  App	
  (Enterprise	
  users)	
  



7/15/2010                                                                                    24
Output	
  bucke,ng	
  
       Collector                                                 /logs/web/2010/0715/1200/data-xxx.txt
                                                                 /logs/web/2010/0715/1200/data-xxy.txt
                                                                 /logs/web/2010/0715/1300/data-xxx.txt
                                              HDFS               /logs/web/2010/0715/1300/data-xxy.txt
                                                                 /logs/web/2010/0715/1400/data-xxx.txt
       Collector                                                 …


node : collectorSource | collectorSink ("hdfs://namenode/logs/
web/%Y/%m%d/%H00", "data")


•  Automa,c	
  output	
  file	
  management	
  	
  
      –  Write	
  HDFS	
  files	
  in	
  directories	
  using	
  ,me-­‐based	
  tags	
  



7/15/2010                                                                                          25
For	
  advanced	
  users	
  
•  A	
  concise	
  and	
  precise	
  configura,on	
  language	
  for	
  specifying	
  arbitrary	
  
   data	
  paths	
  
      –  Dataflows	
  are	
  essen,ally	
  DAGs	
  
      –  Control	
  specific	
  event	
  flows	
  
            •  Enable	
  durability	
  mechanism	
  and	
  failover	
  mechanisms	
  
            •  Tune	
  the	
  parameters	
  these	
  mechanisms	
  
      –  Dynamic	
  updates	
  of	
  configura,ons	
  
            •  Allows	
  for	
  live	
  failover	
  changes	
  
            •  Allows	
  for	
  handling	
  newly	
  provisioned	
  machines	
  
            •  Allows	
  for	
  changing	
  analy,cs	
  



7/15/2010                                                                                       26
CONCLUSIONS	
  



7/15/2010              27
Summary	
  
•  Flume	
  is	
  a	
  distributed,	
  reliable,	
  scalable	
  system	
  for	
  collec,ng	
  and	
  
   delivering	
  high-­‐volume	
  con,nuous	
  event	
  data	
  such	
  as	
  logs	
  
      –  Tunable	
  data	
  reliability	
  levels	
  
      –  Reliable	
  master	
  backed	
  by	
  ZooKeeper	
  
      –  Write	
  data	
  to	
  HDFS	
  into	
  buckets	
  ready	
  for	
  batch	
  processing	
  
      –  Dynamically	
  configurable	
  nodes	
  
      –  Simplified	
  automated	
  management	
  for	
  agent+collector	
  topologies	
  


•  Open	
  Source	
  Apache	
  v2.0	
  license	
  


7/15/2010                                                                                               28
Contribute!	
  
•  GitHub	
  source	
  repo	
  
      –  hUp://github.com/cloudera/flume	
  
•  Mailing	
  lists	
  
      –  User:	
  hUps://groups.google.com/a/cloudera.org/group/flume-­‐user	
  
      –  Dev:	
  hUps://groups.google.com/a/cloudera.org/group/flume-­‐dev	
  
•  Development	
  trackers	
  
      –  JIRA	
  (bugs/	
  formal	
  feature	
  requests):	
  	
  
            •  hUps://issues.cloudera.org/browse/FLUME	
  
      –  Review	
  board	
  (code	
  reviews):	
  	
  
            •  hUp://review.hbase.org	
  -­‐>	
  hUp://review.cloudera.org	
  
•  IRC	
  Channels	
  
      –  #flume	
  @	
  irc.freenode.net	
  


7/15/2010                                                                         29
Image	
  credits	
  
•    hUp://www.flickr.com/photos/victorvonsalza/3327750057/	
  
•    hUp://www.flickr.com/photos/victorvonsalza/3207639929/	
  
•    hUp://www.flickr.com/photos/victorvonsalza/3327750059/	
  
•    hUp://www.emvergeoning.com/?m=200811	
  
•    hUp://www.flickr.com/photos/juse/188960076/	
  
•    hUp://www.flickr.com/photos/juse/188960076/	
  
•    hUp://www.flickr.com/photos/23720661@N08/3186507302/	
  
•    hUp://clarksoutdoorchairs.com/log_adirondack_chairs.html	
  
•    hUp://www.flickr.com/photos/dboo/3314299591/	
  
7/15/2010                                                           30
Flumetalk

Contenu connexe

Similaire à Flumetalk

Hadoop for carrier
Hadoop for carrierHadoop for carrier
Hadoop for carrierFlytxt
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive IntroductionHanborq Inc.
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutesdwmclary
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBasedarach
 
Building Tungsten Clusters with PostgreSQL Hot Standby and Streaming Replication
Building Tungsten Clusters with PostgreSQL Hot Standby and Streaming ReplicationBuilding Tungsten Clusters with PostgreSQL Hot Standby and Streaming Replication
Building Tungsten Clusters with PostgreSQL Hot Standby and Streaming ReplicationLinas Virbalas
 
Living the Easy Life with Rules-Based Autonomic Database Clusters
Living the Easy Life with Rules-Based Autonomic Database ClustersLiving the Easy Life with Rules-Based Autonomic Database Clusters
Living the Easy Life with Rules-Based Autonomic Database ClustersLinas Virbalas
 
Building Real Life Applications with Adhearsion
Building Real Life Applications with AdhearsionBuilding Real Life Applications with Adhearsion
Building Real Life Applications with AdhearsionMojo Lingo
 
Apache MXNet for IoT with Apache NiFi
Apache MXNet for IoT with Apache NiFiApache MXNet for IoT with Apache NiFi
Apache MXNet for IoT with Apache NiFiTimothy Spann
 
IoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiIoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiDataWorks Summit
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBryan Bende
 
AWS Customer Presentation - The Server Labs
AWS Customer Presentation - The Server Labs AWS Customer Presentation - The Server Labs
AWS Customer Presentation - The Server Labs Amazon Web Services
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureJianfeng Zhang
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureRajesh Balamohan
 
Data-Center Replication with Apache Accumulo
Data-Center Replication with Apache AccumuloData-Center Replication with Apache Accumulo
Data-Center Replication with Apache AccumuloJosh Elser
 
Accumulo Summit 2014: Data-Center Replication with Apache Accumulo
Accumulo Summit 2014: Data-Center Replication with Apache AccumuloAccumulo Summit 2014: Data-Center Replication with Apache Accumulo
Accumulo Summit 2014: Data-Center Replication with Apache AccumuloAccumulo Summit
 
Introduction to flow analysis
Introduction to flow analysisIntroduction to flow analysis
Introduction to flow analysisProQSys
 

Similaire à Flumetalk (20)

Flume intro-100717
Flume intro-100717Flume intro-100717
Flume intro-100717
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
 
Hadoop for carrier
Hadoop for carrierHadoop for carrier
Hadoop for carrier
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive Introduction
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Flume and HBase
Flume and HBase Flume and HBase
Flume and HBase
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
 
Building Tungsten Clusters with PostgreSQL Hot Standby and Streaming Replication
Building Tungsten Clusters with PostgreSQL Hot Standby and Streaming ReplicationBuilding Tungsten Clusters with PostgreSQL Hot Standby and Streaming Replication
Building Tungsten Clusters with PostgreSQL Hot Standby and Streaming Replication
 
Living the Easy Life with Rules-Based Autonomic Database Clusters
Living the Easy Life with Rules-Based Autonomic Database ClustersLiving the Easy Life with Rules-Based Autonomic Database Clusters
Living the Easy Life with Rules-Based Autonomic Database Clusters
 
Building Real Life Applications with Adhearsion
Building Real Life Applications with AdhearsionBuilding Real Life Applications with Adhearsion
Building Real Life Applications with Adhearsion
 
Apache MXNet for IoT with Apache NiFi
Apache MXNet for IoT with Apache NiFiApache MXNet for IoT with Apache NiFi
Apache MXNet for IoT with Apache NiFi
 
IoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiIoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFi
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFi
 
AWS Customer Presentation - The Server Labs
AWS Customer Presentation - The Server Labs AWS Customer Presentation - The Server Labs
AWS Customer Presentation - The Server Labs
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Data-Center Replication with Apache Accumulo
Data-Center Replication with Apache AccumuloData-Center Replication with Apache Accumulo
Data-Center Replication with Apache Accumulo
 
Accumulo Summit 2014: Data-Center Replication with Apache Accumulo
Accumulo Summit 2014: Data-Center Replication with Apache AccumuloAccumulo Summit 2014: Data-Center Replication with Apache Accumulo
Accumulo Summit 2014: Data-Center Replication with Apache Accumulo
 
Introduction to flow analysis
Introduction to flow analysisIntroduction to flow analysis
Introduction to flow analysis
 

Plus de Skills Matter

5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard Lawrence5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard LawrenceSkills Matter
 
Patterns for slick database applications
Patterns for slick database applicationsPatterns for slick database applications
Patterns for slick database applicationsSkills Matter
 
Scala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvmScala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvmSkills Matter
 
Oscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimOscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimSkills Matter
 
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Skills Matter
 
Cukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlCukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlSkills Matter
 
Cukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsCukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsSkills Matter
 
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Skills Matter
 
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Skills Matter
 
Progressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldProgressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldSkills Matter
 
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Skills Matter
 
Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Skills Matter
 
A poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingA poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingSkills Matter
 
Russ miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveRuss miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveSkills Matter
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSkills Matter
 
I went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tI went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tSkills Matter
 

Plus de Skills Matter (20)

5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard Lawrence5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard Lawrence
 
Patterns for slick database applications
Patterns for slick database applicationsPatterns for slick database applications
Patterns for slick database applications
 
Scala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvmScala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvm
 
Oscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimOscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheim
 
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
 
Cukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlCukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberl
 
Cukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsCukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.js
 
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
 
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
 
Progressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldProgressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source world
 
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
 
Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#
 
A poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingA poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testing
 
Russ miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveRuss miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-dive
 
Serendipity-neo4j
Serendipity-neo4jSerendipity-neo4j
Serendipity-neo4j
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
 
Plug 20110217
Plug   20110217Plug   20110217
Plug 20110217
 
Lug presentation
Lug presentationLug presentation
Lug presentation
 
I went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tI went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_t
 
Plug saiku
Plug   saikuPlug   saiku
Plug saiku
 

Dernier

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Dernier (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Flumetalk

  • 1. Flume Reliable Distributed Streaming Log Collection Ian Wrigley Educational Services, Cloudera, Inc ian@cloudera.com
  • 2. Scenario   •  Situa,on:   –  You  have  hundreds  of  services  producing  logs   –  You’re  running  a  daily  cron  job  on  the  logs   •  Rota,ng  the  logs   •  Maybe  compressing  or  otherwise  processing  them   •  Transferring  them  to  HDFS  (the  Hadoop  Distributed  File  System)   •  Problem:     –  As  the  amount  of  data  increases,  it  takes  longer  and  longer  to  run  the   cron  job   7/15/2010 2
  • 3. You  need  a  “Flume”   •  Flume  is  a  distributed  system  that  gets   your  logs  from  their  source  and   aggregates  them  to  where  you  want  to   process  them   •  Open  source,    Apache  v2.0  License   •  Goals:   –  Reliability   –  Scalability   –  Extensibility   –  Manageability   Columbia Gorge, Broughton Log Flume 7/15/2010 3
  • 4. Use  cases   •  Collec,ng  logs  from  nodes  in  your   Hadoop  cluster   •  Collec,ng  logs  from  services  such   as  hUpd,  mail,  etc.   •  Collec,ng  impressions  from   custom  apps  for  an  ad  network   •  But  wait,  there’s  more!   It’s log, log ... Everyone wants a log! –  Basic  online  in-­‐stream  analysis   –  Online  in-­‐stream  file  processing  and   manipula,on   7/15/2010 4
  • 5. Key  abstrac,ons   •  Data  path  and  control  path   Agent •  Nodes  are  in  the  data  path     –  Nodes  have  a  source  and  a  sink   Collector –  They  can  take  different  roles   •  A  typical  topology  has  agent  nodes  and  collector  nodes   •  Op,onally  it  has  processor  nodes   •  Masters  are  in  the  control  path   Master   –  Centralized  point  of  configura,on   –  Specify  sources  and  sinks     –  Can  control  flows  of  data  between  nodes   –  Use  one  master  or  use  many  with  a  ZooKeeper-­‐backed  quorum   7/15/2010 5
  • 6. A  sample  topology   Agent tier Collector tier Master   Agent Agent Collector Agent Agent Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 6
  • 7. Masters  control  node  configura,on   Agent tier Collector tier Master   Agent Agent Collector Agent Agent Storage tier Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 7
  • 8. Outline   •  What  is  Flume?   –  Goals  and  architecture   •  Reliability   –  Fault-­‐tolerance  and  High  availability     •  Scalability   –  Horizontal  scalability  of  all  nodes  and  masters   •  Extensibility   –  Unix  principle,  all  kinds  of  data,  all  kinds  of  sources,  all  kinds  of  sinks   •  Manageability   –  Centralized  management  suppor,ng  dynamic  reconfigura,on     7/15/2010 8
  • 9. RELIABILITY   The logs will still get there… 7/15/2010 9
  • 10. Tunable  data  reliability  levels   •  Best  effort   –  Fire  and  forget   Agent Collector HDFS •  Store  on  failure  +  retry   –  Local  acks,  local  errors  detectable     Agent Collector HDFS –  Failover  when  faults  detected     •  End-­‐to-­‐end  reliability   –  End  to  end  acks   Agent Collector HDFS –  Data  survives  compound  failures,   and  may  be  retried  mul,ple   ,mes   7/15/2010 10
  • 11. SCALABILITY   7/15/2010 Logs jamming the Kemi River 11
  • 12. A  sample  topology   Agent tier Collector tier Master   Agent Agent Collector Agent Agent Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 12
  • 13. Data  path  is  horizontally  scalable   Agent Agent Collector HDFS Agent Agent •  Add  collectors  to  increase  availability  and  to  handle  more  data   –  Assumes  a  single  agent  will  not  dominate  a  collector   –  Fewer  connec,ons  to  HDFS   –  Larger,  more  efficient  writes  to  HDFS   •  Agents  have  mechanisms  for  machine  resource  tradeoffs   •  Write  log  locally  to  avoid  collector  disk  IO  boUleneck  and  catastrophic  failures   •  Compression  and  batching    (trade  cpu  for  network)   •  Push  computa,on  into  the  event  collec,on  pipeline  (balance  IO,  Mem,  and  CPU   resource  boUlenecks)   7/15/2010 13
  • 14. Load  balancing   Agent Agent Collector Agent Agent Collector Agent Collector Agent •  Agents  are  logically  par,,oned  and  can  send  to  different  collectors   •  Use  randomiza,on  to  pre-­‐specify  failovers  when  many  collectors   exist     •  Spread  load  if  a  collector  goes  down   •  Spread  load  if  new  collectors  are  added  to  the  system   7/15/2010 14
  • 15. Load  balancing   Agent Agent Collector Agent Agent Collector Agent Collector Agent •  Agents  are  logically  par,,oned  and  can  send  to  different  collectors   •  Use  randomiza,on  to  pre-­‐specify  failovers  when  many  collectors   exist     •  Spread  load  if  a  collector  goes  down   •  Spread  load  if  new  collectors  are  added  to  the  system   7/15/2010 15
  • 16. Control  plane  is  horizontally  scalable   Node Master   ZK1   Node Master   ZK2   Node Master   ZK3   •  A  master  controls  dynamic  configura,ons  of  nodes   –  Uses  consensus  protocol  to  keep  state  consistent   –  Scales  well  for  configura,on  reads   –  Allows  for  adap,ve  repar,,oning  in  the  future   •  Nodes  can  talk  to  any  master   •  Masters  can  talk  to  any  ZooKeeper  member   7/15/2010 16
  • 17. Control  plane  is  horizontally  scalable   Node Master   ZK1   Node Master   ZK2   Node Master   ZK3   •  A  master  controls  dynamic  configura,ons  of  nodes   –  Uses  consensus  protocol  to  keep  state  consistent   –  Scales  well  for  configura,on  reads   –  Allows  for  adap,ve  repar,,oning  in  the  future   •  Nodes  can  talk  to  any  master   •  Masters  can  talk  to  any  ZooKeeper  member   7/15/2010 17
  • 18. Control  plane  is  horizontally  scalable   Node Master   ZK1   Node Master   ZK2   Node Master   ZK3   •  A  master  controls  dynamic  configura,ons  of  nodes   –  Uses  consensus  protocol  to  keep  state  consistent   –  Scales  well  for  configura,on  reads   –  Allows  for  adap,ve  repar,,oning  in  the  future   •  Nodes  can  talk  to  any  master   •  Masters  can  talk  to  any  ZooKeeper  member   7/15/2010 18
  • 19. EXTENSIBILITY   Turn raw logs into something useful… 7/15/2010 19
  • 20. Flume  is  easy  to  extend   •  Simple  source  and  sink  APIs   –  Event  granularity  streaming  design   –  Have  many  simple  opera,ons  and  compose  for  complex  behavior   •  End-­‐to-­‐end  principle   –  Put  smarts  and  state  at  the  end  points.    Keep  the  middle  simple   •  Flume  deals  with  reliability     –  Just  add  a  new  source  or  add  a  new  sink  and  Flume  has  primi,ves  to  deal   with  reliability   7/15/2010 20
  • 21. Variety  of  Data  sources   •  Can  deal  with  push  and  pull  sources   push App   Agent •  Supports  many  legacy  event  sources   –  Tailing  a  file   poll App   Agent –  Output  from  periodically  Exec’ed  program   –  Syslog,  Syslog-­‐ng   –  Experimental:  IRC  /  TwiUer  /  Scribe  /  AMQP   embed App   Agent 7/15/2010 21
  • 22. Variety  of  Data  output   •  Send  data  to  many  sinks   –  HDFS,  Files,  Console,  RPC   –  Experimental:  HBase,  Voldemort,  S3,  etc…   •  Supports  an  extensible  variety  of  outputs  formats  and  des,na,ons   –  Output  to  language-­‐neutral  and  open  data  formats  (JSON,  Avro,  text)   –  Compressed  output  files  in  development   •  Uses  decorators  to  process  event  data  in-­‐flight   –  Sampling,  aUribute  extrac,on,  filtering,  projec,on,  checksumming,   batching,  wire  compression,  etc…   7/15/2010 22
  • 24. Centralized  data  flow  management   •  Master  specifies  node  sources,  sinks  and  data  flows   –  Simply  specify  the  role  of  the  node:  collector,  agent   –  Or  specify  a  custom  configura,on  for  a  node   •  Control  Interfaces:   –  Flume  Shell     –  Basic  Web   –  HUE  +  Flume  Manager  App  (Enterprise  users)   7/15/2010 24
  • 25. Output  bucke,ng   Collector /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt HDFS /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt Collector … node : collectorSource | collectorSink ("hdfs://namenode/logs/ web/%Y/%m%d/%H00", "data") •  Automa,c  output  file  management     –  Write  HDFS  files  in  directories  using  ,me-­‐based  tags   7/15/2010 25
  • 26. For  advanced  users   •  A  concise  and  precise  configura,on  language  for  specifying  arbitrary   data  paths   –  Dataflows  are  essen,ally  DAGs   –  Control  specific  event  flows   •  Enable  durability  mechanism  and  failover  mechanisms   •  Tune  the  parameters  these  mechanisms   –  Dynamic  updates  of  configura,ons   •  Allows  for  live  failover  changes   •  Allows  for  handling  newly  provisioned  machines   •  Allows  for  changing  analy,cs   7/15/2010 26
  • 28. Summary   •  Flume  is  a  distributed,  reliable,  scalable  system  for  collec,ng  and   delivering  high-­‐volume  con,nuous  event  data  such  as  logs   –  Tunable  data  reliability  levels   –  Reliable  master  backed  by  ZooKeeper   –  Write  data  to  HDFS  into  buckets  ready  for  batch  processing   –  Dynamically  configurable  nodes   –  Simplified  automated  management  for  agent+collector  topologies   •  Open  Source  Apache  v2.0  license   7/15/2010 28
  • 29. Contribute!   •  GitHub  source  repo   –  hUp://github.com/cloudera/flume   •  Mailing  lists   –  User:  hUps://groups.google.com/a/cloudera.org/group/flume-­‐user   –  Dev:  hUps://groups.google.com/a/cloudera.org/group/flume-­‐dev   •  Development  trackers   –  JIRA  (bugs/  formal  feature  requests):     •  hUps://issues.cloudera.org/browse/FLUME   –  Review  board  (code  reviews):     •  hUp://review.hbase.org  -­‐>  hUp://review.cloudera.org   •  IRC  Channels   –  #flume  @  irc.freenode.net   7/15/2010 29
  • 30. Image  credits   •  hUp://www.flickr.com/photos/victorvonsalza/3327750057/   •  hUp://www.flickr.com/photos/victorvonsalza/3207639929/   •  hUp://www.flickr.com/photos/victorvonsalza/3327750059/   •  hUp://www.emvergeoning.com/?m=200811   •  hUp://www.flickr.com/photos/juse/188960076/   •  hUp://www.flickr.com/photos/juse/188960076/   •  hUp://www.flickr.com/photos/23720661@N08/3186507302/   •  hUp://clarksoutdoorchairs.com/log_adirondack_chairs.html   •  hUp://www.flickr.com/photos/dboo/3314299591/   7/15/2010 30