Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Apache kafka

2 118 vues

Publié le

Apache Kafka Deck used at NJ Hadoop meetup session on 8/11/2015

Publié dans : Logiciels
  • Soyez le premier à commenter

Apache kafka

  1. 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Ka:a  -­‐  Inges<on  and   Processing  Pipeline   NJ  Hadoop  Meetup  –  8/11/15   Shravan  Pabba  @skpabba  
  2. 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Agenda   •  Ka:a  Concepts  and  Architecture   •  Ka:a  vs  Tradi<onal  messaging  systems   •  Ka:a  with  Cloudera   •  Demo   § Install  and  configure  Ka:a  on  Cloudera  cluster   § Client  tools  -­‐  Add  and  consume  data  from  topics   § Replica<on  and  Failover  capabili<es   § Flume  Integra<on  and  demo  of  Ka:a  to  Flume  to  HDFS   •  Other  topics  
  3. 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   About  Me   •  Systems  Engineer  @  Cloudera   •  Previously  Pre/Post  Sales  Architect  @  GigaSpaces,  IBM   •  Mainframes,  Client/Server,  Distributed  &  Cloud  
  4. 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Ka:a  Concepts  and  Architecture  
  5. 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Cloudera  Enterprise  Data  Hub   Inges<on   Typical  Data  Hub  Architecture   Cloudera  Manager   Ka:a   Flume   Spark  Streaming   DistCp   Sqoop   File  Dumping   Access  Layer   Interac<ve   JDBC   ODBC   ETL   Hive   Spark  DAG   MLlib   Girpah   Grid   Compute   Custom   Egress   DistCp   Producer   File   Dumping   Ka:a/ Custom   Custom   HBase  API   SolR   Engines  Storage  Layer   HDFS   HBase   SolR   Yarn   Spark   Map  Reduce  Impala   Sentry  (Security  Framework)   Encryp<on   Navigator   PIG  
  6. 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   •  No  ability  to  replay  events   •  Mul<ple  sinks  requires  event  replica<on  (via  mul<ple  channels)   •  Sinks  that  share  a  source  (mostly)  process  events  in  sync   •  This  is  !ght  coupling   Why  Ka:a?  (Or  rather,  why  didn’t  LinkedIn  use  Flume?)   Spool Source Avro Sink Channel Spool Source Avro Sink Channel Avro Source HBase Sink Channel HDFS Sink HBase HDFS Logs More Logs Channel
  7. 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Ka:a?   Web logs Hadoop Connections = O(1) 2009  
  8. 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Ka:a?  Increasing  complexity   Web logs Hadoop Connections = O(1) Connections = O(Systems2) Transactions Metrics Web logs Hadoop Warehouse Alerting Audit Logs Security 2009   2014  
  9. 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Ka:a?  Decoupling   Connections = O(Systems2) Transactions Metrics Web logs Hadoop Warehouse Alerting Audit Logs Security Transactions Metrics Web logs Hadoop Warehouse Alerting Audit Logs Security Connections = O(Systems) Kafka 2014   2015+?  
  10. 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   • Distributed,  structured  logs  are  very  useful   • Resiliency  /  replica<on   •  Database  write-­‐ahead  logs  (HBase  WAL,  Oracle  Redo-­‐logs,  etc)   • System  decoupling   •  Enterprise  service  buses  (ESBs)   •  Data  integra<on  (change  data  capture)   • Stream  processing  (e.g.  real-­‐<me  alerts)   • Consensus  (using  logical  clocks)   Why  Ka:a?  Because  logs.  
  11. 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  …   Transactions Metrics Web logs Hadoop Warehouse Alerting Audit Logs Security Kafka
  12. 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  a  distributed,  …   Transactions Metrics Web logs Hadoop Warehouse Alerting Audit Logs Security Broker Broker Broker Kafka
  13. 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  a  distributed,  topic-­‐oriented,   …   Source 1 Topic 1 Sink 1 Source 2 Source 3 Topic 2 Sink 2 Broker
  14. 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  a  distributed,  topic-­‐oriented,   par00oned,  …   Source 1 Topic 1 Partition 1 Sink 1 Source 2 Source 3 Topic 2 Partition 1 Sink 2 Broker Topic 1 Partition 2 Topic 2 Partition 2 Broker
  15. 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  a  distributed,  topic-­‐oriented,   par<<oned,  replicated  commit  log.   Source 1 Topic 1 Partition 1 Sink 1 Source 2 Source 3 Topic 2 Partition 1 Sink 2 Broker Topic 1 Partition 2 Topic 2 Partition 2 Broker Topic 1 Partition 2 Topic 2 Partition 2 Topic 1 Partition 1 Topic 2 Partition 1
  16. 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  Ka:a?   •  Ka:a  is  a  distributed,  topic-­‐oriented,   par<<oned,  replicated  commit  log.   •  Ka:a  is  also  pub-­‐sub  messaging   system.   •  Messages  can  be  text  (e.g.  syslog),  but   binary  is  best  (preferably  Avro!).   Source 1 Topic 1 Partition 1 Sink 1 Source 2 Source 3 Topic 2 Partition 1 Sink 2 Broker Topic 1 Partition 2 Topic 2 Partition 2 Broker Topic 1 Partition 2 Topic 2 Partition 2 Topic 1 Partition 1 Topic 2 Partition 1
  17. 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Architectural  Overview   •  Each  machine  is  called  a  Broker   •  Data  wrilen  belongs  to  Topics   (analogous  to  a  Table  in  a  database)   •  Each  Topic  is  par<<oned   •  Par<<ons  are  distributed  across  the   Brokers     •  Par<<ons  are  also  replicated  (one   replica  per  par<<on  is  Leader  Par<<on)     •  Producers  and  Consumers  talk  to  the   Leader  Par<<on   Broker  1   Broker  2   Broker  3   Par<<on  1   (Leader)   Par<<on  2   Par<<on  3   Par<<on  2   (Leader)   Par<<on  1   Par<<on  3   Par<<on  3   (Leader)   Par<<on  1   Par<<on  2   Producer   Producer   Consumer  Consumer   Ka:a  Cluster  
  18. 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   The  Ka:a  Advantage     •  One  broker  can  handle  100MBs  of  reads/ writes  per  second,  from  1000s  clients     •  Messages  delivered  in  milliseconds   High-­‐Throughput  &  Low  Latency   •  Zero  data  loss  with  messages  persisted  on   disk  and  replicated  within  the  cluster   •  Highly-­‐available  with  fault-­‐tolerance  built   into  the  system.   Durability  &  Reliability   •  Elas<cally  and  transparently  add  more   machines  without  down<me  for  horizontal   scalability   •  Dynamically  add  Producers  &  Consumers   •  Enable  real-­‐<me  &  batch  consump<on   Scalability  &  Flexibility   •  Modest  cluster  op<mized  to  handle  millions   of  messages  per  second   •  Open  standard  for  long-­‐term  value   •  With  Cloudera,  a  single  system  for  mul<ple   workloads   Cost-­‐Efficient  
  19. 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   How  does  it  compare  to  Flume  and  Tradi<onal   Messaging  
  20. 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Ka4a   •  Ka:a  is  very  much  a  general-­‐purpose   system.  Many  producers  and  many   consumers  sharing  mul<ple  topics   •  Ka:a,  has  a  significantly  smaller   producer  and  consumer  ecosystem   •  Ka:a  requires  an  external  stream   processing  system  for  that   •  Highly  Available  ingest  pipeline   Flume   •  Flume  is  a  special-­‐purpose  tool   designed  to  send  data  to  HDFS,  HBase   (and  Solr)   •  Flume  has  many  built-­‐in  sources  and   sinks   •  In-­‐flight  data  processing  using   interceptors.  Useful  for  data  masking   or  filtering   •  Flume  does  not  replicate  events   Ka:a  Vs  Flume  
  21. 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Random  and  Sequen<al  Access  in  Disk  and  Memory   Source:  hlp://queue.acm.org/detail.cfm?id=1563874  
  22. 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Ka4a   •  Ka:a  does  only  sequen<al  file  I/O   •  Ka:a  keeps  a  single  pointer  into  each   par<<on  of  a  topic.  All  messages  prior   to  the  pointer  are  considered   consumed,  and  all  messages  auer  it   are  consider  unconsumed   •  Relies  heavily  on  OS  pagecache  for   data  storage,  zerocopy   •  No  GC,  No  Memory  overhead   •  Ka:a  supports  end-­‐to-­‐end  batching   and  compression  of  messages   Tradi0onal  Messaging   •  Tradi<onal  messaging  does  random   file/memory  I/O  (BTree  structures)   •  Typically  messaging  system  keep   some  kind  of  per-­‐message  state   about  what  has  been  consumed  and   have  to  update  it   •  Disk/Memory  is  used  for  storage   •  JVM  ==  GC  and  memory  overhead   •  Tradi<onal  messaging  is  typically  as   non-­‐batch  and  un-­‐compressed   Why  is  Ka:a  fast?  
  23. 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   Canonical  Use  Cases   •  Real-­‐Time  Stream  Processing     •  General-­‐Purpose  Message  Bus     •  User  Ac<vity  Data  Collec<on     •  Opera<onal  Metrics  Collec<on   (applica<ons,  servers,  or  devices)         •  Log  Aggrega<on     •  Change  Data  Capture     •  Distributed  Systems  Commit  Log    
  24. 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   Ka:a  and  Cloudera  
  25. 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   Simplified  Management   •  Deploy  and  Configure   Ka:a  clusters     •  Unified  Management   •  Mul<ple  Ka:a   clusters   •  En<re  plavorm     •  Monitoring,  Alerts,   and  Dashboards    
  26. 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   Configure  Ka:a  using  CM  
  27. 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   CM  has  much  more!  
  28. 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   CM  has  much  more!  
  29. 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   CM  has  much  more!  
  30. 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   Ka:a  +  Apache  Flume   •  Ka:a  can  be  configured  as  a  fast,  reliable  Flume  Channel   •  Flume  Sources  and  Sinks  can  be  used  as  out-­‐of-­‐the-­‐box  Ka:a  Producers  and  Consumers   Flume  Sinks  Consume  from  Ka4a:   Write  data  to  HDFS,  HBase,  or  Search   Flume  Sources  Write  to  Ka4a:   Read  from  logs,  files,  jms,  hlp,  rpc,  thriu,   etc  and  write  events  to  Ka:a  
  31. 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   Cloudera  +  Ka:a   Community  involvement  and  contribu0on:   •  Spearheading  adding  security  features  to  Ka:a   •  Iden<fied  and  fixed  core  architectural  issues  to  make  Ka:a  fully  reliable   •  Strong  rela<onship  with  the  Confluent.io  and  other  Ka:a  Commilers     Support  exper0se  and  experience:   •  Mul<ple  produc<on  customers   •  Support  team  trained  by  Ka:a  Commilers     Integrated  with  Cloudera’s  produc0on-­‐ready  plaForm:   •  Cloudera  Manager  CSD  makes  it  easy  to  deploy,  configure,  and  monitor  Ka:a  clusters   •  End-­‐to-­‐end  workloads  with  other  components,  all  on  a  single  system   •  Leading  security,  governance,  administra<on,  and  partner  network  
  32. 32. 32  ©  Cloudera,  Inc.  All  rights  reserved.   Roadmap   Security:   • Authen<ca<on  with  Kerberos   • Topic  level  Authoriza<on   • SSL  encryp<on  of  data  over-­‐the-­‐wire     • Improved  Cloudera  Manager  integra<on     • HUE  integra<on   *Roadmap  is  subject  to  change  
  33. 33. 33  ©  Cloudera,  Inc.  All  rights  reserved.   Demo  
  34. 34. 34  ©  Cloudera,  Inc.  All  rights  reserved.   Ka:a  Demo   •  Install  and  configure  Ka:a  on  Cloudera  cluster   •  Client  tools  -­‐  Add  and  consume  data  from  topics   •  Replica<on  and  Failover  capabili<es   •  Flume  Integra<on  and  demo  of  Ka:a  to  Flume  to  HDFS  
  35. 35. 35  ©  Cloudera,  Inc.  All  rights  reserved.   Other  Topics  
  36. 36. 36  ©  Cloudera,  Inc.  All  rights  reserved.   Clients/API’s   •  Java,  Python,  Go,  C/C++,  .Net,  Clojure,  Ruby,  Erlang,  stdin/stdout  and  more  here,   hlps://cwiki.apache.org/confluence/display/KAFKA/Clients#Clients-­‐ ProducerDaemon   •  Producer  and  Consumer  API   •  New  Java  Producer  API  was  in  0.8.2   •  New  consumer  API  is  coming  in  next  release  
  37. 37. 37  ©  Cloudera,  Inc.  All  rights  reserved.   Mirror  Maker   •  Mul<  Ka:a  Cluster  replica<on,  HA  Across  datacenters  
  38. 38. 38  ©  Cloudera,  Inc.  All  rights  reserved.   Camus/Samza/Ka:a  Manager   •  Camus/Samza  are  tools  used  and  created  in  LinkedIn   •  Camus  is  a  client  for  inges<ng  Ka:a  data  into  Hadoop  (MR  jobs  under  the  covers)   •  Camus  being  phased  out  and  replaced  with  Gobblin   •  Samza  is  stream  processing  framework  that  uses  Ka:a  for  messaging  and  YARN   for  processing  (resource  management  etc)   •  Management  tool  for  Ka:a  develop  @  Yahoo  
  39. 39. 39  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  You  

×