Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
#backdaybyxebia
Sylvain Lequeux
@slequeux
Event Ingestion in HDFS
#backdaybyxebia
Back To Basics
#backdaybyxebia
Basics : Event ?
Asynchronysm …
… to message systems …
… to event systems
#backdaybyxebia
Basics : Kafka
A messaging system
#backdaybyxebia
Basics : Kafka
A distributed messaging system
#backdaybyxebia
Basics : Kafka
A distributed messaging system
… multi-queues (called “topics”)
splitted in partitions
#backdaybyxebia
Basics : Kafka
A distributed messaging system
… multi-queues (called “topics”)
splitted in partitions
… mu...
#backdaybyxebia
#backdaybyxebia
#backdaybyxebia
Basics : Hadoop Distributed FileSystem
Distributed & scalable
Highly fault-tolerant
Standard support for B...
#backdaybyxebia
#backdaybyxebia
VS
#backdaybyxebia
Flume
http://flume.apache.org/
#backdaybyxebia
Flume
Concepts
#backdaybyxebia
Flume
➔ Top level Apache project
➔ “Item” streaming based on data
flow
#backdaybyxebia
Flume
1. An “item” exists somewhere
Initially, “items” were log files
#backdaybyxebia
Flume
1. An “item” exists somewhere
Initially, “items” were log files
2. A source is a way to transform
th...
#backdaybyxebia
Flume
1. An “item” exists somewhere
Initially, “items” were log files
2. A source is a way to transform
th...
#backdaybyxebia
Flume
1. An “item” exists somewhere
Initially, “items” were log files
2. A source is a way to pull and
tra...
#backdaybyxebia
Flume + Kafka = Flafka
#backdaybyxebia
Flume
How it works
#backdaybyxebia
Simple flume configuration file
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.chann...
#backdaybyxebia
Flafka source configuration
# Mandatory config
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSou...
#backdaybyxebia
HDFS Sink configuration
# Mandatory config
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://localhos...
#backdaybyxebia
Flume : data transformation
#backdaybyxebia
Flume Interceptors
# Mandatory config
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type =...
#backdaybyxebia
Flume : how to run it ?
Command line : Included in distribs :
flume-ng agent 
-n a1 
-c /usr/lib/flume-ng/...
#backdaybyxebia
Camus
linkedin/camus
#backdaybyxebia
Camus
Concepts
#backdaybyxebia
Camus
➔ OpenSource project developped
by LinkedIn
➔ Based entirely on MapReduce
#backdaybyxebia
Camus
A batch consists in three steps :
- P1 : Gets metadata : topic & partitions, latest offsets
- P2 : P...
#backdaybyxebia
Camus
How it works
#backdaybyxebia
Time to write some code
Just explain how to transform
INTO
#backdaybyxebia
Time to write some code
#backdaybyxebia
Time to write some code
#backdaybyxebia
#backdaybyxebia
Round 1 : getting started
Flume Camus
Just a simple configuration file make it
works
Need a complete dev e...
#backdaybyxebia
Round 2 : running time
Flume Camus
Flume events are ingested with no delay MapReduce setup adds uncompress...
#backdaybyxebia
Round 3 : maintainability
Flume Camus
When used by CM, server is easy to
maintain, but config is not
Full ...
#backdaybyxebia
Round 4 : customization
Flume Camus
Interceptors are fully customizable Morphing data could be done easily...
#backdaybyxebia
Round 5 : deployment
Flume Camus
When used by CM, just include your conf,
that’s it
MapReduce jobs may be ...
#backdaybyxebia
Round 6 : state of the project
Flume Camus
Released 1.0.0 in 2012 Currently in v 0.1.0-SNAPSHOT
Included b...
#backdaybyxebia
Summary
Flume Camus
Getting Started
Running time
Maintainability
Customization
Deployment
State of the pro...
#backdaybyxebia
Global Feedback
#backdaybyxebia
Debugging
Debugging on Flume is quite complex
Some really critical bugs like [FLUME-2578]
#backdaybyxebia
Documentation
Flume has really good quality doc
Camus only has a readme file and not up to date !
#backdaybyxebia
Camus & M/R
Camus suffers the use of Map/Reduce.
Maybe using some other concept like Spark may result in b...
#backdaybyxebia
Flume quantity of files
Flume needs a very precise configuration not to generate a bunch of file.
It is ea...
#backdaybyxebia
Thank you
Questions ?
Prochain SlideShare
Chargement dans…5
×

Backday Xebia : Retour d’expérience sur l’ingestion d’événements dans HDFS

Les architectures message-driven sont pensées afin d’offrir une possibilité de gérer un grand nombre d’événements. Si la volumétrie de ces éléments devient suffisante, il sera probablement judicieux de stocker le tout dans un cluster HDFS.

Nous présenterons un retour d’expérience d’ingestion d’événements depuis un broker Kafka vers un cluster HDFS.

Par Sylvain Lecqueux, consultant chez Xebia

  • Identifiez-vous pour voir les commentaires

Backday Xebia : Retour d’expérience sur l’ingestion d’événements dans HDFS

  1. 1. #backdaybyxebia Sylvain Lequeux @slequeux Event Ingestion in HDFS
  2. 2. #backdaybyxebia Back To Basics
  3. 3. #backdaybyxebia Basics : Event ? Asynchronysm … … to message systems … … to event systems
  4. 4. #backdaybyxebia Basics : Kafka A messaging system
  5. 5. #backdaybyxebia Basics : Kafka A distributed messaging system
  6. 6. #backdaybyxebia Basics : Kafka A distributed messaging system … multi-queues (called “topics”) splitted in partitions
  7. 7. #backdaybyxebia Basics : Kafka A distributed messaging system … multi-queues (called “topics”) splitted in partitions … multi-clients
  8. 8. #backdaybyxebia
  9. 9. #backdaybyxebia
  10. 10. #backdaybyxebia Basics : Hadoop Distributed FileSystem Distributed & scalable Highly fault-tolerant Standard support for BigData jobs to run “Moving computation is cheaper than moving data”
  11. 11. #backdaybyxebia
  12. 12. #backdaybyxebia VS
  13. 13. #backdaybyxebia Flume http://flume.apache.org/
  14. 14. #backdaybyxebia Flume Concepts
  15. 15. #backdaybyxebia Flume ➔ Top level Apache project ➔ “Item” streaming based on data flow
  16. 16. #backdaybyxebia Flume 1. An “item” exists somewhere Initially, “items” were log files
  17. 17. #backdaybyxebia Flume 1. An “item” exists somewhere Initially, “items” were log files 2. A source is a way to transform this data into Flume events
  18. 18. #backdaybyxebia Flume 1. An “item” exists somewhere Initially, “items” were log files 2. A source is a way to transform this data into Flume events 3. A channel is a way to transport data (memory, file)
  19. 19. #backdaybyxebia Flume 1. An “item” exists somewhere Initially, “items” were log files 2. A source is a way to pull and transform this data into Flume events 3. A channel is a way to transport data (memory, file) 4. A sink is a way to put a Flume event somewhere
  20. 20. #backdaybyxebia Flume + Kafka = Flafka
  21. 21. #backdaybyxebia Flume How it works
  22. 22. #backdaybyxebia Simple flume configuration file # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 1 - Define an agent 2 - Explicit the sources 3 - Explicit the sinks 4 - Explicit the channels 5 - Connect the pieces
  23. 23. #backdaybyxebia Flafka source configuration # Mandatory config a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource a1.sources.r1.zookeeperConnect = localhost:2181 a1.sources.r1.topic = MyTopic # Optional config a1.sources.r1.batchSize = 1000 a1.sources.r1.batchDurationMillis = 1000 a1.sources.r1.consumer.timeout.ms = 10 a1.sources.r1.auto.commit.enabled = false a1.sources.r1.groupId = flume
  24. 24. #backdaybyxebia HDFS Sink configuration # Mandatory config a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = hdfs://localhost:54310/data/flume/ %{topic}/%y-%m-%d # A log of optional configs inluding : # Compression # File types : SequenceFile, Avro, etc. # Possibility to use a custom Writable # Kerberos configuration
  25. 25. #backdaybyxebia Flume : data transformation
  26. 26. #backdaybyxebia Flume Interceptors # Mandatory config a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = ... .. ➔ Transformation executed ◆ After event is generated ◆ Before sending it to channel ➔ Some predefined interceptors ◆ Timestamp ◆ UUID ◆ Filtering ◆ Morphline ◆ ... ➔ Could write your own (pure Java)
  27. 27. #backdaybyxebia Flume : how to run it ? Command line : Included in distribs : flume-ng agent -n a1 -c /usr/lib/flume-ng/conf/ -f /usr/lib/flume-ng/conf/flume-kafka.conf &
  28. 28. #backdaybyxebia Camus linkedin/camus
  29. 29. #backdaybyxebia Camus Concepts
  30. 30. #backdaybyxebia Camus ➔ OpenSource project developped by LinkedIn ➔ Based entirely on MapReduce
  31. 31. #backdaybyxebia Camus A batch consists in three steps : - P1 : Gets metadata : topic & partitions, latest offsets - P2 : Pulls new events - P3 : Updates local metadatas
  32. 32. #backdaybyxebia Camus How it works
  33. 33. #backdaybyxebia Time to write some code Just explain how to transform INTO
  34. 34. #backdaybyxebia Time to write some code
  35. 35. #backdaybyxebia Time to write some code
  36. 36. #backdaybyxebia
  37. 37. #backdaybyxebia Round 1 : getting started Flume Camus Just a simple configuration file make it works Need a complete dev environment (included maven) to use it Morphline interceptor’syntax is quite complex Dev should understand MapReduce concepts
  38. 38. #backdaybyxebia Round 2 : running time Flume Camus Flume events are ingested with no delay MapReduce setup adds uncompressable time 31 sec uncompressable + ~ 1 sec / 500 messages / node => 111 sec for 1 message => 117 sec for 50 messages => 116 sec for 1.000 mesages => 127 sec for 10.000 messages
  39. 39. #backdaybyxebia Round 3 : maintainability Flume Camus When used by CM, server is easy to maintain, but config is not Full Maven project. Just use a version control system (Git, SVN, aso.)
  40. 40. #backdaybyxebia Round 4 : customization Flume Camus Interceptors are fully customizable Morphing data could be done easily Event headers make HDFS path highly modulable
  41. 41. #backdaybyxebia Round 5 : deployment Flume Camus When used by CM, just include your conf, that’s it MapReduce jobs may be included in any MR orchestrator (Oozie for instance) Without a manager, everything needs to be done manually
  42. 42. #backdaybyxebia Round 6 : state of the project Flume Camus Released 1.0.0 in 2012 Currently in v 0.1.0-SNAPSHOT Included by default with Hadoop distributions Highly used by LinkedIn in production Almost no documentation
  43. 43. #backdaybyxebia Summary Flume Camus Getting Started Running time Maintainability Customization Deployment State of the project
  44. 44. #backdaybyxebia Global Feedback
  45. 45. #backdaybyxebia Debugging Debugging on Flume is quite complex Some really critical bugs like [FLUME-2578]
  46. 46. #backdaybyxebia Documentation Flume has really good quality doc Camus only has a readme file and not up to date !
  47. 47. #backdaybyxebia Camus & M/R Camus suffers the use of Map/Reduce. Maybe using some other concept like Spark may result in better perfs.
  48. 48. #backdaybyxebia Flume quantity of files Flume needs a very precise configuration not to generate a bunch of file. It is easy to get it generate a lot of little files, which is problematic in term of BigData.
  49. 49. #backdaybyxebia Thank you Questions ?

×