SlideShare une entreprise Scribd logo
1  sur  69
Télécharger pour lire hors ligne
1
Headline	
  Goes	
  Here	
  
Speaker	
  Name	
  or	
  Subhead	
  Goes	
  Here	
  
DO	
  NOT	
  USE	
  PUBLICLY	
  
PRIOR	
  TO	
  10/23/12	
  
ApplicaAon	
  Architectures	
  with	
  
Hadoop	
  
Mark	
  Grover	
  |	
  SoGware	
  Engineer	
  
Jonathan	
  Seidman	
  	
  |	
  SoluAons	
  Architect,	
  Partner	
  
Engineering	
  
April	
  1,	
  2014	
  
©2014 Cloudera, Inc. All Rights
Reserved.
About	
  Us	
  
•  Mark	
  
•  CommiOer	
  on	
  Apache	
  Bigtop,	
  commiOer	
  and	
  PPMC	
  member	
  on	
  Apache	
  
Sentry	
  (incubaAng).	
  
•  Contributor	
  to	
  Hadoop,	
  Hive,	
  Spark,	
  Sqoop,	
  Flume.	
  
•  @mark_grover	
  
•  Jonathan	
  
•  SoluAons	
  Architect,	
  Partner	
  Engineering	
  Team.	
  
•  Co-­‐founder	
  of	
  Chicago	
  Hadoop	
  User	
  Group	
  and	
  Chicago	
  Big	
  Data.	
  
•  jseidman@cloudera.com	
  
•  @jseidman	
  
2
©2014 Cloudera, Inc. All Rights
Reserved.
Co-­‐authoring	
  O’Reilly	
  book	
  
•  Titled	
  ‘Hadoop	
  ApplicaAon	
  Architectures’	
  
•  How	
  to	
  build	
  end-­‐to-­‐end	
  soluAons	
  using	
  	
  
Apache	
  Hadoop	
  and	
  related	
  tools	
  
•  Updates	
  on	
  TwiOer:	
  @hadooparchbook	
  
•  hOp://www.hadooparchitecturebook.com	
  
©2014 Cloudera, Inc. All Rights
Reserved.
3
Challenges	
  of	
  Hadoop	
  ImplementaAon	
  
4	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Challenges	
  of	
  Hadoop	
  ImplementaAon	
  
5	
  
©2014 Cloudera, Inc. All Rights
Reserved.
6
Click	
  Stream	
  Analysis	
  
Case	
  Study	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Click	
  Stream	
  Analysis	
  
7	
  
Log	
  
Files	
  
DWH	
  
X	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Web	
  Log	
  Example	
  
©2014 Cloudera, Inc. All Rights
Reserved.
8	
  
[2012/09/22 20:56:04.294 -0500] "GET /info/ HTTP/1.1" 200 701 "-"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en)"
"age=38&gender=1&incomeCategory=5&session=983040389&user=627735038&
region=8&userType=1”
[2012/09/23 14:12:52.294 -0500] "GET /wish/remove/275 HTTP/1.1" 200
701 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; en-us)
AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16"
"age=63&gender=1&incomeCategory=1&session=1561203915&user=136433448
8&region=4&userType=1"
Hadoop	
  Architectural	
  ConsideraAons	
  	
  
•  Storage	
  managers?	
  
•  HDFS?	
  HBase?	
  
•  Data	
  storage	
  and	
  modeling:	
  
•  File	
  formats?	
  Compression?	
  Schema	
  design?	
  
•  Data	
  movement	
  
•  How	
  do	
  we	
  actually	
  get	
  the	
  data	
  into	
  Hadoop?	
  How	
  do	
  we	
  get	
  it	
  out?	
  
•  Metadata	
  
•  How	
  do	
  we	
  manage	
  data	
  about	
  the	
  data?	
  
•  Data	
  access	
  and	
  processing	
  
•  How	
  will	
  the	
  data	
  be	
  accessed	
  once	
  in	
  Hadoop?	
  How	
  can	
  we	
  transform	
  it?	
  How	
  do	
  
we	
  query	
  it?	
  
•  OrchestraAon	
  
•  How	
  do	
  we	
  manage	
  the	
  workflow	
  for	
  all	
  of	
  this?	
  
9
©2014 Cloudera, Inc. All Rights
Reserved.
10
Data	
  Storage	
  and	
  Modeling	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  Storage	
  –	
  Storage	
  Manager	
  consideraAons	
  
•  Popular	
  storage	
  managers	
  for	
  Hadoop	
  
•  Hadoop	
  Distributed	
  File	
  System	
  (HDFS)	
  
•  HBase	
  
11
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  Storage	
  –	
  HDFS	
  vs	
  HBase	
  
HDFS	
  
•  Stores	
  data	
  directly	
  as	
  files	
  
•  Fast	
  scans	
  
•  Poor	
  random	
  reads/writes	
  
HBase	
  
•  Stores	
  data	
  as	
  Hfiles	
  on	
  HDFS	
  
•  Slow	
  scans	
  
•  Fast	
  random	
  reads/writes	
  
12	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  Storage	
  –	
  Storage	
  Manager	
  consideraAons	
  
•  We	
  choose	
  HDFS	
  
•  AnalyAcal	
  needs	
  in	
  this	
  case	
  served	
  beOer	
  by	
  fast	
  scans.	
  
13
©2014 Cloudera, Inc. All Rights
Reserved.
14
Data	
  Storage	
  Format	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  Storage	
  –	
  Format	
  ConsideraAons	
  	
  
•  Store	
  as	
  plain	
  text?	
  
•  Sure,	
  well	
  supported	
  by	
  Hadoop.	
  
•  Text	
  can	
  easily	
  be	
  processed	
  by	
  MapReduce,	
  loaded	
  into	
  Hive	
  for	
  
analysis,	
  and	
  so	
  on.	
  
•  But…	
  
•  Will	
  begin	
  to	
  consume	
  lots	
  of	
  space	
  in	
  HDFS.	
  
•  May	
  not	
  be	
  opAmal	
  for	
  processing	
  by	
  tools	
  in	
  the	
  Hadoop	
  
ecosystem.	
  
15
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  Storage	
  –	
  Format	
  ConsideraAons	
  	
  
•  But,	
  we	
  can	
  compress	
  the	
  text	
  files…	
  
•  Gzip	
  –	
  supported	
  by	
  Hadoop,	
  but	
  not	
  spliOable.	
  
•  Bzip2	
  –	
  hey,	
  spliOable!	
  Great	
  compression!	
  But	
  decompression	
  is	
  
slooowww.	
  
•  LZO	
  –	
  spliOable	
  (with	
  some	
  work),	
  good	
  compress/de-­‐compress	
  
performance.	
  Good	
  choice	
  for	
  storing	
  text	
  files	
  on	
  Hadoop.	
  	
  
•  Snappy	
  –	
  provides	
  a	
  good	
  tradeoff	
  between	
  size	
  and	
  speed.	
  	
  
16
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  Storage	
  –	
  More	
  About	
  Snappy	
  
•  Designed	
  at	
  Google	
  to	
  provide	
  high	
  compression	
  speeds	
  with	
  
reasonable	
  compression.	
  
•  Not	
  the	
  highest	
  compression,	
  but	
  provides	
  very	
  good	
  performance	
  
for	
  processing	
  on	
  Hadoop.	
  
•  Snappy	
  is	
  not	
  spliOable	
  though,	
  which	
  brings	
  us	
  to…	
  
	
  
17
©2014 Cloudera, Inc. All Rights
Reserved.
SequenceFile	
  
• Stores	
  records	
  as	
  binary	
  
key/value	
  pairs.	
  
• SequenceFile	
  “blocks”	
  
can	
  be	
  compressed.	
  
• This	
  enables	
  spliOability	
  
with	
  non-­‐spliOable	
  
compression.	
  	
  	
  
18	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Avro	
  
•  Kinda	
  SequenceFile	
  on	
  
Steroids.	
  
•  Self-­‐documenAng	
  –	
  stores	
  
schema	
  in	
  header.	
  
•  Provides	
  very	
  efficient	
  
storage.	
  
•  Supports	
  spliOable	
  
compression.	
  
19	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Our	
  Format	
  Choices…	
  
•  Avro	
  with	
  Snappy	
  
•  Snappy	
  provides	
  opAmized	
  compression.	
  
•  Avro	
  provides	
  compact	
  storage,	
  self-­‐documenAng	
  files,	
  and	
  
supports	
  schema	
  evoluAon.	
  
•  Avro	
  also	
  provides	
  beOer	
  failure	
  handling	
  than	
  other	
  choices.	
  
•  SequenceFiles	
  would	
  also	
  be	
  a	
  good	
  choice,	
  and	
  are	
  directly	
  
supported	
  by	
  ingesAon	
  tools	
  in	
  the	
  ecosystem.	
  
•  But	
  only	
  supports	
  Java.	
  
20
©2014 Cloudera, Inc. All Rights
Reserved.
21
HDFS	
  Schema	
  Design	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Recommended	
  HDFS	
  Schema	
  Design	
  
•  How	
  to	
  lay	
  out	
  data	
  on	
  HDFS?	
  
22
©2014 Cloudera, Inc. All Rights
Reserved.
Recommended	
  HDFS	
  Schema	
  Design	
  
/user/<username>	
  -­‐	
  User	
  specific	
  data,	
  jars,	
  conf	
  files	
  
/etl	
  –	
  Data	
  in	
  various	
  stages	
  of	
  ETL	
  workflow	
  
/tmp	
  –	
  temp	
  data	
  from	
  tools	
  or	
  shared	
  between	
  users	
  
/data	
  –	
  shared	
  data	
  for	
  the	
  enAre	
  organizaAon	
  
/app	
  –	
  Everything	
  but	
  data:	
  UDF	
  jars,	
  HQL	
  files,	
  Oozie	
  workflows	
  
23
©2014 Cloudera, Inc. All Rights
Reserved.
24
Advanced	
  HDFS	
  Schema	
  Design	
  
©2014 Cloudera, Inc. All Rights
Reserved.
What	
  is	
  ParAAoning?	
  
25
dataset	
  
	
  	
  	
  col=val1/file.txt	
  
	
  	
  	
  col=val2/file.txt	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  col=valn/file.txt	
  
dataset	
  
	
  	
  file1.txt	
  
	
  	
  file2.txt	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  filen.txt	
  
Un-­‐parAAoned	
  HDFS	
  
directory	
  structure	
  
ParAAoned	
  HDFS	
  directory	
  
structure	
  
©2014 Cloudera, Inc. All Rights
Reserved.
What	
  is	
  ParAAoning?	
  
26
clicks	
  
	
  	
  	
  dt=2014-­‐01-­‐01/clicks.txt	
  
	
  	
  	
  dt=2014-­‐01-­‐02/clicks.txt	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  dt=2014-­‐03-­‐31/clicks.txt	
  
clicks	
  
	
  	
  clicks-­‐2014-­‐01-­‐01.txt	
  
	
  	
  clicks-­‐2014-­‐01-­‐02.txt	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  clicks-­‐2014-­‐03-­‐31.txt	
  
Un-­‐parAAoned	
  HDFS	
  
directory	
  structure	
  
ParAAoned	
  HDFS	
  directory	
  
structure	
  
©2014 Cloudera, Inc. All Rights
Reserved.
ParAAoning	
  
•  Split	
  the	
  dataset	
  into	
  smaller	
  consumable	
  chunks	
  
•  Rudimentary	
  form	
  of	
  “indexing”	
  
•  <data	
  set	
  name>/
<parAAon_column_name=parAAon_column_value>/{files}	
  
27
©2014 Cloudera, Inc. All Rights
Reserved.
ParAAoning	
  consideraAons	
  
•  What	
  column	
  to	
  bucket	
  by?	
  
•  HDFS	
  is	
  append	
  only.	
  
•  Don’t	
  have	
  too	
  many	
  parAAons	
  (<10,000)	
  
•  Don’t	
  have	
  too	
  many	
  small	
  files	
  in	
  the	
  parAAons	
  (more	
  than	
  
block	
  size	
  generally)	
  
•  We	
  decided	
  to	
  parAAon	
  by	
  1mestamp	
  
28
©2014 Cloudera, Inc. All Rights
Reserved.
What	
  is	
  buckeAng?	
  
29
clicks	
  
	
  	
  	
  dt=2014-­‐01-­‐01/clicks.txt	
  
	
  
	
  	
  	
  dt=2014-­‐01-­‐02/clicks.txt	
  
Un-­‐bucketed	
  HDFS	
  
directory	
  structure	
  
clicks	
  
	
  	
  	
  dt=2014-­‐01-­‐01/file0.txt	
  
	
  	
  	
  dt=2014-­‐01-­‐01/file1.txt	
  
	
  	
  	
  dt=2014-­‐01-­‐01/file2.txt	
  
	
  	
  	
  dt=2014-­‐01-­‐01/file3.txt	
  
	
  
	
  	
  	
  dt=2014-­‐01-­‐02/file0.txt	
  
	
  	
  	
  dt=2014-­‐01-­‐02/file1.txt	
  
	
  	
  	
  dt=2014-­‐01-­‐02/file2.txt	
  
	
  	
  	
  dt=2014-­‐01-­‐02/file3.txt	
  
Bucketed	
  HDFS	
  directory	
  
structure	
  
©2014 Cloudera, Inc. All Rights
Reserved.
BuckeAng	
  
•  Hash-­‐bucketed	
  files	
  within	
  each	
  parAAon	
  based	
  on	
  a	
  parAcular	
  
column	
  
•  Useful	
  when	
  sampling	
  
•  In	
  some	
  joins,	
  pre-­‐reqs:	
  
•  Datasets	
  bucketed	
  on	
  the	
  same	
  key	
  as	
  the	
  join	
  key	
  
•  Number	
  of	
  buckets	
  are	
  the	
  same	
  or	
  one	
  is	
  a	
  mulAple	
  of	
  the	
  other	
  
30
©2014 Cloudera, Inc. All Rights
Reserved.
BuckeAng	
  consideraAons?	
  
•  Which	
  column	
  to	
  bucket	
  on?	
  
•  How	
  many	
  buckets?	
  
•  We	
  decided	
  to	
  bucket	
  based	
  on	
  cookie	
  
31
©2014 Cloudera, Inc. All Rights
Reserved.
De-­‐normalizing	
  consideraAons	
  
•  In	
  general,	
  big	
  data	
  joins	
  are	
  expensive	
  
•  When	
  to	
  de-­‐normalize?	
  
•  Decided	
  to	
  join	
  the	
  smaller	
  dimension	
  tables	
  
•  Big	
  fact	
  tables	
  are	
  sAll	
  joined	
  
32
©2014 Cloudera, Inc. All Rights
Reserved.
33
Data	
  IngesAon	
  
©2014 Cloudera, Inc. All Rights
Reserved.
File	
  Transfers	
  	
  
• “hadoop	
  fs	
  –put	
  <file>”	
  
• Reliable,	
  but	
  not	
  resilient	
  
to	
  failure.	
  
• Other	
  opAons	
  are	
  
mountable	
  HDFS,	
  for	
  
example	
  NFSv3.	
  
34	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Streaming	
  IngesAon	
  
•  Flume	
  
•  Reliable,	
  distributed,	
  and	
  available	
  system	
  for	
  efficient	
  collecAon,	
  
aggregaAon	
  and	
  movement	
  of	
  streaming	
  data,	
  e.g.	
  logs.	
  
•  Ka{a	
  
•  Reliable	
  and	
  distributed	
  publish-­‐subscribe	
  messaging	
  system.	
  
35
©2014 Cloudera, Inc. All Rights
Reserved.
Flume	
  vs.	
  Ka{a	
  
• Purpose	
  built	
  for	
  Hadoop	
  
data	
  ingest.	
  
• Pre-­‐built	
  sinks	
  for	
  HDFS,	
  
HBase,	
  etc.	
  
• Supports	
  transformaAon	
  
of	
  data	
  in-­‐flight.	
  
• General	
  pub-­‐sub	
  
messaging	
  framework.	
  
• Hadoop	
  not	
  supported,	
  
requires	
  3rd-­‐party	
  
component	
  (Camus).	
  
• Just	
  a	
  message	
  transport	
  
(a	
  very	
  fast	
  one).	
  
36	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Flume	
  vs.	
  Ka{a	
  
•  BoOom	
  line:	
  
•  Flume	
  very	
  well	
  integrated	
  with	
  Hadoop	
  ecosystem,	
  well	
  suited	
  
to	
  ingesAon	
  of	
  sources	
  such	
  as	
  log	
  files.	
  
•  Ka{a	
  is	
  a	
  highly	
  reliable	
  and	
  scalable	
  enterprise	
  messaging	
  
system,	
  and	
  great	
  for	
  scaling	
  out	
  to	
  mulAple	
  consumers.	
  
37
©2014 Cloudera, Inc. All Rights
Reserved.
A	
  Quick	
  IntroducAon	
  to	
  Flume	
  
38	
  
Flume	
  Agent	
  
Source	
   Channel	
   Sink	
   DesAnaAon	
  External	
  
Source	
  
Web	
  Server	
  
TwiOer	
  
JMS	
  
System	
  logs	
  
…	
  
Consumes	
  events	
  
and	
  forwards	
  to	
  
channels	
  
Stores	
  events	
  
unAl	
  consumed	
  
by	
  sinks	
  –	
  file,	
  
memory,	
  JDBC	
  
Removes	
  event	
  from	
  
channel	
  and	
  puts	
  
into	
  external	
  
desAnaAon	
  
JVM	
  	
  process	
  hosAng	
  components	
  
©2014 Cloudera, Inc. All Rights
Reserved.
A	
  Quick	
  IntroducAon	
  to	
  Flume	
  
•  Reliable	
  –	
  events	
  are	
  stored	
  in	
  channel	
  unAl	
  delivered	
  to	
  next	
  stage.	
  
•  Recoverable	
  –	
  events	
  can	
  be	
  persisted	
  to	
  disk	
  and	
  recovered	
  in	
  the	
  
event	
  of	
  failure.	
  
39
Flume	
  Agent	
  
Source	
   Channel	
   Sink	
   DesAnaAon	
  
©2014 Cloudera, Inc. All Rights
Reserved.
A	
  Quick	
  IntroducAon	
  to	
  Flume	
  
• DeclaraAve	
  	
  
•  No	
  coding	
  required.	
  
•  ConfiguraAon	
  specifies	
  
how	
  components	
  are	
  
wired	
  together.	
  
40	
  
©2014 Cloudera, Inc. All Rights
Reserved.
A	
  Brief	
  Discussion	
  of	
  Flume	
  PaOerns	
  –	
  Fan-­‐in	
  
• Flume	
  agent	
  runs	
  on	
  
each	
  of	
  our	
  servers.	
  
• These	
  agents	
  send	
  data	
  
to	
  mulAple	
  agents	
  to	
  
provide	
  reliability.	
  
• Flume	
  provides	
  support	
  
for	
  load	
  balancing.	
  
41	
  
©2014 Cloudera, Inc. All Rights
Reserved.
A	
  Brief	
  Discussion	
  of	
  Flume	
  PaOerns	
  –	
  Spli~ng	
  
•  Common	
  need	
  is	
  to	
  split	
  
data	
  on	
  ingest.	
  
•  For	
  example:	
  
•  Sending	
  data	
  to	
  mulAple	
  
clusters	
  for	
  DR.	
  
•  To	
  mulAple	
  desAnaAons.	
  
•  Flume	
  also	
  supports	
  
parAAoning,	
  which	
  is	
  key	
  
to	
  our	
  implementaAon.	
  
42	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Sqoop	
  Overview	
  
•  Apache	
  project	
  designed	
  to	
  ease	
  import	
  and	
  export	
  of	
  data	
  
between	
  Hadoop	
  and	
  external	
  data	
  stores	
  such	
  as	
  relaAonal	
  
databases.	
  
•  Great	
  for	
  doing	
  bulk	
  imports	
  and	
  exports	
  of	
  data	
  between	
  
HDFS,	
  Hive	
  and	
  HBase	
  and	
  an	
  external	
  data	
  store.	
  Not	
  suited	
  
for	
  ingesAng	
  event	
  based	
  data.	
  
©2014 Cloudera, Inc. All Rights
Reserved.
43
IngesAon	
  Decisions	
  
•  Historical	
  Data	
  
•  Smaller	
  files:	
  file	
  transfer	
  
•  Larger	
  files:	
  Flume	
  with	
  spooling	
  directory	
  source.	
  
•  Incoming	
  Data	
  
•  Flume	
  with	
  the	
  spooling	
  directory	
  source.	
  
44
©2014 Cloudera, Inc. All Rights
Reserved.
45
Data	
  Processing	
  and	
  Access	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  flow	
  
46	
  
Raw	
  data	
  
ParAAoned	
  
clickstream	
  
data	
  
Other	
  data	
  
(Financial,	
  
CRM,	
  etc.)	
  
Aggregated	
  
dataset	
  #2	
  
Aggregated	
  
dataset	
  #1	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  processing	
  tools	
  
47	
  
•  Hive	
  
•  Impala	
  
•  Pig,	
  etc.	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Hive	
  
48	
  
•  Open	
  source	
  data	
  warehouse	
  system	
  for	
  Hadoop	
  
•  Converts	
  SQL-­‐like	
  queries	
  to	
  MapReduce	
  jobs	
  
•  Work	
  is	
  being	
  done	
  to	
  move	
  this	
  away	
  from	
  MR	
  
•  Stores	
  metadata	
  in	
  Hive	
  metastore	
  
•  Can	
  create	
  tables	
  over	
  HDFS	
  or	
  HBase	
  data	
  
•  Access	
  available	
  via	
  JDBC/ODBC	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Impala	
  
49	
  
•  Real-­‐Ame	
  open	
  source	
  SQL	
  query	
  engine	
  for	
  Hadoop	
  
•  Doesn’t	
  build	
  on	
  MapReduce	
  
•  WriOen	
  in	
  C++,	
  uses	
  LLVM	
  for	
  run-­‐Ame	
  code	
  generaAon	
  
•  Can	
  create	
  tables	
  over	
  HDFS	
  or	
  HBase	
  data	
  
•  Accesses	
  Hive	
  metastore	
  for	
  metadata	
  
•  Access	
  available	
  via	
  JDBC/ODBC	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Pig	
  
50	
  
•  Higher	
  level	
  abstracAon	
  over	
  MapReduce	
  (like	
  Hive)	
  
•  Write	
  transformaAons	
  in	
  scripAng	
  language	
  –	
  Pig	
  LaAn	
  
•  Can	
  access	
  Hive	
  metastore	
  via	
  HCatalog	
  for	
  metadata	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  Processing	
  consideraAons	
  
51	
  
•  We	
  chose	
  Hive	
  for	
  ETL	
  	
  and	
  Impala	
  for	
  interac1ve	
  BI.	
  
©2014 Cloudera, Inc. All Rights
Reserved.
52
Metadata	
  Management	
  
©2014 Cloudera, Inc. All Rights
Reserved.
What	
  is	
  Metadata?	
  
53	
  
•  Metadata	
  is	
  data	
  about	
  the	
  data	
  
•  Format	
  in	
  which	
  data	
  is	
  stored	
  
•  Compression	
  codec	
  
•  LocaAon	
  of	
  the	
  data	
  
•  Is	
  the	
  data	
  parAAoned/bucketed/sorted?	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Metadata	
  in	
  Hive	
  
54
Hive	
  
Metastore	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Metadata	
  
55	
  
•  Hive	
  metastore	
  has	
  become	
  the	
  de-­‐facto	
  metadata	
  repository	
  
•  HCatalog	
  makes	
  Hive	
  metastore	
  accessible	
  to	
  other	
  
applicaAons	
  (Pig,	
  MapReduce,	
  custom	
  apps,	
  etc.)	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Hive	
  +	
  HCatalog	
  
56	
  
©2014 Cloudera, Inc. All Rights
Reserved.
57
OrchestraAon	
  
©2014 Cloudera, Inc. All Rights
Reserved.
OrchestraAon	
  
•  Once	
  the	
  data	
  is	
  in	
  Hadoop,	
  we	
  need	
  a	
  way	
  to	
  manage	
  
workflows	
  in	
  our	
  architecture.	
  
•  Scheduling	
  and	
  tracking	
  MapReduce	
  jobs,	
  Hive	
  jobs,	
  etc.	
  
•  Several	
  opAons	
  here:	
  
•  Cron	
  
•  Oozie,	
  Azkaban	
  
•  3rd-­‐party	
  tools,	
  Talend,	
  Pentaho,	
  InformaAca,	
  enterprise	
  
schedulers.	
  
58
©2014 Cloudera, Inc. All Rights
Reserved.
Oozie	
  
• Supports	
  defining	
  and	
  
execuAng	
  a	
  sequence	
  of	
  
jobs.	
  
• Can	
  trigger	
  jobs	
  based	
  on	
  
external	
  dependencies	
  or	
  
schedules.	
  
59	
  
©2014 Cloudera, Inc. All Rights
Reserved.
60
Final	
  Architecture	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Final	
  Architecture	
  –	
  High	
  Level	
  Overview	
  
61	
  
Data	
  
Sources	
  
IngesAon	
  
Data	
  
Storage/
Processing	
  
Data	
  
ReporAng/
Analysis	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Final	
  Architecture	
  –	
  High	
  Level	
  Overview	
  
62	
  
Data	
  
Sources	
  
IngesAon	
  
Data	
  
Storage/
Processing	
  
Data	
  
ReporAng/
Analysis	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Final	
  Architecture	
  –	
  IngesAon	
  
63	
  
Web	
  App	
   Avro	
  Agent	
  
Web	
  App	
   Avro	
  Agent	
  
Web	
  App	
   Avro	
  Agent	
  
Web	
  App	
   Avro	
  Agent	
  
Web	
  App	
   Avro	
  Agent	
  
Web	
  App	
   Avro	
  Agent	
  
Web	
  App	
   Avro	
  Agent	
  
Web	
  App	
   Avro	
  Agent	
  
Flume	
  Agent	
  
Flume	
  Agent	
  
Flume	
  Agent	
  
Flume	
  Agent	
  
Fan-­‐in	
  	
  
PaOern	
  
MulA	
  Agents	
  for	
  	
  
Failover	
  and	
  rolling	
  restarts	
  
HDFS	
  	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Final	
  Architecture	
  –	
  High	
  Level	
  Overview	
  
64	
  
Data	
  
Sources	
  
IngesAon	
  
Data	
  
Storage/
Processing	
  
Data	
  
ReporAng/
Analysis	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Final	
  Architecture	
  –	
  Storage	
  and	
  Processing	
  
65	
  
/etl/weblogs/20140331/	
  
/etl/weblogs/20140401/	
  
…	
  
Data	
  Processing	
  
/data/markeAng/clickstream/bouncerate/	
  
/data/markeAng/clickstream/aOribuAon/	
  
…	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Final	
  Architecture	
  –	
  High	
  Level	
  Overview	
  
66	
  
Data	
  
Sources	
  
IngesAon	
  
Data	
  
Storage/
Processing	
  
Data	
  
ReporAng/
Analysis	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Final	
  Architecture	
  –	
  Data	
  Access	
  
67	
  
Hive/
Impala	
  
BI/
AnalyAcs	
  
Tools	
  
DWH	
  
Sqoop	
  
Local	
  
Disk	
  
R,	
  etc.	
  
DB	
  import	
  tool	
  
JDBC/ODBC	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Contact	
  info	
  
•  Mark	
  Grover	
  
•  @mark_grover	
  
•  www.linkedin.com/in/grovermark	
  
•  Jonathan	
  Seidman	
  
•  jseidman@cloudera.com	
  
•  @jseidman	
  
•  hOps://www.linkedin.com/pub/jonathan-­‐seidman/1/26a/959	
  
•  hOp://www.slideshare.net/jseidman	
  
•  Slides	
  at	
  slideshare.net/hadooparchbook	
  
68
©2014 Cloudera, Inc. All Rights
Reserved.
69
©2014 Cloudera, Inc. All Rights
Reserved.

Contenu connexe

Tendances

Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data PlatformRakuten Group, Inc.
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valleymarkgrover
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014hadooparchbook
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Stefan Lipp
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Jonathan Seidman
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale DataCloudera, Inc.
 
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...Cloudera, Inc.
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...Yahoo Developer Network
 

Tendances (20)

Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
Kafka Security
Kafka SecurityKafka Security
Kafka Security
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Data
 
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 

En vedette

Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
Advanced Shell Scripting for Oracle professionals
Advanced Shell Scripting for Oracle professionalsAdvanced Shell Scripting for Oracle professionals
Advanced Shell Scripting for Oracle professionalsAndrejs Vorobjovs
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in KafkaJoel Koshy
 
DDD patterns that were not in the book
DDD patterns that were not in the bookDDD patterns that were not in the book
DDD patterns that were not in the bookCyrille Martraire
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 

En vedette (9)

Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Advanced Shell Scripting for Oracle professionals
Advanced Shell Scripting for Oracle professionalsAdvanced Shell Scripting for Oracle professionals
Advanced Shell Scripting for Oracle professionals
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
DDD patterns that were not in the book
DDD patterns that were not in the bookDDD patterns that were not in the book
DDD patterns that were not in the book
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 

Similaire à Application architectures with hadoop – big data techcon 2014

Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Grouphadooparchbook
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopYifeng Jiang
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialhadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Cloudera, Inc.
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an examplehadooparchbook
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Cloudera, Inc.
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Hadoop HDFS and Oracle
Hadoop HDFS and OracleHadoop HDFS and Oracle
Hadoop HDFS and OracleJohan Louwers
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 

Similaire à Application architectures with hadoop – big data techcon 2014 (20)

Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Hadoop HDFS and Oracle
Hadoop HDFS and OracleHadoop HDFS and Oracle
Hadoop HDFS and Oracle
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hadoop In Action
Hadoop In ActionHadoop In Action
Hadoop In Action
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 

Plus de Jonathan Seidman

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Jonathan Seidman
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_finalJonathan Seidman
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Jonathan Seidman
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Jonathan Seidman
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Jonathan Seidman
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Jonathan Seidman
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Jonathan Seidman
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Jonathan Seidman
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010Jonathan Seidman
 

Plus de Jonathan Seidman (15)

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_final
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 

Dernier

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Dernier (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Application architectures with hadoop – big data techcon 2014

  • 1. 1 Headline  Goes  Here   Speaker  Name  or  Subhead  Goes  Here   DO  NOT  USE  PUBLICLY   PRIOR  TO  10/23/12   ApplicaAon  Architectures  with   Hadoop   Mark  Grover  |  SoGware  Engineer   Jonathan  Seidman    |  SoluAons  Architect,  Partner   Engineering   April  1,  2014   ©2014 Cloudera, Inc. All Rights Reserved.
  • 2. About  Us   •  Mark   •  CommiOer  on  Apache  Bigtop,  commiOer  and  PPMC  member  on  Apache   Sentry  (incubaAng).   •  Contributor  to  Hadoop,  Hive,  Spark,  Sqoop,  Flume.   •  @mark_grover   •  Jonathan   •  SoluAons  Architect,  Partner  Engineering  Team.   •  Co-­‐founder  of  Chicago  Hadoop  User  Group  and  Chicago  Big  Data.   •  jseidman@cloudera.com   •  @jseidman   2 ©2014 Cloudera, Inc. All Rights Reserved.
  • 3. Co-­‐authoring  O’Reilly  book   •  Titled  ‘Hadoop  ApplicaAon  Architectures’   •  How  to  build  end-­‐to-­‐end  soluAons  using     Apache  Hadoop  and  related  tools   •  Updates  on  TwiOer:  @hadooparchbook   •  hOp://www.hadooparchitecturebook.com   ©2014 Cloudera, Inc. All Rights Reserved. 3
  • 4. Challenges  of  Hadoop  ImplementaAon   4   ©2014 Cloudera, Inc. All Rights Reserved.
  • 5. Challenges  of  Hadoop  ImplementaAon   5   ©2014 Cloudera, Inc. All Rights Reserved.
  • 6. 6 Click  Stream  Analysis   Case  Study   ©2014 Cloudera, Inc. All Rights Reserved.
  • 7. Click  Stream  Analysis   7   Log   Files   DWH   X   ©2014 Cloudera, Inc. All Rights Reserved.
  • 8. Web  Log  Example   ©2014 Cloudera, Inc. All Rights Reserved. 8   [2012/09/22 20:56:04.294 -0500] "GET /info/ HTTP/1.1" 200 701 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en)" "age=38&gender=1&incomeCategory=5&session=983040389&user=627735038& region=8&userType=1” [2012/09/23 14:12:52.294 -0500] "GET /wish/remove/275 HTTP/1.1" 200 701 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; en-us) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16" "age=63&gender=1&incomeCategory=1&session=1561203915&user=136433448 8&region=4&userType=1"
  • 9. Hadoop  Architectural  ConsideraAons     •  Storage  managers?   •  HDFS?  HBase?   •  Data  storage  and  modeling:   •  File  formats?  Compression?  Schema  design?   •  Data  movement   •  How  do  we  actually  get  the  data  into  Hadoop?  How  do  we  get  it  out?   •  Metadata   •  How  do  we  manage  data  about  the  data?   •  Data  access  and  processing   •  How  will  the  data  be  accessed  once  in  Hadoop?  How  can  we  transform  it?  How  do   we  query  it?   •  OrchestraAon   •  How  do  we  manage  the  workflow  for  all  of  this?   9 ©2014 Cloudera, Inc. All Rights Reserved.
  • 10. 10 Data  Storage  and  Modeling   ©2014 Cloudera, Inc. All Rights Reserved.
  • 11. Data  Storage  –  Storage  Manager  consideraAons   •  Popular  storage  managers  for  Hadoop   •  Hadoop  Distributed  File  System  (HDFS)   •  HBase   11 ©2014 Cloudera, Inc. All Rights Reserved.
  • 12. Data  Storage  –  HDFS  vs  HBase   HDFS   •  Stores  data  directly  as  files   •  Fast  scans   •  Poor  random  reads/writes   HBase   •  Stores  data  as  Hfiles  on  HDFS   •  Slow  scans   •  Fast  random  reads/writes   12   ©2014 Cloudera, Inc. All Rights Reserved.
  • 13. Data  Storage  –  Storage  Manager  consideraAons   •  We  choose  HDFS   •  AnalyAcal  needs  in  this  case  served  beOer  by  fast  scans.   13 ©2014 Cloudera, Inc. All Rights Reserved.
  • 14. 14 Data  Storage  Format   ©2014 Cloudera, Inc. All Rights Reserved.
  • 15. Data  Storage  –  Format  ConsideraAons     •  Store  as  plain  text?   •  Sure,  well  supported  by  Hadoop.   •  Text  can  easily  be  processed  by  MapReduce,  loaded  into  Hive  for   analysis,  and  so  on.   •  But…   •  Will  begin  to  consume  lots  of  space  in  HDFS.   •  May  not  be  opAmal  for  processing  by  tools  in  the  Hadoop   ecosystem.   15 ©2014 Cloudera, Inc. All Rights Reserved.
  • 16. Data  Storage  –  Format  ConsideraAons     •  But,  we  can  compress  the  text  files…   •  Gzip  –  supported  by  Hadoop,  but  not  spliOable.   •  Bzip2  –  hey,  spliOable!  Great  compression!  But  decompression  is   slooowww.   •  LZO  –  spliOable  (with  some  work),  good  compress/de-­‐compress   performance.  Good  choice  for  storing  text  files  on  Hadoop.     •  Snappy  –  provides  a  good  tradeoff  between  size  and  speed.     16 ©2014 Cloudera, Inc. All Rights Reserved.
  • 17. Data  Storage  –  More  About  Snappy   •  Designed  at  Google  to  provide  high  compression  speeds  with   reasonable  compression.   •  Not  the  highest  compression,  but  provides  very  good  performance   for  processing  on  Hadoop.   •  Snappy  is  not  spliOable  though,  which  brings  us  to…     17 ©2014 Cloudera, Inc. All Rights Reserved.
  • 18. SequenceFile   • Stores  records  as  binary   key/value  pairs.   • SequenceFile  “blocks”   can  be  compressed.   • This  enables  spliOability   with  non-­‐spliOable   compression.       18   ©2014 Cloudera, Inc. All Rights Reserved.
  • 19. Avro   •  Kinda  SequenceFile  on   Steroids.   •  Self-­‐documenAng  –  stores   schema  in  header.   •  Provides  very  efficient   storage.   •  Supports  spliOable   compression.   19   ©2014 Cloudera, Inc. All Rights Reserved.
  • 20. Our  Format  Choices…   •  Avro  with  Snappy   •  Snappy  provides  opAmized  compression.   •  Avro  provides  compact  storage,  self-­‐documenAng  files,  and   supports  schema  evoluAon.   •  Avro  also  provides  beOer  failure  handling  than  other  choices.   •  SequenceFiles  would  also  be  a  good  choice,  and  are  directly   supported  by  ingesAon  tools  in  the  ecosystem.   •  But  only  supports  Java.   20 ©2014 Cloudera, Inc. All Rights Reserved.
  • 21. 21 HDFS  Schema  Design   ©2014 Cloudera, Inc. All Rights Reserved.
  • 22. Recommended  HDFS  Schema  Design   •  How  to  lay  out  data  on  HDFS?   22 ©2014 Cloudera, Inc. All Rights Reserved.
  • 23. Recommended  HDFS  Schema  Design   /user/<username>  -­‐  User  specific  data,  jars,  conf  files   /etl  –  Data  in  various  stages  of  ETL  workflow   /tmp  –  temp  data  from  tools  or  shared  between  users   /data  –  shared  data  for  the  enAre  organizaAon   /app  –  Everything  but  data:  UDF  jars,  HQL  files,  Oozie  workflows   23 ©2014 Cloudera, Inc. All Rights Reserved.
  • 24. 24 Advanced  HDFS  Schema  Design   ©2014 Cloudera, Inc. All Rights Reserved.
  • 25. What  is  ParAAoning?   25 dataset        col=val1/file.txt        col=val2/file.txt          .          .          .        col=valn/file.txt   dataset      file1.txt      file2.txt          .          .          .        filen.txt   Un-­‐parAAoned  HDFS   directory  structure   ParAAoned  HDFS  directory   structure   ©2014 Cloudera, Inc. All Rights Reserved.
  • 26. What  is  ParAAoning?   26 clicks        dt=2014-­‐01-­‐01/clicks.txt        dt=2014-­‐01-­‐02/clicks.txt          .          .          .        dt=2014-­‐03-­‐31/clicks.txt   clicks      clicks-­‐2014-­‐01-­‐01.txt      clicks-­‐2014-­‐01-­‐02.txt          .          .          .        clicks-­‐2014-­‐03-­‐31.txt   Un-­‐parAAoned  HDFS   directory  structure   ParAAoned  HDFS  directory   structure   ©2014 Cloudera, Inc. All Rights Reserved.
  • 27. ParAAoning   •  Split  the  dataset  into  smaller  consumable  chunks   •  Rudimentary  form  of  “indexing”   •  <data  set  name>/ <parAAon_column_name=parAAon_column_value>/{files}   27 ©2014 Cloudera, Inc. All Rights Reserved.
  • 28. ParAAoning  consideraAons   •  What  column  to  bucket  by?   •  HDFS  is  append  only.   •  Don’t  have  too  many  parAAons  (<10,000)   •  Don’t  have  too  many  small  files  in  the  parAAons  (more  than   block  size  generally)   •  We  decided  to  parAAon  by  1mestamp   28 ©2014 Cloudera, Inc. All Rights Reserved.
  • 29. What  is  buckeAng?   29 clicks        dt=2014-­‐01-­‐01/clicks.txt          dt=2014-­‐01-­‐02/clicks.txt   Un-­‐bucketed  HDFS   directory  structure   clicks        dt=2014-­‐01-­‐01/file0.txt        dt=2014-­‐01-­‐01/file1.txt        dt=2014-­‐01-­‐01/file2.txt        dt=2014-­‐01-­‐01/file3.txt          dt=2014-­‐01-­‐02/file0.txt        dt=2014-­‐01-­‐02/file1.txt        dt=2014-­‐01-­‐02/file2.txt        dt=2014-­‐01-­‐02/file3.txt   Bucketed  HDFS  directory   structure   ©2014 Cloudera, Inc. All Rights Reserved.
  • 30. BuckeAng   •  Hash-­‐bucketed  files  within  each  parAAon  based  on  a  parAcular   column   •  Useful  when  sampling   •  In  some  joins,  pre-­‐reqs:   •  Datasets  bucketed  on  the  same  key  as  the  join  key   •  Number  of  buckets  are  the  same  or  one  is  a  mulAple  of  the  other   30 ©2014 Cloudera, Inc. All Rights Reserved.
  • 31. BuckeAng  consideraAons?   •  Which  column  to  bucket  on?   •  How  many  buckets?   •  We  decided  to  bucket  based  on  cookie   31 ©2014 Cloudera, Inc. All Rights Reserved.
  • 32. De-­‐normalizing  consideraAons   •  In  general,  big  data  joins  are  expensive   •  When  to  de-­‐normalize?   •  Decided  to  join  the  smaller  dimension  tables   •  Big  fact  tables  are  sAll  joined   32 ©2014 Cloudera, Inc. All Rights Reserved.
  • 33. 33 Data  IngesAon   ©2014 Cloudera, Inc. All Rights Reserved.
  • 34. File  Transfers     • “hadoop  fs  –put  <file>”   • Reliable,  but  not  resilient   to  failure.   • Other  opAons  are   mountable  HDFS,  for   example  NFSv3.   34   ©2014 Cloudera, Inc. All Rights Reserved.
  • 35. Streaming  IngesAon   •  Flume   •  Reliable,  distributed,  and  available  system  for  efficient  collecAon,   aggregaAon  and  movement  of  streaming  data,  e.g.  logs.   •  Ka{a   •  Reliable  and  distributed  publish-­‐subscribe  messaging  system.   35 ©2014 Cloudera, Inc. All Rights Reserved.
  • 36. Flume  vs.  Ka{a   • Purpose  built  for  Hadoop   data  ingest.   • Pre-­‐built  sinks  for  HDFS,   HBase,  etc.   • Supports  transformaAon   of  data  in-­‐flight.   • General  pub-­‐sub   messaging  framework.   • Hadoop  not  supported,   requires  3rd-­‐party   component  (Camus).   • Just  a  message  transport   (a  very  fast  one).   36   ©2014 Cloudera, Inc. All Rights Reserved.
  • 37. Flume  vs.  Ka{a   •  BoOom  line:   •  Flume  very  well  integrated  with  Hadoop  ecosystem,  well  suited   to  ingesAon  of  sources  such  as  log  files.   •  Ka{a  is  a  highly  reliable  and  scalable  enterprise  messaging   system,  and  great  for  scaling  out  to  mulAple  consumers.   37 ©2014 Cloudera, Inc. All Rights Reserved.
  • 38. A  Quick  IntroducAon  to  Flume   38   Flume  Agent   Source   Channel   Sink   DesAnaAon  External   Source   Web  Server   TwiOer   JMS   System  logs   …   Consumes  events   and  forwards  to   channels   Stores  events   unAl  consumed   by  sinks  –  file,   memory,  JDBC   Removes  event  from   channel  and  puts   into  external   desAnaAon   JVM    process  hosAng  components   ©2014 Cloudera, Inc. All Rights Reserved.
  • 39. A  Quick  IntroducAon  to  Flume   •  Reliable  –  events  are  stored  in  channel  unAl  delivered  to  next  stage.   •  Recoverable  –  events  can  be  persisted  to  disk  and  recovered  in  the   event  of  failure.   39 Flume  Agent   Source   Channel   Sink   DesAnaAon   ©2014 Cloudera, Inc. All Rights Reserved.
  • 40. A  Quick  IntroducAon  to  Flume   • DeclaraAve     •  No  coding  required.   •  ConfiguraAon  specifies   how  components  are   wired  together.   40   ©2014 Cloudera, Inc. All Rights Reserved.
  • 41. A  Brief  Discussion  of  Flume  PaOerns  –  Fan-­‐in   • Flume  agent  runs  on   each  of  our  servers.   • These  agents  send  data   to  mulAple  agents  to   provide  reliability.   • Flume  provides  support   for  load  balancing.   41   ©2014 Cloudera, Inc. All Rights Reserved.
  • 42. A  Brief  Discussion  of  Flume  PaOerns  –  Spli~ng   •  Common  need  is  to  split   data  on  ingest.   •  For  example:   •  Sending  data  to  mulAple   clusters  for  DR.   •  To  mulAple  desAnaAons.   •  Flume  also  supports   parAAoning,  which  is  key   to  our  implementaAon.   42   ©2014 Cloudera, Inc. All Rights Reserved.
  • 43. Sqoop  Overview   •  Apache  project  designed  to  ease  import  and  export  of  data   between  Hadoop  and  external  data  stores  such  as  relaAonal   databases.   •  Great  for  doing  bulk  imports  and  exports  of  data  between   HDFS,  Hive  and  HBase  and  an  external  data  store.  Not  suited   for  ingesAng  event  based  data.   ©2014 Cloudera, Inc. All Rights Reserved. 43
  • 44. IngesAon  Decisions   •  Historical  Data   •  Smaller  files:  file  transfer   •  Larger  files:  Flume  with  spooling  directory  source.   •  Incoming  Data   •  Flume  with  the  spooling  directory  source.   44 ©2014 Cloudera, Inc. All Rights Reserved.
  • 45. 45 Data  Processing  and  Access   ©2014 Cloudera, Inc. All Rights Reserved.
  • 46. Data  flow   46   Raw  data   ParAAoned   clickstream   data   Other  data   (Financial,   CRM,  etc.)   Aggregated   dataset  #2   Aggregated   dataset  #1   ©2014 Cloudera, Inc. All Rights Reserved.
  • 47. Data  processing  tools   47   •  Hive   •  Impala   •  Pig,  etc.   ©2014 Cloudera, Inc. All Rights Reserved.
  • 48. Hive   48   •  Open  source  data  warehouse  system  for  Hadoop   •  Converts  SQL-­‐like  queries  to  MapReduce  jobs   •  Work  is  being  done  to  move  this  away  from  MR   •  Stores  metadata  in  Hive  metastore   •  Can  create  tables  over  HDFS  or  HBase  data   •  Access  available  via  JDBC/ODBC   ©2014 Cloudera, Inc. All Rights Reserved.
  • 49. Impala   49   •  Real-­‐Ame  open  source  SQL  query  engine  for  Hadoop   •  Doesn’t  build  on  MapReduce   •  WriOen  in  C++,  uses  LLVM  for  run-­‐Ame  code  generaAon   •  Can  create  tables  over  HDFS  or  HBase  data   •  Accesses  Hive  metastore  for  metadata   •  Access  available  via  JDBC/ODBC   ©2014 Cloudera, Inc. All Rights Reserved.
  • 50. Pig   50   •  Higher  level  abstracAon  over  MapReduce  (like  Hive)   •  Write  transformaAons  in  scripAng  language  –  Pig  LaAn   •  Can  access  Hive  metastore  via  HCatalog  for  metadata   ©2014 Cloudera, Inc. All Rights Reserved.
  • 51. Data  Processing  consideraAons   51   •  We  chose  Hive  for  ETL    and  Impala  for  interac1ve  BI.   ©2014 Cloudera, Inc. All Rights Reserved.
  • 52. 52 Metadata  Management   ©2014 Cloudera, Inc. All Rights Reserved.
  • 53. What  is  Metadata?   53   •  Metadata  is  data  about  the  data   •  Format  in  which  data  is  stored   •  Compression  codec   •  LocaAon  of  the  data   •  Is  the  data  parAAoned/bucketed/sorted?   ©2014 Cloudera, Inc. All Rights Reserved.
  • 54. Metadata  in  Hive   54 Hive   Metastore   ©2014 Cloudera, Inc. All Rights Reserved.
  • 55. Metadata   55   •  Hive  metastore  has  become  the  de-­‐facto  metadata  repository   •  HCatalog  makes  Hive  metastore  accessible  to  other   applicaAons  (Pig,  MapReduce,  custom  apps,  etc.)   ©2014 Cloudera, Inc. All Rights Reserved.
  • 56. Hive  +  HCatalog   56   ©2014 Cloudera, Inc. All Rights Reserved.
  • 57. 57 OrchestraAon   ©2014 Cloudera, Inc. All Rights Reserved.
  • 58. OrchestraAon   •  Once  the  data  is  in  Hadoop,  we  need  a  way  to  manage   workflows  in  our  architecture.   •  Scheduling  and  tracking  MapReduce  jobs,  Hive  jobs,  etc.   •  Several  opAons  here:   •  Cron   •  Oozie,  Azkaban   •  3rd-­‐party  tools,  Talend,  Pentaho,  InformaAca,  enterprise   schedulers.   58 ©2014 Cloudera, Inc. All Rights Reserved.
  • 59. Oozie   • Supports  defining  and   execuAng  a  sequence  of   jobs.   • Can  trigger  jobs  based  on   external  dependencies  or   schedules.   59   ©2014 Cloudera, Inc. All Rights Reserved.
  • 60. 60 Final  Architecture   ©2014 Cloudera, Inc. All Rights Reserved.
  • 61. Final  Architecture  –  High  Level  Overview   61   Data   Sources   IngesAon   Data   Storage/ Processing   Data   ReporAng/ Analysis   ©2014 Cloudera, Inc. All Rights Reserved.
  • 62. Final  Architecture  –  High  Level  Overview   62   Data   Sources   IngesAon   Data   Storage/ Processing   Data   ReporAng/ Analysis   ©2014 Cloudera, Inc. All Rights Reserved.
  • 63. Final  Architecture  –  IngesAon   63   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Flume  Agent   Flume  Agent   Flume  Agent   Flume  Agent   Fan-­‐in     PaOern   MulA  Agents  for     Failover  and  rolling  restarts   HDFS     ©2014 Cloudera, Inc. All Rights Reserved.
  • 64. Final  Architecture  –  High  Level  Overview   64   Data   Sources   IngesAon   Data   Storage/ Processing   Data   ReporAng/ Analysis   ©2014 Cloudera, Inc. All Rights Reserved.
  • 65. Final  Architecture  –  Storage  and  Processing   65   /etl/weblogs/20140331/   /etl/weblogs/20140401/   …   Data  Processing   /data/markeAng/clickstream/bouncerate/   /data/markeAng/clickstream/aOribuAon/   …   ©2014 Cloudera, Inc. All Rights Reserved.
  • 66. Final  Architecture  –  High  Level  Overview   66   Data   Sources   IngesAon   Data   Storage/ Processing   Data   ReporAng/ Analysis   ©2014 Cloudera, Inc. All Rights Reserved.
  • 67. Final  Architecture  –  Data  Access   67   Hive/ Impala   BI/ AnalyAcs   Tools   DWH   Sqoop   Local   Disk   R,  etc.   DB  import  tool   JDBC/ODBC   ©2014 Cloudera, Inc. All Rights Reserved.
  • 68. Contact  info   •  Mark  Grover   •  @mark_grover   •  www.linkedin.com/in/grovermark   •  Jonathan  Seidman   •  jseidman@cloudera.com   •  @jseidman   •  hOps://www.linkedin.com/pub/jonathan-­‐seidman/1/26a/959   •  hOp://www.slideshare.net/jseidman   •  Slides  at  slideshare.net/hadooparchbook   68 ©2014 Cloudera, Inc. All Rights Reserved.
  • 69. 69 ©2014 Cloudera, Inc. All Rights Reserved.