SlideShare une entreprise Scribd logo
1  sur  58
Télécharger pour lire hors ligne
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
David	
  Alves	
  on	
  behalf	
  of	
  the	
  Kudu	
  team	
  
	
  
Kudu:	
  Resolving	
  Transac@onal	
  
and	
  Analy@c	
  Trade-­‐offs	
  in	
  
Hadoop	
  
1	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  
Storage	
  for	
  Fast	
  Analy@cs	
  on	
  Fast	
  Data	
  
•  New	
  upda@ng	
  column	
  store	
  for	
  
Hadoop	
  
	
  
•  Apache-­‐licensed	
  open	
  source	
  
•  Beta	
  now	
  available	
  
Columnar	
  Store	
  
Kudu	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Mo@va@on	
  and	
  Goals	
  
Why	
  build	
  Kudu?	
  
3	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Mo@va@ng	
  Ques@ons	
  
•  Are	
  there	
  user	
  problems	
  that	
  can	
  we	
  can’t	
  address	
  because	
  of	
  gaps	
  in	
  Hadoop	
  
ecosystem	
  storage	
  technologies?	
  
•  Are	
  we	
  posi@oned	
  to	
  take	
  advantage	
  of	
  advancements	
  in	
  the	
  hardware	
  
landscape?	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Current	
  Storage	
  Landscape	
  in	
  Hadoop	
  
HDFS	
  excels	
  at:	
  
•  Efficiently	
  scanning	
  large	
  amounts	
  
of	
  data	
  
•  Accumula@ng	
  data	
  with	
  high	
  
throughput	
  
HBase	
  excels	
  at:	
  
•  Efficiently	
  finding	
  and	
  wri@ng	
  
individual	
  rows	
  
•  Making	
  data	
  mutable	
  
	
  
Gaps	
  exist	
  when	
  these	
  proper@es	
  
are	
  needed	
  simultaneously	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Changing	
  Hardware	
  landscape	
  
•  Spinning	
  disk	
  -­‐>	
  solid	
  state	
  storage	
  
• NAND	
  flash:	
  Up	
  to	
  450k	
  read	
  250k	
  write	
  iops,	
  about	
  2GB/sec	
  read	
  and	
  1.5GB/
sec	
  write	
  throughput,	
  at	
  a	
  price	
  of	
  less	
  than	
  $3/GB	
  and	
  dropping	
  
• 3D	
  XPoint	
  memory	
  (1000x	
  faster	
  than	
  NAND,	
  cheaper	
  than	
  RAM)	
  
•  RAM	
  is	
  cheaper	
  and	
  more	
  abundant:	
  
• 64-­‐>128-­‐>256GB	
  over	
  last	
  few	
  years	
  
•  Takeaway	
  1:	
  The	
  next	
  bo?leneck	
  is	
  CPU,	
  and	
  current	
  storage	
  systems	
  weren’t	
  
designed	
  with	
  CPU	
  efficiency	
  in	
  mind.	
  
•  Takeaway	
  2:	
  Column	
  stores	
  are	
  feasible	
  for	
  random	
  access	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  High	
  throughput	
  for	
  big	
  scans	
  (columnar	
  
storage	
  and	
  replica@on)	
  
Goal:	
  Within	
  2x	
  of	
  Parquet	
  
	
  
•  Low-­‐latency	
  for	
  short	
  accesses	
  (primary	
  key	
  
indexes	
  and	
  quorum	
  replica@on)	
  
Goal:	
  1ms	
  read/write	
  on	
  SSD	
  
	
  
•  Database-­‐like	
  seman@cs	
  (ini@ally	
  single-­‐row	
  
ACID)	
  
	
  
•  RelaHonal	
  data	
  model	
  
•  SQL	
  query	
  
•  “NoSQL”	
  style	
  scan/insert/update	
  (Java	
  client)	
  
Kudu	
  Design	
  Goals	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  Design	
  Goals	
  
how effectively primary key filters can be pushed down to Kudu. 
What do I use Kudu for? 
We talked about how Kudu is made for SQL, allows fast scans, and allows fast mutability at 
scale.  With that in context, let’s look at the variety of use cases done in Hadoop today and see 
where Kudu fits in. 
 
 
If we look at Kudu in the above figure, we will see that many of the traditional SQL use cases 
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  Usage	
  
•  Table	
  has	
  a	
  SQL-­‐like	
  schema	
  
• Finite	
  number	
  of	
  columns	
  (unlike	
  HBase/Cassandra)	
  
• Types:	
  BOOL,	
  INT8,	
  INT16,	
  INT32,	
  INT64,	
  FLOAT,	
  DOUBLE,	
  STRING,	
  BINARY,	
  
TIMESTAMP	
  
• Some	
  subset	
  of	
  columns	
  makes	
  up	
  a	
  possibly-­‐composite	
  primary	
  key	
  
• Fast	
  ALTER	
  TABLE	
  
•  Java	
  and	
  C++	
  “NoSQL”	
  style	
  APIs	
  
• Insert(),	
  Update(),	
  Delete(),	
  Scan()	
  
•  Integra@ons	
  with	
  MapReduce,	
  Spark,	
  and	
  Impala	
  
• more	
  to	
  come!	
  
9	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Use	
  cases	
  and	
  architectures	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  Use	
  Cases	
  
Kudu	
  is	
  best	
  for	
  use	
  cases	
  requiring	
  a	
  simultaneous	
  combinaHon	
  of	
  
sequenHal	
  and	
  random	
  reads	
  and	
  writes	
  
	
  
● Time	
  Series	
  
○  Examples:	
  Stream	
  market	
  data;	
  fraud	
  detec@on	
  &	
  preven@on;	
  risk	
  monitoring	
  
○  Workload:	
  Insert,	
  updates,	
  scans,	
  lookups	
  
● Machine	
  Data	
  AnalyHcs	
  
○  Examples:	
  Network	
  threat	
  detec@on	
  
○  Workload:	
  Inserts,	
  scans,	
  lookups	
  
● Online	
  ReporHng	
  
○  Examples:	
  ODS	
  
○  Workload:	
  Inserts,	
  updates,	
  scans,	
  lookups	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real-­‐Time	
  Analy@cs	
  in	
  Hadoop	
  Today	
  
Fraud	
  Detec@on	
  in	
  the	
  Real	
  World	
  =	
  Storage	
  Complexity	
  
ConsideraHons:	
  
●  How	
  do	
  I	
  handle	
  failure	
  
during	
  this	
  process?	
  
	
  
●  How	
  oten	
  do	
  I	
  reorganize	
  
data	
  streaming	
  in	
  into	
  a	
  
format	
  appropriate	
  for	
  
repor@ng?	
  
	
  
●  When	
  repor@ng,	
  how	
  do	
  I	
  see	
  
data	
  that	
  has	
  not	
  yet	
  been	
  
reorganized?	
  
	
  
●  How	
  do	
  I	
  ensure	
  that	
  
important	
  jobs	
  aren’t	
  
interrupted	
  by	
  maintenance?	
  
New	
  Par@@on	
  
Most	
  Recent	
  Par@@on	
  
Historic	
  Data	
  
HBase	
  
Parquet	
  
File	
  
Have	
  we	
  
accumulated	
  
enough	
  data?	
  
Reorganize	
  
HBase	
  file	
  
into	
  Parquet	
  
•  Wait	
  for	
  running	
  opera@ons	
  to	
  complete	
  	
  
•  Define	
  new	
  Impala	
  par@@on	
  referencing	
  
the	
  newly	
  wriwen	
  Parquet	
  file	
  
Incoming	
  Data	
  
(Messaging	
  
System)	
  
Repor@ng	
  
Request	
  
Impala	
  on	
  HDFS	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real-­‐Time	
  Analy@cs	
  in	
  Hadoop	
  with	
  Kudu	
  
Improvements:	
  
●  One	
  system	
  to	
  operate	
  
●  No	
  cron	
  jobs	
  or	
  background	
  
processes	
  
●  Handle	
  late	
  arrivals	
  or	
  data	
  
correcHons	
  with	
  ease	
  
●  New	
  data	
  available	
  
immediately	
  for	
  analyHcs	
  or	
  
operaHons	
  	
  
Historical	
  and	
  Real-­‐@me	
  
Data	
  
Incoming	
  Data	
  
(Messaging	
  
System)	
  
Repor@ng	
  
Request	
  
Storage	
  in	
  Kudu	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  it	
  works	
  
Replica@on	
  and	
  distribu@on	
  
14	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Tables	
  and	
  Tablets	
  
•  Table	
  is	
  horizontally	
  parHHoned	
  into	
  tablets	
  
• Range	
  or	
  hash	
  par@@oning	
  
• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY
HASH(timestamp) INTO 100 BUCKETS
•  Each	
  tablet	
  has	
  N	
  replicas	
  (3	
  or	
  5),	
  with	
  RaX	
  consensus	
  
• Allow	
  read	
  from	
  any	
  replica,	
  plus	
  leader-­‐driven	
  writes	
  with	
  low	
  MTTR	
  
•  Tablet	
  servers	
  host	
  tablets	
  
• Store	
  data	
  on	
  local	
  disks	
  (no	
  HDFS)	
  
15	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Metadata	
  
•  Replicated	
  master*	
  
• Acts	
  as	
  a	
  tablet	
  directory	
  (“META”	
  table)	
  
• Acts	
  as	
  a	
  catalog	
  (table	
  schemas,	
  etc)	
  
• Acts	
  as	
  a	
  load	
  balancer	
  (tracks	
  TS	
  liveness,	
  re-­‐replicates	
  under-­‐replicated	
  
tablets)	
  
•  Caches	
  all	
  metadata	
  in	
  RAM	
  for	
  high	
  performance	
  
• 80-­‐node	
  load	
  test,	
  GetTableLoca@ons	
  RPC	
  perf:	
  
•  99th	
  percen@le:	
  68us,	
  	
  99.99th	
  percen@le:	
  657us	
  	
  
•  <2%	
  peak	
  CPU	
  usage	
  
•  Client	
  configured	
  with	
  master	
  addresses	
  
• Asks	
  master	
  for	
  tablet	
  loca@ons	
  as	
  needed	
  and	
  caches	
  them	
  
16	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Rat	
  consensus	
  
18	
  
TS	
  A	
  
	
  
	
  
	
  
	
  
Tablet	
  1	
  
(LEADER)	
  
Client	
  
TS	
  B	
  
	
  
	
  
	
  
	
  
Tablet	
  1	
  
(FOLLOWER)	
  
TS	
  C	
  
	
  
	
  
	
  
	
  
Tablet	
  1	
  
(FOLLOWER)	
  
WAL	
  
WAL	
  WAL	
  
2b.	
  Leader	
  writes	
  local	
  WAL	
  
1a.	
  Client-­‐>Leader:	
  Write()	
  RPC	
  
2a.	
  Leader-­‐>Followers:	
  
UpdateConsensus()	
  RPC	
  
3.	
  Follower:	
  write	
  WAL	
  
4.	
  Follower-­‐>Leader:	
  success	
  
3.	
  Follower:	
  write	
  WAL	
  
5.	
  Leader	
  has	
  achieved	
  majority	
  
6.	
  Leader-­‐>Client:	
  Success!	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Fault	
  tolerance	
  
•  Transient	
  FOLLOWER	
  failure:	
  
• Leader	
  can	
  s@ll	
  achieve	
  majority	
  
• Restart	
  follower	
  TS	
  within	
  5	
  min	
  and	
  it	
  will	
  rejoin	
  transparently	
  
•  Transient	
  LEADER	
  failure:	
  
• Followers	
  expect	
  to	
  hear	
  a	
  heartbeat	
  from	
  their	
  leader	
  every	
  1.5	
  seconds	
  
• 3	
  missed	
  heartbeats:	
  leader	
  elec@on!	
  
•  New	
  LEADER	
  is	
  elected	
  from	
  remaining	
  nodes	
  within	
  a	
  few	
  seconds	
  
• Restart	
  within	
  5	
  min	
  and	
  it	
  rejoins	
  as	
  a	
  FOLLOWER	
  
•  N	
  replicas	
  handle	
  (N-­‐1)/2	
  failures	
  
19	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Fault	
  tolerance	
  (2)	
  
•  Permanent	
  failure:	
  
• Leader	
  no@ces	
  that	
  a	
  follower	
  has	
  been	
  dead	
  for	
  5	
  minutes	
  
• Evicts	
  that	
  follower	
  
• Master	
  selects	
  a	
  new	
  replica	
  
• Leader	
  copies	
  the	
  data	
  over	
  to	
  the	
  new	
  one,	
  which	
  joins	
  as	
  a	
  new	
  FOLLOWER	
  
20	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  it	
  works	
  
Storage	
  engine	
  internals	
  
21	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Tablet	
  design	
  
•  Inserts	
  buffered	
  in	
  an	
  in-­‐memory	
  store	
  (like	
  HBase’s	
  memstore)	
  
•  Flushed	
  to	
  disk	
  
• Columnar	
  layout,	
  similar	
  to	
  Apache	
  Parquet	
  
•  Updates	
  use	
  MVCC	
  (updates	
  tagged	
  with	
  @mestamp,	
  not	
  in-­‐place)	
  
• Allow	
  “SELECT	
  AS	
  OF	
  <@mestamp>”	
  queries	
  and	
  consistent	
  cross-­‐tablet	
  scans	
  
•  Near-­‐op@mal	
  read	
  path	
  for	
  “current	
  @me”	
  scans	
  
• No	
  per	
  row	
  branches,	
  fast	
  vectorized	
  decoding	
  and	
  predicate	
  evalua@on	
  
•  Performance	
  worsens	
  based	
  on	
  number	
  of	
  recent	
  updates	
  
22	
  
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
LSM	
  vs	
  Kudu	
  
•  LSM	
  –	
  Log	
  Structured	
  Merge	
  (Cassandra,	
  HBase,	
  etc)	
  
• Inserts	
  and	
  updates	
  all	
  go	
  to	
  an	
  in-­‐memory	
  map	
  (MemStore)	
  and	
  later	
  flush	
  to	
  
on-­‐disk	
  files	
  (HFile/SSTable)	
  
• Reads	
  perform	
  an	
  on-­‐the-­‐fly	
  merge	
  of	
  all	
  on-­‐disk	
  HFiles	
  
•  Kudu	
  
• Shares	
  some	
  traits	
  (memstores,	
  compac@ons)	
  
• More	
  complex.	
  
• Slower	
  writes	
  in	
  exchange	
  for	
  faster	
  reads	
  (especially	
  scans)	
  
23	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
LSM	
  Insert	
  Path	
  
24	
  
MemStore	
  
INSERT	
  
Row=r1	
  col=c1	
  val=“blah”	
  
Row=r1	
  col=c2	
  val=“1”	
  
HFile	
  1	
  
Row=r1	
  col=c1	
  val=“blah”	
  
Row=r1	
  col=c2	
  val=“1”	
  
flush	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
LSM	
  Insert	
  Path	
  
25	
  
MemStore	
  
INSERT	
  
Row=r1	
  col=c1	
  val=“blah”	
  
Row=r1	
  col=c2	
  val=“2”	
  
HFile	
  2	
  
Row=r2	
  col=c1	
  val=“blah2”	
  
Row=r2	
  col=c2	
  val=“2”	
  
flush	
  
HFile	
  1	
  
Row=r1	
  col=c1	
  val=“blah”	
  
Row=r1	
  col=c2	
  val=“1”	
  
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
LSM	
  Update	
  path	
  
26	
  
MemStore	
  
UPDATE	
  
HFile	
  1	
  
Row=r1	
  col=c1	
  val=“blah”	
  
Row=r1	
  col=c2	
  val=“2”	
  
HFile	
  2	
  
Row=r2	
  col=c1	
  val=“v2”	
  
Row=r2	
  col=c2	
  val=“5”	
  
Row=r2	
  col=c1	
  val=“newval”	
  
Note:	
  all	
  updates	
  are	
  “fully	
  
decoupled”	
  from	
  reads.	
  Random-­‐
write	
  workload	
  is	
  transformed	
  to	
  
fully	
  sequen@al!	
  
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
LSM	
  Read	
  path	
  
27	
  
MemStore	
  
HFile	
  1	
  
Row=r1	
  col=c1	
  val=“blah”	
  
Row=r1	
  col=c2	
  val=“2”	
  
HFile	
  2	
  
Row=r2	
  col=c1	
  val=“v2”	
  
Row=r2	
  col=c2	
  val=“5”	
  
Row=r2	
  col=c1	
  val=“newval”	
  
Merge	
  based	
  on	
  string	
  row	
  
keys	
  
R1:	
  c1=blah	
  c2=2	
  
R2:	
  c1=newval	
  c2=5	
  
….	
  
CPU	
  intensive!	
  
Must	
  always	
  read	
  
rowkeys	
  
Any	
  given	
  row	
  may	
  exist	
  across	
  
mul@ple	
  HFiles:	
  must	
  always	
  
merge!	
  
The	
  more	
  HFiles	
  to	
  merge,	
  the	
  
slower	
  it	
  reads	
  
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  –	
  Inserts	
  and	
  Flushes	
  
28	
  
MemRowSet	
  
INSERT	
  (“todd”,	
  
“$1000”,”engineer”)	
  
name	
   pay	
   role	
  
DiskRowSet	
  1	
  
flush	
  
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  –	
  Inserts	
  and	
  Flushes	
  
29	
  
MemRowSet	
  
name	
   pay	
   role	
  
DiskRowSet	
  1	
  
name	
   pay	
   role	
  
DiskRowSet	
  2	
  
INSERT	
  (“doug”,	
  “$1B”,	
  “Hadoop	
  man”)	
  
flush	
  
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  -­‐	
  Updates	
  
30	
  
MemRowSet	
  
name	
   pay	
   role	
  
DiskRowSet	
  1	
  
name	
   pay	
   role	
  
DiskRowSet	
  2	
  
Delta	
  MS	
  
Delta	
  MS	
  
Each	
  DiskRowSet	
  has	
  its	
  own	
  
DeltaMemStore	
  to	
  
accumulate	
  updates	
  
base	
  data	
  
base	
  data	
  
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  -­‐	
  Updates	
  
31	
  
MemRowSet	
  
name	
   pay	
   role	
  
DiskRowSet	
  1	
  
name	
   pay	
   role	
  
DiskRowSet	
  2	
  
Delta	
  MS	
  
Delta	
  MS	
  
UPDATE	
  set	
  pay=“$1M”	
  
WHERE	
  name=“todd”	
  
Is	
  the	
  row	
  in	
  DiskRowSet	
  2?	
  
(check	
  bloom	
  filters)	
  
Is	
  the	
  row	
  in	
  DiskRowSet	
  1?	
  
(check	
  bloom	
  filters)	
  
Bloom	
  says:	
  no!	
  
Bloom	
  says:	
  maybe!	
  
Search	
  key	
  column	
  to	
  find	
  
offset:	
  rowid	
  =	
  150	
  
150:	
  col	
  1=$1M	
  
	
  
base	
  data	
  
32	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  –	
  Read	
  path	
  
32	
  
MemRowSet	
  
name	
   pay	
   role	
  
DiskRowSet	
  1	
  
name	
   pay	
   role	
  
DiskRowSet	
  2	
  
Delta	
  MS	
  
Delta	
  MS	
  
150:	
  pay=$1M	
  
Read	
  rows	
  in	
  DiskRowSet	
  2	
  
Then,	
  read	
  rows	
  in	
  
DiskRowSet	
  1	
  
Any	
  row	
  is	
  only	
  in	
  exactly	
  one	
  
DiskRowSet–	
  no	
  need	
  to	
  merge	
  cross-­‐
DRS!	
  
Updates	
  are	
  merged	
  based	
  on	
  ordinal	
  
offset	
  within	
  DRS:	
  array	
  indexing,	
  no	
  
string	
  compares	
  
base	
  data	
  
base	
  data	
  
33	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  –	
  Delta	
  flushes	
  
33	
  
MemRowSet	
  
name	
   pay	
   role	
  
DiskRowSet	
  1	
  
name	
   pay	
   role	
  
DiskRowSet	
  2	
  
Delta	
  MS	
  
Delta	
  MS	
  
0:	
  pay=foo	
  REDO	
  DeltaFile	
  
Flush	
  
A	
  REDO	
  delta	
  indicates	
  how	
  to	
  
transform	
  between	
  the	
  ‘base	
  
data’	
  (columnar)	
  and	
  a	
  later	
  version	
  
base	
  data	
  
base	
  data	
  
34	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  –	
  Major	
  delta	
  compac@on	
  
34	
  
name	
   pay	
   role	
  
DiskRowSet(pre-­‐compac@on)	
  
Delta	
  MS	
  
REDO	
  DeltaFile	
   REDO	
  DeltaFile	
   REDO	
  DeltaFile	
  
Many	
  deltas	
  accumulate:	
  lots	
  of	
  delta	
  applica@on	
  
work	
  on	
  reads	
  
name	
   pay	
   role	
  
DiskRowSet(post-­‐compac@on)	
  
Delta	
  MS	
  
Unmerged	
  REDO	
  
deltas	
  UNDO	
  deltas	
  
If	
  a	
  column	
  has	
  few	
  updates,	
  doesn’t	
  need	
  to	
  be	
  re-­‐
wriwen:	
  those	
  deltas	
  maintained	
  in	
  new	
  DeltaFile	
  
Merge	
  updates	
  for	
  columns	
  with	
  high	
  update	
  
percentage	
  
base	
  data	
  
35	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  –	
  RowSet	
  Compac@ons	
  
35	
  
DRS	
  1	
  (32MB)	
  
[PK=alice],	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=joe],	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=linda],	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=zach]	
  
DRS	
  2	
  (32MB)	
  
	
  	
  [PK=bob],	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=jon],	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=mary]	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=zeke]	
  
DRS	
  3	
  (32MB)	
  
	
  	
  	
  	
  [PK=carl],	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=julie],	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=omar]	
  	
  	
  	
  	
  	
  	
  	
  [PK=zoe]	
  
DRS	
  4	
  (32MB)	
   DRS	
  5	
  (32MB)	
   DRS	
  6	
  (32MB)	
  
[alice,	
  bob,	
  carl,	
  
joe]	
  
[jon,	
  julie,	
  linda,	
  
mary]	
  
[omar,	
  zach,	
  
zeke,	
  zoe]	
  
Reorganize	
  rows	
  to	
  avoid	
  rowsets	
  
with	
  overlapping	
  key	
  ranges	
  
36	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  –	
  Compac@on	
  policy	
  
•  Solves	
  an	
  op@miza@on	
  problem	
  (knapsack	
  problem)	
  
•  Minimize	
  “height”	
  of	
  rowsets	
  for	
  the	
  average	
  key	
  lookup	
  
• Bound	
  on	
  number	
  of	
  seeks	
  for	
  write	
  or	
  random-­‐read	
  
•  Restrict	
  total	
  IO	
  of	
  any	
  compac@on	
  to	
  a	
  budget	
  (128MB)	
  
• No	
  long	
  compacHons,	
  ever	
  
• No	
  “minor”	
  vs	
  “major”	
  disHncHon	
  
• Always	
  be	
  compac@ng	
  or	
  flushing	
  
• Low	
  IO	
  priority	
  maintenance	
  threads	
  
36	
  
37	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  trade-­‐offs	
  
•  Random	
  updates	
  will	
  be	
  slower	
  
• HBase	
  model	
  allows	
  random	
  updates	
  without	
  incurring	
  a	
  disk	
  seek	
  
• Kudu	
  requires	
  a	
  key	
  lookup	
  before	
  update,	
  bloom	
  lookup	
  before	
  insert	
  
•  Single-­‐row	
  reads	
  may	
  be	
  slower	
  
• Columnar	
  design	
  is	
  op@mized	
  for	
  scans	
  
• Future:	
  may	
  introduce	
  “column	
  groups”	
  for	
  applica@ons	
  where	
  single-­‐row	
  
access	
  is	
  more	
  important	
  
• Especially	
  slow	
  at	
  reading	
  a	
  row	
  that	
  has	
  had	
  many	
  recent	
  updates	
  (e.g	
  YCSB	
  
“zipfian”)	
  
37	
  
38	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Benchmarks	
  
38	
  
39	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
TPC-­‐H	
  (Analy@cs	
  benchmark)	
  
•  75TS	
  +	
  1	
  master	
  cluster	
  
• 12	
  (spinning)	
  disk	
  each,	
  enough	
  RAM	
  to	
  fit	
  dataset	
  
• Using	
  Kudu	
  0.5.0,	
  Impala	
  2.2	
  with	
  Kudu	
  support,	
  CDH	
  5.4	
  
• TPC-­‐H	
  Scale	
  Factor	
  100	
  (100GB)	
  
•  Example	
  query:	
  
•  SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer,
orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND
l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'
AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY
n_name ORDER BY revenue desc;
39	
  
40	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
-­‐	
  Kudu	
  outperforms	
  Parquet	
  by	
  31%	
  (geometric	
  mean)	
  for	
  RAM-­‐resident	
  data	
  
-­‐	
  Parquet	
  likely	
  to	
  outperform	
  Kudu	
  for	
  HDD-­‐resident	
  (larger	
  IO	
  requests)	
  
41	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  about	
  Apache	
  Phoenix?	
  
•  10	
  node	
  cluster	
  (9	
  worker,	
  1	
  master)	
  
•  HBase	
  1.0,	
  Phoenix	
  4.3	
  
•  TPC-­‐H	
  LINEITEM	
  table	
  only	
  (6B	
  rows)	
  
41	
  
2152	
  
219	
  
76	
  
131	
  
0.04	
  
1918	
  
13.2	
  
1.7	
  
0.7	
  
0.15	
  
155	
  
9.3	
  
1.4	
   1.5	
   1.37	
  
0.01	
  
0.1	
  
1	
  
10	
  
100	
  
1000	
  
10000	
  
Load	
   TPCH	
  Q1	
   COUNT(*)	
  
COUNT(*)	
  
WHERE…	
  
single-­‐row	
  
lookup	
  
Time	
  (sec)	
  
Phoenix	
  
Kudu	
  
Parquet	
  
42	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  about	
  NoSQL-­‐style	
  random	
  access?	
  (YCSB)	
  
•  YCSB	
  0.5.0-­‐snapshot	
  
•  10	
  node	
  cluster	
  
(9	
  worker,	
  1	
  master)	
  
•  HBase	
  1.0	
  
•  100M	
  rows,	
  10M	
  ops	
  
42	
  
43	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
But	
  don’t	
  trust	
  me	
  (a	
  vendor)…	
  
43	
  
44	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
About	
  Xiaomi	
  
Mobile	
  Internet	
  Company	
  Founded	
  in	
  2010
Smartphones SoXware
E-­‐commerce
MIUI
Cloud	
  Services
App	
  Store/Game
Payment/Finance
…
Smart	
  Home
Smart	
  Devices
45	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Big	
  Data	
  AnalyHcs	
  Pipeline	
  
Before	
  Kudu
•  Long	
  pipeline	
  
high	
  latency(1	
  hour	
  ~	
  1	
  day),	
  data	
  conversion	
  pains	
  
•  No	
  ordering	
  
Log	
  arrival(storage)	
  order	
  not	
  exactly	
  logical	
  order	
  
e.g.	
  read	
  2-­‐3	
  days	
  of	
  log	
  for	
  data	
  in	
  1	
  day
46	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Big	
  Data	
  Analysis	
  Pipeline	
  
Simplified	
  With	
  Kudu
•  ETL	
  Pipeline(0~10s	
  latency)	
  
Apps	
  that	
  need	
  to	
  prevent	
  backpressure	
  or	
  require	
  ETL	
  	
  
•  Direct	
  Pipeline(no	
  latency)	
  
Apps	
  that	
  don’t	
  require	
  ETL	
  and	
  no	
  backpressure	
  issues	
  
	
  
OLAP	
  scan	
  
Side	
  table	
  lookup	
  
Result	
  store	
  
47	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Use	
  Case	
  1	
  
Mobile	
  service	
  monitoring	
  and	
  tracing	
  tool
Requirements	
  
u  High	
  write	
  throughput	
  
>5	
  Billion	
  records/day	
  and	
  growing	
  
u  Query	
  latest	
  data	
  and	
  quick	
  response	
  
Iden@fy	
  and	
  resolve	
  issues	
  quickly	
  
u  Can	
  search	
  for	
  individual	
  records	
  
Easy	
  for	
  troubleshoo@ng	
  
Gather	
  important	
  RPC	
  tracing	
  events	
  from	
  
mobile	
  app	
  and	
  backend	
  service.	
  	
  
Service	
  monitoring	
  &	
  troubleshoo@ng	
  tool.	
  
48	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Use	
  Case	
  1:	
  Benchmark	
  
Environment	
  
u  71	
  Node	
  cluster	
  
u  Hardware	
  
CPU:	
  E5-­‐2620	
  2.1GHz	
  *	
  24	
  core	
  	
  Memory:	
  64GB	
  	
  
Network:	
  1Gb	
  	
  Disk:	
  12	
  HDD	
  
u  Sotware	
  
Hadoop2.6/Impala	
  2.1/Kudu	
  
Data	
  
u  1	
  day	
  of	
  server	
  side	
  tracing	
  data	
  
~2.6	
  Billion	
  rows	
  
~270	
  bytes/row	
  
17	
  columns,	
  5	
  key	
  columns	
  
49	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Use	
  Case	
  1:	
  Benchmark	
  Results	
  	
  
1.4	
  	
   2.0	
  	
   2.3	
  	
  
3.1	
  	
  
1.3	
  	
   0.9	
  	
  1.3	
  	
  
2.8	
  	
  
4.0	
  	
  
5.7	
  	
  
7.5	
  	
  
16.7	
  	
  
Q1	
   Q2	
   Q3	
   Q4	
   Q5	
   Q6	
  
kudu	
  
parquet	
  
Total	
  Time(s)	
   Throughput(Total)	
   Throughput(per	
  node)	
  
Kudu	
   961.1	
   2.8M	
  record/s	
   39.5k	
  record/s	
  
Parquet	
   114.6	
   23.5M	
  record/s	
   331k	
  records/s	
  
Bulk	
  load	
  using	
  impala	
  (INSERT	
  INTO):	
  	
  
Query	
  latency:	
  
*	
  HDFS	
  parquet	
  file	
  replica@on	
  =	
  3	
  ,	
  kudu	
  table	
  replica@on	
  =	
  3	
  
*	
  Each	
  query	
  run	
  5	
  @mes	
  then	
  take	
  average	
  
50	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Use	
  Case	
  1:	
  Result	
  Analysis	
  
u  Lazy	
  materializa@on	
  
Ideal	
  for	
  search	
  style	
  query	
  
Q6	
  returns	
  only	
  a	
  few	
  records	
  (of	
  a	
  single	
  user)	
  with	
  all	
  columns	
  
u  Scan	
  range	
  pruning	
  using	
  primary	
  index	
  
Predicates	
  on	
  primary	
  key	
  
Q5	
  only	
  scans	
  1	
  hour	
  of	
  data	
  
u  Future	
  work	
  
Primary	
  index:	
  speed-­‐up	
  order	
  by	
  and	
  dis@nct	
  
Hash	
  Par@@oning:	
  speed-­‐up	
  count(dis@nct),	
  no	
  need	
  for	
  global	
  
shuffle/merge	
  
51	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Use	
  Case	
  2	
  
OLAP	
  PaaS	
  for	
  ecosystem	
  cloud
u  Provide	
  big	
  data	
  service	
  for	
  smart	
  hardware	
  startups	
  (Xiaomi’s	
  
ecosystem	
  members)	
  
u  OLAP	
  database	
  with	
  some	
  OLTP	
  features	
  
u  Manage/Ingest/query	
  your	
  data	
  and	
  serving	
  results	
  in	
  one	
  place	
  
Backend/Mobile	
  App/Smart	
  Device/IoT	
  …	
  
52	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  Kudu	
  is	
  not	
  
52	
  
53	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  is…	
  
• NOT	
  a	
  SQL	
  database	
  
•  “BYO	
  SQL”	
  
• NOT	
  a	
  filesystem	
  
•  data	
  must	
  have	
  tabular	
  structure	
  
• NOT	
  a	
  replacement	
  for	
  HBase	
  or	
  HDFS	
  
•  Cloudera	
  con@nues	
  to	
  invest	
  in	
  those	
  systems	
  
•  Many	
  use	
  cases	
  where	
  they’re	
  s@ll	
  more	
  appropriate	
  
• NOT	
  an	
  in-­‐memory	
  database	
  
•  Very	
  fast	
  for	
  memory-­‐sized	
  workloads,	
  but	
  can	
  operate	
  on	
  larger	
  data	
  too!	
  
	
  
	
  53	
  
54	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ge…ng	
  started	
  
54	
  
55	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ge…ng	
  started	
  as	
  a	
  user	
  
•  hwp://getkudu.io	
  
•  kudu-­‐user@googlegroups.com	
  
•  Quickstart	
  VM	
  
• Easiest	
  way	
  to	
  get	
  started	
  
• Impala	
  and	
  Kudu	
  in	
  an	
  easy-­‐to-­‐install	
  VM	
  
•  CSD	
  and	
  Parcels	
  
• For	
  installa@on	
  on	
  a	
  Cloudera	
  Manager-­‐managed	
  cluster	
  
55	
  
56	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ge…ng	
  started	
  as	
  a	
  developer	
  
•  hwp://github.com/cloudera/kudu	
  
• All	
  commits	
  go	
  here	
  first	
  
•  Public	
  gerrit:	
  hwp://gerrit.cloudera.org	
  
• All	
  code	
  reviews	
  happening	
  here	
  
•  Public	
  JIRA:	
  hwp://issues.cloudera.org	
  
• Includes	
  bugs	
  going	
  back	
  to	
  2013.	
  Come	
  see	
  our	
  dirty	
  laundry!	
  
•  kudu-­‐dev@googlegroups.com	
  
•  Apache	
  2.0	
  license	
  open	
  source	
  
•  Contribu@ons	
  are	
  welcome	
  and	
  encouraged!	
  
56	
  
57	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Demo?	
  
(if	
  we	
  have	
  @me	
  and	
  internet	
  gods	
  willing)	
  
57	
  
58	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
hwp://getkudu.io/	
  
@getkudu	
  

Contenu connexe

Tendances

Capital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msCapital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msApache Apex
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Data Con LA
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...DataStax
 
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...Cloudera, Inc.
 
Next-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msNext-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msIlya Ganelin
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...Spark Summit
 
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...DataStax
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?MapR Technologies
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningDataWorks Summit
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Big Data Spain
 

Tendances (20)

Capital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msCapital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 ms
 
In Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging serviceIn Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging service
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
 
Next-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msNext-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2ms
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
 
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
 
YARN Federation
YARN Federation YARN Federation
YARN Federation
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 

En vedette

Actitudes deseadas y_no_deseadas
Actitudes deseadas y_no_deseadasActitudes deseadas y_no_deseadas
Actitudes deseadas y_no_deseadaspochoalejo
 
Flattening The Classroom
Flattening The ClassroomFlattening The Classroom
Flattening The Classroomebrownorama
 
презентація серце, зболене війною
презентація серце, зболене війноюпрезентація серце, зболене війною
презентація серце, зболене війноюLana1980
 
ЦМИТ "Фабрика Идей и Инноваций"
ЦМИТ "Фабрика Идей и Инноваций"ЦМИТ "Фабрика Идей и Инноваций"
ЦМИТ "Фабрика Идей и Инноваций"Kirill Zavedenskiy
 
Tax_Avoidance_Report
Tax_Avoidance_ReportTax_Avoidance_Report
Tax_Avoidance_ReportAngela Wang
 
descripcion de la oportunidad de compra de las concesiones mineras Vicoz 1 y ...
descripcion de la oportunidad de compra de las concesiones mineras Vicoz 1 y ...descripcion de la oportunidad de compra de las concesiones mineras Vicoz 1 y ...
descripcion de la oportunidad de compra de las concesiones mineras Vicoz 1 y ...Vicoz, Empresa Minera en Venta
 
Protecting Families, Finances, and Your Future
Protecting Families, Finances, and Your FutureProtecting Families, Finances, and Your Future
Protecting Families, Finances, and Your FutureMax Charles Alperstein
 
Hypothesis Action Data Insight cycles, Lean Startup
Hypothesis Action Data Insight cycles, Lean StartupHypothesis Action Data Insight cycles, Lean Startup
Hypothesis Action Data Insight cycles, Lean StartupAlexander Sukhanov
 
Bobinas y condensadores
Bobinas y condensadoresBobinas y condensadores
Bobinas y condensadoressoto0106
 
Metal and Engineering update July 2016
Metal and Engineering update July 2016 Metal and Engineering update July 2016
Metal and Engineering update July 2016 Ian Delport
 
Red hat storage objects, containers and Beyond!
Red hat storage objects, containers and Beyond!Red hat storage objects, containers and Beyond!
Red hat storage objects, containers and Beyond!andreas kuncoro
 
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)Bhaskar Mitra
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataPatrick McFadin
 

En vedette (18)

Presentation miljötillståndsprövning
Presentation miljötillståndsprövning Presentation miljötillståndsprövning
Presentation miljötillståndsprövning
 
Redes de computo
Redes de                   computoRedes de                   computo
Redes de computo
 
Actitudes deseadas y_no_deseadas
Actitudes deseadas y_no_deseadasActitudes deseadas y_no_deseadas
Actitudes deseadas y_no_deseadas
 
Flattening The Classroom
Flattening The ClassroomFlattening The Classroom
Flattening The Classroom
 
презентація серце, зболене війною
презентація серце, зболене війноюпрезентація серце, зболене війною
презентація серце, зболене війною
 
ЦМИТ "Фабрика Идей и Инноваций"
ЦМИТ "Фабрика Идей и Инноваций"ЦМИТ "Фабрика Идей и Инноваций"
ЦМИТ "Фабрика Идей и Инноваций"
 
Tax_Avoidance_Report
Tax_Avoidance_ReportTax_Avoidance_Report
Tax_Avoidance_Report
 
descripcion de la oportunidad de compra de las concesiones mineras Vicoz 1 y ...
descripcion de la oportunidad de compra de las concesiones mineras Vicoz 1 y ...descripcion de la oportunidad de compra de las concesiones mineras Vicoz 1 y ...
descripcion de la oportunidad de compra de las concesiones mineras Vicoz 1 y ...
 
blind to threats
blind to threatsblind to threats
blind to threats
 
Protecting Families, Finances, and Your Future
Protecting Families, Finances, and Your FutureProtecting Families, Finances, and Your Future
Protecting Families, Finances, and Your Future
 
Hypothesis Action Data Insight cycles, Lean Startup
Hypothesis Action Data Insight cycles, Lean StartupHypothesis Action Data Insight cycles, Lean Startup
Hypothesis Action Data Insight cycles, Lean Startup
 
Bobinas y condensadores
Bobinas y condensadoresBobinas y condensadores
Bobinas y condensadores
 
Metal and Engineering update July 2016
Metal and Engineering update July 2016 Metal and Engineering update July 2016
Metal and Engineering update July 2016
 
4. CMS
4. CMS4. CMS
4. CMS
 
Triukšmo tarša
Triukšmo taršaTriukšmo tarša
Triukšmo tarša
 
Red hat storage objects, containers and Beyond!
Red hat storage objects, containers and Beyond!Red hat storage objects, containers and Beyond!
Red hat storage objects, containers and Beyond!
 
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 

Similaire à Kudu: A Columnar Store for Fast Analytics on Fast Data

Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoopjdcryans
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...Yahoo Developer Network
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Datamichaelguia
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 

Similaire à Kudu: A Columnar Store for Fast Analytics on Fast Data (20)

Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 

Plus de Felicia Haggarty

8 Tips for Deploying DevSecOps
8 Tips for Deploying DevSecOps8 Tips for Deploying DevSecOps
8 Tips for Deploying DevSecOpsFelicia Haggarty
 
Yarn presentation - DFW CUG - December 2015
Yarn presentation - DFW CUG - December 2015Yarn presentation - DFW CUG - December 2015
Yarn presentation - DFW CUG - December 2015Felicia Haggarty
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Felicia Haggarty
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisFelicia Haggarty
 
Data revolution by Doug Cutting
Data revolution by Doug CuttingData revolution by Doug Cutting
Data revolution by Doug CuttingFelicia Haggarty
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
 

Plus de Felicia Haggarty (6)

8 Tips for Deploying DevSecOps
8 Tips for Deploying DevSecOps8 Tips for Deploying DevSecOps
8 Tips for Deploying DevSecOps
 
Yarn presentation - DFW CUG - December 2015
Yarn presentation - DFW CUG - December 2015Yarn presentation - DFW CUG - December 2015
Yarn presentation - DFW CUG - December 2015
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Data revolution by Doug Cutting
Data revolution by Doug CuttingData revolution by Doug Cutting
Data revolution by Doug Cutting
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 

Dernier

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Dernier (20)

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Kudu: A Columnar Store for Fast Analytics on Fast Data

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   David  Alves  on  behalf  of  the  Kudu  team     Kudu:  Resolving  Transac@onal   and  Analy@c  Trade-­‐offs  in   Hadoop   1  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu   Storage  for  Fast  Analy@cs  on  Fast  Data   •  New  upda@ng  column  store  for   Hadoop     •  Apache-­‐licensed  open  source   •  Beta  now  available   Columnar  Store   Kudu  
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Mo@va@on  and  Goals   Why  build  Kudu?   3  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Mo@va@ng  Ques@ons   •  Are  there  user  problems  that  can  we  can’t  address  because  of  gaps  in  Hadoop   ecosystem  storage  technologies?   •  Are  we  posi@oned  to  take  advantage  of  advancements  in  the  hardware   landscape?  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Current  Storage  Landscape  in  Hadoop   HDFS  excels  at:   •  Efficiently  scanning  large  amounts   of  data   •  Accumula@ng  data  with  high   throughput   HBase  excels  at:   •  Efficiently  finding  and  wri@ng   individual  rows   •  Making  data  mutable     Gaps  exist  when  these  proper@es   are  needed  simultaneously  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Changing  Hardware  landscape   •  Spinning  disk  -­‐>  solid  state  storage   • NAND  flash:  Up  to  450k  read  250k  write  iops,  about  2GB/sec  read  and  1.5GB/ sec  write  throughput,  at  a  price  of  less  than  $3/GB  and  dropping   • 3D  XPoint  memory  (1000x  faster  than  NAND,  cheaper  than  RAM)   •  RAM  is  cheaper  and  more  abundant:   • 64-­‐>128-­‐>256GB  over  last  few  years   •  Takeaway  1:  The  next  bo?leneck  is  CPU,  and  current  storage  systems  weren’t   designed  with  CPU  efficiency  in  mind.   •  Takeaway  2:  Column  stores  are  feasible  for  random  access  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   •  High  throughput  for  big  scans  (columnar   storage  and  replica@on)   Goal:  Within  2x  of  Parquet     •  Low-­‐latency  for  short  accesses  (primary  key   indexes  and  quorum  replica@on)   Goal:  1ms  read/write  on  SSD     •  Database-­‐like  seman@cs  (ini@ally  single-­‐row   ACID)     •  RelaHonal  data  model   •  SQL  query   •  “NoSQL”  style  scan/insert/update  (Java  client)   Kudu  Design  Goals  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  Design  Goals   how effectively primary key filters can be pushed down to Kudu.  What do I use Kudu for?  We talked about how Kudu is made for SQL, allows fast scans, and allows fast mutability at  scale.  With that in context, let’s look at the variety of use cases done in Hadoop today and see  where Kudu fits in.      If we look at Kudu in the above figure, we will see that many of the traditional SQL use cases 
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  Usage   •  Table  has  a  SQL-­‐like  schema   • Finite  number  of  columns  (unlike  HBase/Cassandra)   • Types:  BOOL,  INT8,  INT16,  INT32,  INT64,  FLOAT,  DOUBLE,  STRING,  BINARY,   TIMESTAMP   • Some  subset  of  columns  makes  up  a  possibly-­‐composite  primary  key   • Fast  ALTER  TABLE   •  Java  and  C++  “NoSQL”  style  APIs   • Insert(),  Update(),  Delete(),  Scan()   •  Integra@ons  with  MapReduce,  Spark,  and  Impala   • more  to  come!   9  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Use  cases  and  architectures  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  Use  Cases   Kudu  is  best  for  use  cases  requiring  a  simultaneous  combinaHon  of   sequenHal  and  random  reads  and  writes     ● Time  Series   ○  Examples:  Stream  market  data;  fraud  detec@on  &  preven@on;  risk  monitoring   ○  Workload:  Insert,  updates,  scans,  lookups   ● Machine  Data  AnalyHcs   ○  Examples:  Network  threat  detec@on   ○  Workload:  Inserts,  scans,  lookups   ● Online  ReporHng   ○  Examples:  ODS   ○  Workload:  Inserts,  updates,  scans,  lookups  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Real-­‐Time  Analy@cs  in  Hadoop  Today   Fraud  Detec@on  in  the  Real  World  =  Storage  Complexity   ConsideraHons:   ●  How  do  I  handle  failure   during  this  process?     ●  How  oten  do  I  reorganize   data  streaming  in  into  a   format  appropriate  for   repor@ng?     ●  When  repor@ng,  how  do  I  see   data  that  has  not  yet  been   reorganized?     ●  How  do  I  ensure  that   important  jobs  aren’t   interrupted  by  maintenance?   New  Par@@on   Most  Recent  Par@@on   Historic  Data   HBase   Parquet   File   Have  we   accumulated   enough  data?   Reorganize   HBase  file   into  Parquet   •  Wait  for  running  opera@ons  to  complete     •  Define  new  Impala  par@@on  referencing   the  newly  wriwen  Parquet  file   Incoming  Data   (Messaging   System)   Repor@ng   Request   Impala  on  HDFS  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Real-­‐Time  Analy@cs  in  Hadoop  with  Kudu   Improvements:   ●  One  system  to  operate   ●  No  cron  jobs  or  background   processes   ●  Handle  late  arrivals  or  data   correcHons  with  ease   ●  New  data  available   immediately  for  analyHcs  or   operaHons     Historical  and  Real-­‐@me   Data   Incoming  Data   (Messaging   System)   Repor@ng   Request   Storage  in  Kudu  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   How  it  works   Replica@on  and  distribu@on   14  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Tables  and  Tablets   •  Table  is  horizontally  parHHoned  into  tablets   • Range  or  hash  par@@oning   • PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS •  Each  tablet  has  N  replicas  (3  or  5),  with  RaX  consensus   • Allow  read  from  any  replica,  plus  leader-­‐driven  writes  with  low  MTTR   •  Tablet  servers  host  tablets   • Store  data  on  local  disks  (no  HDFS)   15  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   Metadata   •  Replicated  master*   • Acts  as  a  tablet  directory  (“META”  table)   • Acts  as  a  catalog  (table  schemas,  etc)   • Acts  as  a  load  balancer  (tracks  TS  liveness,  re-­‐replicates  under-­‐replicated   tablets)   •  Caches  all  metadata  in  RAM  for  high  performance   • 80-­‐node  load  test,  GetTableLoca@ons  RPC  perf:   •  99th  percen@le:  68us,    99.99th  percen@le:  657us     •  <2%  peak  CPU  usage   •  Client  configured  with  master  addresses   • Asks  master  for  tablet  loca@ons  as  needed  and  caches  them   16  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Rat  consensus   18   TS  A           Tablet  1   (LEADER)   Client   TS  B           Tablet  1   (FOLLOWER)   TS  C           Tablet  1   (FOLLOWER)   WAL   WAL  WAL   2b.  Leader  writes  local  WAL   1a.  Client-­‐>Leader:  Write()  RPC   2a.  Leader-­‐>Followers:   UpdateConsensus()  RPC   3.  Follower:  write  WAL   4.  Follower-­‐>Leader:  success   3.  Follower:  write  WAL   5.  Leader  has  achieved  majority   6.  Leader-­‐>Client:  Success!  
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Fault  tolerance   •  Transient  FOLLOWER  failure:   • Leader  can  s@ll  achieve  majority   • Restart  follower  TS  within  5  min  and  it  will  rejoin  transparently   •  Transient  LEADER  failure:   • Followers  expect  to  hear  a  heartbeat  from  their  leader  every  1.5  seconds   • 3  missed  heartbeats:  leader  elec@on!   •  New  LEADER  is  elected  from  remaining  nodes  within  a  few  seconds   • Restart  within  5  min  and  it  rejoins  as  a  FOLLOWER   •  N  replicas  handle  (N-­‐1)/2  failures   19  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Fault  tolerance  (2)   •  Permanent  failure:   • Leader  no@ces  that  a  follower  has  been  dead  for  5  minutes   • Evicts  that  follower   • Master  selects  a  new  replica   • Leader  copies  the  data  over  to  the  new  one,  which  joins  as  a  new  FOLLOWER   20  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   How  it  works   Storage  engine  internals   21  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Tablet  design   •  Inserts  buffered  in  an  in-­‐memory  store  (like  HBase’s  memstore)   •  Flushed  to  disk   • Columnar  layout,  similar  to  Apache  Parquet   •  Updates  use  MVCC  (updates  tagged  with  @mestamp,  not  in-­‐place)   • Allow  “SELECT  AS  OF  <@mestamp>”  queries  and  consistent  cross-­‐tablet  scans   •  Near-­‐op@mal  read  path  for  “current  @me”  scans   • No  per  row  branches,  fast  vectorized  decoding  and  predicate  evalua@on   •  Performance  worsens  based  on  number  of  recent  updates   22  
  • 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   LSM  vs  Kudu   •  LSM  –  Log  Structured  Merge  (Cassandra,  HBase,  etc)   • Inserts  and  updates  all  go  to  an  in-­‐memory  map  (MemStore)  and  later  flush  to   on-­‐disk  files  (HFile/SSTable)   • Reads  perform  an  on-­‐the-­‐fly  merge  of  all  on-­‐disk  HFiles   •  Kudu   • Shares  some  traits  (memstores,  compac@ons)   • More  complex.   • Slower  writes  in  exchange  for  faster  reads  (especially  scans)   23  
  • 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   LSM  Insert  Path   24   MemStore   INSERT   Row=r1  col=c1  val=“blah”   Row=r1  col=c2  val=“1”   HFile  1   Row=r1  col=c1  val=“blah”   Row=r1  col=c2  val=“1”   flush  
  • 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   LSM  Insert  Path   25   MemStore   INSERT   Row=r1  col=c1  val=“blah”   Row=r1  col=c2  val=“2”   HFile  2   Row=r2  col=c1  val=“blah2”   Row=r2  col=c2  val=“2”   flush   HFile  1   Row=r1  col=c1  val=“blah”   Row=r1  col=c2  val=“1”  
  • 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   LSM  Update  path   26   MemStore   UPDATE   HFile  1   Row=r1  col=c1  val=“blah”   Row=r1  col=c2  val=“2”   HFile  2   Row=r2  col=c1  val=“v2”   Row=r2  col=c2  val=“5”   Row=r2  col=c1  val=“newval”   Note:  all  updates  are  “fully   decoupled”  from  reads.  Random-­‐ write  workload  is  transformed  to   fully  sequen@al!  
  • 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   LSM  Read  path   27   MemStore   HFile  1   Row=r1  col=c1  val=“blah”   Row=r1  col=c2  val=“2”   HFile  2   Row=r2  col=c1  val=“v2”   Row=r2  col=c2  val=“5”   Row=r2  col=c1  val=“newval”   Merge  based  on  string  row   keys   R1:  c1=blah  c2=2   R2:  c1=newval  c2=5   ….   CPU  intensive!   Must  always  read   rowkeys   Any  given  row  may  exist  across   mul@ple  HFiles:  must  always   merge!   The  more  HFiles  to  merge,  the   slower  it  reads  
  • 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  –  Inserts  and  Flushes   28   MemRowSet   INSERT  (“todd”,   “$1000”,”engineer”)   name   pay   role   DiskRowSet  1   flush  
  • 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  –  Inserts  and  Flushes   29   MemRowSet   name   pay   role   DiskRowSet  1   name   pay   role   DiskRowSet  2   INSERT  (“doug”,  “$1B”,  “Hadoop  man”)   flush  
  • 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  -­‐  Updates   30   MemRowSet   name   pay   role   DiskRowSet  1   name   pay   role   DiskRowSet  2   Delta  MS   Delta  MS   Each  DiskRowSet  has  its  own   DeltaMemStore  to   accumulate  updates   base  data   base  data  
  • 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  -­‐  Updates   31   MemRowSet   name   pay   role   DiskRowSet  1   name   pay   role   DiskRowSet  2   Delta  MS   Delta  MS   UPDATE  set  pay=“$1M”   WHERE  name=“todd”   Is  the  row  in  DiskRowSet  2?   (check  bloom  filters)   Is  the  row  in  DiskRowSet  1?   (check  bloom  filters)   Bloom  says:  no!   Bloom  says:  maybe!   Search  key  column  to  find   offset:  rowid  =  150   150:  col  1=$1M     base  data  
  • 32. 32  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  –  Read  path   32   MemRowSet   name   pay   role   DiskRowSet  1   name   pay   role   DiskRowSet  2   Delta  MS   Delta  MS   150:  pay=$1M   Read  rows  in  DiskRowSet  2   Then,  read  rows  in   DiskRowSet  1   Any  row  is  only  in  exactly  one   DiskRowSet–  no  need  to  merge  cross-­‐ DRS!   Updates  are  merged  based  on  ordinal   offset  within  DRS:  array  indexing,  no   string  compares   base  data   base  data  
  • 33. 33  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  –  Delta  flushes   33   MemRowSet   name   pay   role   DiskRowSet  1   name   pay   role   DiskRowSet  2   Delta  MS   Delta  MS   0:  pay=foo  REDO  DeltaFile   Flush   A  REDO  delta  indicates  how  to   transform  between  the  ‘base   data’  (columnar)  and  a  later  version   base  data   base  data  
  • 34. 34  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  –  Major  delta  compac@on   34   name   pay   role   DiskRowSet(pre-­‐compac@on)   Delta  MS   REDO  DeltaFile   REDO  DeltaFile   REDO  DeltaFile   Many  deltas  accumulate:  lots  of  delta  applica@on   work  on  reads   name   pay   role   DiskRowSet(post-­‐compac@on)   Delta  MS   Unmerged  REDO   deltas  UNDO  deltas   If  a  column  has  few  updates,  doesn’t  need  to  be  re-­‐ wriwen:  those  deltas  maintained  in  new  DeltaFile   Merge  updates  for  columns  with  high  update   percentage   base  data  
  • 35. 35  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  –  RowSet  Compac@ons   35   DRS  1  (32MB)   [PK=alice],                              [PK=joe],                            [PK=linda],                      [PK=zach]   DRS  2  (32MB)      [PK=bob],                              [PK=jon],                            [PK=mary]                  [PK=zeke]   DRS  3  (32MB)          [PK=carl],                              [PK=julie],                        [PK=omar]                [PK=zoe]   DRS  4  (32MB)   DRS  5  (32MB)   DRS  6  (32MB)   [alice,  bob,  carl,   joe]   [jon,  julie,  linda,   mary]   [omar,  zach,   zeke,  zoe]   Reorganize  rows  to  avoid  rowsets   with  overlapping  key  ranges  
  • 36. 36  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  –  Compac@on  policy   •  Solves  an  op@miza@on  problem  (knapsack  problem)   •  Minimize  “height”  of  rowsets  for  the  average  key  lookup   • Bound  on  number  of  seeks  for  write  or  random-­‐read   •  Restrict  total  IO  of  any  compac@on  to  a  budget  (128MB)   • No  long  compacHons,  ever   • No  “minor”  vs  “major”  disHncHon   • Always  be  compac@ng  or  flushing   • Low  IO  priority  maintenance  threads   36  
  • 37. 37  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  trade-­‐offs   •  Random  updates  will  be  slower   • HBase  model  allows  random  updates  without  incurring  a  disk  seek   • Kudu  requires  a  key  lookup  before  update,  bloom  lookup  before  insert   •  Single-­‐row  reads  may  be  slower   • Columnar  design  is  op@mized  for  scans   • Future:  may  introduce  “column  groups”  for  applica@ons  where  single-­‐row   access  is  more  important   • Especially  slow  at  reading  a  row  that  has  had  many  recent  updates  (e.g  YCSB   “zipfian”)   37  
  • 38. 38  ©  Cloudera,  Inc.  All  rights  reserved.   Benchmarks   38  
  • 39. 39  ©  Cloudera,  Inc.  All  rights  reserved.   TPC-­‐H  (Analy@cs  benchmark)   •  75TS  +  1  master  cluster   • 12  (spinning)  disk  each,  enough  RAM  to  fit  dataset   • Using  Kudu  0.5.0,  Impala  2.2  with  Kudu  support,  CDH  5.4   • TPC-­‐H  Scale  Factor  100  (100GB)   •  Example  query:   •  SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc; 39  
  • 40. 40  ©  Cloudera,  Inc.  All  rights  reserved.   -­‐  Kudu  outperforms  Parquet  by  31%  (geometric  mean)  for  RAM-­‐resident  data   -­‐  Parquet  likely  to  outperform  Kudu  for  HDD-­‐resident  (larger  IO  requests)  
  • 41. 41  ©  Cloudera,  Inc.  All  rights  reserved.   What  about  Apache  Phoenix?   •  10  node  cluster  (9  worker,  1  master)   •  HBase  1.0,  Phoenix  4.3   •  TPC-­‐H  LINEITEM  table  only  (6B  rows)   41   2152   219   76   131   0.04   1918   13.2   1.7   0.7   0.15   155   9.3   1.4   1.5   1.37   0.01   0.1   1   10   100   1000   10000   Load   TPCH  Q1   COUNT(*)   COUNT(*)   WHERE…   single-­‐row   lookup   Time  (sec)   Phoenix   Kudu   Parquet  
  • 42. 42  ©  Cloudera,  Inc.  All  rights  reserved.   What  about  NoSQL-­‐style  random  access?  (YCSB)   •  YCSB  0.5.0-­‐snapshot   •  10  node  cluster   (9  worker,  1  master)   •  HBase  1.0   •  100M  rows,  10M  ops   42  
  • 43. 43  ©  Cloudera,  Inc.  All  rights  reserved.   But  don’t  trust  me  (a  vendor)…   43  
  • 44. 44  ©  Cloudera,  Inc.  All  rights  reserved.   About  Xiaomi   Mobile  Internet  Company  Founded  in  2010 Smartphones SoXware E-­‐commerce MIUI Cloud  Services App  Store/Game Payment/Finance … Smart  Home Smart  Devices
  • 45. 45  ©  Cloudera,  Inc.  All  rights  reserved.   Big  Data  AnalyHcs  Pipeline   Before  Kudu •  Long  pipeline   high  latency(1  hour  ~  1  day),  data  conversion  pains   •  No  ordering   Log  arrival(storage)  order  not  exactly  logical  order   e.g.  read  2-­‐3  days  of  log  for  data  in  1  day
  • 46. 46  ©  Cloudera,  Inc.  All  rights  reserved.   Big  Data  Analysis  Pipeline   Simplified  With  Kudu •  ETL  Pipeline(0~10s  latency)   Apps  that  need  to  prevent  backpressure  or  require  ETL     •  Direct  Pipeline(no  latency)   Apps  that  don’t  require  ETL  and  no  backpressure  issues     OLAP  scan   Side  table  lookup   Result  store  
  • 47. 47  ©  Cloudera,  Inc.  All  rights  reserved.   Use  Case  1   Mobile  service  monitoring  and  tracing  tool Requirements   u  High  write  throughput   >5  Billion  records/day  and  growing   u  Query  latest  data  and  quick  response   Iden@fy  and  resolve  issues  quickly   u  Can  search  for  individual  records   Easy  for  troubleshoo@ng   Gather  important  RPC  tracing  events  from   mobile  app  and  backend  service.     Service  monitoring  &  troubleshoo@ng  tool.  
  • 48. 48  ©  Cloudera,  Inc.  All  rights  reserved.   Use  Case  1:  Benchmark   Environment   u  71  Node  cluster   u  Hardware   CPU:  E5-­‐2620  2.1GHz  *  24  core    Memory:  64GB     Network:  1Gb    Disk:  12  HDD   u  Sotware   Hadoop2.6/Impala  2.1/Kudu   Data   u  1  day  of  server  side  tracing  data   ~2.6  Billion  rows   ~270  bytes/row   17  columns,  5  key  columns  
  • 49. 49  ©  Cloudera,  Inc.  All  rights  reserved.   Use  Case  1:  Benchmark  Results     1.4     2.0     2.3     3.1     1.3     0.9    1.3     2.8     4.0     5.7     7.5     16.7     Q1   Q2   Q3   Q4   Q5   Q6   kudu   parquet   Total  Time(s)   Throughput(Total)   Throughput(per  node)   Kudu   961.1   2.8M  record/s   39.5k  record/s   Parquet   114.6   23.5M  record/s   331k  records/s   Bulk  load  using  impala  (INSERT  INTO):     Query  latency:   *  HDFS  parquet  file  replica@on  =  3  ,  kudu  table  replica@on  =  3   *  Each  query  run  5  @mes  then  take  average  
  • 50. 50  ©  Cloudera,  Inc.  All  rights  reserved.   Use  Case  1:  Result  Analysis   u  Lazy  materializa@on   Ideal  for  search  style  query   Q6  returns  only  a  few  records  (of  a  single  user)  with  all  columns   u  Scan  range  pruning  using  primary  index   Predicates  on  primary  key   Q5  only  scans  1  hour  of  data   u  Future  work   Primary  index:  speed-­‐up  order  by  and  dis@nct   Hash  Par@@oning:  speed-­‐up  count(dis@nct),  no  need  for  global   shuffle/merge  
  • 51. 51  ©  Cloudera,  Inc.  All  rights  reserved.   Use  Case  2   OLAP  PaaS  for  ecosystem  cloud u  Provide  big  data  service  for  smart  hardware  startups  (Xiaomi’s   ecosystem  members)   u  OLAP  database  with  some  OLTP  features   u  Manage/Ingest/query  your  data  and  serving  results  in  one  place   Backend/Mobile  App/Smart  Device/IoT  …  
  • 52. 52  ©  Cloudera,  Inc.  All  rights  reserved.   What  Kudu  is  not   52  
  • 53. 53  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  is…   • NOT  a  SQL  database   •  “BYO  SQL”   • NOT  a  filesystem   •  data  must  have  tabular  structure   • NOT  a  replacement  for  HBase  or  HDFS   •  Cloudera  con@nues  to  invest  in  those  systems   •  Many  use  cases  where  they’re  s@ll  more  appropriate   • NOT  an  in-­‐memory  database   •  Very  fast  for  memory-­‐sized  workloads,  but  can  operate  on  larger  data  too!      53  
  • 54. 54  ©  Cloudera,  Inc.  All  rights  reserved.   Ge…ng  started   54  
  • 55. 55  ©  Cloudera,  Inc.  All  rights  reserved.   Ge…ng  started  as  a  user   •  hwp://getkudu.io   •  kudu-­‐user@googlegroups.com   •  Quickstart  VM   • Easiest  way  to  get  started   • Impala  and  Kudu  in  an  easy-­‐to-­‐install  VM   •  CSD  and  Parcels   • For  installa@on  on  a  Cloudera  Manager-­‐managed  cluster   55  
  • 56. 56  ©  Cloudera,  Inc.  All  rights  reserved.   Ge…ng  started  as  a  developer   •  hwp://github.com/cloudera/kudu   • All  commits  go  here  first   •  Public  gerrit:  hwp://gerrit.cloudera.org   • All  code  reviews  happening  here   •  Public  JIRA:  hwp://issues.cloudera.org   • Includes  bugs  going  back  to  2013.  Come  see  our  dirty  laundry!   •  kudu-­‐dev@googlegroups.com   •  Apache  2.0  license  open  source   •  Contribu@ons  are  welcome  and  encouraged!   56  
  • 57. 57  ©  Cloudera,  Inc.  All  rights  reserved.   Demo?   (if  we  have  @me  and  internet  gods  willing)   57  
  • 58. 58  ©  Cloudera,  Inc.  All  rights  reserved.   hwp://getkudu.io/   @getkudu