SlideShare a Scribd company logo
1 of 46
Download to read offline
1©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Jean-­‐Daniel	
  Cryans on	
  behalf	
  of	
  the	
  Kudu	
  team
Kudu:	
  Resolving	
  Transactional	
  
and	
  Analytic	
  Trade-­‐offs	
  in	
  
Hadoop
2©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Myself
• Software	
  Engineer	
  at	
  Cloudera
• On	
  the	
  Kudu	
  team	
  for	
  2	
  years
• Apache	
  HBase committer	
  and	
  PMC	
  member	
  since	
  2008
• Previously	
  at	
  StumbleUpon
3©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Kudu
Storage	
  for	
  Fast	
  Analytics	
  on	
  Fast	
  Data
• New	
  updating	
  column	
  store	
  for	
  
Hadoop
• Apache-­‐licensed	
  open	
  source
• Beta	
  now	
  available
Columnar	
  Store
Kudu
4©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Motivation	
  and	
  Goals
Why	
  build	
  Kudu?
4
5©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Motivating	
  Questions
• Are	
  there	
  user	
  problems	
  that	
  can	
  we	
  can’t	
  address	
  because	
  of	
  gaps	
  in	
  Hadoop
ecosystem	
  storage	
  technologies?
• Are	
  we	
  positioned	
  to	
  take	
  advantage	
  of	
  advancements	
  in	
  the	
  hardware	
  
landscape?
6©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Current	
  Storage	
  Landscape	
  in	
  Hadoop
HDFS	
  excels	
  at:
• Efficiently	
  scanning	
  large	
  amounts	
  
of	
  data
• Accumulating	
  data	
  with	
  high	
  
throughput
HBase	
  excels	
  at:
• Efficiently	
  finding	
  and	
  writing	
  
individual	
  rows
• Making	
  data	
  mutable
Gaps	
  exist	
  when	
  these	
  properties	
  
are	
  needed	
  simultaneously
7©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
• High	
  throughput	
  for	
  big	
  scans	
  (columnar	
  
storage	
  and	
  replication)
Goal: Within	
  2x	
  of	
  Parquet
• Low-­‐latency	
  for	
  short	
  accesses	
  (primary	
  key	
  
indexes	
  and	
  quorum	
  replication)
Goal: 1ms	
  read/write	
  on	
  SSD
• Database-­‐like semantics	
  (initially	
  single-­‐row	
  
ACID)
• Relational	
  data	
  model
• SQL	
  query
• “NoSQL”	
  style	
  scan/insert/update	
  (Java	
  client)
Kudu	
  Design	
  Goals
8©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Changing	
  Hardware	
  landscape
• Spinning	
  disk	
  -­‐>	
  solid	
  state	
  storage
• NAND	
  flash:	
  Up	
  to	
  450k	
  read	
  250k	
  write	
  iops,	
  about	
  2GB/sec	
  read	
  and	
  
1.5GB/sec	
  write	
  throughput,at	
  a	
  price	
  of	
  less	
  than	
  $3/GB	
  and	
  dropping
• 3D	
  XPoint memory (1000x	
  faster	
  than	
  NAND,	
  cheaper	
  than	
  RAM)
• RAM is	
  cheaper	
  and	
  more	
  abundant:
• 64-­‐>128-­‐>256GB	
  over	
  last	
  few	
  years
• Takeaway	
  1:	
  The next	
  bottleneck	
  is	
  CPU,	
  and	
  current	
  storage	
  systems	
  weren’t	
  
designed	
  with	
  CPU	
  efficiency	
  in	
  mind.
• Takeaway	
  2: Column	
  stores	
  are	
  feasible	
  for	
  random	
  access
9©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Kudu	
  Usage
• Table	
  has	
  a	
  SQL-­‐like	
  schema
• Finite	
  number	
  of	
  columns	
  (unlike	
  HBase/Cassandra)
• Types:	
  BOOL,	
  INT8,	
  INT16,	
  INT32,	
  INT64,	
  FLOAT,	
  DOUBLE,	
  STRING,	
  BINARY,	
  
TIMESTAMP
• Some	
  subset	
  of	
  columns	
  makes	
  up	
  a	
  possibly-­‐composite	
  primary	
  key
• Fast	
  ALTER	
  TABLE
• Java	
  and	
  C++	
  “NoSQL”	
  style	
  APIs
• Insert(),	
  Update(),	
  Delete(),	
  Scan()
• Integrations	
  with	
  MapReduce,	
  Spark,	
  and	
  Impala
• more	
  to	
  come!
9
10©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Use	
  cases	
  and	
  architectures
11©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Kudu	
  Use	
  Cases
Kudu	
  is	
  best	
  for	
  use	
  cases	
  requiring	
  a	
  simultaneous	
  combination	
  of
sequential	
  and	
  random	
  reads	
  and	
  writes
● Time	
  Series
○ Examples:	
  Stream	
  market	
  data;	
  fraud	
  detection	
  &	
  prevention;	
  risk	
  monitoring
○ Workload:	
  Insert,	
  updates,	
  scans,	
  lookups
● Machine	
  Data	
  Analytics
○ Examples:	
  Network	
  threat	
  detection
○ Workload:	
  Inserts,	
  scans,	
  lookups
● Online	
  Reporting
○ Examples:	
  OperationalData	
  Store	
  (ODS)
○ Workload:	
  Inserts,	
  updates,	
  scans,	
  lookups
12©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Real-­‐Time	
  Analytics	
  in	
  Hadoop	
  Today
Fraud	
  Detection	
  in	
  the	
  Real	
  World	
  =	
  Storage	
  Complexity
Considerations:
● How	
  do	
  I	
  handle	
  failure	
  
during	
   this	
  process?
● How	
  often	
  do	
  I	
  reorganize	
  
data	
  streaming	
  in	
  into	
  a	
  
format	
  appropriate	
  for	
  
reporting?
● When	
  reporting,	
   how	
  do	
  I	
  see	
  
data	
  that	
  has	
  not	
  yet	
  been	
  
reorganized?
● How	
  do	
  I	
  ensure	
  that	
  
important	
  jobs	
  aren’t	
  
interrupted	
  by	
  maintenance?
HBase
Have	
  we	
  
accumulated	
  
enough	
  data?
Incoming	
  Data	
  
(Messaging	
  
System)
Parquet	
  
File
Reorganize	
  
HBase	
  file	
  
into	
  Parquet
Reporting	
  
Request
New	
  Partition
Most	
  Recent	
  Partition
Historic	
  Data
Impala	
  on	
  HDFS
• Wait	
  for	
  running	
  operations	
  to	
  complete	
  
• Define	
  new	
  Impala	
  partition	
  referencing	
  
the	
  newly	
  written	
  Parquet	
  file
13©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Real-­‐Time	
  Analytics	
  in	
  Hadoop	
  with	
  Kudu
Improvements:
● One	
  system to	
  operate
● No	
  cron	
  jobs	
  or	
  background	
  
processes
● Handle	
  late	
  arrivals	
  or	
  data	
  
corrections	
  with	
  ease
● New	
  data	
  available	
  
immediately	
  for	
  analytics	
  or	
  
operations	
  
Historical	
  and	
  Real-­‐time
Data
Incoming	
  Data	
  
(Messaging	
  
System)
Reporting	
  
Request
Storage	
  in	
  Kudu
14©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
How	
  it	
  works
14
15©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Tables	
  and	
  Tablets
• Table	
  is	
  horizontally	
  partitioned	
  into	
  tablets
• Range or	
  hash partitioning
• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY
HASH(timestamp) INTO 100 BUCKETS
• Each	
  tablet	
  has	
  N	
  replicas	
  (3	
  or	
  5),	
  with	
  Raft consensus
• Allow	
  read	
  from	
  any	
  replica,	
  plus	
  leader-­‐driven	
  writes	
  with	
  low	
  MTTR
• Tablet	
  servers	
  host	
  tablets
• Store	
  data	
  on	
  local	
  disks	
  (no	
  HDFS)
15
16©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Metadata
• Replicated	
  master*
• Acts	
  as	
  a	
  tablet	
  directory	
  (“META”	
  table)
• Acts	
  as	
  a	
  catalog	
  (table	
  schemas,	
  etc)
• Acts	
  as	
  a	
  load	
  balancer	
  (tracks	
  TS	
  liveness,	
  re-­‐replicates	
  under-­‐replicated	
  
tablets)
• Caches	
  all	
  metadata	
  in	
  RAM	
  for	
  high	
  performance
• 80-­‐node	
  load	
  test,	
  GetTableLocationsRPC	
  perf:
• 99th percentile:	
  68us,	
  	
  99.99th percentile:	
  657us	
  
• <2%	
  peak	
  CPU	
  usage
• Client	
  configured	
  with	
  master	
  addresses
• Asks	
  master	
  for	
  tablet	
  locations	
  as	
  needed	
  and	
  caches	
  them
16
17©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
18©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Raft	
  consensus
18
TS	
  A
Tablet	
  1
(LEADER)
Client
TS	
  B
Tablet	
  1
(FOLLOWER)
TS	
  C
Tablet	
  1
(FOLLOWER)
WAL
WALWAL
2b.	
  Leader	
  writes	
  local	
  WAL
1a.	
  Client-­‐>Leader:	
  Write()	
  RPC
2a.	
  Leader-­‐>Followers:	
  
UpdateConsensus()	
  RPC
3.	
  Follower:	
  write	
  WAL
4.	
  Follower-­‐>Leader:	
  success
3.	
  Follower:	
  write	
  WAL
5.	
  Leader	
  has	
  achieved	
  majority
6.	
  Leader-­‐>Client:	
  Success!
19©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Fault	
  tolerance
• Transient	
  FOLLOWER	
  failure:
• Leader	
  can	
  still	
  achieve	
  majority
• Restart	
  follower	
  TS	
  within	
  5	
  min	
  and	
  it	
  will	
  rejoin	
  transparently
• Transient	
  LEADER	
  failure:
• Followers	
  expect	
  to	
  hear	
  a	
  heartbeat	
  from	
  their	
  leader	
  every	
  1.5	
  seconds
• 3	
  missed	
  heartbeats:	
  leader	
  election!
• New	
  LEADER	
  is	
  elected	
  from	
  remaining	
  nodes	
  within	
  a	
  few	
  seconds
• Restart	
  within	
  5	
  min	
  and	
  it	
  rejoins	
  as	
  a	
  FOLLOWER
• N	
  replicas	
  handle	
  (N-­‐1)/2	
  failures
19
20©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Fault	
  tolerance	
  (2)
• Permanent	
  failure:
• Leader	
  notices	
  that	
  a	
  follower	
  has	
  been	
  dead	
  for	
  5	
  minutes
• Evicts	
  that	
  follower
• Master	
  selects	
  a	
  new	
  replica
• Leader	
  copies	
  the	
  data	
  over	
  to	
  the	
  new	
  one,	
  which	
  joins	
  as	
  a	
  new	
  FOLLOWER
20
21©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Tablet	
  design
• Inserts	
  buffered	
  in	
  an	
  in-­‐memory	
  store	
  (like	
  HBase’s	
  memstore)
• Flushed	
  to	
  disk
• Columnar	
  layout,	
  similar	
  to	
  Apache	
  Parquet
• Updates	
  use	
  MVCC	
  (updates	
  tagged	
  with	
  timestamp,	
  not	
  in-­‐place)
• Allow	
  “SELECT	
  AS	
  OF	
  <timestamp>”	
  queries	
  and	
  consistent	
  cross-­‐tablet	
  scans
• Near-­‐optimal	
  read	
  path	
  for	
  “current	
  time”	
  scans
• No	
  per	
  row	
  branches,	
  fast	
  vectorized decoding	
  and	
  predicate	
  evaluation
• Performance	
  worsens	
  based	
  on	
  number	
  of	
  recent	
  updates
21
22©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
LSM	
  vs Kudu
• LSM	
  – Log	
  Structured	
  Merge	
  (Cassandra,	
  HBase,	
  etc)
• Inserts	
  and	
  updates	
  all	
  go	
  to	
  an	
  in-­‐memory	
  map	
  (MemStore)	
  and	
  later	
  flush	
  to	
  
on-­‐disk	
  files	
  (HFile/SSTable)
• Reads	
  perform	
  an	
  on-­‐the-­‐fly	
  merge	
  of	
  all	
  on-­‐disk	
  HFiles
• Kudu
• Shares	
  some	
  traits	
  (memstores,	
  compactions)
• More	
  complex.
• Slower	
  writes in	
  exchange	
  for	
  faster	
  reads	
  (especially	
  scans)
22
23©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Kudu	
  trade-­‐offs
• Random	
  updates	
  will	
  be	
  slower
• HBase	
  model	
  allows	
  random	
  updates	
  without	
  incurring	
  a	
  disk	
  seek
• Kudu	
  requires	
  a	
  key	
  lookup	
  before	
  update,	
  bloom	
  lookup	
  before	
  insert,	
  may	
  
incur	
  seeks
• Single-­‐row	
  reads	
  may	
  be	
  slower
• Columnar	
  design	
  is	
  optimized	
  for	
  scans
• Especially	
  slow	
  at	
  reading	
  a	
  row	
  that	
  has	
  had	
  many	
  recent	
  updates	
  (e.g YCSB	
  
“zipfian”)
23
24©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Benchmarks
24
25©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
TPC-­‐H	
  (Analytics	
  benchmark)
• 75TS	
  +	
  1	
  master	
  cluster
• 12	
  (spinning)	
  disk	
  each,	
  enough	
  RAM	
  to	
  fit	
  dataset
• Using	
  Kudu	
  0.5.0,	
  Impala	
  2.2	
  with	
  Kudu	
  support,	
  CDH	
  5.4
• TPC-­‐H	
  Scale	
  Factor	
  100	
  (100GB)
• Example	
  query:
• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer,
orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND
l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'
AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY
n_name ORDER BY revenue desc;
25
26©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
-­‐ Kudu	
  outperforms	
   Parquet	
  by	
  31%	
  (geometric	
  mean)	
  for	
  RAM-­‐resident	
  data
-­‐ Parquet	
  likely	
  to	
  outperform	
   Kudu	
  for	
  HDD-­‐resident	
  (larger	
  IO	
  requests)
27©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
What	
  about	
  Apache	
  Phoenix?
• 10	
  node	
  cluster	
  (9	
  worker,	
  1	
  master)
• HBase	
  1.0,	
  Phoenix	
  4.3
• TPC-­‐H	
  LINEITEM	
  table	
  only	
  (6B	
  rows)
27
2152
219
76
131
0.04
1918
13.2
1.7
0.7
0.15
155
9.3
1.4 1.5 1.37
0.01
0.1
1
10
100
1000
10000
Load TPCH	
  Q1 COUNT(*)
COUNT(*)
WHERE…
single-­‐row
lookup
Time	
  (sec)
Phoenix
Kudu
Parquet
28©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
What	
  about	
  NoSQL-­‐style	
  random	
  access?	
  (YCSB)
• YCSB 0.5.0-­‐snapshot
• 10	
  node	
  cluster
(9	
  worker,	
  1	
  master)
• HBase 1.0
• 100M	
  rows,	
  10M	
  ops
28
29©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
But	
  don’t	
  trust	
  me	
  (a	
  vendor)…
29
30©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
About	
  Xiaomi
Mobile	
  Internet	
  Company	
  Founded	
  in	
  2010
Smartphones Software
E-­‐commerce
MIUI
Cloud	
  Services
App	
  Store/Game
Payment/Finance
…
Smart	
  Home
Smart	
  Devices
31©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Big	
  Data	
  Analytics	
  Pipeline
Before	
  Kudu
• Long	
  pipeline
high	
  latency(1	
  hour	
  ~	
  1	
  day),	
  data	
  conversion	
  pains
• No	
  ordering
Log	
  arrival(storage)	
  order	
  not	
  exactly	
  logical	
  order
e.g.	
  read	
  2-­‐3	
  days	
  of	
  log	
  for	
  data	
  in	
  1	
  day
32©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Big	
  Data	
  Analysis	
  Pipeline
Simplified	
  With	
  Kudu
• ETL	
  Pipeline(0~10s	
  latency)
Apps	
  that	
  need	
  to	
  prevent	
  backpressure	
  or	
  require	
  ETL	
  
• Direct	
  Pipeline(no	
  latency)
Apps	
  that	
  don’t	
  require	
  ETL	
  and	
  no	
  backpressure	
  issues
OLAP	
  scan
Side	
  table	
  lookup
Result	
  store
33©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Use	
  Case	
  1
Mobile	
  service	
  monitoring	
  and	
  tracing	
  tool
Requirements
u High	
  write	
  throughput
>5	
  Billion	
  records/day	
  and	
  growing
u Query	
  latest	
  data	
  and	
  quick	
  response
Identify	
  and	
  resolve	
  issues	
  quickly
u Can	
  search	
  for	
  individual	
  records
Easy	
  for	
  troubleshooting
Gather	
  important	
  RPC	
  tracing	
  events	
  from	
  mobile
app	
  and	
  backend	
  service.	
  
Service	
  monitoring	
  &	
  troubleshooting	
  tool.
34©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Use	
  Case	
  1:	
  Benchmark
Environment
u 71	
  Node	
  cluster
u Hardware
CPU:	
  E5-­‐2620	
  2.1GHz	
  *	
  24	
  core	
  	
  Memory:	
  64GB	
  
Network:	
  1Gb	
  	
  Disk:	
  12	
  HDD
u Software
Hadoop2.6/Impala	
  2.1/Kudu
Data
u 1	
  day	
  of	
  server	
  side	
  tracing	
  data
~2.6	
  Billion	
  rows
~270	
  bytes/row
17	
  columns,	
  5	
  key	
  columns
35©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Use	
  Case	
  1:	
  Benchmark	
  Results
1.4	
   2.0	
   2.3	
  
3.1	
  
1.3	
   0.9	
  1.3	
  
2.8	
  
4.0	
  
5.7	
  
7.5	
  
16.7	
  
Q1 Q2 Q3 Q4 Q5 Q6
kudu
parquet
Total	
  Time(s) Throughput(Total) Throughput(pernode)
Kudu 961.1 2.8M	
  record/s 39.5k	
  record/s
Parquet 114.6 23.5M	
  record/s 331k records/s
Bulk	
  load	
  using	
  impala	
  (INSERT	
  INTO):	
  
Query	
  latency:
*	
  HDFS	
  parquet	
  file	
  replication	
  =	
  3	
  ,	
  kudu	
  table	
  replication	
  =	
  3
*	
  Each	
  query	
  run	
  5	
  times	
  then	
  take	
  average
36©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Use	
  Case	
  1:	
  Result	
  Analysis
u Lazy	
  materialization
Ideal	
  for	
  search	
  style	
  query
Q6	
  returns	
  only	
  a	
  few	
  records	
  (of	
  a	
  single	
  user)	
  with	
  all	
  columns
u Scan	
  range	
  pruning	
  using	
  primary	
  index
Predicates	
  on	
  primary	
  key
Q5	
  only	
  scans	
  1	
  hour	
  of	
  data
u Future	
  work
Primary	
  index:	
  speed-­‐up	
  order	
  by	
  and	
  distinct
Hash	
  Partitioning:	
  speed-­‐up	
  count(distinct),	
  no	
  need	
  for	
  global	
  
shuffle/merge
37©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Use	
  Case	
  2
OLAP	
  PaaS for	
  ecosystem	
  cloud
u Provide	
  big	
  data	
  service	
  for	
  smart	
  hardware	
  startups	
  (Xiaomi’s	
  
ecosystem	
  members)
u OLAP	
  database	
  with	
  some	
  OLTP	
  features
u Manage/Ingest/query	
  your	
  data	
  and	
  serving	
  results	
  in	
  one	
  place
Backend/Mobile	
  App/Smart	
  Device/IoT …
38©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Demo
38
39©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Demo
39
• Code	
  currently	
  at	
  https://github.com/tmalaska/SparkOnKudu/
• Work	
  being	
  finished	
  in	
  https://issues.cloudera.org/browse/KUDU-­‐1214
Ingestion	
  in	
  
Kafka
Gamer	
  data	
  
points
Processing	
  in	
  
Spark	
  
Streaming
Data	
  stored	
  in	
  
Kudu
Querying	
  
done	
  in	
  
ImpalaProducer	
  sends	
  data
points	
  to	
  Kafka
Spark	
  pulls	
  from	
  Kafka Spark	
  loads	
  base
data	
  from	
  Kudu
Aggregates	
  are	
  stored
back	
  into	
  Kudu
Live	
  queries	
  come
from	
  Impala
40©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Demo
41©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Project	
  status
41
42©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Project	
  status
• Public	
  Beta	
  released	
  September	
  28th	
  2015,	
  version	
  0.5.0
• Not	
  ready	
  for	
  production
• No	
  security
• Feedback/jiras/patches	
  welcome
• Next	
  release	
  in	
  November	
  (0.6.0):
• Mac	
  OSX	
  support	
  for	
  single	
  node	
  deployment
• Lots	
  of	
  small	
  fixes	
  and	
  improvements
• GA	
  sometime	
  next	
  year	
  (hopefully!)
• Will	
  have	
  Kerberos	
  integration
• Ready	
  for	
  production
42
43©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Getting	
  started
43
44©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Getting	
  started	
  as	
  a	
  user
• http://getkudu.io
• kudu-­‐user@googlegroups.com
• http://getkudu-­‐slack.herokuapp.com/
• Quickstart VM
• Easiest	
  way	
  to	
  get	
  started
• Impala	
  and	
  Kudu	
  in	
  an	
  easy-­‐to-­‐install	
  VM
• CSD	
  and	
  Parcels
• For	
  installation	
  on	
  a	
  Cloudera	
  Manager-­‐managed	
  cluster
44
45©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
Getting	
  started	
  as	
  a	
  developer
• http://github.com/cloudera/kudu
• All	
  commits	
  go	
  here	
  first
• Public	
  gerrit:	
  http://gerrit.cloudera.org
• All	
  code	
  reviews	
  happening	
  here
• Public	
  JIRA:	
  http://issues.cloudera.org
• Includes	
  bugs	
  going	
  back	
  to	
  2013.	
  Come	
  see	
  our	
  dirty	
  laundry!
• kudu-­‐dev@googlegroups.com
• Apache	
  2.0	
  license	
  open	
  source
• Contributions	
  are	
  welcome	
  and	
  encouraged!
45
46©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.
http://getkudu.io/
@getkudu

More Related Content

What's hot

Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...Yahoo Developer Network
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data PlatformRakuten Group, Inc.
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Dataconomy Media
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architectureMartinStrycek
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesNacho García Fernández
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 

What's hot (20)

Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architecture
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architectures
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 

Viewers also liked

Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]Hiral Patel
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaJason Shih
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala InternalsDavid Groozman
 
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisCloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisYue Chen
 
SecPod: A Framework for Virtualization-based Security Systems
SecPod: A Framework for Virtualization-based Security SystemsSecPod: A Framework for Virtualization-based Security Systems
SecPod: A Framework for Virtualization-based Security SystemsYue Chen
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduCloudera, Inc.
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Cloudera, Inc.
 
How Impala Works
How Impala WorksHow Impala Works
How Impala WorksYue Chen
 
Inside HDFS Append
Inside HDFS AppendInside HDFS Append
Inside HDFS AppendYue Chen
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 

Viewers also liked (13)

Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]Query generation across multiple data stores [SBTB 2016]
Query generation across multiple data stores [SBTB 2016]
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisCloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and Analysis
 
SecPod: A Framework for Virtualization-based Security Systems
SecPod: A Framework for Virtualization-based Security SystemsSecPod: A Framework for Virtualization-based Security Systems
SecPod: A Framework for Virtualization-based Security Systems
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
How Impala Works
How Impala WorksHow Impala Works
How Impala Works
 
Inside HDFS Append
Inside HDFS AppendInside HDFS Append
Inside HDFS Append
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 

Similar to Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Kathleen Ting
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaManish Maheshwari
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxManish Maheshwari
 
NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5UniFabric
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Wei-Chiu Chuang
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming ArchitecturesCloudera, Inc.
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechCloudera Japan
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 

Similar to Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop (20)

Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptx
 
NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentech
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 

Recently uploaded

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 

Recently uploaded (20)

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 

Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop

  • 1. 1©  Cloudera,  Inc.  All  rights  reserved. Jean-­‐Daniel  Cryans on  behalf  of  the  Kudu  team Kudu:  Resolving  Transactional   and  Analytic  Trade-­‐offs  in   Hadoop
  • 2. 2©  Cloudera,  Inc.  All  rights  reserved. Myself • Software  Engineer  at  Cloudera • On  the  Kudu  team  for  2  years • Apache  HBase committer  and  PMC  member  since  2008 • Previously  at  StumbleUpon
  • 3. 3©  Cloudera,  Inc.  All  rights  reserved. Kudu Storage  for  Fast  Analytics  on  Fast  Data • New  updating  column  store  for   Hadoop • Apache-­‐licensed  open  source • Beta  now  available Columnar  Store Kudu
  • 4. 4©  Cloudera,  Inc.  All  rights  reserved. Motivation  and  Goals Why  build  Kudu? 4
  • 5. 5©  Cloudera,  Inc.  All  rights  reserved. Motivating  Questions • Are  there  user  problems  that  can  we  can’t  address  because  of  gaps  in  Hadoop ecosystem  storage  technologies? • Are  we  positioned  to  take  advantage  of  advancements  in  the  hardware   landscape?
  • 6. 6©  Cloudera,  Inc.  All  rights  reserved. Current  Storage  Landscape  in  Hadoop HDFS  excels  at: • Efficiently  scanning  large  amounts   of  data • Accumulating  data  with  high   throughput HBase  excels  at: • Efficiently  finding  and  writing   individual  rows • Making  data  mutable Gaps  exist  when  these  properties   are  needed  simultaneously
  • 7. 7©  Cloudera,  Inc.  All  rights  reserved. • High  throughput  for  big  scans  (columnar   storage  and  replication) Goal: Within  2x  of  Parquet • Low-­‐latency  for  short  accesses  (primary  key   indexes  and  quorum  replication) Goal: 1ms  read/write  on  SSD • Database-­‐like semantics  (initially  single-­‐row   ACID) • Relational  data  model • SQL  query • “NoSQL”  style  scan/insert/update  (Java  client) Kudu  Design  Goals
  • 8. 8©  Cloudera,  Inc.  All  rights  reserved. Changing  Hardware  landscape • Spinning  disk  -­‐>  solid  state  storage • NAND  flash:  Up  to  450k  read  250k  write  iops,  about  2GB/sec  read  and   1.5GB/sec  write  throughput,at  a  price  of  less  than  $3/GB  and  dropping • 3D  XPoint memory (1000x  faster  than  NAND,  cheaper  than  RAM) • RAM is  cheaper  and  more  abundant: • 64-­‐>128-­‐>256GB  over  last  few  years • Takeaway  1:  The next  bottleneck  is  CPU,  and  current  storage  systems  weren’t   designed  with  CPU  efficiency  in  mind. • Takeaway  2: Column  stores  are  feasible  for  random  access
  • 9. 9©  Cloudera,  Inc.  All  rights  reserved. Kudu  Usage • Table  has  a  SQL-­‐like  schema • Finite  number  of  columns  (unlike  HBase/Cassandra) • Types:  BOOL,  INT8,  INT16,  INT32,  INT64,  FLOAT,  DOUBLE,  STRING,  BINARY,   TIMESTAMP • Some  subset  of  columns  makes  up  a  possibly-­‐composite  primary  key • Fast  ALTER  TABLE • Java  and  C++  “NoSQL”  style  APIs • Insert(),  Update(),  Delete(),  Scan() • Integrations  with  MapReduce,  Spark,  and  Impala • more  to  come! 9
  • 10. 10©  Cloudera,  Inc.  All  rights  reserved. Use  cases  and  architectures
  • 11. 11©  Cloudera,  Inc.  All  rights  reserved. Kudu  Use  Cases Kudu  is  best  for  use  cases  requiring  a  simultaneous  combination  of sequential  and  random  reads  and  writes ● Time  Series ○ Examples:  Stream  market  data;  fraud  detection  &  prevention;  risk  monitoring ○ Workload:  Insert,  updates,  scans,  lookups ● Machine  Data  Analytics ○ Examples:  Network  threat  detection ○ Workload:  Inserts,  scans,  lookups ● Online  Reporting ○ Examples:  OperationalData  Store  (ODS) ○ Workload:  Inserts,  updates,  scans,  lookups
  • 12. 12©  Cloudera,  Inc.  All  rights  reserved. Real-­‐Time  Analytics  in  Hadoop  Today Fraud  Detection  in  the  Real  World  =  Storage  Complexity Considerations: ● How  do  I  handle  failure   during   this  process? ● How  often  do  I  reorganize   data  streaming  in  into  a   format  appropriate  for   reporting? ● When  reporting,   how  do  I  see   data  that  has  not  yet  been   reorganized? ● How  do  I  ensure  that   important  jobs  aren’t   interrupted  by  maintenance? HBase Have  we   accumulated   enough  data? Incoming  Data   (Messaging   System) Parquet   File Reorganize   HBase  file   into  Parquet Reporting   Request New  Partition Most  Recent  Partition Historic  Data Impala  on  HDFS • Wait  for  running  operations  to  complete   • Define  new  Impala  partition  referencing   the  newly  written  Parquet  file
  • 13. 13©  Cloudera,  Inc.  All  rights  reserved. Real-­‐Time  Analytics  in  Hadoop  with  Kudu Improvements: ● One  system to  operate ● No  cron  jobs  or  background   processes ● Handle  late  arrivals  or  data   corrections  with  ease ● New  data  available   immediately  for  analytics  or   operations   Historical  and  Real-­‐time Data Incoming  Data   (Messaging   System) Reporting   Request Storage  in  Kudu
  • 14. 14©  Cloudera,  Inc.  All  rights  reserved. How  it  works 14
  • 15. 15©  Cloudera,  Inc.  All  rights  reserved. Tables  and  Tablets • Table  is  horizontally  partitioned  into  tablets • Range or  hash partitioning • PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS • Each  tablet  has  N  replicas  (3  or  5),  with  Raft consensus • Allow  read  from  any  replica,  plus  leader-­‐driven  writes  with  low  MTTR • Tablet  servers  host  tablets • Store  data  on  local  disks  (no  HDFS) 15
  • 16. 16©  Cloudera,  Inc.  All  rights  reserved. Metadata • Replicated  master* • Acts  as  a  tablet  directory  (“META”  table) • Acts  as  a  catalog  (table  schemas,  etc) • Acts  as  a  load  balancer  (tracks  TS  liveness,  re-­‐replicates  under-­‐replicated   tablets) • Caches  all  metadata  in  RAM  for  high  performance • 80-­‐node  load  test,  GetTableLocationsRPC  perf: • 99th percentile:  68us,    99.99th percentile:  657us   • <2%  peak  CPU  usage • Client  configured  with  master  addresses • Asks  master  for  tablet  locations  as  needed  and  caches  them 16
  • 17. 17©  Cloudera,  Inc.  All  rights  reserved.
  • 18. 18©  Cloudera,  Inc.  All  rights  reserved. Raft  consensus 18 TS  A Tablet  1 (LEADER) Client TS  B Tablet  1 (FOLLOWER) TS  C Tablet  1 (FOLLOWER) WAL WALWAL 2b.  Leader  writes  local  WAL 1a.  Client-­‐>Leader:  Write()  RPC 2a.  Leader-­‐>Followers:   UpdateConsensus()  RPC 3.  Follower:  write  WAL 4.  Follower-­‐>Leader:  success 3.  Follower:  write  WAL 5.  Leader  has  achieved  majority 6.  Leader-­‐>Client:  Success!
  • 19. 19©  Cloudera,  Inc.  All  rights  reserved. Fault  tolerance • Transient  FOLLOWER  failure: • Leader  can  still  achieve  majority • Restart  follower  TS  within  5  min  and  it  will  rejoin  transparently • Transient  LEADER  failure: • Followers  expect  to  hear  a  heartbeat  from  their  leader  every  1.5  seconds • 3  missed  heartbeats:  leader  election! • New  LEADER  is  elected  from  remaining  nodes  within  a  few  seconds • Restart  within  5  min  and  it  rejoins  as  a  FOLLOWER • N  replicas  handle  (N-­‐1)/2  failures 19
  • 20. 20©  Cloudera,  Inc.  All  rights  reserved. Fault  tolerance  (2) • Permanent  failure: • Leader  notices  that  a  follower  has  been  dead  for  5  minutes • Evicts  that  follower • Master  selects  a  new  replica • Leader  copies  the  data  over  to  the  new  one,  which  joins  as  a  new  FOLLOWER 20
  • 21. 21©  Cloudera,  Inc.  All  rights  reserved. Tablet  design • Inserts  buffered  in  an  in-­‐memory  store  (like  HBase’s  memstore) • Flushed  to  disk • Columnar  layout,  similar  to  Apache  Parquet • Updates  use  MVCC  (updates  tagged  with  timestamp,  not  in-­‐place) • Allow  “SELECT  AS  OF  <timestamp>”  queries  and  consistent  cross-­‐tablet  scans • Near-­‐optimal  read  path  for  “current  time”  scans • No  per  row  branches,  fast  vectorized decoding  and  predicate  evaluation • Performance  worsens  based  on  number  of  recent  updates 21
  • 22. 22©  Cloudera,  Inc.  All  rights  reserved. LSM  vs Kudu • LSM  – Log  Structured  Merge  (Cassandra,  HBase,  etc) • Inserts  and  updates  all  go  to  an  in-­‐memory  map  (MemStore)  and  later  flush  to   on-­‐disk  files  (HFile/SSTable) • Reads  perform  an  on-­‐the-­‐fly  merge  of  all  on-­‐disk  HFiles • Kudu • Shares  some  traits  (memstores,  compactions) • More  complex. • Slower  writes in  exchange  for  faster  reads  (especially  scans) 22
  • 23. 23©  Cloudera,  Inc.  All  rights  reserved. Kudu  trade-­‐offs • Random  updates  will  be  slower • HBase  model  allows  random  updates  without  incurring  a  disk  seek • Kudu  requires  a  key  lookup  before  update,  bloom  lookup  before  insert,  may   incur  seeks • Single-­‐row  reads  may  be  slower • Columnar  design  is  optimized  for  scans • Especially  slow  at  reading  a  row  that  has  had  many  recent  updates  (e.g YCSB   “zipfian”) 23
  • 24. 24©  Cloudera,  Inc.  All  rights  reserved. Benchmarks 24
  • 25. 25©  Cloudera,  Inc.  All  rights  reserved. TPC-­‐H  (Analytics  benchmark) • 75TS  +  1  master  cluster • 12  (spinning)  disk  each,  enough  RAM  to  fit  dataset • Using  Kudu  0.5.0,  Impala  2.2  with  Kudu  support,  CDH  5.4 • TPC-­‐H  Scale  Factor  100  (100GB) • Example  query: • SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc; 25
  • 26. 26©  Cloudera,  Inc.  All  rights  reserved. -­‐ Kudu  outperforms   Parquet  by  31%  (geometric  mean)  for  RAM-­‐resident  data -­‐ Parquet  likely  to  outperform   Kudu  for  HDD-­‐resident  (larger  IO  requests)
  • 27. 27©  Cloudera,  Inc.  All  rights  reserved. What  about  Apache  Phoenix? • 10  node  cluster  (9  worker,  1  master) • HBase  1.0,  Phoenix  4.3 • TPC-­‐H  LINEITEM  table  only  (6B  rows) 27 2152 219 76 131 0.04 1918 13.2 1.7 0.7 0.15 155 9.3 1.4 1.5 1.37 0.01 0.1 1 10 100 1000 10000 Load TPCH  Q1 COUNT(*) COUNT(*) WHERE… single-­‐row lookup Time  (sec) Phoenix Kudu Parquet
  • 28. 28©  Cloudera,  Inc.  All  rights  reserved. What  about  NoSQL-­‐style  random  access?  (YCSB) • YCSB 0.5.0-­‐snapshot • 10  node  cluster (9  worker,  1  master) • HBase 1.0 • 100M  rows,  10M  ops 28
  • 29. 29©  Cloudera,  Inc.  All  rights  reserved. But  don’t  trust  me  (a  vendor)… 29
  • 30. 30©  Cloudera,  Inc.  All  rights  reserved. About  Xiaomi Mobile  Internet  Company  Founded  in  2010 Smartphones Software E-­‐commerce MIUI Cloud  Services App  Store/Game Payment/Finance … Smart  Home Smart  Devices
  • 31. 31©  Cloudera,  Inc.  All  rights  reserved. Big  Data  Analytics  Pipeline Before  Kudu • Long  pipeline high  latency(1  hour  ~  1  day),  data  conversion  pains • No  ordering Log  arrival(storage)  order  not  exactly  logical  order e.g.  read  2-­‐3  days  of  log  for  data  in  1  day
  • 32. 32©  Cloudera,  Inc.  All  rights  reserved. Big  Data  Analysis  Pipeline Simplified  With  Kudu • ETL  Pipeline(0~10s  latency) Apps  that  need  to  prevent  backpressure  or  require  ETL   • Direct  Pipeline(no  latency) Apps  that  don’t  require  ETL  and  no  backpressure  issues OLAP  scan Side  table  lookup Result  store
  • 33. 33©  Cloudera,  Inc.  All  rights  reserved. Use  Case  1 Mobile  service  monitoring  and  tracing  tool Requirements u High  write  throughput >5  Billion  records/day  and  growing u Query  latest  data  and  quick  response Identify  and  resolve  issues  quickly u Can  search  for  individual  records Easy  for  troubleshooting Gather  important  RPC  tracing  events  from  mobile app  and  backend  service.   Service  monitoring  &  troubleshooting  tool.
  • 34. 34©  Cloudera,  Inc.  All  rights  reserved. Use  Case  1:  Benchmark Environment u 71  Node  cluster u Hardware CPU:  E5-­‐2620  2.1GHz  *  24  core    Memory:  64GB   Network:  1Gb    Disk:  12  HDD u Software Hadoop2.6/Impala  2.1/Kudu Data u 1  day  of  server  side  tracing  data ~2.6  Billion  rows ~270  bytes/row 17  columns,  5  key  columns
  • 35. 35©  Cloudera,  Inc.  All  rights  reserved. Use  Case  1:  Benchmark  Results 1.4   2.0   2.3   3.1   1.3   0.9  1.3   2.8   4.0   5.7   7.5   16.7   Q1 Q2 Q3 Q4 Q5 Q6 kudu parquet Total  Time(s) Throughput(Total) Throughput(pernode) Kudu 961.1 2.8M  record/s 39.5k  record/s Parquet 114.6 23.5M  record/s 331k records/s Bulk  load  using  impala  (INSERT  INTO):   Query  latency: *  HDFS  parquet  file  replication  =  3  ,  kudu  table  replication  =  3 *  Each  query  run  5  times  then  take  average
  • 36. 36©  Cloudera,  Inc.  All  rights  reserved. Use  Case  1:  Result  Analysis u Lazy  materialization Ideal  for  search  style  query Q6  returns  only  a  few  records  (of  a  single  user)  with  all  columns u Scan  range  pruning  using  primary  index Predicates  on  primary  key Q5  only  scans  1  hour  of  data u Future  work Primary  index:  speed-­‐up  order  by  and  distinct Hash  Partitioning:  speed-­‐up  count(distinct),  no  need  for  global   shuffle/merge
  • 37. 37©  Cloudera,  Inc.  All  rights  reserved. Use  Case  2 OLAP  PaaS for  ecosystem  cloud u Provide  big  data  service  for  smart  hardware  startups  (Xiaomi’s   ecosystem  members) u OLAP  database  with  some  OLTP  features u Manage/Ingest/query  your  data  and  serving  results  in  one  place Backend/Mobile  App/Smart  Device/IoT …
  • 38. 38©  Cloudera,  Inc.  All  rights  reserved. Demo 38
  • 39. 39©  Cloudera,  Inc.  All  rights  reserved. Demo 39 • Code  currently  at  https://github.com/tmalaska/SparkOnKudu/ • Work  being  finished  in  https://issues.cloudera.org/browse/KUDU-­‐1214 Ingestion  in   Kafka Gamer  data   points Processing  in   Spark   Streaming Data  stored  in   Kudu Querying   done  in   ImpalaProducer  sends  data points  to  Kafka Spark  pulls  from  Kafka Spark  loads  base data  from  Kudu Aggregates  are  stored back  into  Kudu Live  queries  come from  Impala
  • 40. 40©  Cloudera,  Inc.  All  rights  reserved. Demo
  • 41. 41©  Cloudera,  Inc.  All  rights  reserved. Project  status 41
  • 42. 42©  Cloudera,  Inc.  All  rights  reserved. Project  status • Public  Beta  released  September  28th  2015,  version  0.5.0 • Not  ready  for  production • No  security • Feedback/jiras/patches  welcome • Next  release  in  November  (0.6.0): • Mac  OSX  support  for  single  node  deployment • Lots  of  small  fixes  and  improvements • GA  sometime  next  year  (hopefully!) • Will  have  Kerberos  integration • Ready  for  production 42
  • 43. 43©  Cloudera,  Inc.  All  rights  reserved. Getting  started 43
  • 44. 44©  Cloudera,  Inc.  All  rights  reserved. Getting  started  as  a  user • http://getkudu.io • kudu-­‐user@googlegroups.com • http://getkudu-­‐slack.herokuapp.com/ • Quickstart VM • Easiest  way  to  get  started • Impala  and  Kudu  in  an  easy-­‐to-­‐install  VM • CSD  and  Parcels • For  installation  on  a  Cloudera  Manager-­‐managed  cluster 44
  • 45. 45©  Cloudera,  Inc.  All  rights  reserved. Getting  started  as  a  developer • http://github.com/cloudera/kudu • All  commits  go  here  first • Public  gerrit:  http://gerrit.cloudera.org • All  code  reviews  happening  here • Public  JIRA:  http://issues.cloudera.org • Includes  bugs  going  back  to  2013.  Come  see  our  dirty  laundry! • kudu-­‐dev@googlegroups.com • Apache  2.0  license  open  source • Contributions  are  welcome  and  encouraged! 45
  • 46. 46©  Cloudera,  Inc.  All  rights  reserved. http://getkudu.io/ @getkudu