SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Todd	
  Lipcon	
  (Kudu	
  team	
  lead)	
  –	
  todd@cloudera.com	
  
	
  
Twi$er:	
  @ApacheKudu	
  or	
  #kudu	
  
	
   	
  	
  	
  	
  @tlipcon	
  
	
  
*	
  incubaEng	
  
Apache	
  Kudu*:	
  Fast	
  AnalyEcs	
  on	
  
Fast	
  Data	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  Kudu?	
  
2	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Current	
  Storage	
  Landscape	
  in	
  Hadoop	
  Ecosystem	
  
HDFS	
  (GFS)	
  excels	
  at:	
  
•  Batch	
  ingest	
  only	
  (eg	
  hourly)	
  
•  Efficiently	
  scanning	
  large	
  amounts	
  
of	
  data	
  (analyEcs)	
  
HBase	
  (BigTable)	
  excels	
  at:	
  
•  Efficiently	
  finding	
  and	
  wriEng	
  
individual	
  rows	
  
•  Making	
  data	
  mutable	
  
	
  
Gaps	
  exist	
  when	
  these	
  properEes	
  
are	
  needed	
  simultaneously	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  High	
  throughput	
  for	
  big	
  scans	
  
Goal:	
  Within	
  2x	
  of	
  Parquet	
  
	
  
•  Low-­‐latency	
  for	
  short	
  accesses	
  	
  
	
  	
  	
  Goal:	
  1ms	
  read/write	
  on	
  SSD	
  
	
  
•  Database-­‐like	
  semanEcs	
  (iniEally	
  single-­‐row	
  
ACID)	
  
Kudu	
  Design	
  Goals	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Changing	
  Hardware	
  landscape	
  
•  Spinning	
  disk	
  -­‐>	
  solid	
  state	
  storage	
  
• NAND	
  flash:	
  Up	
  to	
  450k	
  read	
  250k	
  write	
  iops,	
  about	
  2GB/sec	
  read	
  and	
  1.5GB/
sec	
  write	
  throughput,	
  at	
  a	
  price	
  of	
  less	
  than	
  $3/GB	
  and	
  dropping	
  
• 3D	
  XPoint	
  memory	
  (1000x	
  faster	
  than	
  NAND,	
  cheaper	
  than	
  RAM)	
  
•  RAM	
  is	
  cheaper	
  and	
  more	
  abundant:	
  
• 64-­‐>128-­‐>256GB	
  over	
  last	
  few	
  years	
  
•  Takeaway:	
  The	
  next	
  bo$leneck	
  is	
  CPU,	
  and	
  current	
  storage	
  systems	
  weren’t	
  
designed	
  with	
  CPU	
  efficiency	
  in	
  mind.	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  Kudu?	
  
6	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Scalable	
  and	
  fast	
  tabular	
  storage	
  
•  Scalable	
  
• Tested	
  up	
  to	
  275	
  nodes	
  (~3PB	
  cluster)	
  
• Designed	
  to	
  scale	
  to	
  1000s	
  of	
  nodes,	
  tens	
  of	
  PBs	
  
•  Fast	
  
• Millions	
  of	
  read/write	
  operaEons	
  per	
  second	
  across	
  cluster	
  
• MulPple	
  GB/second	
  read	
  throughput	
  per	
  node	
  
•  Tabular	
  
• SQL-­‐like	
  schema:	
  finite	
  number	
  of	
  typed	
  columns	
  (unlike	
  HBase/Cassandra)	
  
• Fast	
  ALTER	
  TABLE	
  
• “NoSQL”	
  APIs:	
  Java/C++/Python	
  	
  or	
  SQL	
  (Impala/Spark/etc)	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Use	
  cases	
  and	
  architectures	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  Use	
  Cases	
  
Kudu	
  is	
  best	
  for	
  use	
  cases	
  requiring	
  a	
  simultaneous	
  combinaPon	
  of	
  
sequenPal	
  and	
  random	
  reads	
  and	
  writes	
  
	
  
● Time	
  Series	
  
○  Examples:	
  Stream	
  market	
  data;	
  fraud	
  detecEon	
  &	
  prevenEon;	
  risk	
  monitoring	
  
○  Workload:	
  Insert,	
  updates,	
  scans,	
  lookups	
  
● Online	
  ReporPng	
  
○  Examples:	
  ODS	
  
○  Workload:	
  Inserts,	
  updates,	
  scans,	
  lookups	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real-­‐Time	
  AnalyEcs	
  in	
  Hadoop	
  Today	
  
Fraud	
  DetecEon	
  in	
  the	
  Real	
  World	
  =	
  Storage	
  Complexity	
  
ConsideraPons:	
  
●  How	
  do	
  I	
  handle	
  failure	
  
during	
  this	
  process?	
  
	
  
●  How	
  oten	
  do	
  I	
  reorganize	
  
data	
  streaming	
  in	
  into	
  a	
  
format	
  appropriate	
  for	
  
reporEng?	
  
	
  
●  When	
  reporEng,	
  how	
  do	
  I	
  see	
  
data	
  that	
  has	
  not	
  yet	
  been	
  
reorganized?	
  
	
  
●  How	
  do	
  I	
  ensure	
  that	
  
important	
  jobs	
  aren’t	
  
interrupted	
  by	
  maintenance?	
  
New	
  ParEEon	
  
Most	
  Recent	
  ParEEon	
  
Historic	
  Data	
  
HBase	
  
Parquet	
  
File	
  
Have	
  we	
  
accumulated	
  
enough	
  data?	
  
Reorganize	
  
HBase	
  file	
  
into	
  Parquet	
  
•  Wait	
  for	
  running	
  operaEons	
  to	
  complete	
  	
  
•  Define	
  new	
  Impala	
  parEEon	
  referencing	
  
the	
  newly	
  wriwen	
  Parquet	
  file	
  
Incoming	
  Data	
  
(Messaging	
  
System)	
  
ReporEng	
  
Request	
  
Impala	
  on	
  HDFS	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real-­‐Time	
  AnalyEcs	
  in	
  Hadoop	
  with	
  Kudu	
  
Improvements:	
  
●  One	
  system	
  to	
  operate	
  
●  No	
  cron	
  jobs	
  or	
  background	
  
processes	
  
●  Handle	
  late	
  arrivals	
  or	
  data	
  
correcPons	
  with	
  ease	
  
●  New	
  data	
  available	
  
immediately	
  for	
  analyPcs	
  or	
  
operaPons	
  	
  
Historical	
  and	
  Real-­‐Eme	
  
Data	
  
Incoming	
  Data	
  
(Messaging	
  
System)	
  
ReporEng	
  
Request	
  
Storage	
  in	
  Kudu	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Xiaomi	
  use	
  case	
  
•  World’s	
  4th	
  largest	
  smart-­‐phone	
  maker	
  (most	
  popular	
  in	
  China)	
  
•  Gather	
  important	
  RPC	
  tracing	
  events	
  from	
  mobile	
  app	
  and	
  backend	
  service.	
  	
  
•  Service	
  monitoring	
  &	
  troubleshooEng	
  tool.	
  
u  High	
  write	
  throughput	
  
•  >10	
  Billion	
  records/day	
  and	
  growing	
  
u  Query	
  latest	
  data	
  and	
  quick	
  response	
  
•  IdenEfy	
  and	
  resolve	
  issues	
  quickly	
  
u  Can	
  search	
  for	
  individual	
  records	
  
•  Easy	
  for	
  troubleshooEng	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Xiaomi	
  Big	
  Data	
  AnalyPcs	
  Pipeline	
  
Before	
  Kudu
•  Long	
  pipeline	
  
high	
  latency(1	
  hour	
  ~	
  1	
  day),	
  data	
  conversion	
  pains	
  
•  No	
  ordering	
  
Log	
  arrival(storage)	
  order	
  not	
  exactly	
  logical	
  order	
  
e.g.	
  read	
  2-­‐3	
  days	
  of	
  log	
  for	
  data	
  in	
  1	
  day
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Xiaomi	
  Big	
  Data	
  Analysis	
  Pipeline	
  
Simplified	
  With	
  Kudu
•  ETL	
  Pipeline(0~10s	
  latency)	
  
Apps	
  that	
  need	
  to	
  prevent	
  backpressure	
  or	
  require	
  ETL	
  	
  
•  Direct	
  Pipeline(no	
  latency)	
  
Apps	
  that	
  don’t	
  require	
  ETL	
  and	
  no	
  backpressure	
  issues	
  
	
  
OLAP	
  scan	
  
Side	
  table	
  lookup	
  
Result	
  store	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  it	
  works	
  
DistribuEon	
  and	
  fault	
  tolerance	
  
15	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Tables,	
  Tablets,	
  and	
  Tablet	
  Servers	
  
•  Table	
  is	
  horizontally	
  parPPoned	
  into	
  tablets	
  
• Range	
  or	
  hash	
  parEEoning	
  
• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY
HASH(timestamp) INTO 100 BUCKETS
•  bucketNumber = hashCode(row[‘timestamp’]) % 100
•  Each	
  tablet	
  has	
  N	
  replicas	
  (3	
  or	
  5),	
  with	
  Ra^	
  consensus	
  
• AutomaEc	
  fault	
  tolerance	
  
• MTTR:	
  ~5	
  seconds	
  
•  Tablet	
  servers	
  host	
  tablets	
  on	
  local	
  disk	
  drives	
  
16	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Metadata	
  and	
  the	
  Master	
  
•  Replicated	
  master	
  
• Acts	
  as	
  a	
  tablet	
  directory	
  
• Acts	
  as	
  a	
  catalog	
  (which	
  tables	
  exist,	
  etc)	
  
• Acts	
  as	
  a	
  load	
  balancer	
  (tracks	
  TS	
  liveness,	
  re-­‐replicates	
  under-­‐replicated	
  
tablets)	
  
•  Not	
  a	
  bo$leneck	
  
• super	
  fast	
  in-­‐memory	
  lookups	
  
17	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  it	
  works	
  
Columnar	
  storage	
  
18	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Columnar	
  storage	
  
{25059873,	
  
22309487,	
  
23059861,	
  
23010982}	
  
Tweet_id	
  
{newsycbot,	
  
RideImpala,	
  
fastly,	
  
llvmorg}	
  
User_name	
  
{1442865158,	
  
1442828307,	
  
1442865156,	
  
1442865155}	
  
Created_at	
  
{Visual	
  exp…,	
  
Introducing	
  ..,	
  
Missing	
  July…,	
  
LLVM	
  3.7….}	
  
text	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Columnar	
  storage	
  
{25059873,	
  
22309487,	
  
23059861,	
  
23010982}	
  
Tweet_id	
  
{newsycbot,	
  
RideImpala,	
  
fastly,	
  
llvmorg}	
  
User_name	
  
{1442865158,	
  
1442828307,	
  
1442865156,	
  
1442865155}	
  
Created_at	
  
{Visual	
  exp…,	
  
Introducing	
  ..,	
  
Missing	
  July…,	
  
LLVM	
  3.7….}	
  
text	
  
SELECT	
  COUNT(*)	
  FROM	
  tweets	
  WHERE	
  user_name	
  =	
  ‘newsycbot’;	
  
Only	
  read	
  1	
  column	
  	
  
1GB	
   2GB	
   1GB	
   200GB	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Columnar	
  compression	
  
{1442865158,	
  
1442828307,	
  
1442865156,	
  
1442865155}	
  
Created_at	
  
Created_at	
   Diff(created_at)	
  
1442865158	
   n/a	
  
1442828307	
   -­‐36851	
  
1442865156	
   36849	
  
1442865155	
   -­‐1	
  
64	
  bits	
  each	
   17	
  bits	
  each	
  
•  Many	
  columns	
  can	
  compress	
  to	
  
a	
  few	
  bits	
  per	
  row!	
  
•  Especially:	
  
• Timestamps	
  
• Time	
  series	
  values	
  
•  Massive	
  space	
  savings	
  and	
  
throughput	
  increase!	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Handling	
  inserts	
  and	
  updates	
  
•  Inserts	
  go	
  to	
  an	
  in-­‐memory	
  row	
  store	
  (MemRowSet)	
  
• Durable	
  due	
  to	
  write-­‐ahead	
  logging	
  
• Later	
  flush	
  to	
  columnar	
  format	
  on	
  disk	
  
•  Updates	
  go	
  to	
  in-­‐memory	
  “delta	
  store”	
  
• Later	
  flush	
  to	
  “delta	
  files”	
  on	
  disk	
  
• Eventually	
  “compact”	
  into	
  the	
  previously-­‐wriwen	
  columnar	
  data	
  files	
  
•  Skipping	
  details	
  due	
  to	
  Eme	
  constraints	
  
• available	
  in	
  other	
  slide	
  decks	
  online,	
  or	
  read	
  the	
  Kudu	
  whitepaper	
  to	
  learn	
  
more!	
  
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
IntegraEons	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  DataSource	
  integraEon	
  
sqlContext.load("org.kududb.spark",
Map("kudu.table" -> “foo”,
"kudu.master" -> “master.example.com”))
.registerTempTable(“mytable”)
df = sqlContext.sql(
“select col_a, col_b from mytable “ +
“where col_c = 123”)
Available	
  in	
  next	
  release	
  (Kudu	
  0.7.0)	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Impala	
  integraEon	
  
• CREATE TABLE … DISTRIBUTE BY HASH(col1) INTO 16
BUCKETS AS SELECT … FROM …
• INSERT/UPDATE/DELETE
• Not an Impala user? Working on more integrations…
• Apache Drill
• Apache Hive
• Maybe Presto?
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
MapReduce	
  integraEon	
  
• MulE-­‐framework	
  cluster	
  (MR	
  +	
  HDFS	
  +	
  Kudu	
  on	
  the	
  same	
  disks)	
  
• KuduTableInputFormat	
  /	
  KuduTableOutputFormat	
  
• Support	
  for	
  pushing	
  predicates,	
  column	
  projecEons,	
  etc	
  
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Performance	
  
27	
  
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
TPC-­‐H	
  (AnalyEcs	
  benchmark)	
  
•  75	
  server	
  cluster	
  
• 12	
  (spinning)	
  disk	
  each,	
  enough	
  RAM	
  to	
  fit	
  dataset	
  
• TPC-­‐H	
  Scale	
  Factor	
  100	
  (100GB)	
  
•  Example	
  query:	
  
•  SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer,
orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND
l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'
AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY
n_name ORDER BY revenue desc;
28	
  
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
-­‐	
  Kudu	
  outperforms	
  Parquet	
  by	
  31%	
  (geometric	
  mean)	
  for	
  RAM-­‐resident	
  data	
  
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Versus	
  other	
  NoSQL	
  storage	
  
•  Phoenix:	
  SQL	
  layer	
  on	
  HBase	
  
•  10	
  node	
  cluster	
  (9	
  worker,	
  1	
  master)	
  
•  TPC-­‐H	
  LINEITEM	
  table	
  only	
  (6B	
  rows)	
  
30	
  
2152	
  
219	
  
76	
  
131	
  
0.04	
  
1918	
  
13.2	
  
1.7	
  
0.7	
  
0.15	
  
155	
  
9.3	
  
1.4	
   1.5	
   1.37	
  
0.01	
  
0.1	
  
1	
  
10	
  
100	
  
1000	
  
10000	
  
Load	
   TPCH	
  Q1	
   COUNT(*)	
  
COUNT(*)	
  
WHERE…	
  
single-­‐row	
  
lookup	
  
Time	
  (sec)	
  
Phoenix	
  
Kudu	
  
Parquet	
  
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Xiaomi	
  benchmark	
  
•  6	
  real	
  queries	
  from	
  applicaEon	
  trace	
  analysis	
  applicaEon	
  
•  Q1:	
  SELECT	
  COUNT(*)	
  
•  Q2:	
  SELECT	
  hour,	
  COUNT(*)	
  WHERE	
  module	
  =	
  ‘foo’	
  GROUP	
  BY	
  HOUR	
  
•  Q3:	
  SELECT	
  hour,	
  COUNT(DISTINCT	
  uid)	
  WHERE	
  module	
  =	
  ‘foo’	
  AND	
  app=‘bar’	
  
GROUP	
  BY	
  HOUR	
  
•  Q4:	
  analyEcs	
  on	
  RPC	
  success	
  rate	
  over	
  all	
  data	
  for	
  one	
  app	
  
•  Q5:	
  same	
  as	
  Q4,	
  but	
  filter	
  by	
  Eme	
  range	
  
•  Q6:	
  SELECT	
  *	
  WHERE	
  app	
  =	
  …	
  AND	
  uid	
  =	
  …	
  ORDER	
  BY	
  ts	
  LIMIT	
  30	
  OFFSET	
  30	
  
32	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Xiaomi:	
  Benchmark	
  Results	
  	
  
1.4	
  	
   2.0	
  	
   2.3	
  	
  
3.1	
  	
  
1.3	
  	
   0.9	
  	
  1.3	
  	
  
2.8	
  	
  
4.0	
  	
  
5.7	
  	
  
7.5	
  	
  
16.7	
  	
  
Q1	
   Q2	
   Q3	
   Q4	
   Q5	
   Q6	
  
kudu	
  
parquet	
  
Query	
  latency:	
  
*	
  HDFS	
  parquet	
  file	
  replicaEon	
  =	
  3	
  ,	
  kudu	
  table	
  replicaEon	
  =	
  3	
  
*	
  Each	
  query	
  run	
  5	
  Emes	
  then	
  take	
  average	
  
33	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  about	
  NoSQL-­‐style	
  random	
  access?	
  (YCSB)	
  
•  YCSB	
  0.5.0-­‐snapshot	
  
•  10	
  node	
  cluster	
  
(9	
  worker,	
  1	
  master)	
  
•  100M	
  row	
  data	
  set	
  
•  10M	
  operaEons	
  each	
  
workload	
  
33	
  
34	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ge…ng	
  started	
  
34	
  
35	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Project	
  status	
  
•  Open	
  source	
  beta	
  released	
  in	
  September	
  
•  First	
  update	
  version	
  0.6.0	
  released	
  end	
  of	
  November	
  
• Usable	
  for	
  many	
  applicaEons	
  
• Missing	
  many	
  features:	
  security,	
  backup,	
  etc.	
  
• Have	
  not	
  experienced	
  data	
  loss,	
  reasonably	
  stable	
  (almost	
  no	
  crashes	
  reported)	
  
• SEll	
  requires	
  some	
  expert	
  assistance,	
  and	
  you’ll	
  probably	
  find	
  some	
  bugs	
  
•  Joining	
  the	
  Apache	
  So^ware	
  FoundaPon	
  Incubator	
  
• First	
  ASF	
  release	
  (0.7.0)	
  release	
  candidate	
  tonight!	
  
	
  
36	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  Community	
  
Your	
  company	
  here!	
  
37	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ge…ng	
  started	
  as	
  a	
  user	
  
•  hwp://getkudu.io	
  
•  user-­‐subscribe@kudu.incubator.apache.org	
  
•  hwp://getkudu-­‐slack.herokuapp.com/	
  
•  Quickstart	
  VM	
  
• Easiest	
  way	
  to	
  get	
  started	
  
• Impala	
  and	
  Kudu	
  in	
  an	
  easy-­‐to-­‐install	
  VM	
  
•  CSD	
  and	
  Parcels	
  
• For	
  installaEon	
  on	
  a	
  Cloudera	
  Manager-­‐managed	
  cluster	
  
37	
  
38	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ge…ng	
  started	
  as	
  a	
  developer	
  
•  hwp://github.com/apache/incubator-­‐kudu/	
  
• All	
  commits	
  go	
  here	
  first	
  
•  Public	
  gerrit:	
  hwp://gerrit.cloudera.org	
  
• All	
  code	
  reviews	
  happening	
  here	
  
•  Public	
  JIRA:	
  hwp://issues.cloudera.org	
  
• Includes	
  bugs	
  going	
  back	
  to	
  2013.	
  
•  dev@kudu.incubator.apache.org	
  
•  Apache	
  2.0	
  license	
  open	
  source,	
  part	
  of	
  ASF	
  Incubator	
  
•  ContribuEons	
  are	
  welcome	
  and	
  encouraged!	
  
38	
  
39	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
hwp://getkudu.io/	
  
@ApacheKudu	
  

Contenu connexe

Tendances

Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst TrainingCloudera, Inc.
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopYifeng Jiang
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data PlatformRakuten Group, Inc.
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 

Tendances (20)

Apache kudu
Apache kuduApache kudu
Apache kudu
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 

En vedette

Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)Hadoop / Spark Conference Japan
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境Hadoop / Spark Conference Japan
 
Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」
Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」
Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」オラクルエンジニア通信
 
ストリーミングアーキテクチャ: State から Flow へ - 2016/02/08 Hadoop / Spark Conference Japan ...
ストリーミングアーキテクチャ: State から Flow へ - 2016/02/08 Hadoop / Spark Conference Japan ...ストリーミングアーキテクチャ: State から Flow へ - 2016/02/08 Hadoop / Spark Conference Japan ...
ストリーミングアーキテクチャ: State から Flow へ - 2016/02/08 Hadoop / Spark Conference Japan ...MapR Technologies Japan
 
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)Hadoop / Spark Conference Japan
 
リクルートライフスタイルの考える ストリームデータの活かし方(Hadoop Spark Conference2016)
リクルートライフスタイルの考えるストリームデータの活かし方(Hadoop Spark Conference2016)リクルートライフスタイルの考えるストリームデータの活かし方(Hadoop Spark Conference2016)
リクルートライフスタイルの考える ストリームデータの活かし方(Hadoop Spark Conference2016)Atsushi Kurumada
 
データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-
データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-
データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-Makoto SHIMURA
 
基幹業務もHadoopで!! -ローソンにおける店舗発注業務への Hadoop + Hive導入と その取り組みについて-
基幹業務もHadoopで!! -ローソンにおける店舗発注業務へのHadoop + Hive導入と その取り組みについて-基幹業務もHadoopで!! -ローソンにおける店舗発注業務へのHadoop + Hive導入と その取り組みについて-
基幹業務もHadoopで!! -ローソンにおける店舗発注業務への Hadoop + Hive導入と その取り組みについて-Keigo Suda
 
初めてのHadoopパッチ投稿 / How to Contribute to Hadoop (Cloudera World Tokyo 2014 LT講演資料)
初めてのHadoopパッチ投稿 / How to Contribute to Hadoop (Cloudera World Tokyo 2014 LT講演資料)初めてのHadoopパッチ投稿 / How to Contribute to Hadoop (Cloudera World Tokyo 2014 LT講演資料)
初めてのHadoopパッチ投稿 / How to Contribute to Hadoop (Cloudera World Tokyo 2014 LT講演資料)Hadoop / Spark Conference Japan
 
いろいろなストリーム処理プロダクトをベンチマークしてみた #hcj2016
いろいろなストリーム処理プロダクトをベンチマークしてみた #hcj2016いろいろなストリーム処理プロダクトをベンチマークしてみた #hcj2016
いろいろなストリーム処理プロダクトをベンチマークしてみた #hcj2016Yahoo!デベロッパーネットワーク
 
Path to 400M Members: LinkedIn’s Data Powered Journey
Path to 400M Members: LinkedIn’s Data Powered JourneyPath to 400M Members: LinkedIn’s Data Powered Journey
Path to 400M Members: LinkedIn’s Data Powered JourneyDataWorks Summit/Hadoop Summit
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemDataWorks Summit/Hadoop Summit
 
Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...
Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...
Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...Hadoop / Spark Conference Japan
 

En vedette (20)

Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Apache Hadoop の現在と将来(Hadoop / Spark Conference Japan 2016 キーノート講演資料)
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
Hadoop / Spark Conference Japan 2016 ご挨拶・Hadoopを取り巻く環境
 
Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」
Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」
Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」
 
ストリーミングアーキテクチャ: State から Flow へ - 2016/02/08 Hadoop / Spark Conference Japan ...
ストリーミングアーキテクチャ: State から Flow へ - 2016/02/08 Hadoop / Spark Conference Japan ...ストリーミングアーキテクチャ: State から Flow へ - 2016/02/08 Hadoop / Spark Conference Japan ...
ストリーミングアーキテクチャ: State から Flow へ - 2016/02/08 Hadoop / Spark Conference Japan ...
 
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
 
リクルートライフスタイルの考える ストリームデータの活かし方(Hadoop Spark Conference2016)
リクルートライフスタイルの考えるストリームデータの活かし方(Hadoop Spark Conference2016)リクルートライフスタイルの考えるストリームデータの活かし方(Hadoop Spark Conference2016)
リクルートライフスタイルの考える ストリームデータの活かし方(Hadoop Spark Conference2016)
 
データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-
データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-
データドリブン企業におけるHadoop基盤とETL -niconicoでの実践例-
 
基幹業務もHadoopで!! -ローソンにおける店舗発注業務への Hadoop + Hive導入と その取り組みについて-
基幹業務もHadoopで!! -ローソンにおける店舗発注業務へのHadoop + Hive導入と その取り組みについて-基幹業務もHadoopで!! -ローソンにおける店舗発注業務へのHadoop + Hive導入と その取り組みについて-
基幹業務もHadoopで!! -ローソンにおける店舗発注業務への Hadoop + Hive導入と その取り組みについて-
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 
Are you Kudu-ing me?!
Are you Kudu-ing me?!Are you Kudu-ing me?!
Are you Kudu-ing me?!
 
初めてのHadoopパッチ投稿 / How to Contribute to Hadoop (Cloudera World Tokyo 2014 LT講演資料)
初めてのHadoopパッチ投稿 / How to Contribute to Hadoop (Cloudera World Tokyo 2014 LT講演資料)初めてのHadoopパッチ投稿 / How to Contribute to Hadoop (Cloudera World Tokyo 2014 LT講演資料)
初めてのHadoopパッチ投稿 / How to Contribute to Hadoop (Cloudera World Tokyo 2014 LT講演資料)
 
いろいろなストリーム処理プロダクトをベンチマークしてみた #hcj2016
いろいろなストリーム処理プロダクトをベンチマークしてみた #hcj2016いろいろなストリーム処理プロダクトをベンチマークしてみた #hcj2016
いろいろなストリーム処理プロダクトをベンチマークしてみた #hcj2016
 
State of the art Stream Processing #hadoopreading
State of the art Stream Processing #hadoopreadingState of the art Stream Processing #hadoopreading
State of the art Stream Processing #hadoopreading
 
Protecting Enterprise Data In Apache Hadoop
Protecting Enterprise Data In Apache HadoopProtecting Enterprise Data In Apache Hadoop
Protecting Enterprise Data In Apache Hadoop
 
Path to 400M Members: LinkedIn’s Data Powered Journey
Path to 400M Members: LinkedIn’s Data Powered JourneyPath to 400M Members: LinkedIn’s Data Powered Journey
Path to 400M Members: LinkedIn’s Data Powered Journey
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...
Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...
Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...
 
Apache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in NutshellApache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in Nutshell
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 

Similaire à Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016講演資料)

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Datamichaelguia
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoopjdcryans
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechCloudera Japan
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Cloudera, Inc.
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...Cloudera, Inc.
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BIDataWorks Summit
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Apache Accumulo Overview
Apache Accumulo OverviewApache Accumulo Overview
Apache Accumulo OverviewBill Havanki
 

Similaire à Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016講演資料) (20)

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentech
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Apache Accumulo Overview
Apache Accumulo OverviewApache Accumulo Overview
Apache Accumulo Overview
 

Plus de Hadoop / Spark Conference Japan

機械学習、グラフ分析、SQLによるサイバー攻撃対策事例(金融業界)
機械学習、グラフ分析、SQLによるサイバー攻撃対策事例(金融業界)機械学習、グラフ分析、SQLによるサイバー攻撃対策事例(金融業界)
機械学習、グラフ分析、SQLによるサイバー攻撃対策事例(金融業界)Hadoop / Spark Conference Japan
 
マルチテナント Hadoop クラスタのためのモニタリング Best Practice
マルチテナント Hadoop クラスタのためのモニタリング Best Practiceマルチテナント Hadoop クラスタのためのモニタリング Best Practice
マルチテナント Hadoop クラスタのためのモニタリング Best PracticeHadoop / Spark Conference Japan
 
Hadoop / Spark Conference Japan 2019 ご挨拶・開催にあたって
Hadoop / Spark Conference Japan 2019 ご挨拶・開催にあたってHadoop / Spark Conference Japan 2019 ご挨拶・開催にあたって
Hadoop / Spark Conference Japan 2019 ご挨拶・開催にあたってHadoop / Spark Conference Japan
 
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)Hadoop / Spark Conference Japan
 
MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)
MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)
MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)Hadoop / Spark Conference Japan
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)Hadoop / Spark Conference Japan
 
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)Hadoop / Spark Conference Japan
 
HadoopとRDBMSをシームレスに連携させるSmart SQL Processing (Hadoop Conference Japan 2014)
HadoopとRDBMSをシームレスに連携させるSmart SQL Processing (Hadoop Conference Japan 2014)HadoopとRDBMSをシームレスに連携させるSmart SQL Processing (Hadoop Conference Japan 2014)
HadoopとRDBMSをシームレスに連携させるSmart SQL Processing (Hadoop Conference Japan 2014)Hadoop / Spark Conference Japan
 

Plus de Hadoop / Spark Conference Japan (10)

機械学習、グラフ分析、SQLによるサイバー攻撃対策事例(金融業界)
機械学習、グラフ分析、SQLによるサイバー攻撃対策事例(金融業界)機械学習、グラフ分析、SQLによるサイバー攻撃対策事例(金融業界)
機械学習、グラフ分析、SQLによるサイバー攻撃対策事例(金融業界)
 
What makes Apache Spark?
What makes Apache Spark?What makes Apache Spark?
What makes Apache Spark?
 
マルチテナント Hadoop クラスタのためのモニタリング Best Practice
マルチテナント Hadoop クラスタのためのモニタリング Best Practiceマルチテナント Hadoop クラスタのためのモニタリング Best Practice
マルチテナント Hadoop クラスタのためのモニタリング Best Practice
 
Hadoop / Spark Conference Japan 2019 ご挨拶・開催にあたって
Hadoop / Spark Conference Japan 2019 ご挨拶・開催にあたってHadoop / Spark Conference Japan 2019 ご挨拶・開催にあたって
Hadoop / Spark Conference Japan 2019 ご挨拶・開催にあたって
 
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
Sparkによる GISデータを題材とした時系列データ処理 (Hadoop / Spark Conference Japan 2016 講演資料)
 
MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)
MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)
MapReduce/Spark/Tezのフェアな性能比較に向けて (Cloudera World Tokyo 2014 LT講演)
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
 
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)
 
The Future of Apache Spark
The Future of Apache SparkThe Future of Apache Spark
The Future of Apache Spark
 
HadoopとRDBMSをシームレスに連携させるSmart SQL Processing (Hadoop Conference Japan 2014)
HadoopとRDBMSをシームレスに連携させるSmart SQL Processing (Hadoop Conference Japan 2014)HadoopとRDBMSをシームレスに連携させるSmart SQL Processing (Hadoop Conference Japan 2014)
HadoopとRDBMSをシームレスに連携させるSmart SQL Processing (Hadoop Conference Japan 2014)
 

Dernier

Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 

Dernier (20)

Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 

Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016講演資料)

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Todd  Lipcon  (Kudu  team  lead)  –  todd@cloudera.com     Twi$er:  @ApacheKudu  or  #kudu            @tlipcon     *  incubaEng   Apache  Kudu*:  Fast  AnalyEcs  on   Fast  Data  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Kudu?   2  
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Current  Storage  Landscape  in  Hadoop  Ecosystem   HDFS  (GFS)  excels  at:   •  Batch  ingest  only  (eg  hourly)   •  Efficiently  scanning  large  amounts   of  data  (analyEcs)   HBase  (BigTable)  excels  at:   •  Efficiently  finding  and  wriEng   individual  rows   •  Making  data  mutable     Gaps  exist  when  these  properEes   are  needed  simultaneously  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   •  High  throughput  for  big  scans   Goal:  Within  2x  of  Parquet     •  Low-­‐latency  for  short  accesses          Goal:  1ms  read/write  on  SSD     •  Database-­‐like  semanEcs  (iniEally  single-­‐row   ACID)   Kudu  Design  Goals  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Changing  Hardware  landscape   •  Spinning  disk  -­‐>  solid  state  storage   • NAND  flash:  Up  to  450k  read  250k  write  iops,  about  2GB/sec  read  and  1.5GB/ sec  write  throughput,  at  a  price  of  less  than  $3/GB  and  dropping   • 3D  XPoint  memory  (1000x  faster  than  NAND,  cheaper  than  RAM)   •  RAM  is  cheaper  and  more  abundant:   • 64-­‐>128-­‐>256GB  over  last  few  years   •  Takeaway:  The  next  bo$leneck  is  CPU,  and  current  storage  systems  weren’t   designed  with  CPU  efficiency  in  mind.  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  Kudu?   6  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Scalable  and  fast  tabular  storage   •  Scalable   • Tested  up  to  275  nodes  (~3PB  cluster)   • Designed  to  scale  to  1000s  of  nodes,  tens  of  PBs   •  Fast   • Millions  of  read/write  operaEons  per  second  across  cluster   • MulPple  GB/second  read  throughput  per  node   •  Tabular   • SQL-­‐like  schema:  finite  number  of  typed  columns  (unlike  HBase/Cassandra)   • Fast  ALTER  TABLE   • “NoSQL”  APIs:  Java/C++/Python    or  SQL  (Impala/Spark/etc)  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Use  cases  and  architectures  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  Use  Cases   Kudu  is  best  for  use  cases  requiring  a  simultaneous  combinaPon  of   sequenPal  and  random  reads  and  writes     ● Time  Series   ○  Examples:  Stream  market  data;  fraud  detecEon  &  prevenEon;  risk  monitoring   ○  Workload:  Insert,  updates,  scans,  lookups   ● Online  ReporPng   ○  Examples:  ODS   ○  Workload:  Inserts,  updates,  scans,  lookups  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Real-­‐Time  AnalyEcs  in  Hadoop  Today   Fraud  DetecEon  in  the  Real  World  =  Storage  Complexity   ConsideraPons:   ●  How  do  I  handle  failure   during  this  process?     ●  How  oten  do  I  reorganize   data  streaming  in  into  a   format  appropriate  for   reporEng?     ●  When  reporEng,  how  do  I  see   data  that  has  not  yet  been   reorganized?     ●  How  do  I  ensure  that   important  jobs  aren’t   interrupted  by  maintenance?   New  ParEEon   Most  Recent  ParEEon   Historic  Data   HBase   Parquet   File   Have  we   accumulated   enough  data?   Reorganize   HBase  file   into  Parquet   •  Wait  for  running  operaEons  to  complete     •  Define  new  Impala  parEEon  referencing   the  newly  wriwen  Parquet  file   Incoming  Data   (Messaging   System)   ReporEng   Request   Impala  on  HDFS  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Real-­‐Time  AnalyEcs  in  Hadoop  with  Kudu   Improvements:   ●  One  system  to  operate   ●  No  cron  jobs  or  background   processes   ●  Handle  late  arrivals  or  data   correcPons  with  ease   ●  New  data  available   immediately  for  analyPcs  or   operaPons     Historical  and  Real-­‐Eme   Data   Incoming  Data   (Messaging   System)   ReporEng   Request   Storage  in  Kudu  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Xiaomi  use  case   •  World’s  4th  largest  smart-­‐phone  maker  (most  popular  in  China)   •  Gather  important  RPC  tracing  events  from  mobile  app  and  backend  service.     •  Service  monitoring  &  troubleshooEng  tool.   u  High  write  throughput   •  >10  Billion  records/day  and  growing   u  Query  latest  data  and  quick  response   •  IdenEfy  and  resolve  issues  quickly   u  Can  search  for  individual  records   •  Easy  for  troubleshooEng  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Xiaomi  Big  Data  AnalyPcs  Pipeline   Before  Kudu •  Long  pipeline   high  latency(1  hour  ~  1  day),  data  conversion  pains   •  No  ordering   Log  arrival(storage)  order  not  exactly  logical  order   e.g.  read  2-­‐3  days  of  log  for  data  in  1  day
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Xiaomi  Big  Data  Analysis  Pipeline   Simplified  With  Kudu •  ETL  Pipeline(0~10s  latency)   Apps  that  need  to  prevent  backpressure  or  require  ETL     •  Direct  Pipeline(no  latency)   Apps  that  don’t  require  ETL  and  no  backpressure  issues     OLAP  scan   Side  table  lookup   Result  store  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   How  it  works   DistribuEon  and  fault  tolerance   15  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   Tables,  Tablets,  and  Tablet  Servers   •  Table  is  horizontally  parPPoned  into  tablets   • Range  or  hash  parEEoning   • PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS •  bucketNumber = hashCode(row[‘timestamp’]) % 100 •  Each  tablet  has  N  replicas  (3  or  5),  with  Ra^  consensus   • AutomaEc  fault  tolerance   • MTTR:  ~5  seconds   •  Tablet  servers  host  tablets  on  local  disk  drives   16  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Metadata  and  the  Master   •  Replicated  master   • Acts  as  a  tablet  directory   • Acts  as  a  catalog  (which  tables  exist,  etc)   • Acts  as  a  load  balancer  (tracks  TS  liveness,  re-­‐replicates  under-­‐replicated   tablets)   •  Not  a  bo$leneck   • super  fast  in-­‐memory  lookups   17  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   How  it  works   Columnar  storage   18  
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  storage   {25059873,   22309487,   23059861,   23010982}   Tweet_id   {newsycbot,   RideImpala,   fastly,   llvmorg}   User_name   {1442865158,   1442828307,   1442865156,   1442865155}   Created_at   {Visual  exp…,   Introducing  ..,   Missing  July…,   LLVM  3.7….}   text  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  storage   {25059873,   22309487,   23059861,   23010982}   Tweet_id   {newsycbot,   RideImpala,   fastly,   llvmorg}   User_name   {1442865158,   1442828307,   1442865156,   1442865155}   Created_at   {Visual  exp…,   Introducing  ..,   Missing  July…,   LLVM  3.7….}   text   SELECT  COUNT(*)  FROM  tweets  WHERE  user_name  =  ‘newsycbot’;   Only  read  1  column     1GB   2GB   1GB   200GB  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  compression   {1442865158,   1442828307,   1442865156,   1442865155}   Created_at   Created_at   Diff(created_at)   1442865158   n/a   1442828307   -­‐36851   1442865156   36849   1442865155   -­‐1   64  bits  each   17  bits  each   •  Many  columns  can  compress  to   a  few  bits  per  row!   •  Especially:   • Timestamps   • Time  series  values   •  Massive  space  savings  and   throughput  increase!  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Handling  inserts  and  updates   •  Inserts  go  to  an  in-­‐memory  row  store  (MemRowSet)   • Durable  due  to  write-­‐ahead  logging   • Later  flush  to  columnar  format  on  disk   •  Updates  go  to  in-­‐memory  “delta  store”   • Later  flush  to  “delta  files”  on  disk   • Eventually  “compact”  into  the  previously-­‐wriwen  columnar  data  files   •  Skipping  details  due  to  Eme  constraints   • available  in  other  slide  decks  online,  or  read  the  Kudu  whitepaper  to  learn   more!  
  • 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   IntegraEons  
  • 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  DataSource  integraEon   sqlContext.load("org.kududb.spark", Map("kudu.table" -> “foo”, "kudu.master" -> “master.example.com”)) .registerTempTable(“mytable”) df = sqlContext.sql( “select col_a, col_b from mytable “ + “where col_c = 123”) Available  in  next  release  (Kudu  0.7.0)  
  • 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   Impala  integraEon   • CREATE TABLE … DISTRIBUTE BY HASH(col1) INTO 16 BUCKETS AS SELECT … FROM … • INSERT/UPDATE/DELETE • Not an Impala user? Working on more integrations… • Apache Drill • Apache Hive • Maybe Presto?
  • 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   MapReduce  integraEon   • MulE-­‐framework  cluster  (MR  +  HDFS  +  Kudu  on  the  same  disks)   • KuduTableInputFormat  /  KuduTableOutputFormat   • Support  for  pushing  predicates,  column  projecEons,  etc  
  • 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   Performance   27  
  • 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   TPC-­‐H  (AnalyEcs  benchmark)   •  75  server  cluster   • 12  (spinning)  disk  each,  enough  RAM  to  fit  dataset   • TPC-­‐H  Scale  Factor  100  (100GB)   •  Example  query:   •  SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc; 28  
  • 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   -­‐  Kudu  outperforms  Parquet  by  31%  (geometric  mean)  for  RAM-­‐resident  data  
  • 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   Versus  other  NoSQL  storage   •  Phoenix:  SQL  layer  on  HBase   •  10  node  cluster  (9  worker,  1  master)   •  TPC-­‐H  LINEITEM  table  only  (6B  rows)   30   2152   219   76   131   0.04   1918   13.2   1.7   0.7   0.15   155   9.3   1.4   1.5   1.37   0.01   0.1   1   10   100   1000   10000   Load   TPCH  Q1   COUNT(*)   COUNT(*)   WHERE…   single-­‐row   lookup   Time  (sec)   Phoenix   Kudu   Parquet  
  • 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   Xiaomi  benchmark   •  6  real  queries  from  applicaEon  trace  analysis  applicaEon   •  Q1:  SELECT  COUNT(*)   •  Q2:  SELECT  hour,  COUNT(*)  WHERE  module  =  ‘foo’  GROUP  BY  HOUR   •  Q3:  SELECT  hour,  COUNT(DISTINCT  uid)  WHERE  module  =  ‘foo’  AND  app=‘bar’   GROUP  BY  HOUR   •  Q4:  analyEcs  on  RPC  success  rate  over  all  data  for  one  app   •  Q5:  same  as  Q4,  but  filter  by  Eme  range   •  Q6:  SELECT  *  WHERE  app  =  …  AND  uid  =  …  ORDER  BY  ts  LIMIT  30  OFFSET  30  
  • 32. 32  ©  Cloudera,  Inc.  All  rights  reserved.   Xiaomi:  Benchmark  Results     1.4     2.0     2.3     3.1     1.3     0.9    1.3     2.8     4.0     5.7     7.5     16.7     Q1   Q2   Q3   Q4   Q5   Q6   kudu   parquet   Query  latency:   *  HDFS  parquet  file  replicaEon  =  3  ,  kudu  table  replicaEon  =  3   *  Each  query  run  5  Emes  then  take  average  
  • 33. 33  ©  Cloudera,  Inc.  All  rights  reserved.   What  about  NoSQL-­‐style  random  access?  (YCSB)   •  YCSB  0.5.0-­‐snapshot   •  10  node  cluster   (9  worker,  1  master)   •  100M  row  data  set   •  10M  operaEons  each   workload   33  
  • 34. 34  ©  Cloudera,  Inc.  All  rights  reserved.   Ge…ng  started   34  
  • 35. 35  ©  Cloudera,  Inc.  All  rights  reserved.   Project  status   •  Open  source  beta  released  in  September   •  First  update  version  0.6.0  released  end  of  November   • Usable  for  many  applicaEons   • Missing  many  features:  security,  backup,  etc.   • Have  not  experienced  data  loss,  reasonably  stable  (almost  no  crashes  reported)   • SEll  requires  some  expert  assistance,  and  you’ll  probably  find  some  bugs   •  Joining  the  Apache  So^ware  FoundaPon  Incubator   • First  ASF  release  (0.7.0)  release  candidate  tonight!    
  • 36. 36  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  Community   Your  company  here!  
  • 37. 37  ©  Cloudera,  Inc.  All  rights  reserved.   Ge…ng  started  as  a  user   •  hwp://getkudu.io   •  user-­‐subscribe@kudu.incubator.apache.org   •  hwp://getkudu-­‐slack.herokuapp.com/   •  Quickstart  VM   • Easiest  way  to  get  started   • Impala  and  Kudu  in  an  easy-­‐to-­‐install  VM   •  CSD  and  Parcels   • For  installaEon  on  a  Cloudera  Manager-­‐managed  cluster   37  
  • 38. 38  ©  Cloudera,  Inc.  All  rights  reserved.   Ge…ng  started  as  a  developer   •  hwp://github.com/apache/incubator-­‐kudu/   • All  commits  go  here  first   •  Public  gerrit:  hwp://gerrit.cloudera.org   • All  code  reviews  happening  here   •  Public  JIRA:  hwp://issues.cloudera.org   • Includes  bugs  going  back  to  2013.   •  dev@kudu.incubator.apache.org   •  Apache  2.0  license  open  source,  part  of  ASF  Incubator   •  ContribuEons  are  welcome  and  encouraged!   38  
  • 39. 39  ©  Cloudera,  Inc.  All  rights  reserved.   hwp://getkudu.io/   @ApacheKudu