SlideShare une entreprise Scribd logo
1  sur  47
Télécharger pour lire hors ligne
10/11/11	
   ©	
  MapR	
  Confiden0al	
   1	
  
MapR,	
  Implica0ons	
  for	
  Integra0on	
  
CMU	
  –	
  September	
  2011	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   2	
  
Outline	
  
•  MapR	
  system	
  overview	
  
•  Map-­‐reduce	
  review	
  
•  MapR	
  architecture	
  
•  Performance	
  Results	
  
•  Map-­‐reduce	
  on	
  MapR	
  
•  Architectural	
  implica0ons	
  
•  Search	
  indexing	
  /	
  deployment	
  
•  EM	
  algorithm	
  for	
  machine	
  learning	
  
•  …	
  and	
  more	
  …	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   3	
  
Map-­‐Reduce	
  
!"
!"
!#
!#
$%&'()"*" +,&)!'%-(./%0)
"*#
12'!!3)"*4 536'-3)
!'%-(./%0)
"*7
8'(&'()"930)
"*:
@/-,9)
A.0B
@/-,9)
A.0B
!"!#
Input	
   Output	
  
Shuffle	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   4	
  
BoQlenecks	
  and	
  Issues	
  
•  Read-­‐only	
  files	
  
•  Many	
  copies	
  in	
  I/O	
  path	
  
•  Shuffle	
  based	
  on	
  HTTP	
  
•  Can’t	
  use	
  new	
  technologies	
  
•  Eats	
  file	
  descriptors	
  
•  Spills	
  go	
  to	
  local	
  file	
  space	
  
•  Bad	
  for	
  skewed	
  distribu0on	
  of	
  sizes	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   5	
  
MapR	
  Areas	
  of	
  Development	
  
Map	
  
Reduce	
  
Storage	
  
Services	
  
Ecosystem	
  
HBase	
  
Management	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   6	
  
MapR	
  Improvements	
  
•  Faster	
  file	
  system	
  
•  Fewer	
  copies	
  
•  Mul0ple	
  NICS	
  
•  No	
  file	
  descriptor	
  or	
  page-­‐buf	
  compe00on	
  
•  Faster	
  map-­‐reduce	
  
•  Uses	
  distributed	
  file	
  system	
  
•  Direct	
  RPC	
  to	
  receiver	
  
•  Very	
  wide	
  merges	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   7	
  
MapR	
  Innova0ons	
  
•  Volumes	
  
•  Distributed	
  management	
  
•  Data	
  placement	
  
•  Read/write	
  random	
  access	
  file	
  system	
  
•  Allows	
  distributed	
  meta-­‐data	
  
•  Improved	
  scaling	
  
•  Enables	
  NFS	
  access	
  
•  Applica0on-­‐level	
  NIC	
  bonding	
  
•  Transac0onally	
  correct	
  snapshots	
  and	
  mirrors	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   8	
  
MapR's	
  Containers	
  
l  Each	
  container	
  contains	
  
l  Directories	
  &	
  files	
  
l  Data	
  blocks	
  
l  Replicated	
  on	
  servers	
  
l  No	
  need	
  to	
  manage	
  
directly	
  
Files/directories	
  are	
  sharded	
  into	
  blocks,	
  which	
  
are	
  placed	
  into	
  mini	
  NNs	
  (containers	
  )	
  on	
  disks	
  
Containers	
  are	
  
16-­‐32	
  GB	
  segments	
  
of	
  disk,	
  placed	
  on	
  
nodes	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   9	
  
MapR's	
  Containers	
  
l  Each	
  container	
  has	
  a	
  
replica0on	
  chain	
  
l  Updates	
  are	
  transac0onal	
  
l  Failures	
  are	
  handled	
  by	
  
rearranging	
  replica0on	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   10	
  
Container	
  loca0ons	
  and	
  replica0on	
  
CLDB	
  
N1,	
  N2	
  
N3,	
  N2	
  
N1,	
  N2	
  
N1,	
  N3	
  
N3,	
  N2	
  
N1	
  
N2	
  
N3	
  
Container	
  loca0on	
  database	
  
(CLDB)	
  keeps	
  track	
  of	
  nodes	
  
hos0ng	
  each	
  container	
  and	
  
replica0on	
  chain	
  order	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   11	
  
MapR	
  Scaling	
  
Containers	
  represent	
  16	
  -­‐	
  32GB	
  of	
  data	
  
l  Each	
  can	
  hold	
  up	
  to	
  	
  1	
  Billion	
  files	
  and	
  directories	
  
l  100M	
  containers	
  =	
  	
  ~	
  2	
  Exabytes	
  	
  (a	
  very	
  large	
  cluster)	
  
250	
  bytes	
  DRAM	
  to	
  cache	
  a	
  container	
  
l  25GB	
  to	
  cache	
  all	
  containers	
  for	
  2EB	
  cluster	
  
-  But	
  not	
  necessary,	
  can	
  page	
  to	
  disk	
  
l  Typical	
  large	
  10PB	
  cluster	
  needs	
  2GB	
  
Container-­‐reports	
  are	
  100x	
  -­‐	
  1000x	
  	
  <	
  	
  HDFS	
  block-­‐reports	
  
l  Serve	
  100x	
  more	
  data-­‐nodes	
  
l  Increase	
  container	
  size	
  to	
  64G	
  to	
  serve	
  4EB	
  cluster	
  
l  Map/reduce	
  not	
  affected	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   12	
  
MapR's	
  Streaming	
  Performance	
  
Read Write
0
250
500
750
1000
1250
1500
1750
2000
2250
Read Write
0
250
500
750
1000
1250
1500
1750
2000
2250
Hardware
MapR
HadoopMB	
  
per	
  
sec	
  
Tests:	
  	
  	
  	
  	
  i.	
  	
  16	
  streams	
  x	
  120GB	
  	
  	
  	
  	
  	
  	
  ii.	
  	
  2000	
  streams	
  x	
  1GB	
  
11	
  x	
  7200rpm	
  SATA	
   11	
  x	
  15Krpm	
  SAS	
  
Higher	
  is	
  be;er	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   13	
  
Terasort	
  on	
  MapR	
  
1.0	
  TB
0
10
20
30
40
50
60
3.5	
  TB
0
50
100
150
200
250
300
MapR
Hadoop
Elapsed	
  
=me	
  
(mins)	
  
10+1	
  nodes:	
  8	
  core,	
  24GB	
  DRAM,	
  11	
  x	
  1TB	
  SATA	
  7200	
  rpm	
  
Lower	
  is	
  be;er	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   14	
  
HBase	
  on	
  MapR	
  
Records	
  
per	
  
second	
  
Higher	
  is	
  be;er	
  
0	
  
5000	
  
10000	
  
15000	
  
20000	
  
25000	
  
Zipfian	
   Uniform	
  
MapR	
  
Apache	
  
YCSB	
  Random	
  Read	
  	
  with	
  1	
  billion	
  1K	
  records	
  
10+1	
  node	
  cluster:	
  8	
  core,	
  24GB	
  DRAM,	
  11	
  x	
  1TB	
  7200	
  RPM	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   15	
  
#	
  of	
  files	
  (m)	
  
Rate(files/sec)
Op:	
  	
  -­‐	
  create	
  file	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  -­‐	
  write	
  100	
  bytes	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  -­‐	
  close	
  
Notes:	
  
-­‐	
  NN	
  not	
  replicated	
  
-­‐	
  NN	
  uses	
  20G	
  DRAM	
  
-­‐	
  DN	
  uses	
  	
  2G	
  	
  DRAM	
  
Out	
  of	
  box	
  
Tuned	
  
Small	
  Files	
  (Apache	
  Hadoop,	
  10	
  nodes)	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   16	
  
MUCH	
  faster	
  for	
  some	
  opera0ons	
  
#	
  of	
  files	
  (millions)	
  
Create	
  
Rate	
  
Same	
  10	
  nodes	
  …	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   17	
  
What	
  MapR	
  is	
  not	
  
•  Volumes	
  !=	
  federa0on	
  
•  MapR	
  supports	
  >	
  10,000	
  volumes	
  all	
  with	
  
independent	
  placement	
  and	
  defaults	
  
•  Volumes	
  support	
  snapshots	
  and	
  mirroring	
  
•  NFS	
  !=	
  FUSE	
  
•  Checksum	
  and	
  compress	
  at	
  gateway	
  
•  IP	
  fail-­‐over	
  
•  Read/write/update	
  seman0cs	
  at	
  full	
  speed	
  
•  MapR	
  !=	
  maprfs	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   18	
  
New	
  Capabili0es	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   19	
  
Alterna0ve	
  NFS	
  moun0ng	
  models	
  
•  Export	
  to	
  the	
  world	
  
•  NFS	
  gateway	
  runs	
  on	
  selected	
  gateway	
  hosts	
  
•  Local	
  server	
  
•  NFS	
  gateway	
  runs	
  on	
  local	
  host	
  
•  Enables	
  local	
  compression	
  and	
  check	
  summing	
  
•  Export	
  to	
  self	
  
•  NFS	
  gateway	
  runs	
  on	
  all	
  data	
  nodes,	
  mounted	
  
from	
  localhost	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   20	
  
Export	
  to	
  the	
  world	
  
NFS	
  
Server	
  
NFS	
  
Server	
  
NFS	
  
Server	
  
NFS	
  
Server	
  NFS	
  
Client	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   21	
  
Client	
  
NFS	
  
Server	
  
Local	
  server	
  
Applica0on	
  
Cluster	
  Nodes	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   22	
  
Cluster	
  
Node	
  
NFS	
  
Server	
  
Universal	
  export	
  to	
  self	
  
Task	
  
Cluster	
  Nodes	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   23	
  
Cluster	
  
Node	
  
NFS	
  
Server	
  
Task	
  
Cluster	
  
Node	
  
NFS	
  
Server	
  
Task	
  
Cluster	
  
Node	
  
NFS	
  
Server	
  
Task	
  
Nodes	
  are	
  iden0cal	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   24	
  
Applica0on	
  architecture	
  
•  High	
  performance	
  map-­‐reduce	
  is	
  nice	
  
•  But	
  algorithmic	
  flexibility	
  is	
  even	
  nicer	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   25	
  
Sharded	
  text	
  Indexing	
  
Map	
  
Reducer	
  
Input	
  
documents	
  
Local	
  
disk	
   Search	
  
Engine	
  
Local	
  
disk	
  
Clustered	
  
index	
  storage	
  
Assign	
  documents	
  
to	
  shards	
  
Index	
  text	
  to	
  local	
  disk	
  
and	
  then	
  copy	
  index	
  to	
  
distributed	
  file	
  store	
  
Copy	
  to	
  local	
  disk	
  
typically	
  required	
  before	
  
index	
  can	
  be	
  loaded	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   26	
  
Sharded	
  text	
  indexing	
  
•  Mapper	
  assigns	
  document	
  to	
  shard	
  
•  Shard	
  is	
  usually	
  hash	
  of	
  document	
  id	
  
•  Reducer	
  indexes	
  all	
  documents	
  for	
  a	
  shard	
  
•  Indexes	
  created	
  on	
  local	
  disk	
  
•  On	
  success,	
  copy	
  index	
  to	
  DFS	
  
•  On	
  failure,	
  delete	
  local	
  files	
  
•  Must	
  avoid	
  directory	
  collisions	
  	
  
•  can’t	
  use	
  shard	
  id!	
  
•  Must	
  manage	
  and	
  reclaim	
  local	
  disk	
  space	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   27	
  
Conven0onal	
  data	
  flow	
  
Map	
  
Reducer	
  
Input	
  
documents	
  
Local	
  
disk	
   Search	
  
Engine	
  
Local	
  
disk	
  
Clustered	
  
index	
  storage	
  
Failure	
  of	
  a	
  reducer	
  
causes	
  garbage	
  to	
  
accumulate	
  in	
  the	
  
local	
  disk	
  
Failure	
  of	
  search	
  
engine	
  requires	
  
another	
  download	
  
of	
  the	
  index	
  from	
  
clustered	
  storage.	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   28	
  
Search	
  
Engine	
  
Simplified	
  NFS	
  data	
  flows	
  
Map	
  
Reducer	
  
Input	
  
documents	
  
Clustered	
  
index	
  storage	
  
Failure	
  of	
  a	
  reducer	
  
is	
  cleaned	
  up	
  by	
  
map-­‐reduce	
  
framework	
  
Search	
  engine	
  
reads	
  mirrored	
  
index	
  directly.	
  
Index	
  to	
  task	
  work	
  
directory	
  via	
  NFS	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   29	
  
Simplified	
  NFS	
  data	
  flows	
  
Map	
  
Reducer	
  
Input	
  
documents	
  
Search	
  
Engine	
  
Mirrors	
  
Search	
  
Engine	
  
Mirroring	
  allows	
  
exact	
  placement	
  
of	
  index	
  data	
  
Aribitrary	
  levels	
  
of	
  replica0on	
  
also	
  possible	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   30	
  
How	
  about	
  another	
  one?	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   31	
  
K-­‐means	
  
•  Classic	
  E-­‐M	
  based	
  algorithm	
  
•  Given	
  cluster	
  centroids,	
  
•  Assign	
  each	
  data	
  point	
  to	
  nearest	
  centroid	
  
•  Accumulate	
  new	
  centroids	
  
•  Rinse,	
  lather,	
  repeat	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   32	
  
Aggregate	
  
new	
  
centroids	
  
K-­‐means,	
  the	
  movie	
  
Assign	
  
to	
  
Nearest	
  
centroid	
  
Centroids	
  
I	
  
n	
  
p	
  
u	
  
t	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   33	
  
But	
  …	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   34	
  
Average	
  
models	
  
Parallel	
  Stochas0c	
  Gradient	
  Descent	
  
Train	
  
sub	
  
model	
  
Model	
  
I	
  
n	
  
p	
  
u	
  
t	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   35	
  
Update	
  
model	
  
Varia0onal	
  Dirichlet	
  Assignment	
  
Gather	
  
sufficient	
  
sta0s0cs	
  
Model	
  
I	
  
n	
  
p	
  
u	
  
t	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   36	
  
Old	
  tricks,	
  new	
  dogs	
  
•  Mapper	
  
•  Assign	
  point	
  to	
  cluster	
  
•  Emit	
  cluster	
  id,	
  (1,	
  point)	
  
•  Combiner	
  and	
  reducer	
  
•  Sum	
  counts,	
  weighted	
  sum	
  of	
  points	
  
•  Emit	
  cluster	
  id,	
  (n,	
  sum/n)	
  
•  Output	
  to	
  HDFS	
  
Read	
  from	
  
HDFS	
  to	
  local	
  disk	
  
by	
  distributed	
  cache	
  
WriQen	
  by	
  
map-­‐reduce	
  
Read	
  from	
  local	
  disk	
  
from	
  distributed	
  cache	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   37	
  
Old	
  tricks,	
  new	
  dogs	
  
•  Mapper	
  
•  Assign	
  point	
  to	
  cluster	
  
•  Emit	
  cluster	
  id,	
  (1,	
  point)	
  
•  Combiner	
  and	
  reducer	
  
•  Sum	
  counts,	
  weighted	
  sum	
  of	
  points	
  
•  Emit	
  cluster	
  id,	
  (n,	
  sum/n)	
  
•  Output	
  to	
  HDFS	
  
MapR	
  FS	
  
Read	
  
from	
  
NFS	
  
WriQen	
  by	
  
map-­‐reduce	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   38	
  
Poor	
  man’s	
  Pregel	
  
•  Mapper	
  
•  Lines	
  in	
  bold	
  can	
  use	
  conven0onal	
  I/O	
  via	
  NFS	
  
38	
  
while not done:!
read and accumulate input models!
for each input:!
accumulate model!
write model!
synchronize!
reset input format!
emit summary!
10/11/11	
   ©	
  MapR	
  Confiden0al	
   39	
  
Click	
  modeling	
  architecture	
  
Feature	
  
extrac0on	
  
and	
  
down	
  
sampling	
  
I	
  
n	
  
p	
  
u	
  
t	
  
Side-­‐data	
  
Data	
  
join	
  
Sequen0al	
  
SGD	
  
Learning	
  
Map-­‐reduce	
  
Now	
  via	
  NFS	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   40	
  
Click	
  modeling	
  architecture	
  
Map-­‐reduce	
  Map-­‐reduce	
  
Feature	
  
extrac0on	
  
and	
  
down	
  
sampling	
  
I	
  
n	
  
p	
  
u	
  
t	
  
Side-­‐data	
  
Data	
  
join	
  
Sequen0al	
  
SGD	
  
Learning	
  
Map-­‐reduce	
  
cooperates	
  
with	
  NFS	
  
Sequen0al	
  
SGD	
  
Learning	
  
Sequen0al	
  
SGD	
  
Learning	
  
Sequen0al	
  
SGD	
  
Learning	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   41	
  
And	
  another…	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   42	
  
??	
  
Hybrid	
  model	
  flow	
  
Map-­‐reduce	
  
Map-­‐reduce	
  
Feature	
  extrac0on	
  	
  
and	
  	
  
down	
  sampling	
  
SVD	
  
(PageRank)	
  
(spectral)	
  
Deployed	
  
Model	
  
Down	
  	
  
stream	
  	
  
modeling	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   43	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   44	
  
Map-­‐reduce	
  Sequen0al	
  
Hybrid	
  model	
  flow	
  
Feature	
  extrac0on	
  	
  
and	
  	
  
down	
  sampling	
  
SVD	
  
(PageRank)	
  
(spectral)	
  
Deployed	
  
Model	
  
Down	
  	
  
stream	
  	
  
modeling	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   45	
  
And	
  visualiza0on…	
  
10/11/11	
   ©	
  MapR	
  Confiden0al	
   46	
  
Trivial	
  visualiza0on	
  interface	
  
•  Map-­‐reduce	
  output	
  is	
  visible	
  via	
  NFS	
  
•  Legacy	
  visualiza0on	
  just	
  works	
  
$ R!
> x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”)!
> plot(error ~ t, x)!
> q(save=‘n’)!
10/11/11	
   ©	
  MapR	
  Confiden0al	
   47	
  
Conclusions	
  
•  We	
  used	
  to	
  know	
  all	
  this	
  
•  Tab	
  comple0on	
  used	
  to	
  work	
  
•  5	
  years	
  of	
  work-­‐arounds	
  have	
  clouded	
  our	
  
memories	
  
•  We	
  just	
  have	
  to	
  remember	
  the	
  future	
  

Contenu connexe

Tendances

Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBaseCarol McDonald
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseMapR Technologies
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoHyunsik Choi
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemDataWorks Summit/Hadoop Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application ResourcesDataWorks Summit
 

Tendances (20)

Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 

En vedette

En vedette (6)

New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Summit EU Machine Learning
Summit EU Machine LearningSummit EU Machine Learning
Summit EU Machine Learning
 
Drill Lightning London Big Data
Drill Lightning London Big DataDrill Lightning London Big Data
Drill Lightning London Big Data
 
News From Mahout
News From MahoutNews From Mahout
News From Mahout
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 

Similaire à Cmu 2011 09.pptx

Cmu-2011-09.pptx
Cmu-2011-09.pptxCmu-2011-09.pptx
Cmu-2011-09.pptxTed Dunning
 
Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither HadoopEd Kohlwey
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integrationtrihug
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the ElephantDataWorks Summit
 
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Aerospike
 
Lawrence Livermore Labs talk 2011
Lawrence Livermore Labs talk 2011Lawrence Livermore Labs talk 2011
Lawrence Livermore Labs talk 2011MapR Technologies
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceShift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceMapR Technologies
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.Roman Nikitchenko
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...ScyllaDB
 
Clemson: Solving the HPC Data Deluge
Clemson: Solving the HPC Data DelugeClemson: Solving the HPC Data Deluge
Clemson: Solving the HPC Data Delugeinside-BigData.com
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 

Similaire à Cmu 2011 09.pptx (20)

Cmu-2011-09.pptx
Cmu-2011-09.pptxCmu-2011-09.pptx
Cmu-2011-09.pptx
 
Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither Hadoop
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integration
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the Elephant
 
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
 
Lawrence Livermore Labs talk 2011
Lawrence Livermore Labs talk 2011Lawrence Livermore Labs talk 2011
Lawrence Livermore Labs talk 2011
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceShift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
 
Llnl talk
Llnl talkLlnl talk
Llnl talk
 
13c planning
13c planning13c planning
13c planning
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Clemson: Solving the HPC Data Deluge
Clemson: Solving the HPC Data DelugeClemson: Solving the HPC Data Deluge
Clemson: Solving the HPC Data Deluge
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 

Plus de MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Plus de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Dernier

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Cmu 2011 09.pptx

  • 1. 10/11/11   ©  MapR  Confiden0al   1   MapR,  Implica0ons  for  Integra0on   CMU  –  September  2011  
  • 2. 10/11/11   ©  MapR  Confiden0al   2   Outline   •  MapR  system  overview   •  Map-­‐reduce  review   •  MapR  architecture   •  Performance  Results   •  Map-­‐reduce  on  MapR   •  Architectural  implica0ons   •  Search  indexing  /  deployment   •  EM  algorithm  for  machine  learning   •  …  and  more  …  
  • 3. 10/11/11   ©  MapR  Confiden0al   3   Map-­‐Reduce   !" !" !# !# $%&'()"*" +,&)!'%-(./%0) "*# 12'!!3)"*4 536'-3) !'%-(./%0) "*7 8'(&'()"930) "*: @/-,9) A.0B @/-,9) A.0B !"!# Input   Output   Shuffle  
  • 4. 10/11/11   ©  MapR  Confiden0al   4   BoQlenecks  and  Issues   •  Read-­‐only  files   •  Many  copies  in  I/O  path   •  Shuffle  based  on  HTTP   •  Can’t  use  new  technologies   •  Eats  file  descriptors   •  Spills  go  to  local  file  space   •  Bad  for  skewed  distribu0on  of  sizes  
  • 5. 10/11/11   ©  MapR  Confiden0al   5   MapR  Areas  of  Development   Map   Reduce   Storage   Services   Ecosystem   HBase   Management  
  • 6. 10/11/11   ©  MapR  Confiden0al   6   MapR  Improvements   •  Faster  file  system   •  Fewer  copies   •  Mul0ple  NICS   •  No  file  descriptor  or  page-­‐buf  compe00on   •  Faster  map-­‐reduce   •  Uses  distributed  file  system   •  Direct  RPC  to  receiver   •  Very  wide  merges  
  • 7. 10/11/11   ©  MapR  Confiden0al   7   MapR  Innova0ons   •  Volumes   •  Distributed  management   •  Data  placement   •  Read/write  random  access  file  system   •  Allows  distributed  meta-­‐data   •  Improved  scaling   •  Enables  NFS  access   •  Applica0on-­‐level  NIC  bonding   •  Transac0onally  correct  snapshots  and  mirrors  
  • 8. 10/11/11   ©  MapR  Confiden0al   8   MapR's  Containers   l  Each  container  contains   l  Directories  &  files   l  Data  blocks   l  Replicated  on  servers   l  No  need  to  manage   directly   Files/directories  are  sharded  into  blocks,  which   are  placed  into  mini  NNs  (containers  )  on  disks   Containers  are   16-­‐32  GB  segments   of  disk,  placed  on   nodes  
  • 9. 10/11/11   ©  MapR  Confiden0al   9   MapR's  Containers   l  Each  container  has  a   replica0on  chain   l  Updates  are  transac0onal   l  Failures  are  handled  by   rearranging  replica0on  
  • 10. 10/11/11   ©  MapR  Confiden0al   10   Container  loca0ons  and  replica0on   CLDB   N1,  N2   N3,  N2   N1,  N2   N1,  N3   N3,  N2   N1   N2   N3   Container  loca0on  database   (CLDB)  keeps  track  of  nodes   hos0ng  each  container  and   replica0on  chain  order  
  • 11. 10/11/11   ©  MapR  Confiden0al   11   MapR  Scaling   Containers  represent  16  -­‐  32GB  of  data   l  Each  can  hold  up  to    1  Billion  files  and  directories   l  100M  containers  =    ~  2  Exabytes    (a  very  large  cluster)   250  bytes  DRAM  to  cache  a  container   l  25GB  to  cache  all  containers  for  2EB  cluster   -  But  not  necessary,  can  page  to  disk   l  Typical  large  10PB  cluster  needs  2GB   Container-­‐reports  are  100x  -­‐  1000x    <    HDFS  block-­‐reports   l  Serve  100x  more  data-­‐nodes   l  Increase  container  size  to  64G  to  serve  4EB  cluster   l  Map/reduce  not  affected  
  • 12. 10/11/11   ©  MapR  Confiden0al   12   MapR's  Streaming  Performance   Read Write 0 250 500 750 1000 1250 1500 1750 2000 2250 Read Write 0 250 500 750 1000 1250 1500 1750 2000 2250 Hardware MapR HadoopMB   per   sec   Tests:          i.    16  streams  x  120GB              ii.    2000  streams  x  1GB   11  x  7200rpm  SATA   11  x  15Krpm  SAS   Higher  is  be;er  
  • 13. 10/11/11   ©  MapR  Confiden0al   13   Terasort  on  MapR   1.0  TB 0 10 20 30 40 50 60 3.5  TB 0 50 100 150 200 250 300 MapR Hadoop Elapsed   =me   (mins)   10+1  nodes:  8  core,  24GB  DRAM,  11  x  1TB  SATA  7200  rpm   Lower  is  be;er  
  • 14. 10/11/11   ©  MapR  Confiden0al   14   HBase  on  MapR   Records   per   second   Higher  is  be;er   0   5000   10000   15000   20000   25000   Zipfian   Uniform   MapR   Apache   YCSB  Random  Read    with  1  billion  1K  records   10+1  node  cluster:  8  core,  24GB  DRAM,  11  x  1TB  7200  RPM  
  • 15. 10/11/11   ©  MapR  Confiden0al   15   #  of  files  (m)   Rate(files/sec) Op:    -­‐  create  file                    -­‐  write  100  bytes                    -­‐  close   Notes:   -­‐  NN  not  replicated   -­‐  NN  uses  20G  DRAM   -­‐  DN  uses    2G    DRAM   Out  of  box   Tuned   Small  Files  (Apache  Hadoop,  10  nodes)  
  • 16. 10/11/11   ©  MapR  Confiden0al   16   MUCH  faster  for  some  opera0ons   #  of  files  (millions)   Create   Rate   Same  10  nodes  …  
  • 17. 10/11/11   ©  MapR  Confiden0al   17   What  MapR  is  not   •  Volumes  !=  federa0on   •  MapR  supports  >  10,000  volumes  all  with   independent  placement  and  defaults   •  Volumes  support  snapshots  and  mirroring   •  NFS  !=  FUSE   •  Checksum  and  compress  at  gateway   •  IP  fail-­‐over   •  Read/write/update  seman0cs  at  full  speed   •  MapR  !=  maprfs  
  • 18. 10/11/11   ©  MapR  Confiden0al   18   New  Capabili0es  
  • 19. 10/11/11   ©  MapR  Confiden0al   19   Alterna0ve  NFS  moun0ng  models   •  Export  to  the  world   •  NFS  gateway  runs  on  selected  gateway  hosts   •  Local  server   •  NFS  gateway  runs  on  local  host   •  Enables  local  compression  and  check  summing   •  Export  to  self   •  NFS  gateway  runs  on  all  data  nodes,  mounted   from  localhost  
  • 20. 10/11/11   ©  MapR  Confiden0al   20   Export  to  the  world   NFS   Server   NFS   Server   NFS   Server   NFS   Server  NFS   Client  
  • 21. 10/11/11   ©  MapR  Confiden0al   21   Client   NFS   Server   Local  server   Applica0on   Cluster  Nodes  
  • 22. 10/11/11   ©  MapR  Confiden0al   22   Cluster   Node   NFS   Server   Universal  export  to  self   Task   Cluster  Nodes  
  • 23. 10/11/11   ©  MapR  Confiden0al   23   Cluster   Node   NFS   Server   Task   Cluster   Node   NFS   Server   Task   Cluster   Node   NFS   Server   Task   Nodes  are  iden0cal  
  • 24. 10/11/11   ©  MapR  Confiden0al   24   Applica0on  architecture   •  High  performance  map-­‐reduce  is  nice   •  But  algorithmic  flexibility  is  even  nicer  
  • 25. 10/11/11   ©  MapR  Confiden0al   25   Sharded  text  Indexing   Map   Reducer   Input   documents   Local   disk   Search   Engine   Local   disk   Clustered   index  storage   Assign  documents   to  shards   Index  text  to  local  disk   and  then  copy  index  to   distributed  file  store   Copy  to  local  disk   typically  required  before   index  can  be  loaded  
  • 26. 10/11/11   ©  MapR  Confiden0al   26   Sharded  text  indexing   •  Mapper  assigns  document  to  shard   •  Shard  is  usually  hash  of  document  id   •  Reducer  indexes  all  documents  for  a  shard   •  Indexes  created  on  local  disk   •  On  success,  copy  index  to  DFS   •  On  failure,  delete  local  files   •  Must  avoid  directory  collisions     •  can’t  use  shard  id!   •  Must  manage  and  reclaim  local  disk  space  
  • 27. 10/11/11   ©  MapR  Confiden0al   27   Conven0onal  data  flow   Map   Reducer   Input   documents   Local   disk   Search   Engine   Local   disk   Clustered   index  storage   Failure  of  a  reducer   causes  garbage  to   accumulate  in  the   local  disk   Failure  of  search   engine  requires   another  download   of  the  index  from   clustered  storage.  
  • 28. 10/11/11   ©  MapR  Confiden0al   28   Search   Engine   Simplified  NFS  data  flows   Map   Reducer   Input   documents   Clustered   index  storage   Failure  of  a  reducer   is  cleaned  up  by   map-­‐reduce   framework   Search  engine   reads  mirrored   index  directly.   Index  to  task  work   directory  via  NFS  
  • 29. 10/11/11   ©  MapR  Confiden0al   29   Simplified  NFS  data  flows   Map   Reducer   Input   documents   Search   Engine   Mirrors   Search   Engine   Mirroring  allows   exact  placement   of  index  data   Aribitrary  levels   of  replica0on   also  possible  
  • 30. 10/11/11   ©  MapR  Confiden0al   30   How  about  another  one?  
  • 31. 10/11/11   ©  MapR  Confiden0al   31   K-­‐means   •  Classic  E-­‐M  based  algorithm   •  Given  cluster  centroids,   •  Assign  each  data  point  to  nearest  centroid   •  Accumulate  new  centroids   •  Rinse,  lather,  repeat  
  • 32. 10/11/11   ©  MapR  Confiden0al   32   Aggregate   new   centroids   K-­‐means,  the  movie   Assign   to   Nearest   centroid   Centroids   I   n   p   u   t  
  • 33. 10/11/11   ©  MapR  Confiden0al   33   But  …  
  • 34. 10/11/11   ©  MapR  Confiden0al   34   Average   models   Parallel  Stochas0c  Gradient  Descent   Train   sub   model   Model   I   n   p   u   t  
  • 35. 10/11/11   ©  MapR  Confiden0al   35   Update   model   Varia0onal  Dirichlet  Assignment   Gather   sufficient   sta0s0cs   Model   I   n   p   u   t  
  • 36. 10/11/11   ©  MapR  Confiden0al   36   Old  tricks,  new  dogs   •  Mapper   •  Assign  point  to  cluster   •  Emit  cluster  id,  (1,  point)   •  Combiner  and  reducer   •  Sum  counts,  weighted  sum  of  points   •  Emit  cluster  id,  (n,  sum/n)   •  Output  to  HDFS   Read  from   HDFS  to  local  disk   by  distributed  cache   WriQen  by   map-­‐reduce   Read  from  local  disk   from  distributed  cache  
  • 37. 10/11/11   ©  MapR  Confiden0al   37   Old  tricks,  new  dogs   •  Mapper   •  Assign  point  to  cluster   •  Emit  cluster  id,  (1,  point)   •  Combiner  and  reducer   •  Sum  counts,  weighted  sum  of  points   •  Emit  cluster  id,  (n,  sum/n)   •  Output  to  HDFS   MapR  FS   Read   from   NFS   WriQen  by   map-­‐reduce  
  • 38. 10/11/11   ©  MapR  Confiden0al   38   Poor  man’s  Pregel   •  Mapper   •  Lines  in  bold  can  use  conven0onal  I/O  via  NFS   38   while not done:! read and accumulate input models! for each input:! accumulate model! write model! synchronize! reset input format! emit summary!
  • 39. 10/11/11   ©  MapR  Confiden0al   39   Click  modeling  architecture   Feature   extrac0on   and   down   sampling   I   n   p   u   t   Side-­‐data   Data   join   Sequen0al   SGD   Learning   Map-­‐reduce   Now  via  NFS  
  • 40. 10/11/11   ©  MapR  Confiden0al   40   Click  modeling  architecture   Map-­‐reduce  Map-­‐reduce   Feature   extrac0on   and   down   sampling   I   n   p   u   t   Side-­‐data   Data   join   Sequen0al   SGD   Learning   Map-­‐reduce   cooperates   with  NFS   Sequen0al   SGD   Learning   Sequen0al   SGD   Learning   Sequen0al   SGD   Learning  
  • 41. 10/11/11   ©  MapR  Confiden0al   41   And  another…  
  • 42. 10/11/11   ©  MapR  Confiden0al   42   ??   Hybrid  model  flow   Map-­‐reduce   Map-­‐reduce   Feature  extrac0on     and     down  sampling   SVD   (PageRank)   (spectral)   Deployed   Model   Down     stream     modeling  
  • 43. 10/11/11   ©  MapR  Confiden0al   43  
  • 44. 10/11/11   ©  MapR  Confiden0al   44   Map-­‐reduce  Sequen0al   Hybrid  model  flow   Feature  extrac0on     and     down  sampling   SVD   (PageRank)   (spectral)   Deployed   Model   Down     stream     modeling  
  • 45. 10/11/11   ©  MapR  Confiden0al   45   And  visualiza0on…  
  • 46. 10/11/11   ©  MapR  Confiden0al   46   Trivial  visualiza0on  interface   •  Map-­‐reduce  output  is  visible  via  NFS   •  Legacy  visualiza0on  just  works   $ R! > x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”)! > plot(error ~ t, x)! > q(save=‘n’)!
  • 47. 10/11/11   ©  MapR  Confiden0al   47   Conclusions   •  We  used  to  know  all  this   •  Tab  comple0on  used  to  work   •  5  years  of  work-­‐arounds  have  clouded  our   memories   •  We  just  have  to  remember  the  future