9. Put
it
away,
delete
it,
tweet
it,
compress
it,
shred
it,
wikileak-‐it,
put
it
in
a
database,
put
it
in
SAN/NAS,
put
it
in
the
cloud,
hide
it
in
tape…
10. You
are
obsessive
compulsive
about
collec=ng
and
structuring
your
data.
19. Another
EDW
Analy=cal
DB
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
The
solu=on?
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
OLTP
Data
Data
Data
Data
Data
Data
Data
Data
Data
Yet
Another
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
20. Another
EDW
Analy=cal
DB
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
OLTP
Ummm…you
dropped
something
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Yet
Another
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
23. Wait,
you’ve
seen
this
before.
Data
Data
Data
…
Sausage
Factory
Data
Data
Data
Data
Data
Data
Data
Data
Data
…
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
26. “Prices,
Stupid
passwords,
and
Boring
Sta=s=cs.”
-‐
Hans
Rosling
h"p://www.youtube.com/watch?v=hVimVzgtD6w
27. Your
data
silos
are
lonely
places.
EDW
Accounts
Customers
Web
Proper=es
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
28. …
Data
likes
to
be
together.
EDW
Accounts
Customers
Data
Data
Web
Proper=es
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
29. CDR
Data
Data
Data
Machine
Data
Facebook
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Weather
Data
TwiYer
Data
Data
likes
to
socialize
too.
Data
Data
EDW
Data
Data
Data
Data
Data
Data
Accounts
Data
Web
Proper=es
Data
Data
Data
Customers
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
30. New
types
of
data
don’t
quite
fit
into
your
pris=ne
view
of
the
world.
Logs
Data
Data
Data
Data
Data
Data
Data
Machine
Data
Data
Data
Data
Data
Data
Data
Data
My
LiYle
Data
Empire
Data
?
Data
?
Data
Data
Data
Data
Data
?
?
Data
Data
31. To
resolve
this,
some
people
take
hints
from
Lord
Of
The
Rings...
33. ETL
Data
Data
Data
ETL
ETL
ETL
EDW
Data
Data
Data
Data
Data
Schema
Data
Data
Data
Data
…but
that
has
its
problems
too.
ETL
Data
Data
Data
ETL
ETL
ETL
EDW
Data
Data
Data
Data
Data
Schema
Data
Data
Data
Data
41. If
you
could
design
a
system
that
would
handle
this,
what
would
it
look
like?
42. It
would
probably
need
a
highly
resilient,
self-‐healing,
cost-‐efficient,
distributed
file
system…
Storage
Storage
Storage
Storage
Storage
Storage
Storage
Storage
Storage
43. It
would
probably
need
a
completely
parallel
processing
framework
that
took
tasks
to
the
data…
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
44. It
would
probably
run
on
commodity
hardware,
virtualized
machines,
and
common
OS
pladorms
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
45. It
would
probably
be
open
source
so
innova=on
could
happen
as
quickly
as
possible
48. HDFS
stores
data
in
blocks
and
replicates
those
blocks
block1
Processing
Processing
Processing
Storage
Storage
Storage
block2
block2
Processing
Processing
Processing
block1
Storage
Storage
Storage
block3
block2
Processing
Storage
block3
Processing
Processing
block1
Storage
Storage
block3
49. If
a
block
fails
then
HDFS
always
has
the
other
copies
and
heals
itself
block1
Processing
Processing
Processing
block3
Storage
Storage
Storage
block2
block2
Processing
Processing
Processing
block1
Storage
Storage
Storage
block3
block2
Processing
Storage
block3
Processing
Processing
block1
Storage
Storage
X
50. MapReduce
is
a
programming
paradigm
that
completely
parallel
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Data
Data
Data
Reducer
Data
Data
Data
Reducer
Data
Data
Data
51. MapReduce
has
three
phases:
Map,
Sort/Shuffle,
Reduce
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Mapper
Mapper
Key,
Value
Key,
Value
Key,
Value
Reducer
Key,
Value
Key,
Value
Key,
Value
Mapper
Reducer
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Mapper
Reducer
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Mapper
Key,
Value
Key,
Value
Key,
Value
52. MapReduce
applies
to
a
lot
of
data
processing
problems
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Data
Data
Data
Reducer
Data
Data
Data
Reducer
Data
Data
Data
55. YARN
abstracts
resource
management
so
you
can
run
more
than
just
MapReduce
MapReduce
V2
MapReduce
V?
STORM
Giraph
Tez
YARN
HDFS2
MPI
HBase
…
and
more