The document discusses various approaches to data analytics and common pitfalls. It provides examples of recommendation systems at Netflix and Pandora that achieved success by focusing on the business goals rather than just the technology. It also warns against complexifying systems and architectures unnecessarily over time and refusing to remove outdated components. Overall it advocates embracing complexity but also avoiding duct tape solutions, and designing systems with the intended use and business goals in mind rather than getting attached to specific technologies.
1. Five
Ways
to
Do
Data
Analytics
“The
Wrong
Way”
Title
of
the
talk,
on
August
6
2014,
@
Pinterest
Powered
by
the
Wisconsin
Idea:
The
Wisconsin
Idea
is
the
principle
that
the
university
should
improve
people’s
lives
beyond
the
classroom.
It
spans
UW–Madison’s
teaching,
research,
outreach
and
public
service.
Jignesh
M.
Patel
jignesh@cs.wisc.edu
1
2. Definition:
A
computing
or
networking
architecture
suggested
by
the
marketing
department
for
sales
purposes
rather
than
for
technical
reasons.
Cisco
calls
them
"reference
designs".
http://www.urbandictionary.com
Follow
the
markitecture
2
3. http://gridgaintech.wordpress.com
Technology
=
In-‐memory
file
system
https://spark.apache.org
Technology
=
In-‐memory
caching
+
language
bindings
http://hortonworks.com/blog/100x-‐faster-‐hive/
The
Stinger
Initiative:
100X
Hive
Technology
=
caching,
vectorized
query
execution
http://blog.cloudera.com
Technology
=
pin
files
in
memory
3
5. Never
fix
a
duct-‐taped
solution
Embrace
complexity
5
6. Image
from:
http://http://
thewaysleueslove.blogspot.com
One
has
to
apply
duct
tape
to
fix
problems,
but
consider
removing
it
later.
Stonebraker
and
Cetintemel,
ICDE
2005
Natural
instinct
is
to
build/deploy
a
specialized
system
for
each
application,
but
that
approach
blows
up
the
operational
complexity
6
7. Chasseur
and
Patel,
WebDB’13
JSON
JSON
Web App
Mapping Layer
Rather
than
a
specialized
engine
for
JSON
document
store,
a
simple
language
translator
to
SQL
has
higher
performance
and
better
data
integrity.
Chasseur
and
Patel,
WebDB’13
Similar
story
for
graphs
and
linear
ML
models
–
can
easily
be
supported
on
top
of
systems
powered
by
relational
algebra
The
network
effect!
But
in
a
bad
way!
Complexity
Growth
=
O(N2)
1
2
3
1
2
3
4
7
8. R
v/s
Python
debate
Complexity
Growth
=
O(N2)
Also
applies
to
tools
and
programming
languages
in
house
R
Python
5K
CRAN
statistically
robust
packages
Linear
algebra,
clustering,
…
ETL
8
9. Never
realize
that
technology
is
NOT
the
“end,”
but
simply
the
“means
to
a
(business)
end”
Think
of
technology
as
the
end
9
11. Figure
from:
Ricardo:
Integrating
R
and
Hadoop
by
Das
et
al.
SIGMOD’10
Key
approach:
Latent-‐factor
Modeling
All
Together
Now:
A
Perspective
on
the
Netflix
Prize,
by
Bell,
Koren
and
Volinsky
Winning
insights
• Missing
ratings
are
not
missing
by
random!
• Parameters
(popularity,
users
standards
for
rating,
user
tastes,
…)
vary
over
time
• Combining
sets
of
predictors
• Efficient
computation
critical
11
12. Pandora’s
Music
Recommender
by
Michael
Howe
Pandora:
Music
Genome
• Content-‐filtering
• Classification
to
pick
the
recommendation
• Key
is
to
“build
up
a
neighborhood
for
a
particular
user’s
preference”
Pandora.com
Pandora:
Music
Genome
12
13. Build
before
you
analyze
the
technology
trend
Never
use
back-‐of-‐the
envelope
calculations
13
14. Motivation
for
the
UW
Quickstep
project
http://quickstep.cs.wisc.edu
Hardware
changes
are
far
more
non-‐linear
than
in
the
past
L
a
ten
cy((
cy
c
le
s
)(
CPU$
$
DRAM$
caches$
Magnetic)Hard)Disk)Drives)
~1#10s!
~100!
~107!–
!108!
CPU$
$caches$
NVRAM)(e.g.)SSDs))
~105)
–)106!
Cap
a
c
ity(
Co
s
t(
Energy
Efficiency
for
Large-‐Scale
MapReduce
Workloads
with
Significant
Interactive
Analysis,
Chen
et
al.
EuroSys’12
Most
interactive
jobs
work
on
“small”
data
sets
14
15. 15
Patterson,
CACM
2004
Latency
lags
bandwidth
J.
Dean,
Latency
numbers
every
programmer
should
know,
2012
0
10
1,000
100,000
10,000,000
1,000,000,000
L1
cache
reference
Branch
mispredict
L2
cache
reference
Mutex
lock/unlock
Main
memory
reference
Compress
1K
bytes
with
Zippy
Send
1K
bytes
over
1
Gbps
network
Read
4K
randomly
from
SSD*
Read
1
MB
sequentially
from
memory
Round
trip
within
same
datacenter
Read
1
MB
sequentially
from
SSD*
Disk
seek
Read
1
MB
sequentially
from
disk
Send
packet
CA-‐>Netherlands-‐>CA
Time
in
ns
(log
scale)
16. Amazing
way
to
reason
about
bottlenecks
Little’s
Law
L
=
λW
16
Amdahl,
AFIPS
1967
Amdahl's
law
DeWitt
and
Gray,
CACM
1992
Parallel
computing
is
hard
Speedup
=
Old/New
17. Stubbornly
refuse
to
throw
away
code
and
platform
architecture.
Fall
in
love
with
your
architecture
17
18. Data
from
2013
publicly
reported
numbers
and
Alexa
19#
29#
18#7#
9#
1"
2"
4"
8"
16"
32"
64"
0" 1" 2" 3"
$/Active)User)(log)scale))
Revenue/Employee)($M))
Google
YouTube
Problem:
It’s
hard
to
throw
away
something
that
you
built,
even
if
it
doesn’t
fit
anymore
18
Bubble
volume
based
on
daily
time
on
the
site
19. 19
Watch
for
claims
that
are
too
broad
Markitecture
Simple
is
beautiful
–
keep
the
building
blocks
of
your
architectural
DNA
simple
Complexity
Periodically
re-‐evaluate
your
technology
architecture.
Also,
people
and
processes.
Architecture
Technology
must
serve
an
end
business
goal
Technology
and
Business
Amazingly
powerful
–
think
hard
before
you
build!
Back-‐of-‐the
envelope
calculations
doing
it
right
…
SSuummmmaarryy