SlideShare a Scribd company logo
1 of 19
Download to read offline
Five	
  Ways	
  
to	
  Do	
  Data	
  
Analytics	
  
“The	
  Wrong	
  
Way”	
  	
  
	
  
Title	
  of	
  the	
  talk,	
  on	
  August	
  6	
  2014,	
  @	
  Pinterest	
  	
  
	
  
	
  
Powered	
  by	
  the	
  Wisconsin	
  Idea:	
  The	
  Wisconsin	
  
Idea	
  is	
  the	
  principle	
  that	
  the	
  university	
  should	
  
improve	
  people’s	
  lives	
  beyond	
  the	
  classroom.	
  It	
  
spans	
  UW–Madison’s	
  teaching,	
  research,	
  
outreach	
  and	
  public	
  service.	
  	
  
	
  
	
  
Jignesh	
  M.	
  Patel	
  
	
  
jignesh@cs.wisc.edu	
  
1	
  
Definition:	
  A	
  computing	
  or	
  
networking	
  architecture	
  
suggested	
  by	
  the	
  marketing	
  
department	
  for	
  sales	
  purposes	
  
rather	
  than	
  for	
  technical	
  
reasons.	
  Cisco	
  calls	
  them	
  
"reference	
  designs".	
  
	
  
http://www.urbandictionary.com	
  
Follow	
  the	
  markitecture	
  
2	
  
http://gridgaintech.wordpress.com	
  
Technology	
  =	
  In-­‐memory	
  file	
  system	
  
https://spark.apache.org	
  
	
  
Technology	
  =	
  In-­‐memory	
  caching	
  +	
  
language	
  bindings	
  
http://hortonworks.com/blog/100x-­‐faster-­‐hive/	
  
The	
  Stinger	
  Initiative:	
  100X	
  Hive	
  
	
  
Technology	
  =	
  caching,	
  vectorized	
  
query	
  execution	
  
http://blog.cloudera.com	
  
Technology	
  =	
  pin	
  files	
  in	
  memory	
  
3	
  
http://hortonworks.com/blog/stinger-­‐phase-­‐2-­‐the-­‐journey-­‐to-­‐100x-­‐faster-­‐hive/	
  
Problem:	
  Claims	
  are	
  too	
  broad!	
  
https://spark.apache.org	
  
Problem:	
  Claims	
  are	
  too	
  broad	
  
Venkatraman	
  et	
  al.	
  EuroSys’13	
  	
  
Presto	
  (not	
  the	
  FB)	
  v/s	
  Spark:	
  
Big	
  Wins	
  an	
  in	
  the	
  R	
  framework	
  
4	
  
Never	
  fix	
  a	
  duct-­‐taped	
  solution	
  
Embrace	
  complexity	
  
5	
  
Image	
  from:	
  http://http://
thewaysleueslove.blogspot.com	
  
One	
  has	
  to	
  apply	
  duct	
  tape	
  to	
  
fix	
  problems,	
  but	
  consider	
  
removing	
  it	
  later.	
  
Stonebraker	
  and	
  Cetintemel,	
  ICDE	
  2005	
  
Natural	
  instinct	
  is	
  to	
  build/deploy	
  a	
  
specialized	
  system	
  for	
  each	
  application,	
  
but	
  that	
  approach	
  blows	
  up	
  the	
  
operational	
  complexity	
  
6	
  
Chasseur	
  and	
  Patel,	
  WebDB’13	
  
JSON
JSON
Web App
Mapping Layer
Rather	
  than	
  a	
  specialized	
  engine	
  
for	
  JSON	
  document	
  store,	
  a	
  
simple	
  language	
  translator	
  to	
  
SQL	
  has	
  higher	
  performance	
  and	
  
better	
  data	
  integrity.	
  
Chasseur	
  and	
  Patel,	
  WebDB’13	
  
Similar	
  story	
  for	
  graphs	
  and	
  
linear	
  ML	
  models	
  –	
  can	
  easily	
  be	
  
supported	
  on	
  top	
  of	
  systems	
  
powered	
  by	
  relational	
  algebra	
  
The	
  network	
  effect!	
  But	
  in	
  a	
  bad	
  way!	
  
Complexity	
  Growth	
  =	
  O(N2)	
  
1	
   2	
  
3	
  
1	
   2	
  
3	
   4	
  
7	
  
R	
  v/s	
  Python	
  debate	
  
Complexity	
  Growth	
  =	
  O(N2)	
  
Also	
  applies	
  to	
  tools	
  and	
  
programming	
  languages	
  in	
  
house	
  
R	
   	
  	
  Python	
  
5K	
  CRAN	
  
statistically	
  
robust	
  
packages	
  
Linear	
  
algebra,	
  
clustering,	
  …	
  
ETL	
  
8	
  
Never	
  realize	
  that	
  technology	
  is	
  
NOT	
  the	
  “end,”	
  but	
  simply	
  the	
  
“means	
  to	
  a	
  (business)	
  end”	
  
Think	
  of	
  technology	
  as	
  the	
  
end	
  
9	
  
Netflix	
  Challenge	
  
Example:	
  Building	
  a	
  recommendation	
  
system	
  
10	
  
Figure	
  from:	
  Ricardo:	
  Integrating	
  R	
  and	
  Hadoop	
  by	
  Das	
  et	
  al.	
  SIGMOD’10	
  	
  
Key	
  approach:	
  Latent-­‐factor	
  Modeling	
  	
  
All	
  Together	
  Now:	
  A	
  Perspective	
  on	
  the	
  Netflix	
  
Prize,	
  by	
  Bell,	
  Koren	
  and	
  Volinsky	
  
Winning	
  insights	
  
•  Missing	
  ratings	
  are	
  not	
  
missing	
  by	
  random!	
  
•  Parameters	
  
(popularity,	
  users	
  
standards	
  for	
  rating,	
  
user	
  tastes,	
  …)	
  vary	
  
over	
  time	
  
•  Combining	
  sets	
  of	
  
predictors	
  
•  Efficient	
  computation	
  
critical	
  
11	
  
Pandora’s	
  Music	
  Recommender	
  by	
  Michael	
  Howe	
  
Pandora:	
  Music	
  Genome	
  
•  Content-­‐filtering	
  
•  Classification	
  to	
  pick	
  the	
  
recommendation	
  
•  Key	
  is	
  to	
  “build	
  up	
  a	
  
neighborhood	
  for	
  a	
  
particular	
  user’s	
  preference”	
  
Pandora.com	
  
Pandora:	
  Music	
  Genome	
  
12	
  
Build	
  before	
  you	
  analyze	
  the	
  
technology	
  trend	
  
	
  
Never	
  use	
  back-­‐of-­‐the	
  
envelope	
  calculations	
  
13	
  
Motivation	
  for	
  the	
  UW	
  Quickstep	
  project	
  
http://quickstep.cs.wisc.edu	
  	
  	
  
Hardware	
  changes	
  are	
  far	
  more	
  
non-­‐linear	
  than	
  in	
  the	
  past	
  
L
a
ten
cy((
cy
c
le
s
)(
CPU$
$
DRAM$
caches$
Magnetic)Hard)Disk)Drives)
~1#10s!
~100!
~107!–
!108!
CPU$
$caches$
NVRAM)(e.g.)SSDs))
~105)
–)106!
Cap
a
c
ity(
Co
s
t(
Energy	
  Efficiency	
  for	
  Large-­‐Scale	
  MapReduce	
  Workloads	
  with	
  
Significant	
  Interactive	
  Analysis,	
  Chen	
  et	
  al.	
  EuroSys’12	
  
Most	
  interactive	
  jobs	
  work	
  on	
  
“small”	
  data	
  sets	
  	
  
14	
  
15	
  
Patterson,	
  CACM	
  2004	
  
Latency	
  lags	
  bandwidth	
  
J.	
  Dean,	
  Latency	
  numbers	
  every	
  programmer	
  should	
  know,	
  2012	
  	
  
	
  0	
  	
  
	
  10	
  	
  
	
  1,000	
  	
  
	
  100,000	
  	
  
	
  10,000,000	
  	
  
	
  1,000,000,000	
  	
  
L1	
  cache	
  reference	
  
Branch	
  mispredict	
  
L2	
  cache	
  reference	
  
Mutex	
  lock/unlock	
  
Main	
  memory	
  reference	
  
Compress	
  1K	
  bytes	
  with	
  Zippy	
  
Send	
  1K	
  bytes	
  over	
  1	
  Gbps	
  network	
  
Read	
  4K	
  randomly	
  from	
  SSD*	
  
Read	
  1	
  MB	
  sequentially	
  from	
  memory	
  
Round	
  trip	
  within	
  same	
  datacenter	
  
Read	
  1	
  MB	
  sequentially	
  from	
  SSD*	
  
Disk	
  seek	
  
Read	
  1	
  MB	
  sequentially	
  from	
  disk	
  
Send	
  packet	
  CA-­‐>Netherlands-­‐>CA	
  
Time	
  in	
  ns	
  	
  
(log	
  scale)	
  
Amazing	
  way	
  to	
  reason	
  about	
  bottlenecks	
  
Little’s	
  Law	
  
L	
  =	
  λW	
  
16	
  
Amdahl,	
  AFIPS	
  1967	
  
Amdahl's	
  law	
  
DeWitt	
  and	
  Gray,	
  CACM	
  1992	
  	
  
Parallel	
  computing	
  is	
  hard	
  
Speedup	
  =	
  Old/New	
  
Stubbornly	
  refuse	
  to	
  throw	
  away	
  
code	
  and	
  platform	
  architecture.	
  
Fall	
  in	
  love	
  with	
  your	
  
architecture	
  
17	
  
Data	
  from	
  2013	
  publicly	
  reported	
  numbers	
  and	
  Alexa	
  
19#
29#
18#7#
9#
1"
2"
4"
8"
16"
32"
64"
0" 1" 2" 3"
$/Active)User)(log)scale))
Revenue/Employee)($M))
Google
YouTube
Problem:	
  It’s	
  hard	
  to	
  throw	
  away	
  
something	
  that	
  you	
  built,	
  even	
  if	
  it	
  
doesn’t	
  fit	
  anymore	
  
18	
  
Bubble	
  volume	
  
based	
  on	
  daily	
  
time	
  on	
  the	
  site	
  	
  
19	
  
Watch	
  for	
  claims	
  that	
  are	
  too	
  broad	
  
Markitecture	
  
Simple	
  is	
  beautiful	
  –	
  keep	
  the	
  building	
  
blocks	
  of	
  your	
  architectural	
  DNA	
  simple	
  
Complexity	
  
Periodically	
  re-­‐evaluate	
  your	
  technology	
  
architecture.	
  Also,	
  people	
  and	
  processes.	
  
Architecture	
  	
  
Technology	
  must	
  serve	
  an	
  end	
  business	
  
goal	
  
Technology	
  and	
  Business	
  
Amazingly	
  powerful	
  –	
  think	
  hard	
  before	
  you	
  
build!	
  
Back-­‐of-­‐the	
  envelope	
  
calculations	
  
doing	
  it	
  right	
  …	
  
SSuummmmaarryy

More Related Content

Viewers also liked

Scaling Uber's Real-time Market Platform
Scaling Uber's Real-time Market PlatformScaling Uber's Real-time Market Platform
Scaling Uber's Real-time Market PlatformC4Media
 
Airbnb Growth Tech Talk 4/5/2016
Airbnb Growth Tech Talk 4/5/2016Airbnb Growth Tech Talk 4/5/2016
Airbnb Growth Tech Talk 4/5/2016jwegan
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaEdureka!
 
Singer, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureSinger, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureDiscover Pinterest
 
Deploying Docker Containers at Scale with Mesos and Marathon
Deploying Docker Containers at Scale with Mesos and MarathonDeploying Docker Containers at Scale with Mesos and Marathon
Deploying Docker Containers at Scale with Mesos and MarathonDiscover Pinterest
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionsaba khan
 

Viewers also liked (10)

Data Driven Growth
Data Driven GrowthData Driven Growth
Data Driven Growth
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Scaling Uber's Real-time Market Platform
Scaling Uber's Real-time Market PlatformScaling Uber's Real-time Market Platform
Scaling Uber's Real-time Market Platform
 
Airbnb Growth Tech Talk 4/5/2016
Airbnb Growth Tech Talk 4/5/2016Airbnb Growth Tech Talk 4/5/2016
Airbnb Growth Tech Talk 4/5/2016
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & Scala
 
Singer, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureSinger, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging Infrastructure
 
Deploying Docker Containers at Scale with Mesos and Marathon
Deploying Docker Containers at Scale with Mesos and MarathonDeploying Docker Containers at Scale with Mesos and Marathon
Deploying Docker Containers at Scale with Mesos and Marathon
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
 
Logistic Regression Analysis
Logistic Regression AnalysisLogistic Regression Analysis
Logistic Regression Analysis
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 

Similar to Five Ways To Do Data Analytics "The Wrong Way"

XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemXDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemDan Eaton
 
Afterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écranAfterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écranJoseph Glorieux
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
Right-sized Architecture: Integrity for Emerging Designs
Right-sized Architecture: Integrity for Emerging DesignsRight-sized Architecture: Integrity for Emerging Designs
Right-sized Architecture: Integrity for Emerging DesignsTechWell
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationInside Analysis
 
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...inside-BigData.com
 
Big data berlin
Big data berlinBig data berlin
Big data berlinkammeyer
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationKyle Hailey
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Anomaly Detection at Scale
Anomaly Detection at ScaleAnomaly Detection at Scale
Anomaly Detection at ScaleJeff Henrikson
 
big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork OCTO Technology Suisse
 

Similar to Five Ways To Do Data Analytics "The Wrong Way" (20)

XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemXDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
 
Spark
SparkSpark
Spark
 
Afterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écranAfterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écran
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
PyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc AltedPyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc Alted
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Right-sized Architecture: Integrity for Emerging Designs
Right-sized Architecture: Integrity for Emerging DesignsRight-sized Architecture: Integrity for Emerging Designs
Right-sized Architecture: Integrity for Emerging Designs
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
 
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualization
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Anomaly Detection at Scale
Anomaly Detection at ScaleAnomaly Detection at Scale
Anomaly Detection at Scale
 
big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork big data et data viz - du lac à votre écran - afterwork
big data et data viz - du lac à votre écran - afterwork
 

Recently uploaded

Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdfKamal Acharya
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxNadaHaitham1
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilVinayVitekari
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 

Recently uploaded (20)

Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 

Five Ways To Do Data Analytics "The Wrong Way"

  • 1. Five  Ways   to  Do  Data   Analytics   “The  Wrong   Way”       Title  of  the  talk,  on  August  6  2014,  @  Pinterest         Powered  by  the  Wisconsin  Idea:  The  Wisconsin   Idea  is  the  principle  that  the  university  should   improve  people’s  lives  beyond  the  classroom.  It   spans  UW–Madison’s  teaching,  research,   outreach  and  public  service.         Jignesh  M.  Patel     jignesh@cs.wisc.edu   1  
  • 2. Definition:  A  computing  or   networking  architecture   suggested  by  the  marketing   department  for  sales  purposes   rather  than  for  technical   reasons.  Cisco  calls  them   "reference  designs".     http://www.urbandictionary.com   Follow  the  markitecture   2  
  • 3. http://gridgaintech.wordpress.com   Technology  =  In-­‐memory  file  system   https://spark.apache.org     Technology  =  In-­‐memory  caching  +   language  bindings   http://hortonworks.com/blog/100x-­‐faster-­‐hive/   The  Stinger  Initiative:  100X  Hive     Technology  =  caching,  vectorized   query  execution   http://blog.cloudera.com   Technology  =  pin  files  in  memory   3  
  • 4. http://hortonworks.com/blog/stinger-­‐phase-­‐2-­‐the-­‐journey-­‐to-­‐100x-­‐faster-­‐hive/   Problem:  Claims  are  too  broad!   https://spark.apache.org   Problem:  Claims  are  too  broad   Venkatraman  et  al.  EuroSys’13     Presto  (not  the  FB)  v/s  Spark:   Big  Wins  an  in  the  R  framework   4  
  • 5. Never  fix  a  duct-­‐taped  solution   Embrace  complexity   5  
  • 6. Image  from:  http://http:// thewaysleueslove.blogspot.com   One  has  to  apply  duct  tape  to   fix  problems,  but  consider   removing  it  later.   Stonebraker  and  Cetintemel,  ICDE  2005   Natural  instinct  is  to  build/deploy  a   specialized  system  for  each  application,   but  that  approach  blows  up  the   operational  complexity   6  
  • 7. Chasseur  and  Patel,  WebDB’13   JSON JSON Web App Mapping Layer Rather  than  a  specialized  engine   for  JSON  document  store,  a   simple  language  translator  to   SQL  has  higher  performance  and   better  data  integrity.   Chasseur  and  Patel,  WebDB’13   Similar  story  for  graphs  and   linear  ML  models  –  can  easily  be   supported  on  top  of  systems   powered  by  relational  algebra   The  network  effect!  But  in  a  bad  way!   Complexity  Growth  =  O(N2)   1   2   3   1   2   3   4   7  
  • 8. R  v/s  Python  debate   Complexity  Growth  =  O(N2)   Also  applies  to  tools  and   programming  languages  in   house   R      Python   5K  CRAN   statistically   robust   packages   Linear   algebra,   clustering,  …   ETL   8  
  • 9. Never  realize  that  technology  is   NOT  the  “end,”  but  simply  the   “means  to  a  (business)  end”   Think  of  technology  as  the   end   9  
  • 10. Netflix  Challenge   Example:  Building  a  recommendation   system   10  
  • 11. Figure  from:  Ricardo:  Integrating  R  and  Hadoop  by  Das  et  al.  SIGMOD’10     Key  approach:  Latent-­‐factor  Modeling     All  Together  Now:  A  Perspective  on  the  Netflix   Prize,  by  Bell,  Koren  and  Volinsky   Winning  insights   •  Missing  ratings  are  not   missing  by  random!   •  Parameters   (popularity,  users   standards  for  rating,   user  tastes,  …)  vary   over  time   •  Combining  sets  of   predictors   •  Efficient  computation   critical   11  
  • 12. Pandora’s  Music  Recommender  by  Michael  Howe   Pandora:  Music  Genome   •  Content-­‐filtering   •  Classification  to  pick  the   recommendation   •  Key  is  to  “build  up  a   neighborhood  for  a   particular  user’s  preference”   Pandora.com   Pandora:  Music  Genome   12  
  • 13. Build  before  you  analyze  the   technology  trend     Never  use  back-­‐of-­‐the   envelope  calculations   13  
  • 14. Motivation  for  the  UW  Quickstep  project   http://quickstep.cs.wisc.edu       Hardware  changes  are  far  more   non-­‐linear  than  in  the  past   L a ten cy(( cy c le s )( CPU$ $ DRAM$ caches$ Magnetic)Hard)Disk)Drives) ~1#10s! ~100! ~107!– !108! CPU$ $caches$ NVRAM)(e.g.)SSDs)) ~105) –)106! Cap a c ity( Co s t( Energy  Efficiency  for  Large-­‐Scale  MapReduce  Workloads  with   Significant  Interactive  Analysis,  Chen  et  al.  EuroSys’12   Most  interactive  jobs  work  on   “small”  data  sets     14  
  • 15. 15   Patterson,  CACM  2004   Latency  lags  bandwidth   J.  Dean,  Latency  numbers  every  programmer  should  know,  2012      0      10      1,000      100,000      10,000,000      1,000,000,000     L1  cache  reference   Branch  mispredict   L2  cache  reference   Mutex  lock/unlock   Main  memory  reference   Compress  1K  bytes  with  Zippy   Send  1K  bytes  over  1  Gbps  network   Read  4K  randomly  from  SSD*   Read  1  MB  sequentially  from  memory   Round  trip  within  same  datacenter   Read  1  MB  sequentially  from  SSD*   Disk  seek   Read  1  MB  sequentially  from  disk   Send  packet  CA-­‐>Netherlands-­‐>CA   Time  in  ns     (log  scale)  
  • 16. Amazing  way  to  reason  about  bottlenecks   Little’s  Law   L  =  λW   16   Amdahl,  AFIPS  1967   Amdahl's  law   DeWitt  and  Gray,  CACM  1992     Parallel  computing  is  hard   Speedup  =  Old/New  
  • 17. Stubbornly  refuse  to  throw  away   code  and  platform  architecture.   Fall  in  love  with  your   architecture   17  
  • 18. Data  from  2013  publicly  reported  numbers  and  Alexa   19# 29# 18#7# 9# 1" 2" 4" 8" 16" 32" 64" 0" 1" 2" 3" $/Active)User)(log)scale)) Revenue/Employee)($M)) Google YouTube Problem:  It’s  hard  to  throw  away   something  that  you  built,  even  if  it   doesn’t  fit  anymore   18   Bubble  volume   based  on  daily   time  on  the  site    
  • 19. 19   Watch  for  claims  that  are  too  broad   Markitecture   Simple  is  beautiful  –  keep  the  building   blocks  of  your  architectural  DNA  simple   Complexity   Periodically  re-­‐evaluate  your  technology   architecture.  Also,  people  and  processes.   Architecture     Technology  must  serve  an  end  business   goal   Technology  and  Business   Amazingly  powerful  –  think  hard  before  you   build!   Back-­‐of-­‐the  envelope   calculations   doing  it  right  …   SSuummmmaarryy