SlideShare une entreprise Scribd logo
1  sur  34
Data Science Consulting
or
Science meets business, again.
Third time a charm?
David Johnston
ThoughtWorks
March 17, 2014
Young scientists
become…
Professors
Talk Overview
• Agile Analytics group at ThoughtWorks
• What is data science anyway? Origins and future.
Good or evil?
• Guide to technologies and limits to technology
• Process and methodology for successful data
science consulting
ThoughtWorks
• Global software consulting company
• HQ in Chicago. Major offices in NY, San Fran,
Dallas, India, Brazil, Australia, China - over 30
worldwide.
• Privately owned by Roy Singham
• Flat hierarchy of passionate people
The three pillars
Agile Analytics at TW
• Practiced started 2011
• Led by Ken Collier and John Spens
• About a dozen people involved
Key Themes
• BI, data warehousing and analytics has largely
missed the revolution in agile methodologies.
• We can do analytics in a agile, fast, light-footprint
way.
What do we do?
• Probabilistic modeling
• Predictive analytics / machine learning
• Advanced BI, prescriptive analysis
• Big Data technologies
• Advanced algorithms and data structures, streaming
Our main goals
• Use data analysis to give companies an edge in their marketplace
• Use data analysis to improve the world at large
Some typical projects
• Recommending Systems
• Customer behavior analysis
• Optimization
• Efficient algorithms/tech for massive data sets
• Company specific analytics challenges
Case Study 1: HealthCare Group
Purchasing Organization
• One of the largest GPOs. 1000s of client hospitals
• Hospital sign up, pay fee and get group-
purchasing discounts
• The GPO has to make estimates to hospitals on
their likely savings.
• Hospital’s data is usually in a non-standard
spreadsheet. No SKUs in healthcare (yet).
• A data matching mess
Case Study 1: HealthCare Group
Purchasing Organization
GPO: Johnson & Johnson Sterile Scalpel #F8-505
Hospital: J&J scalpel, steel item f8505 size 3’’
• Their in-place solution – Oracle, lots of ETL tools,
using SQL with lots of rigid rules for how to match.
• Data-base of matching rules was difficult to maintain
• Accuracy of matching ~60%. Rest was done by hand.
Took 1 day for processing and weeks for lines done by
hand.
Case Study 1: HealthCare Group
Purchasing Organization
What we did
• First convince them that their solution was highly
inefficient.
• Wrote python program using a tree data structure and
machine learning to do matching.
• Ran on my laptop in a few minutes. Match rates > 80%
• This done in 3 weeks. Later settled on a solution using
Elastic Search.
Case Study 2: Retail Rec Systems
• Customer providing
coupons to retailer
customers
• Needed a better
recommendation system
• We’re using a simple
logistic regression model
What exactly is data
science?
• Is this really new?
• Does the term “data science” make any sense?
• Is it just a fad? Over-hyped?
• Why did this term just become popular a few years back?
• Where is this going?
• Should scientists/engineers/math-types really go and make
a career doing this?
What exactly is data
science?
• Is this really new? - Not really
• Does the term “data science” make any sense? - Not really
but so what?
• Is it just a fad? Over-hyped? – No, some times.
• Why did this term just become popular a few years back? -
Productivity
• Where is this going?
• Should scientists/engineers/math-types really go and make
a career doing this? Yes for most
Is it new?
Of course not
Combination of many subjects:
• Mathematics and statistics – probability theory
• Machine learning
• Computer science – algorithms, data structures, data bases
• Operations research - process optimization
• Business consulting
• Software development
Where we have seen this before?
Business: Finance, Insurance, Sports, Government accounting, Retail,
Google
Science: Physics, Astronomy, Biology
Isn’t there anything new?
Of course
• Analytics finally becoming ubiquitous in business (as it always should have been)
• Much more communication between disparate fields
• It’s finally work that’s fun
Ok, but why now?
It’s a big movement so lets give it a new name , Data Science
Why now? - Productivity
• There has always been plenty of data science in
science
• Job prospects in academia are slim
• Productivity has been rising much faster than
postdoc salaries and scientist job creation
Data scientist productivity
growth
• Salary increase over postdoc requires
~2.5 x
• Salaries in Industry are set by
productivity and supply/demand
• Crossing the threshold in productivity
Leads to new job creation
• Eventual slowing in productivity
and/or changes in supply/demand
will eventually end this burst in job
creation
• Nothing magical happened in 2005!
Productivity Drivers for Data-
science
Long time scale
• Compute , Moore’s law
• The internet (duh!)
• HD and RAM price drop
• Science learns to deal with
Big Data
• Growing importance of
statistics
More recent
• Git , code –sharing
• Libraries machine learning
• Python/ R Open source
• Hadoop and ecosystem
• The Cloud, AWS
• NoSQL databases, in-mem
• Growing community in “data
science” cohesion, feedback
effects of popularity
Then and now
1990s data science
• Writing code in C/C++
• Working with flat files
• Even relational/SQL is
new
• Using Matlab, IDL
proprietary software
• Writing all algorithms from
scratch. Slow. Buggy.
Data science today
• Working in high level open-
source languages Python, R
• We’re good at SQL and
have lots of other options
NoSQL
• Git, thousands of libraries
available. Easy to install.
• Can concentrate more on
what we’re good at.
So what is data science now
Data Science:
An interdisciplinary field utilizing statistics, computer
science and the methods of scientific research in
areas outside of science.
Where is it going?
• Big Data technology is separated from data science
• Software developers take over much of Big Data roles
• Businesses begin to understand data science terminology like
they now understand software terminology and they are not
Twitter.
• Data scientists and businesses find a methodology that works
like industrial scale software development has
Where is it going?
Specialization
• Most experienced data scientists move into consulting or
management of teams
• Universities graduate many “data scientist-lite” students from
new more specialized BS or MA programs
• Fewer generalists
• PhD students need to learn additional skills. Not instant hires
(http://bit.ly/1m3krq6)
Why won’t we have 100x
more data scientists in N
years?
• Pool of disgruntled postdocs will dry up or “I am
not even supposed to be here!”
• Many data science problems don’t need the most
cutting edge tools. (Some do).
• People rarely get much experience working with
real data in academic settings. Requires real-
world experience, takes time.
Are we there yet?
Overhyped, underhyped, mis-
hyped?
• No, probably not
• Productivity growth is real
• We are solving important
problems. Plenty left.
• Big Data will probably
peak in the hype cycle
before data science
• Just watched my first
analytics commercial. IBM.
Why Big Data enthusiasm
might peak soon
Big Data defined – Process for performing calculations on data
that:
• Cannot possibly be done on a single machine
• When sampling and streaming are not effective
• What data-reduction is not possible
• When storage and compute are closely balanced
• Parallelizing is absolutely unavoidable
Most tasks are not like this
• Sampling is usually good enough for training machine learning
• Need for rapid feedback, interactive work
• CPUs are underutilized. IO limited.
• Usually a better algorithm can solve the problem better
Hadoop (Spark)
Good use cases
• Large batch jobs like:
restructuring and reducing
data from raw files.
• Scoring with ML models
• When you have to do
something on every data
point.
• Raw storage in HDFS
Bad use cases
• Model development
• Visualization
• Brute-forcing an inefficient
algorithm.
• Treating Hadoop like a
data-base.
The data-sizes we typically
see
Most companies have a few million customers 10^7
Often they storage ~ 1000 items per customer
That’s 10^10 data points. 5 bytes/data-point = 500 GB or a few TB. Fits on
our laptops (but not in memory). Such data can be moved to the cloud if
need be in 1-2 days.
Often we can be productive with either a sample or an aggregation.
True when
• Customer specific items are things like purchases, manually entered
text, logins etc.
Not true when
• Things are web-events, pair-wise interactions (i.e. graphs, social)
Sources of really big data
Sensor data
• Pictures
• Video
• Health monitoring devices
• Internal device monitors
• Results of combinatorical-
complexity
However
• Is it really economic to
store and process these
huge data sets to begin
with?
• Will learn to utilize
streaming algorithms
• Will learnt on focus on
information not noise
Case study : Particle Physics
Data reduction par excellence
• 600 million collisions per second
• Most are boring events and are not saved
• Save ~ 100 petabytes per year
Determine existence of Higg-boson – 1 bit
Measure it’s mass to 1% ~ 1 byte
Data = Exabytes
Information = 9 bits
Compression 10^18
Goal
$9 billion per byte!
Data science consulting
The good
• Always something new,
always learning.
• Exposed to many different
people.
• Get to see how everything
works on the inside.
• See the world!
• Low career risk but still
fun.
The bad
• Your clients choose you
• People problems often
more important than math
problems
• Travel can be extreme
• Your great ideas will rarely
be credited to you.
Challenges in data science
consulting
• Business’s don’t yet understand the terminology,
process or techniques. Much teaching involved
• Visionary CEO send you into a not-so-visionary
environment
• Problems can be vague
• Communication with business stakeholders takes
much of your time
• We are still developing an effective model. More than
just agile techniques
Red flags to avoid
• “Built us a platform for analytics so we can
become a data-driven company” Non-sequitur
• Wanting prediction of the un-predicable
• Attempting to use ML on noisy data
• When incentives and opinions are all over the
map
• Convinced that the problem has been solved 20
years ago. E.g. linear regression, segmentation
model, SAS.
Keep offering up bold
ideas
• Look for ways for major
productivity enhancement
• Keep up on cutting-edge
literature in stats/ML
• All my best ideas for web-
apps are now successful
companies.
• Everybody laughed at
them!
Data science is NOT going to be
productized.
FIN

Contenu connexe

Tendances

Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
The Hive
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Management
mark madsen
 

Tendances (20)

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial Intelligence
 
Data science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi PeriasamyData science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi Periasamy
 
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprise
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
From Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valueFrom Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into value
 
Wake up and smell the data
Wake up and smell the dataWake up and smell the data
Wake up and smell the data
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causation
 
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
 
Future of data science as a profession
Future of data science as a professionFuture of data science as a profession
Future of data science as a profession
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Management
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
 
Hadoop Meets Scrum
Hadoop Meets ScrumHadoop Meets Scrum
Hadoop Meets Scrum
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Large Scale Modeling Overview
Large Scale Modeling OverviewLarge Scale Modeling Overview
Large Scale Modeling Overview
 
JavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceJavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data Science
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 
Barga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 KeynoteBarga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 Keynote
 

En vedette

100424 teradata cloud computing 3rd party influencers2c
100424 teradata cloud computing 3rd party influencers2c100424 teradata cloud computing 3rd party influencers2c
100424 teradata cloud computing 3rd party influencers2c
guest8ebe0a8
 
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopPartners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Eric Sun
 
Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Server
joeharris76
 

En vedette (20)

The Intelligent Thing -- Using In-Memory for Big Data and Beyond
The Intelligent Thing -- Using In-Memory for Big Data and BeyondThe Intelligent Thing -- Using In-Memory for Big Data and Beyond
The Intelligent Thing -- Using In-Memory for Big Data and Beyond
 
Building your data warehouse with Redshift
Building your data warehouse with RedshiftBuilding your data warehouse with Redshift
Building your data warehouse with Redshift
 
Teradata Intelligent Memory
Teradata Intelligent MemoryTeradata Intelligent Memory
Teradata Intelligent Memory
 
100424 teradata cloud computing 3rd party influencers2c
100424 teradata cloud computing 3rd party influencers2c100424 teradata cloud computing 3rd party influencers2c
100424 teradata cloud computing 3rd party influencers2c
 
Understanding System Performance
Understanding System PerformanceUnderstanding System Performance
Understanding System Performance
 
Teradata Aggregate Join Indices And Dimensional Models
Teradata Aggregate Join Indices And Dimensional ModelsTeradata Aggregate Join Indices And Dimensional Models
Teradata Aggregate Join Indices And Dimensional Models
 
Teradata memory management - A balancing act
Teradata memory management  -  A balancing actTeradata memory management  -  A balancing act
Teradata memory management - A balancing act
 
ABC of Teradata System Performance Analysis
ABC of Teradata System Performance AnalysisABC of Teradata System Performance Analysis
ABC of Teradata System Performance Analysis
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopPartners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
 
Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Server
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift
 
AWS re:Invent 2016: Metering Big Data at AWS: From 0 to 100 Million Records i...
AWS re:Invent 2016: Metering Big Data at AWS: From 0 to 100 Million Records i...AWS re:Invent 2016: Metering Big Data at AWS: From 0 to 100 Million Records i...
AWS re:Invent 2016: Metering Big Data at AWS: From 0 to 100 Million Records i...
 
Teradata Big Data London Seminar
Teradata Big Data London SeminarTeradata Big Data London Seminar
Teradata Big Data London Seminar
 
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)
 
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
 
Amazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and Optimization
 
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesMigrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
 

Similaire à NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Similaire à NYC Open Data Meetup-- Thoughtworks chief data scientist talk (20)

How to crack Big Data and Data Science roles
How to crack Big Data and Data Science rolesHow to crack Big Data and Data Science roles
How to crack Big Data and Data Science roles
 
Lean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science teamLean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science team
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
 
DataScience_introduction.pdf
DataScience_introduction.pdfDataScience_introduction.pdf
DataScience_introduction.pdf
 
1 data science with python
1 data science with python1 data science with python
1 data science with python
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
data science training and placement
data science training and placementdata science training and placement
data science training and placement
 
online data science training
online data science trainingonline data science training
online data science training
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabad
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 

Plus de Vivian S. Zhang

Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
Vivian S. Zhang
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
Vivian S. Zhang
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
Vivian S. Zhang
 

Plus de Vivian S. Zhang (20)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Xgboost
XgboostXgboost
Xgboost
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Xgboost
XgboostXgboost
Xgboost
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

NYC Open Data Meetup-- Thoughtworks chief data scientist talk

  • 1. Data Science Consulting or Science meets business, again. Third time a charm? David Johnston ThoughtWorks March 17, 2014
  • 3. Talk Overview • Agile Analytics group at ThoughtWorks • What is data science anyway? Origins and future. Good or evil? • Guide to technologies and limits to technology • Process and methodology for successful data science consulting
  • 4. ThoughtWorks • Global software consulting company • HQ in Chicago. Major offices in NY, San Fran, Dallas, India, Brazil, Australia, China - over 30 worldwide. • Privately owned by Roy Singham • Flat hierarchy of passionate people
  • 6. Agile Analytics at TW • Practiced started 2011 • Led by Ken Collier and John Spens • About a dozen people involved Key Themes • BI, data warehousing and analytics has largely missed the revolution in agile methodologies. • We can do analytics in a agile, fast, light-footprint way.
  • 7. What do we do? • Probabilistic modeling • Predictive analytics / machine learning • Advanced BI, prescriptive analysis • Big Data technologies • Advanced algorithms and data structures, streaming Our main goals • Use data analysis to give companies an edge in their marketplace • Use data analysis to improve the world at large
  • 8. Some typical projects • Recommending Systems • Customer behavior analysis • Optimization • Efficient algorithms/tech for massive data sets • Company specific analytics challenges
  • 9. Case Study 1: HealthCare Group Purchasing Organization • One of the largest GPOs. 1000s of client hospitals • Hospital sign up, pay fee and get group- purchasing discounts • The GPO has to make estimates to hospitals on their likely savings. • Hospital’s data is usually in a non-standard spreadsheet. No SKUs in healthcare (yet). • A data matching mess
  • 10. Case Study 1: HealthCare Group Purchasing Organization GPO: Johnson & Johnson Sterile Scalpel #F8-505 Hospital: J&J scalpel, steel item f8505 size 3’’ • Their in-place solution – Oracle, lots of ETL tools, using SQL with lots of rigid rules for how to match. • Data-base of matching rules was difficult to maintain • Accuracy of matching ~60%. Rest was done by hand. Took 1 day for processing and weeks for lines done by hand.
  • 11. Case Study 1: HealthCare Group Purchasing Organization What we did • First convince them that their solution was highly inefficient. • Wrote python program using a tree data structure and machine learning to do matching. • Ran on my laptop in a few minutes. Match rates > 80% • This done in 3 weeks. Later settled on a solution using Elastic Search.
  • 12. Case Study 2: Retail Rec Systems • Customer providing coupons to retailer customers • Needed a better recommendation system • We’re using a simple logistic regression model
  • 13. What exactly is data science? • Is this really new? • Does the term “data science” make any sense? • Is it just a fad? Over-hyped? • Why did this term just become popular a few years back? • Where is this going? • Should scientists/engineers/math-types really go and make a career doing this?
  • 14. What exactly is data science? • Is this really new? - Not really • Does the term “data science” make any sense? - Not really but so what? • Is it just a fad? Over-hyped? – No, some times. • Why did this term just become popular a few years back? - Productivity • Where is this going? • Should scientists/engineers/math-types really go and make a career doing this? Yes for most
  • 15. Is it new? Of course not Combination of many subjects: • Mathematics and statistics – probability theory • Machine learning • Computer science – algorithms, data structures, data bases • Operations research - process optimization • Business consulting • Software development Where we have seen this before? Business: Finance, Insurance, Sports, Government accounting, Retail, Google Science: Physics, Astronomy, Biology
  • 16. Isn’t there anything new? Of course • Analytics finally becoming ubiquitous in business (as it always should have been) • Much more communication between disparate fields • It’s finally work that’s fun Ok, but why now? It’s a big movement so lets give it a new name , Data Science
  • 17. Why now? - Productivity • There has always been plenty of data science in science • Job prospects in academia are slim • Productivity has been rising much faster than postdoc salaries and scientist job creation
  • 18. Data scientist productivity growth • Salary increase over postdoc requires ~2.5 x • Salaries in Industry are set by productivity and supply/demand • Crossing the threshold in productivity Leads to new job creation • Eventual slowing in productivity and/or changes in supply/demand will eventually end this burst in job creation • Nothing magical happened in 2005!
  • 19. Productivity Drivers for Data- science Long time scale • Compute , Moore’s law • The internet (duh!) • HD and RAM price drop • Science learns to deal with Big Data • Growing importance of statistics More recent • Git , code –sharing • Libraries machine learning • Python/ R Open source • Hadoop and ecosystem • The Cloud, AWS • NoSQL databases, in-mem • Growing community in “data science” cohesion, feedback effects of popularity
  • 20. Then and now 1990s data science • Writing code in C/C++ • Working with flat files • Even relational/SQL is new • Using Matlab, IDL proprietary software • Writing all algorithms from scratch. Slow. Buggy. Data science today • Working in high level open- source languages Python, R • We’re good at SQL and have lots of other options NoSQL • Git, thousands of libraries available. Easy to install. • Can concentrate more on what we’re good at.
  • 21. So what is data science now Data Science: An interdisciplinary field utilizing statistics, computer science and the methods of scientific research in areas outside of science.
  • 22. Where is it going? • Big Data technology is separated from data science • Software developers take over much of Big Data roles • Businesses begin to understand data science terminology like they now understand software terminology and they are not Twitter. • Data scientists and businesses find a methodology that works like industrial scale software development has
  • 23. Where is it going? Specialization • Most experienced data scientists move into consulting or management of teams • Universities graduate many “data scientist-lite” students from new more specialized BS or MA programs • Fewer generalists • PhD students need to learn additional skills. Not instant hires (http://bit.ly/1m3krq6)
  • 24. Why won’t we have 100x more data scientists in N years? • Pool of disgruntled postdocs will dry up or “I am not even supposed to be here!” • Many data science problems don’t need the most cutting edge tools. (Some do). • People rarely get much experience working with real data in academic settings. Requires real- world experience, takes time.
  • 25. Are we there yet? Overhyped, underhyped, mis- hyped? • No, probably not • Productivity growth is real • We are solving important problems. Plenty left. • Big Data will probably peak in the hype cycle before data science • Just watched my first analytics commercial. IBM.
  • 26. Why Big Data enthusiasm might peak soon Big Data defined – Process for performing calculations on data that: • Cannot possibly be done on a single machine • When sampling and streaming are not effective • What data-reduction is not possible • When storage and compute are closely balanced • Parallelizing is absolutely unavoidable Most tasks are not like this • Sampling is usually good enough for training machine learning • Need for rapid feedback, interactive work • CPUs are underutilized. IO limited. • Usually a better algorithm can solve the problem better
  • 27. Hadoop (Spark) Good use cases • Large batch jobs like: restructuring and reducing data from raw files. • Scoring with ML models • When you have to do something on every data point. • Raw storage in HDFS Bad use cases • Model development • Visualization • Brute-forcing an inefficient algorithm. • Treating Hadoop like a data-base.
  • 28. The data-sizes we typically see Most companies have a few million customers 10^7 Often they storage ~ 1000 items per customer That’s 10^10 data points. 5 bytes/data-point = 500 GB or a few TB. Fits on our laptops (but not in memory). Such data can be moved to the cloud if need be in 1-2 days. Often we can be productive with either a sample or an aggregation. True when • Customer specific items are things like purchases, manually entered text, logins etc. Not true when • Things are web-events, pair-wise interactions (i.e. graphs, social)
  • 29. Sources of really big data Sensor data • Pictures • Video • Health monitoring devices • Internal device monitors • Results of combinatorical- complexity However • Is it really economic to store and process these huge data sets to begin with? • Will learn to utilize streaming algorithms • Will learnt on focus on information not noise
  • 30. Case study : Particle Physics Data reduction par excellence • 600 million collisions per second • Most are boring events and are not saved • Save ~ 100 petabytes per year Determine existence of Higg-boson – 1 bit Measure it’s mass to 1% ~ 1 byte Data = Exabytes Information = 9 bits Compression 10^18 Goal $9 billion per byte!
  • 31. Data science consulting The good • Always something new, always learning. • Exposed to many different people. • Get to see how everything works on the inside. • See the world! • Low career risk but still fun. The bad • Your clients choose you • People problems often more important than math problems • Travel can be extreme • Your great ideas will rarely be credited to you.
  • 32. Challenges in data science consulting • Business’s don’t yet understand the terminology, process or techniques. Much teaching involved • Visionary CEO send you into a not-so-visionary environment • Problems can be vague • Communication with business stakeholders takes much of your time • We are still developing an effective model. More than just agile techniques
  • 33. Red flags to avoid • “Built us a platform for analytics so we can become a data-driven company” Non-sequitur • Wanting prediction of the un-predicable • Attempting to use ML on noisy data • When incentives and opinions are all over the map • Convinced that the problem has been solved 20 years ago. E.g. linear regression, segmentation model, SAS.
  • 34. Keep offering up bold ideas • Look for ways for major productivity enhancement • Keep up on cutting-edge literature in stats/ML • All my best ideas for web- apps are now successful companies. • Everybody laughed at them! Data science is NOT going to be productized. FIN