This document summarizes a presentation on data science consulting. It discusses:
1) The Agile Analytics group at ThoughtWorks which does data science consulting projects using probabilistic modeling, machine learning, and big data technologies.
2) Two case studies are described, including developing a machine learning model to improve matching of healthcare product data and using logistic regression for retail recommendation systems.
3) The origins and future of the field are discussed, noting that while not entirely new, data science has grown due to improvements in technology, programming languages, and libraries that have increased productivity and driven new career opportunities in the field.
3. Talk Overview
• Agile Analytics group at ThoughtWorks
• What is data science anyway? Origins and future.
Good or evil?
• Guide to technologies and limits to technology
• Process and methodology for successful data
science consulting
4. ThoughtWorks
• Global software consulting company
• HQ in Chicago. Major offices in NY, San Fran,
Dallas, India, Brazil, Australia, China - over 30
worldwide.
• Privately owned by Roy Singham
• Flat hierarchy of passionate people
6. Agile Analytics at TW
• Practiced started 2011
• Led by Ken Collier and John Spens
• About a dozen people involved
Key Themes
• BI, data warehousing and analytics has largely
missed the revolution in agile methodologies.
• We can do analytics in a agile, fast, light-footprint
way.
7. What do we do?
• Probabilistic modeling
• Predictive analytics / machine learning
• Advanced BI, prescriptive analysis
• Big Data technologies
• Advanced algorithms and data structures, streaming
Our main goals
• Use data analysis to give companies an edge in their marketplace
• Use data analysis to improve the world at large
8. Some typical projects
• Recommending Systems
• Customer behavior analysis
• Optimization
• Efficient algorithms/tech for massive data sets
• Company specific analytics challenges
9. Case Study 1: HealthCare Group
Purchasing Organization
• One of the largest GPOs. 1000s of client hospitals
• Hospital sign up, pay fee and get group-
purchasing discounts
• The GPO has to make estimates to hospitals on
their likely savings.
• Hospital’s data is usually in a non-standard
spreadsheet. No SKUs in healthcare (yet).
• A data matching mess
10. Case Study 1: HealthCare Group
Purchasing Organization
GPO: Johnson & Johnson Sterile Scalpel #F8-505
Hospital: J&J scalpel, steel item f8505 size 3’’
• Their in-place solution – Oracle, lots of ETL tools,
using SQL with lots of rigid rules for how to match.
• Data-base of matching rules was difficult to maintain
• Accuracy of matching ~60%. Rest was done by hand.
Took 1 day for processing and weeks for lines done by
hand.
11. Case Study 1: HealthCare Group
Purchasing Organization
What we did
• First convince them that their solution was highly
inefficient.
• Wrote python program using a tree data structure and
machine learning to do matching.
• Ran on my laptop in a few minutes. Match rates > 80%
• This done in 3 weeks. Later settled on a solution using
Elastic Search.
12. Case Study 2: Retail Rec Systems
• Customer providing
coupons to retailer
customers
• Needed a better
recommendation system
• We’re using a simple
logistic regression model
13. What exactly is data
science?
• Is this really new?
• Does the term “data science” make any sense?
• Is it just a fad? Over-hyped?
• Why did this term just become popular a few years back?
• Where is this going?
• Should scientists/engineers/math-types really go and make
a career doing this?
14. What exactly is data
science?
• Is this really new? - Not really
• Does the term “data science” make any sense? - Not really
but so what?
• Is it just a fad? Over-hyped? – No, some times.
• Why did this term just become popular a few years back? -
Productivity
• Where is this going?
• Should scientists/engineers/math-types really go and make
a career doing this? Yes for most
15. Is it new?
Of course not
Combination of many subjects:
• Mathematics and statistics – probability theory
• Machine learning
• Computer science – algorithms, data structures, data bases
• Operations research - process optimization
• Business consulting
• Software development
Where we have seen this before?
Business: Finance, Insurance, Sports, Government accounting, Retail,
Google
Science: Physics, Astronomy, Biology
16. Isn’t there anything new?
Of course
• Analytics finally becoming ubiquitous in business (as it always should have been)
• Much more communication between disparate fields
• It’s finally work that’s fun
Ok, but why now?
It’s a big movement so lets give it a new name , Data Science
17. Why now? - Productivity
• There has always been plenty of data science in
science
• Job prospects in academia are slim
• Productivity has been rising much faster than
postdoc salaries and scientist job creation
18. Data scientist productivity
growth
• Salary increase over postdoc requires
~2.5 x
• Salaries in Industry are set by
productivity and supply/demand
• Crossing the threshold in productivity
Leads to new job creation
• Eventual slowing in productivity
and/or changes in supply/demand
will eventually end this burst in job
creation
• Nothing magical happened in 2005!
19. Productivity Drivers for Data-
science
Long time scale
• Compute , Moore’s law
• The internet (duh!)
• HD and RAM price drop
• Science learns to deal with
Big Data
• Growing importance of
statistics
More recent
• Git , code –sharing
• Libraries machine learning
• Python/ R Open source
• Hadoop and ecosystem
• The Cloud, AWS
• NoSQL databases, in-mem
• Growing community in “data
science” cohesion, feedback
effects of popularity
20. Then and now
1990s data science
• Writing code in C/C++
• Working with flat files
• Even relational/SQL is
new
• Using Matlab, IDL
proprietary software
• Writing all algorithms from
scratch. Slow. Buggy.
Data science today
• Working in high level open-
source languages Python, R
• We’re good at SQL and
have lots of other options
NoSQL
• Git, thousands of libraries
available. Easy to install.
• Can concentrate more on
what we’re good at.
21. So what is data science now
Data Science:
An interdisciplinary field utilizing statistics, computer
science and the methods of scientific research in
areas outside of science.
22. Where is it going?
• Big Data technology is separated from data science
• Software developers take over much of Big Data roles
• Businesses begin to understand data science terminology like
they now understand software terminology and they are not
Twitter.
• Data scientists and businesses find a methodology that works
like industrial scale software development has
23. Where is it going?
Specialization
• Most experienced data scientists move into consulting or
management of teams
• Universities graduate many “data scientist-lite” students from
new more specialized BS or MA programs
• Fewer generalists
• PhD students need to learn additional skills. Not instant hires
(http://bit.ly/1m3krq6)
24. Why won’t we have 100x
more data scientists in N
years?
• Pool of disgruntled postdocs will dry up or “I am
not even supposed to be here!”
• Many data science problems don’t need the most
cutting edge tools. (Some do).
• People rarely get much experience working with
real data in academic settings. Requires real-
world experience, takes time.
25. Are we there yet?
Overhyped, underhyped, mis-
hyped?
• No, probably not
• Productivity growth is real
• We are solving important
problems. Plenty left.
• Big Data will probably
peak in the hype cycle
before data science
• Just watched my first
analytics commercial. IBM.
26. Why Big Data enthusiasm
might peak soon
Big Data defined – Process for performing calculations on data
that:
• Cannot possibly be done on a single machine
• When sampling and streaming are not effective
• What data-reduction is not possible
• When storage and compute are closely balanced
• Parallelizing is absolutely unavoidable
Most tasks are not like this
• Sampling is usually good enough for training machine learning
• Need for rapid feedback, interactive work
• CPUs are underutilized. IO limited.
• Usually a better algorithm can solve the problem better
27. Hadoop (Spark)
Good use cases
• Large batch jobs like:
restructuring and reducing
data from raw files.
• Scoring with ML models
• When you have to do
something on every data
point.
• Raw storage in HDFS
Bad use cases
• Model development
• Visualization
• Brute-forcing an inefficient
algorithm.
• Treating Hadoop like a
data-base.
28. The data-sizes we typically
see
Most companies have a few million customers 10^7
Often they storage ~ 1000 items per customer
That’s 10^10 data points. 5 bytes/data-point = 500 GB or a few TB. Fits on
our laptops (but not in memory). Such data can be moved to the cloud if
need be in 1-2 days.
Often we can be productive with either a sample or an aggregation.
True when
• Customer specific items are things like purchases, manually entered
text, logins etc.
Not true when
• Things are web-events, pair-wise interactions (i.e. graphs, social)
29. Sources of really big data
Sensor data
• Pictures
• Video
• Health monitoring devices
• Internal device monitors
• Results of combinatorical-
complexity
However
• Is it really economic to
store and process these
huge data sets to begin
with?
• Will learn to utilize
streaming algorithms
• Will learnt on focus on
information not noise
30. Case study : Particle Physics
Data reduction par excellence
• 600 million collisions per second
• Most are boring events and are not saved
• Save ~ 100 petabytes per year
Determine existence of Higg-boson – 1 bit
Measure it’s mass to 1% ~ 1 byte
Data = Exabytes
Information = 9 bits
Compression 10^18
Goal
$9 billion per byte!
31. Data science consulting
The good
• Always something new,
always learning.
• Exposed to many different
people.
• Get to see how everything
works on the inside.
• See the world!
• Low career risk but still
fun.
The bad
• Your clients choose you
• People problems often
more important than math
problems
• Travel can be extreme
• Your great ideas will rarely
be credited to you.
32. Challenges in data science
consulting
• Business’s don’t yet understand the terminology,
process or techniques. Much teaching involved
• Visionary CEO send you into a not-so-visionary
environment
• Problems can be vague
• Communication with business stakeholders takes
much of your time
• We are still developing an effective model. More than
just agile techniques
33. Red flags to avoid
• “Built us a platform for analytics so we can
become a data-driven company” Non-sequitur
• Wanting prediction of the un-predicable
• Attempting to use ML on noisy data
• When incentives and opinions are all over the
map
• Convinced that the problem has been solved 20
years ago. E.g. linear regression, segmentation
model, SAS.
34. Keep offering up bold
ideas
• Look for ways for major
productivity enhancement
• Keep up on cutting-edge
literature in stats/ML
• All my best ideas for web-
apps are now successful
companies.
• Everybody laughed at
them!
Data science is NOT going to be
productized.
FIN