NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Data Science Consulting
or
Science meets business, again.
Third time a charm?
David Johnston
ThoughtWorks
March 17, 2014

Young scientists
become…
Professors

Talk Overview
• Agile Analytics group at ThoughtWorks
• What is data science anyway? Origins and future.
Good or evil?
• Guide to technologies and limits to technology
• Process and methodology for successful data
science consulting

ThoughtWorks
• Global software consulting company
• HQ in Chicago. Major offices in NY, San Fran,
Dallas, India, Brazil, Australia, China - over 30
worldwide.
• Privately owned by Roy Singham
• Flat hierarchy of passionate people

Agile Analytics at TW
• Practiced started 2011
• Led by Ken Collier and John Spens
• About a dozen people involved
Key Themes
• BI, data warehousing and analytics has largely
missed the revolution in agile methodologies.
• We can do analytics in a agile, fast, light-footprint
way.

What do we do?
• Probabilistic modeling
• Predictive analytics / machine learning
• Advanced BI, prescriptive analysis
• Big Data technologies
• Advanced algorithms and data structures, streaming
Our main goals
• Use data analysis to give companies an edge in their marketplace
• Use data analysis to improve the world at large

Some typical projects
• Recommending Systems
• Customer behavior analysis
• Optimization
• Efficient algorithms/tech for massive data sets
• Company specific analytics challenges

Case Study 1: HealthCare Group
Purchasing Organization
• One of the largest GPOs. 1000s of client hospitals
• Hospital sign up, pay fee and get group-
purchasing discounts
• The GPO has to make estimates to hospitals on
their likely savings.
• Hospital’s data is usually in a non-standard
spreadsheet. No SKUs in healthcare (yet).
• A data matching mess

GPO: Johnson & Johnson Sterile Scalpel #F8-505
Hospital: J&J scalpel, steel item f8505 size 3’’
• Their in-place solution – Oracle, lots of ETL tools,
using SQL with lots of rigid rules for how to match.
• Data-base of matching rules was difficult to maintain
• Accuracy of matching ~60%. Rest was done by hand.
Took 1 day for processing and weeks for lines done by
hand.

What we did
• First convince them that their solution was highly
inefficient.
• Wrote python program using a tree data structure and
machine learning to do matching.
• Ran on my laptop in a few minutes. Match rates > 80%
• This done in 3 weeks. Later settled on a solution using
Elastic Search.

Case Study 2: Retail Rec Systems
• Customer providing
coupons to retailer
customers
• Needed a better
recommendation system
• We’re using a simple
logistic regression model

What exactly is data
science?
• Is this really new?
• Does the term “data science” make any sense?
• Is it just a fad? Over-hyped?
• Why did this term just become popular a few years back?
• Where is this going?
• Should scientists/engineers/math-types really go and make
a career doing this?

What exactly is data
science?
• Is this really new? - Not really
• Does the term “data science” make any sense? - Not really
but so what?
• Is it just a fad? Over-hyped? – No, some times.
• Why did this term just become popular a few years back? -
Productivity
• Where is this going?
• Should scientists/engineers/math-types really go and make
a career doing this? Yes for most

Is it new?
Of course not
Combination of many subjects:
• Mathematics and statistics – probability theory
• Machine learning
• Computer science – algorithms, data structures, data bases
• Operations research - process optimization
• Business consulting
• Software development
Where we have seen this before?
Business: Finance, Insurance, Sports, Government accounting, Retail,
Google
Science: Physics, Astronomy, Biology

Isn’t there anything new?
Of course
• Analytics finally becoming ubiquitous in business (as it always should have been)
• Much more communication between disparate fields
• It’s finally work that’s fun
Ok, but why now?
It’s a big movement so lets give it a new name , Data Science

Why now? - Productivity
• There has always been plenty of data science in
science
• Job prospects in academia are slim
• Productivity has been rising much faster than
postdoc salaries and scientist job creation

Data scientist productivity
growth
• Salary increase over postdoc requires
~2.5 x
• Salaries in Industry are set by
productivity and supply/demand
• Crossing the threshold in productivity
Leads to new job creation
• Eventual slowing in productivity
and/or changes in supply/demand
will eventually end this burst in job
creation
• Nothing magical happened in 2005!

Productivity Drivers for Data-
science
Long time scale
• Compute , Moore’s law
• The internet (duh!)
• HD and RAM price drop
• Science learns to deal with
Big Data
• Growing importance of
statistics
More recent
• Git , code –sharing
• Libraries machine learning
• Python/ R Open source
• Hadoop and ecosystem
• The Cloud, AWS
• NoSQL databases, in-mem
• Growing community in “data
science” cohesion, feedback
effects of popularity

Then and now
1990s data science
• Writing code in C/C++
• Working with flat files
• Even relational/SQL is
new
• Using Matlab, IDL
proprietary software
• Writing all algorithms from
scratch. Slow. Buggy.
Data science today
• Working in high level open-
source languages Python, R
• We’re good at SQL and
have lots of other options
NoSQL
• Git, thousands of libraries
available. Easy to install.
• Can concentrate more on
what we’re good at.

So what is data science now
Data Science:
An interdisciplinary field utilizing statistics, computer
science and the methods of scientific research in
areas outside of science.

Where is it going?
• Big Data technology is separated from data science
• Software developers take over much of Big Data roles
• Businesses begin to understand data science terminology like
they now understand software terminology and they are not
Twitter.
• Data scientists and businesses find a methodology that works
like industrial scale software development has

Where is it going?
Specialization
• Most experienced data scientists move into consulting or
management of teams
• Universities graduate many “data scientist-lite” students from
new more specialized BS or MA programs
• Fewer generalists
• PhD students need to learn additional skills. Not instant hires
(http://bit.ly/1m3krq6)

Why won’t we have 100x
more data scientists in N
years?
• Pool of disgruntled postdocs will dry up or “I am
not even supposed to be here!”
• Many data science problems don’t need the most
cutting edge tools. (Some do).
• People rarely get much experience working with
real data in academic settings. Requires real-
world experience, takes time.

Are we there yet?
Overhyped, underhyped, mis-
hyped?
• No, probably not
• Productivity growth is real
• We are solving important
problems. Plenty left.
• Big Data will probably
peak in the hype cycle
before data science
• Just watched my first
analytics commercial. IBM.

Why Big Data enthusiasm
might peak soon
Big Data defined – Process for performing calculations on data
that:
• Cannot possibly be done on a single machine
• When sampling and streaming are not effective
• What data-reduction is not possible
• When storage and compute are closely balanced
• Parallelizing is absolutely unavoidable
Most tasks are not like this
• Sampling is usually good enough for training machine learning
• Need for rapid feedback, interactive work
• CPUs are underutilized. IO limited.
• Usually a better algorithm can solve the problem better

Hadoop (Spark)
Good use cases
• Large batch jobs like:
restructuring and reducing
data from raw files.
• Scoring with ML models
• When you have to do
something on every data
point.
• Raw storage in HDFS
Bad use cases
• Model development
• Visualization
• Brute-forcing an inefficient
algorithm.
• Treating Hadoop like a
data-base.

The data-sizes we typically
see
Most companies have a few million customers 10^7
Often they storage ~ 1000 items per customer
That’s 10^10 data points. 5 bytes/data-point = 500 GB or a few TB. Fits on
our laptops (but not in memory). Such data can be moved to the cloud if
need be in 1-2 days.
Often we can be productive with either a sample or an aggregation.
True when
• Customer specific items are things like purchases, manually entered
text, logins etc.
Not true when
• Things are web-events, pair-wise interactions (i.e. graphs, social)

Sources of really big data
Sensor data
• Pictures
• Video
• Health monitoring devices
• Internal device monitors
• Results of combinatorical-
complexity
However
• Is it really economic to
store and process these
huge data sets to begin
with?
• Will learn to utilize
streaming algorithms
• Will learnt on focus on
information not noise

Case study : Particle Physics
Data reduction par excellence
• 600 million collisions per second
• Most are boring events and are not saved
• Save ~ 100 petabytes per year
Determine existence of Higg-boson – 1 bit
Measure it’s mass to 1% ~ 1 byte
Data = Exabytes
Information = 9 bits
Compression 10^18
Goal
$9 billion per byte!

Data science consulting
The good
• Always something new,
always learning.
• Exposed to many different
people.
• Get to see how everything
works on the inside.
• See the world!
• Low career risk but still
fun.
The bad
• Your clients choose you
• People problems often
more important than math
problems
• Travel can be extreme
• Your great ideas will rarely
be credited to you.

Challenges in data science
consulting
• Business’s don’t yet understand the terminology,
process or techniques. Much teaching involved
• Visionary CEO send you into a not-so-visionary
environment
• Problems can be vague
• Communication with business stakeholders takes
much of your time
• We are still developing an effective model. More than
just agile techniques

Red flags to avoid
• “Built us a platform for analytics so we can
become a data-driven company” Non-sequitur
• Wanting prediction of the un-predicable
• Attempting to use ML on noisy data
• When incentives and opinions are all over the
map
• Convinced that the problem has been solved 20
years ago. E.g. linear regression, segmentation
model, SAS.

Keep offering up bold
ideas
• Look for ways for major
productivity enhancement
• Keep up on cutting-edge
literature in stats/ML
• All my best ideas for web-
apps are now successful
companies.
• Everybody laughed at
them!
Data science is NOT going to be
productized.
FIN

NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Similaire à NYC Open Data Meetup-- Thoughtworks chief data scientist talk (20)

Plus de Vivian S. Zhang

Plus de Vivian S. Zhang (20)

Dernier

Dernier (20)

NYC Open Data Meetup-- Thoughtworks chief data scientist talk