3. ThoughtWorks
• Global software consulting company
• HQ in Chicago. Major offices in NY, San Fran, Dallas, India, Brazil,
Australia, China - over 30 worldwide.
• Privately owned by Roy Singham
• Flat hierarchy of passionate people
4. Agile Analytics at TW
• Practiced started 2011
• Led by Ken Collier and John Spens
• About a dozen people involved
Key Theme of Ken’s book
• BI, data warehousing and analytics has largely
missed the revolution in agile methodologies. We
can do it differently.
• Probabilistic modeling
• Predictive analytics / machine learning
• Advanced BI, prescriptive analysis
• Big Data technologies
• Advanced algorithms and data structures, streaming
What we do
5. Case Studies:
Recommendation systems for a
retailer customer. Our Bayesian
model (blue)
Healthcare group purchasing
Organization
• Problem is matching medical
products by text description. Fuzzy
matching.
• In place solution. Rules engine.
Complicated. 60% match rate, one
day required for run
• In 3 weeks we delivered a
lightweight solution in python. >80%
match rate, runtime of a few
minutes (on a laptop).
• Later moved to Elastic Search for
even better results.
6. What exactly is data
science?
• Is this really new? - Not really
• Does the term “data science” make any sense? - Not really
but so what?
• Is it just a fad? Over-hyped? – No, some times.
• Why did this term just become popular a few years back? -
Productivity
• Where is this going?
• Should scientists/engineers/math-types really go and make
a career doing this? Yes for most
7. Is it new?
Of course not
Combination of many subjects:
• Mathematics and statistics – probability theory
• Machine learning
• Computer science – algorithms, data structures, data bases
• Operations research - process optimization
• Business consulting
• Software development
Where we have seen this before?
Business: Finance, Insurance, Sports, Government accounting, Retail,
Google
Science: Physics, Astronomy, Biology
8. Why now? : Data scientist productivity growth
crosses critical threshold for new job creation
• Salary increase over postdoc requires
~2.5 x
• Salaries in Industry are set by
productivity and supply/demand
• Crossing the threshold in productivity
Leads to new job creation
• Eventual slowing in productivity
and/or changes in supply/demand
will eventually end this burst in job
creation
• Nothing magical happened in 2005!
9. Productivity Drivers for Data-
science
Long time scale
• Compute , Moore’s law
• The internet (duh!)
• HD and RAM price drop
• Science learns to deal with
Big Data
• Growing importance of
statistics
More recent
• Git , code –sharing
• Libraries machine learning
• Python/ R Open source
• Hadoop and ecosystem
• The Cloud, AWS
• NoSQL databases, in-mem
• Growing community in “data
science” cohesion, feedback
effects of popularity
10. So what is data science now
My definition of data science:
An interdisciplinary field utilizing statistics, computer
science and the methods of scientific research in
areas outside of science.
Misses only the first one
11. Are we there yet?
Overhyped, underhyped, mis-
hyped?
• No, probably not
• Productivity growth is real
• We are solving important
problems. Plenty left.
• Big Data will probably
peak in the hype cycle
before data science
• Just watched my first
analytics commercial. IBM.
“Math is not a fad”
- Aaron Erickson , ThoughtWorks
12. Case study : Particle Physics
Data reduction par excellence
• 600 million collisions per second
• Most are boring events and are not saved
• Save ~ 100 petabytes per year
Determine existence of Higg-boson – 1 bit
Measure it’s mass to 1% ~ 1 byte
Data = Exabytes
Information = 9 bits
Compression 10^18
Goal
$9 billion per byte!
13. Data science consulting
The good
• Always something new,
always learning.
• Exposed to many different
people.
• Get to see how everything
works on the inside.
• See the world!
• Low career risk but still
fun.
The bad
• Your clients choose you
• People problems often
more important than math
problems
• Travel can be extreme
• Your great ideas will rarely
be credited to you.
14. Challenges in data science
consulting
• Business’s don’t yet
understand the terminology,
process or techniques. Much
teaching involved
• Visionary CEO sends you into
a not-so-visionary environment
• Problems can be vague
• Communication with business
stakeholders takes much of
your time
• We are still developing an
effective model. More than just
agile techniques
• “Built us a platform for analytics
so we can become a data-
driven company” Non-sequitur
• Wanting prediction of the un-
predicable
• Attempting to use ML on noisy
data
• When incentives and opinions
are all over the map
• Convinced that the problem
has been solved 20 years ago.
E.g. linear regression,
segmentation model, SAS.
Common challenges Red flags
15. Keep offering up bold
ideas
• Look for ways for major
productivity enhancement
• Keep up on cutting-edge
literature in stats/ML
• All my best ideas for web-
apps are now successful
companies.
• Everybody laughed at
them! Data science is NOT going to be
productized.
FIN