Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

NYC Open Data Meetup-- Thoughtworks chief data scientist talk


Consultez-les par la suite

1 sur 34 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (20)


Similaire à NYC Open Data Meetup-- Thoughtworks chief data scientist talk (20)

Plus par Vivian S. Zhang (20)


NYC Open Data Meetup-- Thoughtworks chief data scientist talk

  1. 1. Data Science Consulting or Science meets business, again. Third time a charm? David Johnston ThoughtWorks March 17, 2014
  2. 2. Young scientists become… Professors
  3. 3. Talk Overview • Agile Analytics group at ThoughtWorks • What is data science anyway? Origins and future. Good or evil? • Guide to technologies and limits to technology • Process and methodology for successful data science consulting
  4. 4. ThoughtWorks • Global software consulting company • HQ in Chicago. Major offices in NY, San Fran, Dallas, India, Brazil, Australia, China - over 30 worldwide. • Privately owned by Roy Singham • Flat hierarchy of passionate people
  5. 5. The three pillars
  6. 6. Agile Analytics at TW • Practiced started 2011 • Led by Ken Collier and John Spens • About a dozen people involved Key Themes • BI, data warehousing and analytics has largely missed the revolution in agile methodologies. • We can do analytics in a agile, fast, light-footprint way.
  7. 7. What do we do? • Probabilistic modeling • Predictive analytics / machine learning • Advanced BI, prescriptive analysis • Big Data technologies • Advanced algorithms and data structures, streaming Our main goals • Use data analysis to give companies an edge in their marketplace • Use data analysis to improve the world at large
  8. 8. Some typical projects • Recommending Systems • Customer behavior analysis • Optimization • Efficient algorithms/tech for massive data sets • Company specific analytics challenges
  9. 9. Case Study 1: HealthCare Group Purchasing Organization • One of the largest GPOs. 1000s of client hospitals • Hospital sign up, pay fee and get group- purchasing discounts • The GPO has to make estimates to hospitals on their likely savings. • Hospital’s data is usually in a non-standard spreadsheet. No SKUs in healthcare (yet). • A data matching mess
  10. 10. Case Study 1: HealthCare Group Purchasing Organization GPO: Johnson & Johnson Sterile Scalpel #F8-505 Hospital: J&J scalpel, steel item f8505 size 3’’ • Their in-place solution – Oracle, lots of ETL tools, using SQL with lots of rigid rules for how to match. • Data-base of matching rules was difficult to maintain • Accuracy of matching ~60%. Rest was done by hand. Took 1 day for processing and weeks for lines done by hand.
  11. 11. Case Study 1: HealthCare Group Purchasing Organization What we did • First convince them that their solution was highly inefficient. • Wrote python program using a tree data structure and machine learning to do matching. • Ran on my laptop in a few minutes. Match rates > 80% • This done in 3 weeks. Later settled on a solution using Elastic Search.
  12. 12. Case Study 2: Retail Rec Systems • Customer providing coupons to retailer customers • Needed a better recommendation system • We’re using a simple logistic regression model
  13. 13. What exactly is data science? • Is this really new? • Does the term “data science” make any sense? • Is it just a fad? Over-hyped? • Why did this term just become popular a few years back? • Where is this going? • Should scientists/engineers/math-types really go and make a career doing this?
  14. 14. What exactly is data science? • Is this really new? - Not really • Does the term “data science” make any sense? - Not really but so what? • Is it just a fad? Over-hyped? – No, some times. • Why did this term just become popular a few years back? - Productivity • Where is this going? • Should scientists/engineers/math-types really go and make a career doing this? Yes for most
  15. 15. Is it new? Of course not Combination of many subjects: • Mathematics and statistics – probability theory • Machine learning • Computer science – algorithms, data structures, data bases • Operations research - process optimization • Business consulting • Software development Where we have seen this before? Business: Finance, Insurance, Sports, Government accounting, Retail, Google Science: Physics, Astronomy, Biology
  16. 16. Isn’t there anything new? Of course • Analytics finally becoming ubiquitous in business (as it always should have been) • Much more communication between disparate fields • It’s finally work that’s fun Ok, but why now? It’s a big movement so lets give it a new name , Data Science
  17. 17. Why now? - Productivity • There has always been plenty of data science in science • Job prospects in academia are slim • Productivity has been rising much faster than postdoc salaries and scientist job creation
  18. 18. Data scientist productivity growth • Salary increase over postdoc requires ~2.5 x • Salaries in Industry are set by productivity and supply/demand • Crossing the threshold in productivity Leads to new job creation • Eventual slowing in productivity and/or changes in supply/demand will eventually end this burst in job creation • Nothing magical happened in 2005!
  19. 19. Productivity Drivers for Data- science Long time scale • Compute , Moore’s law • The internet (duh!) • HD and RAM price drop • Science learns to deal with Big Data • Growing importance of statistics More recent • Git , code –sharing • Libraries machine learning • Python/ R Open source • Hadoop and ecosystem • The Cloud, AWS • NoSQL databases, in-mem • Growing community in “data science” cohesion, feedback effects of popularity
  20. 20. Then and now 1990s data science • Writing code in C/C++ • Working with flat files • Even relational/SQL is new • Using Matlab, IDL proprietary software • Writing all algorithms from scratch. Slow. Buggy. Data science today • Working in high level open- source languages Python, R • We’re good at SQL and have lots of other options NoSQL • Git, thousands of libraries available. Easy to install. • Can concentrate more on what we’re good at.
  21. 21. So what is data science now Data Science: An interdisciplinary field utilizing statistics, computer science and the methods of scientific research in areas outside of science.
  22. 22. Where is it going? • Big Data technology is separated from data science • Software developers take over much of Big Data roles • Businesses begin to understand data science terminology like they now understand software terminology and they are not Twitter. • Data scientists and businesses find a methodology that works like industrial scale software development has
  23. 23. Where is it going? Specialization • Most experienced data scientists move into consulting or management of teams • Universities graduate many “data scientist-lite” students from new more specialized BS or MA programs • Fewer generalists • PhD students need to learn additional skills. Not instant hires (http://bit.ly/1m3krq6)
  24. 24. Why won’t we have 100x more data scientists in N years? • Pool of disgruntled postdocs will dry up or “I am not even supposed to be here!” • Many data science problems don’t need the most cutting edge tools. (Some do). • People rarely get much experience working with real data in academic settings. Requires real- world experience, takes time.
  25. 25. Are we there yet? Overhyped, underhyped, mis- hyped? • No, probably not • Productivity growth is real • We are solving important problems. Plenty left. • Big Data will probably peak in the hype cycle before data science • Just watched my first analytics commercial. IBM.
  26. 26. Why Big Data enthusiasm might peak soon Big Data defined – Process for performing calculations on data that: • Cannot possibly be done on a single machine • When sampling and streaming are not effective • What data-reduction is not possible • When storage and compute are closely balanced • Parallelizing is absolutely unavoidable Most tasks are not like this • Sampling is usually good enough for training machine learning • Need for rapid feedback, interactive work • CPUs are underutilized. IO limited. • Usually a better algorithm can solve the problem better
  27. 27. Hadoop (Spark) Good use cases • Large batch jobs like: restructuring and reducing data from raw files. • Scoring with ML models • When you have to do something on every data point. • Raw storage in HDFS Bad use cases • Model development • Visualization • Brute-forcing an inefficient algorithm. • Treating Hadoop like a data-base.
  28. 28. The data-sizes we typically see Most companies have a few million customers 10^7 Often they storage ~ 1000 items per customer That’s 10^10 data points. 5 bytes/data-point = 500 GB or a few TB. Fits on our laptops (but not in memory). Such data can be moved to the cloud if need be in 1-2 days. Often we can be productive with either a sample or an aggregation. True when • Customer specific items are things like purchases, manually entered text, logins etc. Not true when • Things are web-events, pair-wise interactions (i.e. graphs, social)
  29. 29. Sources of really big data Sensor data • Pictures • Video • Health monitoring devices • Internal device monitors • Results of combinatorical- complexity However • Is it really economic to store and process these huge data sets to begin with? • Will learn to utilize streaming algorithms • Will learnt on focus on information not noise
  30. 30. Case study : Particle Physics Data reduction par excellence • 600 million collisions per second • Most are boring events and are not saved • Save ~ 100 petabytes per year Determine existence of Higg-boson – 1 bit Measure it’s mass to 1% ~ 1 byte Data = Exabytes Information = 9 bits Compression 10^18 Goal $9 billion per byte!
  31. 31. Data science consulting The good • Always something new, always learning. • Exposed to many different people. • Get to see how everything works on the inside. • See the world! • Low career risk but still fun. The bad • Your clients choose you • People problems often more important than math problems • Travel can be extreme • Your great ideas will rarely be credited to you.
  32. 32. Challenges in data science consulting • Business’s don’t yet understand the terminology, process or techniques. Much teaching involved • Visionary CEO send you into a not-so-visionary environment • Problems can be vague • Communication with business stakeholders takes much of your time • We are still developing an effective model. More than just agile techniques
  33. 33. Red flags to avoid • “Built us a platform for analytics so we can become a data-driven company” Non-sequitur • Wanting prediction of the un-predicable • Attempting to use ML on noisy data • When incentives and opinions are all over the map • Convinced that the problem has been solved 20 years ago. E.g. linear regression, segmentation model, SAS.
  34. 34. Keep offering up bold ideas • Look for ways for major productivity enhancement • Keep up on cutting-edge literature in stats/ML • All my best ideas for web- apps are now successful companies. • Everybody laughed at them! Data science is NOT going to be productized. FIN