On Building a Data Science 
Curriculum 
November 23nd, 2014
Jonathan Dinu 
Director of Education, Galvanize 
jonathan@galvanize.com 
@clearspandex 
Questions? tweet @galvanize
Formerly 
Questions? tweet @galvanize
Formerly 
Questions? tweet @galvanize
+ 
Currently 
Questions? tweet @galvanize
Challenge 
The Challenge 
Questions? tweet @galvanize
Challenge
Tools 
H20 (0xdata) 
Framework/Library 
Big Data (scalability) 
Small Data 
MapReduce (Java) 
MapReduce (Streaming) 
Bespo...
Obligatory 
Name Drop 
At Scale Locally 
Snakebite (HDFS) 
Questions? tweet @galvanize 
Acquisition 
Parse 
Storage 
Trans...
Challenge 
Questions? tweet @galvanize
Challenge 
Now do that in 8 weeks 
Questions? tweet @galvanize
Challenge 
Questions? tweet @galvanize
Intuition 
Iteration 0: Intuition 
Questions? tweet @galvanize
Content 
Questions? tweet @galvanize Source: Metacademy
Bottom Up Approach 
Content 
Questions? tweet @galvanize
Content 
Source: Coursera
Content 
Source: UC Berkeley Masters
Not Everybody 
Learns This Way 
Issues 
Questions? tweet @galvanize
Issues 
• Not Enough Context 
• Not Enough Concept Overlap 
• Takes too much Time 
• Nothing Happens in a Vacuum 
Question...
Digression 
Not Just for Data Science 
(relevant to learning any complex subject) 
Questions? tweet @galvanize
Experience 
Iteration 1: Experience 
Questions? tweet @galvanize
Theory 
Mathematics & Statistics 
Mathematics 
Statistical 
Analysis 
Distributions 
(Binomial, Poisson, 
etc.) 
Summary S...
Worth the Upfront Investment 
Theory 
Questions? tweet @galvanize
Technique 
Machine Learning & Software Engineering 
Distributed 
Computing 
Supervised 
(SVM, Random 
Forest) 
Unsupervise...
Questions? tweet @galvanize 
Just ask them! 
Network 
(the students)
Context is King
Network 
Questions? tweet @galvanize
Network 
Iris Dataset Classification 
Questions? tweet @galvanize
Network 
Iris Dataset Classification 
Questions? tweet @galvanize 
NYT Topic Modeling
Network 
Iris Dataset Classification 
Questions? tweet @galvanize 
NYT Topic Modeling 
Real-time Fraud scoring service
Network 
Iris Dataset Classification 
Questions? tweet @galvanize 
NYT Topic Modeling 
Real-time Fraud scoring service 
Pe...
Network 
“Domesticated Data” Learn the tools/theory 
Questions? tweet @galvanize 
Iris Dataset Classification 
NYT Topic M...
Network 
“Domesticated Data” Learn the tools/theory 
“Wild Data” Learn the application 
Questions? tweet @galvanize 
Iris ...
Network 
“Domesticated Data” Learn the tools/theory 
“Wild Data” Learn the application 
Simulated Case Study Learn the pro...
Network 
“Domesticated Data” Learn the tools/theory 
“Wild Data” Learn the application 
Simulated Case Study 
Learn the pr...
Theory 
Questions? tweet @galvanize 
Theory 
Application 
Synthesis 
$$$ PROFIT!!
Questions? tweet @galvanize 
Just ask them! 
Network
Network 
Questions? tweet @galvanize
Questions? tweet @galvanize 
Just ask them! 
(and be flexible) 
Network
Treat them like customers 
Questions? tweet @galvanize 
(because they are) 
Network
Questions? tweet @galvanize 
Always Validate! 
Network
Metrics 
Iteration 2: Data! 
Questions? tweet @galvanize
METRICS 
Experience 
Iteration 2: Data! 
METRICS EVERYWHERE 
Questions? tweet @galvanize
Metrics 
Questions? tweet @galvanize
Questions? tweet @galvanize 
• Commits 
• Pull Requests 
• Passing Tests 
• Etc. 
Metrics
Curriculum as Product
Learning Techniques 
Questions? tweet @galvanize
Industry Techniques 
Questions? tweet @galvanize 
Source: http://en.wikipedia.org/wiki/Extreme_programming
Industry Techniques 
Questions? tweet @galvanize 
Source: http://lostechies.com/scottreynolds/2009/10/07/how-we-do-things-...
Industry Techniques 
Questions? tweet @galvanize 
Code Reviews 
Source: http://agile.dzone.com/articles/re-pair-programmin...
Our House 
@Zipfian 
(now Galvanize) 
Questions? tweet @galvanize
source: http://www.sebastienmillon.com/Rainbow-Immersion-Therapy-Art-Print-15
Methodology 
Community 
Education 
Meetup 
Student 
Groups 
Corporate 
Training 
Industry 
Questions? tweet @galvanize
Methodology 
• Outcomes focused 
• Project-based curriculum using real datasets 
• Guest lectures from leaders in the fiel...
Employment 
Highest Employment Rates 
(2012) 
University of Massachusetts-Amherst 
School of Nursing 98% 
Georgetown Unive...
Timeline 
Data Science Immersive 
STRUCTURED CURRICULUM 
Questions? tweet @galvanize 
HIRING 
DAY 
CAPSTONE 
PROJECT 
GRAD...
Industry Student Projects 
Questions? tweet @galvanize
! 
• Working knowledge of 
programming 
• Background in a quantitative 
discipline 
• Comfortable with mathematics 
and st...
Our Students 
Questions? tweet @galvanize 
Educational Background 
BS 
MS 
PhD 
0 4 8 12 16
Questions? tweet @galvanize 
Disciplines 
Software Engineering 
Analysts 
Finance/Economics 
Engineering 
Physics 
Physica...
Data Science 
Immersive 
Questions? tweet @galvanize 
Masters in Data 
Science 
Data Engineering 
Immersive 
Weekend 
Work...
Questions? tweet @galvanize 
Immersive 
Masters
Questions? tweet @galvanize 
Immersive 
Masters 
(not to scale)
Masters of Science - 1 year 
Questions? tweet @galvanize 
(Starts in Spring) 
http://www.galvanizeu.com/request-info
Goals 
! 
• Present a guest lecture or share a data story 
• Donate datasets and propose projects 
• Sponsor a scholarship...
Goals 
Questions? tweet @galvanize 
We’re Hiring! 
! 
• Full-time Instructors 
• TAs 
• Mentor (volunteer)
Questions? 
Questions? tweet @galvanize 
Thank You! 
Jonathan Dinu 
Director of Education, Galvanize 
jonathan@galvanize.c...
Prochain SlideShare
Chargement dans…5
×

On Building a Data Science Curriculum

1 616 vues

Publié le

Data Science is a comparatively new field and as such it is constantly changing as new techniques, tools, and problems emerge every day. Traditionally education has taken a top down approach where courses are developed on the scale of years and committees approve curricula based on what might be the most theoretically complete approach. This is at odds however with an evolving industry that needs data scientists faster than they can be (traditionally) trained.

If we are to sustainably push the field of Data Science forward, we must collectively figure out how to best scale this type of education. At Zipfian I have seen (and felt) first hand what works (and what doesn't) when tools and theory are combined in a classroom environment. This talk will be a narrative about the lessons learned trying to integrate high level theory with practical application, how leveraging the Python ecosystem (numpy, scipy, pandas, scikit-learn, etc.) has made this possible, and what happens when you treat curriculum like product (and the classroom like a team).

Publié dans : Technologie
0 commentaire
6 j’aime
Statistiques
Remarques
  • Soyez le premier à commenter

Aucun téléchargement
Vues
Nombre de vues
1 616
Sur SlideShare
0
Issues des intégrations
0
Intégrations
167
Actions
Partages
0
Téléchargements
46
Commentaires
0
J’aime
6
Intégrations 0
Aucune incorporation

Aucune remarque pour cette diapositive

On Building a Data Science Curriculum

  1. 1. On Building a Data Science Curriculum November 23nd, 2014
  2. 2. Jonathan Dinu Director of Education, Galvanize jonathan@galvanize.com @clearspandex Questions? tweet @galvanize
  3. 3. Formerly Questions? tweet @galvanize
  4. 4. Formerly Questions? tweet @galvanize
  5. 5. + Currently Questions? tweet @galvanize
  6. 6. Challenge The Challenge Questions? tweet @galvanize
  7. 7. Challenge
  8. 8. Tools H20 (0xdata) Framework/Library Big Data (scalability) Small Data MapReduce (Java) MapReduce (Streaming) Bespoke Code Cloudera ML Mahout MLlib (amplab) C/C++ Cascading/Crunch Pig/Hive Vowpal Rabbit Giraph GraphLab Spark Storm R CRAN Python Java scikit-learn pandas mlpack Weka Numpy Javascript Questions? tweet @galvanize
  9. 9. Obligatory Name Drop At Scale Locally Snakebite (HDFS) Questions? tweet @galvanize Acquisition Parse Storage Transform/Explore Vectorization Train Model Expose Presentation requests BeautifulSoup4 pymongo pandas Flask scrapy Hadoop Streaming (w/ BeautifulSoup4) mrjob or Mortar (w/ Python UDF) MLlib (pySpark) Flask scikit-learn/NLTK
  10. 10. Challenge Questions? tweet @galvanize
  11. 11. Challenge Now do that in 8 weeks Questions? tweet @galvanize
  12. 12. Challenge Questions? tweet @galvanize
  13. 13. Intuition Iteration 0: Intuition Questions? tweet @galvanize
  14. 14. Content Questions? tweet @galvanize Source: Metacademy
  15. 15. Bottom Up Approach Content Questions? tweet @galvanize
  16. 16. Content Source: Coursera
  17. 17. Content Source: UC Berkeley Masters
  18. 18. Not Everybody Learns This Way Issues Questions? tweet @galvanize
  19. 19. Issues • Not Enough Context • Not Enough Concept Overlap • Takes too much Time • Nothing Happens in a Vacuum Questions? tweet @galvanize
  20. 20. Digression Not Just for Data Science (relevant to learning any complex subject) Questions? tweet @galvanize
  21. 21. Experience Iteration 1: Experience Questions? tweet @galvanize
  22. 22. Theory Mathematics & Statistics Mathematics Statistical Analysis Distributions (Binomial, Poisson, etc.) Summary Statistics (Mean, Variance, etc.) Hypothesis Testing Bayesian Analysis Linear Algebra (Matrix Factorization) Calculus (Integrals, Derivatives, etc) Graph Theory Probability/ Combinatorics Questions? tweet @galvanize
  23. 23. Worth the Upfront Investment Theory Questions? tweet @galvanize
  24. 24. Technique Machine Learning & Software Engineering Distributed Computing Supervised (SVM, Random Forest) Unsupervised (K-means, LDA) NLP / Information Retrieval Algorithms & Data Structures Data Visualization Data Munging Machine Learning Software Engineering Validation, Model Comparison Questions? tweet @galvanize
  25. 25. Questions? tweet @galvanize Just ask them! Network (the students)
  26. 26. Context is King
  27. 27. Network Questions? tweet @galvanize
  28. 28. Network Iris Dataset Classification Questions? tweet @galvanize
  29. 29. Network Iris Dataset Classification Questions? tweet @galvanize NYT Topic Modeling
  30. 30. Network Iris Dataset Classification Questions? tweet @galvanize NYT Topic Modeling Real-time Fraud scoring service
  31. 31. Network Iris Dataset Classification Questions? tweet @galvanize NYT Topic Modeling Real-time Fraud scoring service Personal Capstone Project
  32. 32. Network “Domesticated Data” Learn the tools/theory Questions? tweet @galvanize Iris Dataset Classification NYT Topic Modeling Real-time Fraud scoring service Personal Capstone
  33. 33. Network “Domesticated Data” Learn the tools/theory “Wild Data” Learn the application Questions? tweet @galvanize Iris Dataset Classification NYT Topic Modeling Real-time Fraud scoring service Personal Capstone
  34. 34. Network “Domesticated Data” Learn the tools/theory “Wild Data” Learn the application Simulated Case Study Learn the process Questions? tweet @galvanize Iris Dataset Classification NYT Topic Modeling Real-time Fraud scoring service Personal Capstone
  35. 35. Network “Domesticated Data” Learn the tools/theory “Wild Data” Learn the application Simulated Case Study Learn the process Greenfield Project Learn the practice/art Questions? tweet @galvanize Iris Dataset Classification NYT Topic Modeling Real-time Fraud scoring service Personal Capstone
  36. 36. Theory Questions? tweet @galvanize Theory Application Synthesis $$$ PROFIT!!
  37. 37. Questions? tweet @galvanize Just ask them! Network
  38. 38. Network Questions? tweet @galvanize
  39. 39. Questions? tweet @galvanize Just ask them! (and be flexible) Network
  40. 40. Treat them like customers Questions? tweet @galvanize (because they are) Network
  41. 41. Questions? tweet @galvanize Always Validate! Network
  42. 42. Metrics Iteration 2: Data! Questions? tweet @galvanize
  43. 43. METRICS Experience Iteration 2: Data! METRICS EVERYWHERE Questions? tweet @galvanize
  44. 44. Metrics Questions? tweet @galvanize
  45. 45. Questions? tweet @galvanize • Commits • Pull Requests • Passing Tests • Etc. Metrics
  46. 46. Curriculum as Product
  47. 47. Learning Techniques Questions? tweet @galvanize
  48. 48. Industry Techniques Questions? tweet @galvanize Source: http://en.wikipedia.org/wiki/Extreme_programming
  49. 49. Industry Techniques Questions? tweet @galvanize Source: http://lostechies.com/scottreynolds/2009/10/07/how-we-do-things-tdd-bdd/
  50. 50. Industry Techniques Questions? tweet @galvanize Code Reviews Source: http://agile.dzone.com/articles/re-pair-programming
  51. 51. Our House @Zipfian (now Galvanize) Questions? tweet @galvanize
  52. 52. source: http://www.sebastienmillon.com/Rainbow-Immersion-Therapy-Art-Print-15
  53. 53. Methodology Community Education Meetup Student Groups Corporate Training Industry Questions? tweet @galvanize
  54. 54. Methodology • Outcomes focused • Project-based curriculum using real datasets • Guest lectures from leaders in the field • Mock interviews and hiring preparation • Full instructional staff + personal mentorship Questions? tweet @galvanize
  55. 55. Employment Highest Employment Rates (2012) University of Massachusetts-Amherst School of Nursing 98% Georgetown University McDonough School of Business 94% Michigan State University College of Nursing 92% Syracuse University School of Architecture 90% University of Massachusetts-Amherst Isenberg School of Management 90% Michigan State University School of Hospitality Business 89% New York University 88% Boston College Connell School of Nursing 88% Boston College Carroll School of Management 87% Case Western Reserve University Frances Payne Bolton School of Nursing 86% U.S. News and World Report Ranking 1. Princeton University 2. Harvard University 3. Yale University 4. Columbia University 5. Stanford University 6. University of Chicago 7. Duke University 8. MIT 9. University of Pennsylvania 10. California Institue of Technology Questions? tweet @galvanize Source: http://www.nerdwallet.com/nerdscholar/grad_surveys/highest-employment-rates
  56. 56. Timeline Data Science Immersive STRUCTURED CURRICULUM Questions? tweet @galvanize HIRING DAY CAPSTONE PROJECT GRADUATION 0 INTERVIEWS 8 10.5 12
  57. 57. Industry Student Projects Questions? tweet @galvanize
  58. 58. ! • Working knowledge of programming • Background in a quantitative discipline • Comfortable with mathematics and statistics • Child-like curiosity Questions? tweet @galvanize What We Look For Our Students
  59. 59. Our Students Questions? tweet @galvanize Educational Background BS MS PhD 0 4 8 12 16
  60. 60. Questions? tweet @galvanize Disciplines Software Engineering Analysts Finance/Economics Engineering Physics Physical Sciences Mathematics Statistics Astronomy Linguistics Professional Poker 0 2 4 6 8 Our Students
  61. 61. Data Science Immersive Questions? tweet @galvanize Masters in Data Science Data Engineering Immersive Weekend Workshops +
  62. 62. Questions? tweet @galvanize Immersive Masters
  63. 63. Questions? tweet @galvanize Immersive Masters (not to scale)
  64. 64. Masters of Science - 1 year Questions? tweet @galvanize (Starts in Spring) http://www.galvanizeu.com/request-info
  65. 65. Goals ! • Present a guest lecture or share a data story • Donate datasets and propose projects • Sponsor a scholarship • Attend our Hiring Day Questions? tweet @galvanize Get Involved
  66. 66. Goals Questions? tweet @galvanize We’re Hiring! ! • Full-time Instructors • TAs • Mentor (volunteer)
  67. 67. Questions? Questions? tweet @galvanize Thank You! Jonathan Dinu Director of Education, Galvanize jonathan@galvanize.com @clearspandex

×