Watch talk ➟ http://bit.ly/1NJICGw
At Galvanize, hundreds of students complete capstone projects every year to showcase their skills to hiring partners in industry. This talk distills the main learnings from my experience advising students on some of their first data projects. Learn from their mistakes in using the right development process, building projects with real business value, proper project scoping, and making that final presentation.
knowledge representation in artificial intelligence
Data Science Popup Austin: Data Do's and Dont's: Lessons From The Front Line
1. DATA
SCIENCE
POP UP
AUSTIN
Data Do's and Dont's: Lessons
From the Front Line
Ryan Orban
VP of Product and Strategy,
Data Scientist, Galvanize
ryanorban
6. We believe an opportunity belongs
to anyone with aptitude and ambition.
7. 4Galvanize 2015
NODES ON THE NETWORK
COLORADO (BOULDER, DENVER, FORT COLLINS)
SEATTLE, WA
SAN FRANCISCO, CA
AUSTIN, TX (OPENING Q1 2016)
Programs: Full Stack Immersive, Data Science Immersive,
Entrepreneurship
Programs: Full Stack Immersive, Data Science Immersive,
Entrepreneurship
Programs: Full Stack Immersive, Data Science Immersive, Data
Engineering Immersive, Masters of Science in Data Science,
Entrepreneurship
Programs: Full Stack Immersive, Data Science Immersive,
Entrepreneurship
[Explanation Text]
8. 5Galvanize 2015
5 PROGRAMS
• Full Stack Immersive
• Data Science Immersive
• Data Engineering Immersive
Project over 500 Student Member Graduates in 2015
Currently over 1500 Members
• Master of Science in Data Science
(University of New Haven)
• Startup Membership
9. 6Galvanize 2015
PLACEMENT STATS
FULL STACK IMMERSIVE DATA SCIENCE IMMERSIVE
$43K $77KPre-program Salary
Average Starting Salary
97% Placement
Rate*
*Galvanize is a founder member of NESTA (New Economy Skills Training Association), a trade organization founded to regulate the new “bootcamp” market.
This place rate is more rigorous than that requested by state licensure agencies. The placement rate is calculated 6 months after graduation.
$72K $114KPre-program Salary
94%Placement
Rate*
Average Starting Salary
10. Software Engineering
Data
Science
Data
Analysis
Data
Engineering
Machine
Learning Java
Linux, UNIX
Mobile
Development
Objective C
C, C++, C#
Web
Development
Ruby on Rails
JavaScript
Front-endPHP
Full-
Stack
Excel
Python
SQL
NLP
Hadoop
Databases
Network Analysis
Java
Assembly
Statistics
R
The orange words are the most
important things we teach.
How These Things
Relate to Each Other
Full-Stack Web Development
and Data Science are in gray
circles.
11. 8Galvanize 2015
DATA SCIENCE IMMERSIVE
Week 1 - Exploratory Data Analysis and Software Engineering Best Practices
Week 2 - Statistical Inference, Bayesian Methods, A/B Testing, Multi-Armed Bandit
Week 3 - Regression, Regularization, Gradient Descent
Week 4 - Supervised Machine Learning: Classification, Validation, Ensemble Methods
Week 5 - Clustering, Topic Modeling (NMF, LDA), NLP
Week 6 - Network Analysis, Matrix Factorization, and Time Series
Week 7 - Hadoop, Hive, and MapReduce
Week 8 - Data Visualization with D3.js, Data Products, and Fraud Detection Case Study
Weeks 9-10 - Capstone Projects
Week 12 - Onsite Interviews
15. Do
Don’t
• Assume your data is friendly
• ETL and feature engineering is largely opaque to others (and
yourself after enough time away)
• Automate cleaning and transformation pipelines
• Jupyter and RStudio are great for EDA, but have issues with
collaboration and version control
• Build functional code to be reused; export into plain code files,
track with Git
17. Do
Don’t
• Never use accuracy as your main metric
• You can have 99% accuracy but 0% predictive power
• Unbalanced classes; sampling
• Use metrics like precision and recall
• Aggregate metrics like F1-score, AUC/AIC/BIC also good
• Remember that models with highest scores are not always the
ones you need; permissive vs. conservative based on use case
18. Do
Don’t
• Don’t start with the most complicated models first (deep learning,
gradient boosting, SVMs, etc.)
• Don’t focus on the algorithm
•“More data always beats better algorithms”
• But better features usually beat better algorithms*
• Start with a baseline model, then continuously “close the loop”
• Create a base case to optimize against
• Does 1% greater F1-score outweigh a 10x training time in
production? Not usually unless you’re Google-scale.
19. Do
Don’t
• Assume your cross-validation metrics will hold up against real-life
data
• Separate your application and prediction code
• Fast iteration cycles are key. Create a “scoring service” that is
uncoupled from application code.
• APIs & service oriented architectures typically work best
21. Do
Don’t
• Don’t focus on the “how”, i.e. cover every trial and tribulation
• Cut to the chase
• After a presentation, I always ask the class two questions:
• What is one sentence that describes what the speaker learned?
• Why do I care?
22. 19Galvanize 2015
• Early Access to Students
• Candidate Matching
• Curriculum Development
• Corporate Student Sponsorship
• Diversity
TALENT