Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1kMUPAe.
Josh Wills discusses using Hadoop technologies to build real-time data analysis models with a focus on strategies for data integration, large-scale machine learning, and experimentation. Filmed at qconsf.com.
Josh Wills is the director of data science at Cloudera. Wills is one of the main contributors to Cloudera’s most recent open source project, Crunch, a Java library that aims to make writing, testing, and running MapReduce pipelines easy, efficient, and even fun. Prior to joining Cloudera, Wills was a software engineer at Google. Josh holds a M.S.E. in operations research and a BS in mathematics.
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
From The Lab To The Factory: Building A Production Machine Learning Infrastructure
1. From The Lab to the Factory
Building A Production Machine Learning Infrastructure
Josh Wills, Senior Director of Data Science
Cloudera
1
2. Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/machine-learning-infrastructure
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
3. Presented at QCon San Francisco
www.qconsf.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
17. A Shift In Perspective
Analytics in the Lab
•
•
•
•
•
•
15
Question-driven
Interactive
Ad-hoc, post-hoc
Fixed data
Focus on speed and
flexibility
Output is embedded into a
report or in-database
scoring engine
Analytics in the Factory
•
•
•
•
•
•
Metric-driven
Automated
Systematic
Fluid data
Focus on transparency and
reliability
Output is a production
system that makes
customer-facing decisions
37. Simple Conditional Logic
•
Declare experiment
flags in compiled code
•
•
35
Settings that can vary
per request
Create a config file that
contains simple rules
for calculating flag
values and rules for
experiment diversion
38. Separate Data Push from Code Push
•
Validate config files and
push updates to servers
•
•
•
36
Zookeeper via Curator
File-based
Servers pick up new
configs, load them, and
update experiment
space and flag value
calculations
40. A Few Links I Love
•
http://research.google.com/pubs/pub36500.html
•
•
http://www.exp-platform.com/
•
•
Collection of all of Microsoft’s papers and presentations on
their experimentation platform
http://www.deaneckles.com/blog/596_lossy-betterthan-lossless-in-online-bootstrapping/
•
38
The original paper on the overlapping experiments
infrastrucure at Google
Dean Eckles on his paper about bootstrapped confidence
intervals with multiple dependencies