1. Data Science Retreat
Berlin, Mar 2014
http://datascienceretreat.com/
Introduction to the first Data Science school
in Europe
Plus advice for upcoming data scientists
2. Who Am I
Twitter: @quesada
Before: Consulting on predictive
models of ecommerce (CLV),
data scientist at GetYourGuide
3. What this talk is about: Two problems
• Making the jump from junior to senior data
science is hard (solution: data science retreat)
• Acquiring the skillset, even with killer online
courses, is hard (solution: meerkat method)
4. contents
• Data Science retreat
• The meerkat method
• Data Science retreat for companies
• Hiring
• Getting tailored courses at your location
• Advice to anyone on their path to be a data scientist
• Advice to companies growing a data team
6. It’s too hard for companies to find data scientists
“It takes 150 phone interviews
to find someone who is good
enough to bring in to continue
on-site”
Alex Kagoshima, Pivotal,
Berlin
7. People applying to Data scientist jobs have no experience
• Vincent Granville:
“There is no shortage of data scientists. For every linkedin
Job, there are several hundreds applications on average”
8. Data scientists need to program (5 year experience)
Stefan Schmidt (Amazon Berlin):
“It takes us months to fill our positions; we hire world-wide
for Berlin openings. Most profiles cannot program at the level
we need. We have engineers, but the data scientist needs to be
able to understand large projects and commit code”
9. Truth is, data scientist is a senior role
• Often, advising to the CEO directly
• This is why so many people with strong profiles and
lots of coursera courses cannot find jobs
10. The gap from junior to senior
• Junior:
• Has a technical degree
• Has done some courses online
• Has never worked with data that generates value to
companies
• Can apply ‘recipes’, but not think creatively about data
sources and algorithms
11. The gap from junior to senior
• Junior:
• Has a technical degree
• Has done some courses online
• Has never worked with data that generates value to
companies
• Can apply ‘recipes’, but not think creatively about data
sources and algorithms
This profile has no practical value for most companies
15. Formulating the analytical problem
• Finding the question
• Translating something vague into a
dependent measure and an actual set of
predictors
• What generates business value?
• The Business Model Canvas to design a data
product
• Key performance indicators; examples,
measurement, improvement
• Most business problems are not very well
defined. How do we make them actionable?
• Analyzing big success stories in data science
• Getting Buy-in
16. Getting data (APIs, feature engineering)
• Using APIs
• Using databases
• Parsing html; web scrapping
• Transforming data (reshape)
• Finding APIs
• Feature engineering
• Avoiding autocorrelation
• Removing features with low variance
• Detecting outliers
• Exploratory analyses
• Measuring predictor importance
17. Finding insights, making predictions
• Regression
• Linear regression, penalized
models
• non-linear regression
• SVM
• K-nearest neighbors
• regression trees + rule-based
models (random forests)
19. R
• R language fundamentals
• data structures (including
data.table)
• subsetting
• input/output
• functions/control flow
• vectorization
• split-apply-combine
advanced R
functional programming in R
Profiling
object systems
packaging
Rcpp
20. R
• advanced R
• functional programming in
R
• Profiling
• object systems
• packaging
• Rcpp
21. data at scale
• MapReduce
• MapReduce, Google 2004.
• Applications, extensions. Beyond
MapReduce.
• Big Data analysis
• Preparation and configuration
• Hadoop cluster overview.
• Practice: Uploading / downloading
/ moving files around, executing
jobs, checking for completion /
failure, etc.
22. data at scale
• Hive / Pig
• Defining a Hive table, querying a Hive table.
• Integrating R with Hive.
• An introduction to Pig.
• Mahout
• Executing clustering tasks. Visualizing the
results with R.
• Executing an item-to-item recommender.
• Cascading / Pattern
• Data flow modeling using PyCascading.
• Executing Machine Learning "Pattern"
algorithms.
24. Methodology: portfolio project
• Ten students per batch
• Pair programming and code reviews with mentors (guild model)
• Datasets come from companies (non-NDA only)
• Portfolio project, where the fellow demonstrates what he can do
end-to-end to deliver value
• Weekly presentation training to improve communication to
non-technical stakeholders (video feedback)
25. Who we are looking for
• Passion for generating insights from data
• Familiarity with trends in data growth, open-source platforms, and public
data sets.
• From familiarity to strong knowledge of statistical methods
• Some experience with statistical languages and packages, including Mahout, R
or python with pandas
• Some familiarity with visualization software and techniques (including
Tableau)
• Preferably, experience working hands-on with large-scale data sets
• Excellent written and verbal communications skills
37. advantages
• No need to find the right tutorial/book/whatever
• Spend more time at the border of your capability
• You Save time doing exercises that would be too easy
38. Advantages (cont)
• Higher project completion rates: all projects must have a
concrete output, so you will see your own progress in
tangible ways
• You will have an Easier time to demonstrate progress to
yourself and to others (the Mentor vouches for the
Learner).
• You will get more hands-on training than in other methods
49. Then Laura, Stefan’s friend,
pointed him to
Data Science Retreat
… an intensive course helping selected fellows
ramp-up fast for a career in data science. “Tell
me more…”. Stefan was very interested.
50. Stefan could
interview ten
data scientists that
were
as good as Ben.
He hired three, an
they jumped into
their roles with little
training.
Stefan was Ecstatic!
51. How It works
• As a sponsor you pay 7000€ in advance + 3000€ after the
data scientist worked on-site for 3 months and you know you
want to keep him.
• students who take the sponsorship agree to work for a reduced
salary (50%) the first 3 months. The salary savings during the
internship should cover the cost of the sponsorship. When the
students finish the program, no one has any obligations.
52. • You prepay 7000€ and become a sponsor.
• At this point, you don't know the students.
• But as a committed sponsor you participate actively
during the retreat, see the student's presentations, go
out for lunch with them, etc
• Thanks to these activities, you have now developed
strong relationships and know more about the students
than what would come out in interviews.
How It works: an example
53. How it works, an example (contd)
• You have set your target on a killer candidate: Klaas. You
make an offer, he accepts, and he starts working at your
location
• Klaas gets paid 50% of his negotiated salary. If Klaas’
60000€/year, that is a 5000€/mo cost, and produces
2500€ * 3 months = 7500€ savings, which covers your
initial investment of 7000€.
54. Data Science Retreat
Contact:
Jose Quesada, PhD,
Director, Data Science Retreat Berlin
jose@datascienceretreat.com
DO you want to be a sponsor?
55. Advice to anyone on their path to be a data scientist
• Try to find a mentor
• Spend as much time at the border of your ability
• Practice communication
• Having a culture that can integrate such individuals is as
hard as finding them. Interview your companies
• How do you move from being a junior person to being the
'CEO wisperer'? Spend time with people who are
56. Getting tailored courses at your location
• We listen to the people you need to train before we
design the course
• We will start with a dataset that is important for your
company. Lacking that, we’ll bring a public that is
relevant
• Enterprisey courses are supposed to be non-effective
57. Hire somebody who’s better at engineering and teach him data science or
hire somebody who’s better with data and teach him engineering?
• Is your culture ready? Because if you manage to attract
someone senior enough, they will sense if it's not
• The problems you have must be a good match for the data
scientists. People are extremely specialized, more so after
PhDs. If you have say graph theory/recommendation
problems, and hire someone with a time series background,
things will take a while no matter who good he is in his
field