5. Who are Data Scientists?
Data Scientist:
• Loves data
• Investigator mind set
• Goal of his work is in finding patterns in data and data driven
products
• He is a practitioner, not theorist
• Has “hands on” skills
• Domain expertise (*)
• Team player
Some backgrounds are better than others:
• Computer Science
• Statistics (mathematics)
• Natural sciences with strong quantitative
• PhD’s, but not only
demand for a certain set of skills, while later demand wanes as
automated by even newer tools. Consider, for instance, the wa
management jobs that used to require legions of computer ope
monitoring tools. Data science is still in its very early phase, wi
the right
available
The best source of new Data Science talent
is:
Today's BI
professionals
12%
Professionals
in disciplines
other than IT
or computer
science
27%
EMC Data Science Community Survey, 2011
Higher School of Economics , Moscow, 2013
Other
3%
Students
studying
computer
science
34%
Students
studying
fields other
than
computer
science
24%
university students.
5
Although
opportun
scientist
thirds of
shortfall
the next
research
Institute
190,000
And whe
the best
today’s b
Instead,
6. What do Data Scientists do?
•
•
•
•
•
•
•
•
Designs customized system and tools
Works with structured and unstructured data
Creates data processing pipelines
Analyzes massive datasets (TB, PB)
Builds predictive models
Creates visualizations
Designs data products
Uses Hadoop, MapReduce, Hive, Python, R
Higher School of Economics , Moscow, 2013
6
7. Tools of the trade
• Operating systems:
• Linux + shell tools
• Big data instruments:
• Hadoop (MapReduce) + hadoop tools
• Hive, Pig
• NoSQL (Hbase, MongoDB, Cassandra, Neo4J)
• Database:
• SQL
• Programming:
• Python
• Java
• Scala
• Machine Learning:
• R
• Matlab
• Python libraries (NumPy, SciPy, Nltk,SciKit)
• Java libraries (Mahaut)
.
Higher School of Economics , Moscow, 2013
7
9. Data Scientist roles
From: “Analyzing the Analyzers” by Harlan Harris, Sean Murphy, and Marck Vaisman , O’Reilly Strata 2012
Higher School of Economics , Moscow, 2013
9
10. Data Science ”dream team”
From: “Doing Data Science: Straight Talk from the Frontline”, Rachel Schutt, Cathy O'Neil, O'Reilly Media, 2013
Higher School of Economics , Moscow, 2013
10
11. Data Science project pipeline
Learning
a
problem
Higher School of Economics , Moscow, 2013
Parsing
data
Cleaning,
filtering
and
organizing
Exploring
and
mining
for
paGerns
Acquiring
data
Building
models
Visualizing
results
CommunicaJng
findings
11
12. Business applications
• Marketing:
• Market segmentation
• Product and media mix analysis
• Customer acquisition and churn modeling
• Recommendation system and cross sell
• Social media analysis
• Finance & Insurance:
• Fraud prevention
• Anomaly detection
• Credit risk analysis
• Usage based insurance modeling
• Portfolio optimization
• Healthcare and Pharmaceuticals:
• Genetic analysis
• Clinical trials analysis
• Clinical decision support system
Higher School of Economics , Moscow, 2013
12
13. Industry training
TRAINING SHEET | 2
Course Outline: Cloudera Introduction to Data Science
Introduction
Data Analysis and Statistical Methods
Experimentation and Evaluation
Data Science Overview
> Relationship Between Statistics and
Probability
> Measuring Recommender Effectiveness
> Descriptive Statistics
> Conducting an Effective Experiment
> What Is Data Science?
> The Growing Need for Data Science
> The Role of a Data Scientist
> Inferential Statistics
Fundamentals of Machine Learning
Use Cases
> Overview
> Finance
> The Three Cs of Machine Learning
> Retail
> Spotlight: Naïve Bayes Classifiers
> Advertising
> Importance of Data and Algorithms
> Defense and Intelligence
> Telecommunications and Utilities
> Healthcare and Pharmaceuticals
Evaluating Input Data
> Data Formats
> Data Quantity
> Data Quality
Data Transformation
> Tips and Techniques for Working at Scale
> Summarizing and Visualizing Results
> Considerations for Improvement
Conclusion
> Types of Collaborative Filtering
> Fundamental Concepts
> Acquisition Techniques
> Deploying to Production
> What Is a Recommender System?
> Steps in the Project Lifecycle
> Where to Source Data
Production Deployment and Beyond
> Next Steps for Recommenders
> Limitations of Recommender Systems
Data Acquisition
> User Interfaces for Recommenders
Recommender Overview
Project Lifecycle
> Lab Scenario Explanation
> Designing Effective Experiments
Introduction to Apache Mahout
> What Apache Mahout Is (and Is Not)
> A Brief History of Mahout
> Availability and Installation
Appendix A : Hadoop Overview
Appendix B: Mathematical
Formulas
Appendix C : Language and Tool
Reference
> Demonstration: Using Mahout’s ItemBased Recommender
Implementing Recommenders with
Apache Mahout
> Overview
> Similarity Metrics for Binary Preferences
> Anonymization
> File Format Conversion
TRAINING SHEET
> Similarity Metrics for Numeric Preferences
> Scoring
> Joining Datasets
Cloudera Introduction to Data Science:
Cloudera Certified Professional: Data
Building RecommenderScientist (CCP:DS)
Systems
Higher School of Economics , Moscow, 2013
13
15. Educational
programs
University programs:
•
•
•
•
•
University of Washington: Certificate in Data Science
UC Berkeley: Master of information and data science program
New York University: Data Science at NYU
Columbia University: Institute for Data Sciences and Engineering
University of Southern California (UCS) : Master of Science in Data
Science
Online MOOC courses:
• Coursera
• edX
• Udacity
Accelerated educational programs:
• Zipfian Academy (12 weeks intensive program)
• Insight Data Science Fellows program ( 6 weeks post doc training)
Higher School of Economics , Moscow, 2013
15
16. Conferences
• Industry conferences and meetings:
•
•
•
•
O’Reilly Strata Conference Making Data Work
Hadoop World
Big Data Techcon
Big Data Innovation summits
• Academic conferences (peer reviewed):
•
•
•
•
•
•
•
•
•
•
•
•
•
IEEE & ACM Supercomputing
IEEE Big Data
ACM KDD Knowledge Discovery and Data Mining
ACM SIGIR Information Retrieval
ICML International Conference on Machine Learning
ICDM International Conference on Data Mining
NIPS Neural Information Processing
WWW World Wide Web Conference
VLDB Very Large Data Bases
ACM CIKM Information and Knowledge Management
SIAM SDM International Conference on Data Mining
IEEE ICDE Data Engineering
IEEE Visualization
• Meetups
Higher School of Economics , Moscow, 2013
16
18. Open questions
• How important is domain expertise?
• What is need more: education or experience?
• Future of Data Scientist, will they be replaced by software?
Higher School of Economics , Moscow, 2013
18