SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
Data Scientists
Leonid Zhukov

Higher School of Economics , Moscow, 2013
www.hse.ru
The Sexiest Job of the 21st Century

McKinsey
estimates
140,000-190,000
shortage by 2018

Higher School of Economics , Moscow, 2013

2	
  
Data Scientists wanted!

Higher School of Economics , Moscow, 2013

3	
  
Supply and demand

Higher School of Economics , Moscow, 2013

4	
  
Who are Data Scientists?

Data Scientist:
•  Loves data
•  Investigator mind set
•  Goal of his work is in finding patterns in data and data driven
products
•  He is a practitioner, not theorist
•  Has “hands on” skills
•  Domain expertise (*)
•  Team player
Some backgrounds are better than others:
•  Computer Science
•  Statistics (mathematics)
•  Natural sciences with strong quantitative
•  PhD’s, but not only

demand for a certain set of skills, while later demand wanes as
automated by even newer tools. Consider, for instance, the wa
management jobs that used to require legions of computer ope
monitoring tools. Data science is still in its very early phase, wi
the right
available
The best source of new Data Science talent
is:

Today's BI
professionals
12%

Professionals
in disciplines
other than IT
or computer
science
27%

 EMC Data Science Community Survey, 2011
Higher School of Economics , Moscow, 2013

Other
3%

Students
studying
computer
science
34%
Students
studying
fields other
than
computer
science
24%


university students.

5	
  

Although
opportun
scientist
thirds of
shortfall
the next
research
Institute
190,000
And whe
the best
today’s b
Instead,
What do Data Scientists do?

• 
• 
• 
• 
• 
• 
• 
• 

Designs customized system and tools
Works with structured and unstructured data
Creates data processing pipelines
Analyzes massive datasets (TB, PB)
Builds predictive models
Creates visualizations
Designs data products
Uses Hadoop, MapReduce, Hive, Python, R

Higher School of Economics , Moscow, 2013

6	
  
Tools of the trade
•  Operating systems:
•  Linux + shell tools
•  Big data instruments:
•  Hadoop (MapReduce) + hadoop tools
•  Hive, Pig
•  NoSQL (Hbase, MongoDB, Cassandra, Neo4J)
•  Database:
•  SQL
•  Programming:
•  Python
•  Java
•  Scala
•  Machine Learning:
•  R
•  Matlab
•  Python libraries (NumPy, SciPy, Nltk,SciKit)
•  Java libraries (Mahaut)
.
Higher School of Economics , Moscow, 2013

7	
  
Required skills

• 
• 
• 
• 
• 
• 
• 
• 
• 
• 

Programming
Algorithms
Statistics
Data mining
Machine learning
NLP
Distributed systems
Big data tools
Databases
Visualization

From: Swami Chandrasekaran,Executive Architect, IBM, Watson Solutions
Higher School of Economics , Moscow, 2013

8	
  
Data Scientist roles

From: “Analyzing the Analyzers” by Harlan Harris, Sean Murphy, and Marck Vaisman , O’Reilly Strata 2012
Higher School of Economics , Moscow, 2013

9	
  
Data Science ”dream team”

From: “Doing Data Science: Straight Talk from the Frontline”, Rachel Schutt, Cathy O'Neil, O'Reilly Media, 2013 
Higher School of Economics , Moscow, 2013

10	
  
Data Science project pipeline

Learning	
  a	
  
problem	
  	
  

Higher School of Economics , Moscow, 2013

Parsing	
  data	
  	
  

Cleaning,	
  
filtering	
  and	
  
organizing	
  

Exploring	
  
and	
  mining	
  
for	
  paGerns	
  

Acquiring	
  
data	
  	
  

Building	
  
models	
  

Visualizing	
  
results	
  

CommunicaJng	
  
findings	
  

11	
  
Business applications
•  Marketing:
•  Market segmentation
•  Product and media mix analysis
•  Customer acquisition and churn modeling
•  Recommendation system and cross sell
•  Social media analysis
•  Finance & Insurance:
•  Fraud prevention
•  Anomaly detection
•  Credit risk analysis
•  Usage based insurance modeling
•  Portfolio optimization
•  Healthcare and Pharmaceuticals:
•  Genetic analysis
•  Clinical trials analysis
•  Clinical decision support system
Higher School of Economics , Moscow, 2013

12	
  
Industry training

TRAINING SHEET | 2

Course Outline: Cloudera Introduction to Data Science
Introduction

Data Analysis and Statistical Methods

Experimentation and Evaluation

Data Science Overview

> Relationship Between Statistics and
Probability

> Measuring Recommender Effectiveness

> Descriptive Statistics

> Conducting an Effective Experiment

> What Is Data Science?
> The Growing Need for Data Science
> The Role of a Data Scientist

> Inferential Statistics

Fundamentals of Machine Learning

Use Cases

> Overview

> Finance

> The Three Cs of Machine Learning

> Retail

> Spotlight: Naïve Bayes Classifiers

> Advertising

> Importance of Data and Algorithms

> Defense and Intelligence
> Telecommunications and Utilities
> Healthcare and Pharmaceuticals

Evaluating Input Data
> Data Formats
> Data Quantity
> Data Quality

Data Transformation

> Tips and Techniques for Working at Scale
> Summarizing and Visualizing Results
> Considerations for Improvement

Conclusion

> Types of Collaborative Filtering
> Fundamental Concepts

> Acquisition Techniques

> Deploying to Production

> What Is a Recommender System?

> Steps in the Project Lifecycle

> Where to Source Data

Production Deployment and Beyond

> Next Steps for Recommenders

> Limitations of Recommender Systems

Data Acquisition

> User Interfaces for Recommenders

Recommender Overview

Project Lifecycle
> Lab Scenario Explanation

> Designing Effective Experiments

Introduction to Apache Mahout
> What Apache Mahout Is (and Is Not)
> A Brief History of Mahout
> Availability and Installation

Appendix A : Hadoop Overview
Appendix B: Mathematical
Formulas
Appendix C : Language and Tool
Reference

> Demonstration: Using Mahout’s ItemBased Recommender

Implementing Recommenders with
Apache Mahout
> Overview
> Similarity Metrics for Binary Preferences

> Anonymization
> File Format Conversion

TRAINING SHEET

> Similarity Metrics for Numeric Preferences
> Scoring

> Joining Datasets

Cloudera Introduction to Data Science:
Cloudera Certified Professional: Data
Building RecommenderScientist (CCP:DS)
Systems

Higher School of Economics , Moscow, 2013

13	
  
Industry training

Higher School of Economics , Moscow, 2013

14	
  
Educational
programs
University programs:
• 
• 
• 
• 
• 

University of Washington: Certificate in Data Science
UC Berkeley: Master of information and data science program
New York University: Data Science at NYU
Columbia University: Institute for Data Sciences and Engineering
University of Southern California (UCS) : Master of Science in Data
Science
Online MOOC courses:
•  Coursera
•  edX
•  Udacity

Accelerated educational programs:
•  Zipfian Academy (12 weeks intensive program)
•  Insight Data Science Fellows program ( 6 weeks post doc training)
Higher School of Economics , Moscow, 2013

15	
  
Conferences
•  Industry conferences and meetings:
• 
• 
• 
• 

O’Reilly Strata Conference Making Data Work
Hadoop World
Big Data Techcon
Big Data Innovation summits

•  Academic conferences (peer reviewed):
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 

IEEE & ACM Supercomputing
IEEE Big Data
ACM KDD Knowledge Discovery and Data Mining
ACM SIGIR Information Retrieval
ICML International Conference on Machine Learning
ICDM International Conference on Data Mining
NIPS Neural Information Processing
WWW World Wide Web Conference
VLDB Very Large Data Bases
ACM CIKM Information and Knowledge Management
SIAM SDM International Conference on Data Mining
IEEE ICDE Data Engineering
IEEE Visualization

•  Meetups
Higher School of Economics , Moscow, 2013

16	
  
Textbooks

Higher School of Economics , Moscow, 2013

17	
  
Open questions
• How important is domain expertise?
• What is need more: education or experience?
• Future of Data Scientist, will they be replaced by software?

Higher School of Economics , Moscow, 2013

18	
  
20, Myasnitskaya str., Moscow, Russia, 101000
Tel.: +7 (495) 628-8829, Fax: +7 (495) 628-7931
www.hse.ru

Contenu connexe

Tendances

Emcien overview v6 01282013
Emcien overview v6 01282013Emcien overview v6 01282013
Emcien overview v6 01282013WCJones6348
 
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformSavita Yadav
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project LifecycleJason Geng
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger databodaceacat
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
 
JavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceJavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceMark West
 
姜俊宇/從資料到知識:從零開始的資料探勘
姜俊宇/從資料到知識:從零開始的資料探勘姜俊宇/從資料到知識:從零開始的資料探勘
姜俊宇/從資料到知識:從零開始的資料探勘台灣資料科學年會
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like YouSalford Systems
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive AnalysisJongwook Woo
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)heba_ahmad
 

Tendances (20)

Emcien overview v6 01282013
Emcien overview v6 01282013Emcien overview v6 01282013
Emcien overview v6 01282013
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Elementary Concepts of data minig
Elementary Concepts of data minigElementary Concepts of data minig
Elementary Concepts of data minig
 
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
 
Data Science
Data ScienceData Science
Data Science
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
JavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceJavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data Science
 
姜俊宇/從資料到知識:從零開始的資料探勘
姜俊宇/從資料到知識:從零開始的資料探勘姜俊宇/從資料到知識:從零開始的資料探勘
姜俊宇/從資料到知識:從零開始的資料探勘
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 
Data Science at UC Irvine
Data Science at UC IrvineData Science at UC Irvine
Data Science at UC Irvine
 
AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
 
Data science
Data science Data science
Data science
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 

En vedette

socialnetworkszhukov
socialnetworkszhukovsocialnetworkszhukov
socialnetworkszhukovLeonid Zhukov
 
Numerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link AnalysisNumerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link AnalysisLeonid Zhukov
 
Vis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI VisualizationVis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI VisualizationLeonid Zhukov
 
ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013Leonid Zhukov
 
Business of Big Data
Business of Big DataBusiness of Big Data
Business of Big DataLeonid Zhukov
 
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRIOriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRILeonid Zhukov
 

En vedette (6)

socialnetworkszhukov
socialnetworkszhukovsocialnetworkszhukov
socialnetworkszhukov
 
Numerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link AnalysisNumerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link Analysis
 
Vis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI VisualizationVis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI Visualization
 
ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013
 
Business of Big Data
Business of Big DataBusiness of Big Data
Business of Big Data
 
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRIOriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
 

Similaire à Data Scientists

The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science LandscapePhilip Bourne
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdfpaijitk
 
Data science as a professional career
Data science as a professional careerData science as a professional career
Data science as a professional careerDavid Rostcheck
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxsumitkumar600840
 
Rise of the Data Democracy
Rise of the Data DemocracyRise of the Data Democracy
Rise of the Data DemocracyBrendan Aldrich
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfArmyTrilidiaDevegaSK
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptxshalini s
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceMark West
 
The Download: Tech Talks by the HPCC Systems Community, Episode 12
 The Download: Tech Talks by the HPCC Systems Community, Episode 12 The Download: Tech Talks by the HPCC Systems Community, Episode 12
The Download: Tech Talks by the HPCC Systems Community, Episode 12HPCC Systems
 
Data Science Training and Placement
Data Science Training and PlacementData Science Training and Placement
Data Science Training and PlacementAkhilGGM
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)SayyedYusufali
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)SayyedYusufali
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)SayyedYusufali
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfvishal choudhary
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?DIGITALSAI1
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 

Similaire à Data Scientists (20)

The Analytics and Data Science Landscape
The Analytics and Data Science LandscapeThe Analytics and Data Science Landscape
The Analytics and Data Science Landscape
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdf
 
Data science as a professional career
Data science as a professional careerData science as a professional career
Data science as a professional career
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
Rise of the Data Democracy
Rise of the Data DemocracyRise of the Data Democracy
Rise of the Data Democracy
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
Data-X-Sparse-v2
Data-X-Sparse-v2Data-X-Sparse-v2
Data-X-Sparse-v2
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
DataScience.pptx
DataScience.pptxDataScience.pptx
DataScience.pptx
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
The Download: Tech Talks by the HPCC Systems Community, Episode 12
 The Download: Tech Talks by the HPCC Systems Community, Episode 12 The Download: Tech Talks by the HPCC Systems Community, Episode 12
The Download: Tech Talks by the HPCC Systems Community, Episode 12
 
Data-X-v3.1
Data-X-v3.1Data-X-v3.1
Data-X-v3.1
 
Data Science Training and Placement
Data Science Training and PlacementData Science Training and Placement
Data Science Training and Placement
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdf
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 

Plus de Leonid Zhukov

Ecosystem challenges around data use
Ecosystem challenges around data useEcosystem challenges around data use
Ecosystem challenges around data useLeonid Zhukov
 
Social Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorSocial Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorLeonid Zhukov
 
Big Data at Ancestry.com
Big Data at Ancestry.comBig Data at Ancestry.com
Big Data at Ancestry.comLeonid Zhukov
 
Russian Big Data Startups
Russian Big Data StartupsRussian Big Data Startups
Russian Big Data StartupsLeonid Zhukov
 
Революция Больших Данных
Революция Больших ДанныхРеволюция Больших Данных
Революция Больших ДанныхLeonid Zhukov
 
Профессия Data Scientist
 Профессия Data Scientist Профессия Data Scientist
Профессия Data ScientistLeonid Zhukov
 
Большие Данные
Большие ДанныеБольшие Данные
Большие ДанныеLeonid Zhukov
 
Information cascades
Information cascadesInformation cascades
Information cascadesLeonid Zhukov
 
Инфорамционные каскады
Инфорамционные каскадыИнфорамционные каскады
Инфорамционные каскадыLeonid Zhukov
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network AnalysisLeonid Zhukov
 
Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Leonid Zhukov
 

Plus de Leonid Zhukov (13)

Ecosystem challenges around data use
Ecosystem challenges around data useEcosystem challenges around data use
Ecosystem challenges around data use
 
Social Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorSocial Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to Macrobehavior
 
Big Data at Ancestry.com
Big Data at Ancestry.comBig Data at Ancestry.com
Big Data at Ancestry.com
 
Russian Big Data Startups
Russian Big Data StartupsRussian Big Data Startups
Russian Big Data Startups
 
Революция Больших Данных
Революция Больших ДанныхРеволюция Больших Данных
Революция Больших Данных
 
Профессия Data Scientist
 Профессия Data Scientist Профессия Data Scientist
Профессия Data Scientist
 
Большие Данные
Большие ДанныеБольшие Данные
Большие Данные
 
Information cascades
Information cascadesInformation cascades
Information cascades
 
Инфорамционные каскады
Инфорамционные каскадыИнфорамционные каскады
Инфорамционные каскады
 
Social Networks
Social NetworksSocial Networks
Social Networks
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.
 
Monitorium DLP
Monitorium DLPMonitorium DLP
Monitorium DLP
 

Dernier

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Dernier (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Data Scientists

  • 1. Data Scientists Leonid Zhukov Higher School of Economics , Moscow, 2013 www.hse.ru
  • 2. The Sexiest Job of the 21st Century McKinsey estimates 140,000-190,000 shortage by 2018 Higher School of Economics , Moscow, 2013 2  
  • 3. Data Scientists wanted! Higher School of Economics , Moscow, 2013 3  
  • 4. Supply and demand Higher School of Economics , Moscow, 2013 4  
  • 5. Who are Data Scientists? Data Scientist: •  Loves data •  Investigator mind set •  Goal of his work is in finding patterns in data and data driven products •  He is a practitioner, not theorist •  Has “hands on” skills •  Domain expertise (*) •  Team player Some backgrounds are better than others: •  Computer Science •  Statistics (mathematics) •  Natural sciences with strong quantitative •  PhD’s, but not only demand for a certain set of skills, while later demand wanes as automated by even newer tools. Consider, for instance, the wa management jobs that used to require legions of computer ope monitoring tools. Data science is still in its very early phase, wi the right available The best source of new Data Science talent is: Today's BI professionals 12% Professionals in disciplines other than IT or computer science 27%  EMC Data Science Community Survey, 2011 Higher School of Economics , Moscow, 2013 Other 3% Students studying computer science 34% Students studying fields other than computer science 24% university students. 5   Although opportun scientist thirds of shortfall the next research Institute 190,000 And whe the best today’s b Instead,
  • 6. What do Data Scientists do? •  •  •  •  •  •  •  •  Designs customized system and tools Works with structured and unstructured data Creates data processing pipelines Analyzes massive datasets (TB, PB) Builds predictive models Creates visualizations Designs data products Uses Hadoop, MapReduce, Hive, Python, R Higher School of Economics , Moscow, 2013 6  
  • 7. Tools of the trade •  Operating systems: •  Linux + shell tools •  Big data instruments: •  Hadoop (MapReduce) + hadoop tools •  Hive, Pig •  NoSQL (Hbase, MongoDB, Cassandra, Neo4J) •  Database: •  SQL •  Programming: •  Python •  Java •  Scala •  Machine Learning: •  R •  Matlab •  Python libraries (NumPy, SciPy, Nltk,SciKit) •  Java libraries (Mahaut) . Higher School of Economics , Moscow, 2013 7  
  • 8. Required skills •  •  •  •  •  •  •  •  •  •  Programming Algorithms Statistics Data mining Machine learning NLP Distributed systems Big data tools Databases Visualization From: Swami Chandrasekaran,Executive Architect, IBM, Watson Solutions Higher School of Economics , Moscow, 2013 8  
  • 9. Data Scientist roles From: “Analyzing the Analyzers” by Harlan Harris, Sean Murphy, and Marck Vaisman , O’Reilly Strata 2012 Higher School of Economics , Moscow, 2013 9  
  • 10. Data Science ”dream team” From: “Doing Data Science: Straight Talk from the Frontline”, Rachel Schutt, Cathy O'Neil, O'Reilly Media, 2013 Higher School of Economics , Moscow, 2013 10  
  • 11. Data Science project pipeline Learning  a   problem     Higher School of Economics , Moscow, 2013 Parsing  data     Cleaning,   filtering  and   organizing   Exploring   and  mining   for  paGerns   Acquiring   data     Building   models   Visualizing   results   CommunicaJng   findings   11  
  • 12. Business applications •  Marketing: •  Market segmentation •  Product and media mix analysis •  Customer acquisition and churn modeling •  Recommendation system and cross sell •  Social media analysis •  Finance & Insurance: •  Fraud prevention •  Anomaly detection •  Credit risk analysis •  Usage based insurance modeling •  Portfolio optimization •  Healthcare and Pharmaceuticals: •  Genetic analysis •  Clinical trials analysis •  Clinical decision support system Higher School of Economics , Moscow, 2013 12  
  • 13. Industry training TRAINING SHEET | 2 Course Outline: Cloudera Introduction to Data Science Introduction Data Analysis and Statistical Methods Experimentation and Evaluation Data Science Overview > Relationship Between Statistics and Probability > Measuring Recommender Effectiveness > Descriptive Statistics > Conducting an Effective Experiment > What Is Data Science? > The Growing Need for Data Science > The Role of a Data Scientist > Inferential Statistics Fundamentals of Machine Learning Use Cases > Overview > Finance > The Three Cs of Machine Learning > Retail > Spotlight: Naïve Bayes Classifiers > Advertising > Importance of Data and Algorithms > Defense and Intelligence > Telecommunications and Utilities > Healthcare and Pharmaceuticals Evaluating Input Data > Data Formats > Data Quantity > Data Quality Data Transformation > Tips and Techniques for Working at Scale > Summarizing and Visualizing Results > Considerations for Improvement Conclusion > Types of Collaborative Filtering > Fundamental Concepts > Acquisition Techniques > Deploying to Production > What Is a Recommender System? > Steps in the Project Lifecycle > Where to Source Data Production Deployment and Beyond > Next Steps for Recommenders > Limitations of Recommender Systems Data Acquisition > User Interfaces for Recommenders Recommender Overview Project Lifecycle > Lab Scenario Explanation > Designing Effective Experiments Introduction to Apache Mahout > What Apache Mahout Is (and Is Not) > A Brief History of Mahout > Availability and Installation Appendix A : Hadoop Overview Appendix B: Mathematical Formulas Appendix C : Language and Tool Reference > Demonstration: Using Mahout’s ItemBased Recommender Implementing Recommenders with Apache Mahout > Overview > Similarity Metrics for Binary Preferences > Anonymization > File Format Conversion TRAINING SHEET > Similarity Metrics for Numeric Preferences > Scoring > Joining Datasets Cloudera Introduction to Data Science: Cloudera Certified Professional: Data Building RecommenderScientist (CCP:DS) Systems Higher School of Economics , Moscow, 2013 13  
  • 14. Industry training Higher School of Economics , Moscow, 2013 14  
  • 15. Educational programs University programs: •  •  •  •  •  University of Washington: Certificate in Data Science UC Berkeley: Master of information and data science program New York University: Data Science at NYU Columbia University: Institute for Data Sciences and Engineering University of Southern California (UCS) : Master of Science in Data Science Online MOOC courses: •  Coursera •  edX •  Udacity Accelerated educational programs: •  Zipfian Academy (12 weeks intensive program) •  Insight Data Science Fellows program ( 6 weeks post doc training) Higher School of Economics , Moscow, 2013 15  
  • 16. Conferences •  Industry conferences and meetings: •  •  •  •  O’Reilly Strata Conference Making Data Work Hadoop World Big Data Techcon Big Data Innovation summits •  Academic conferences (peer reviewed): •  •  •  •  •  •  •  •  •  •  •  •  •  IEEE & ACM Supercomputing IEEE Big Data ACM KDD Knowledge Discovery and Data Mining ACM SIGIR Information Retrieval ICML International Conference on Machine Learning ICDM International Conference on Data Mining NIPS Neural Information Processing WWW World Wide Web Conference VLDB Very Large Data Bases ACM CIKM Information and Knowledge Management SIAM SDM International Conference on Data Mining IEEE ICDE Data Engineering IEEE Visualization •  Meetups Higher School of Economics , Moscow, 2013 16  
  • 17. Textbooks Higher School of Economics , Moscow, 2013 17  
  • 18. Open questions • How important is domain expertise? • What is need more: education or experience? • Future of Data Scientist, will they be replaced by software? Higher School of Economics , Moscow, 2013 18  
  • 19. 20, Myasnitskaya str., Moscow, Russia, 101000 Tel.: +7 (495) 628-8829, Fax: +7 (495) 628-7931 www.hse.ru