2. Data Scientist is the sexiest
job of 21st century
(c) Harvard Business Review
What is Data Science and Who is a Data Scientist
3. About me
3 |
• Presenter:
• Robert Williams, (Bobby)
• Microsoft Certified Trainer, (Bytes People Solutions)
• Experience:
• Business Intelligence, (SSAS, SSIS and SSRS)
• Data Science, (Machine Learning in R)
• Microsoft SQL Server, (Transact-SQL)
• Microsoft Azure, (Azure ML, HDInsight, SQL Database,
Data Warehouse and VMs)
4. Topics
• What is Data Science
• Who is a Data Scientist
• Discover about Data Science
5. • There is high demand for students trained in Data
Science and related fields
• Databases, warehousing, data architectures
• Data analytics – statistics, machine learning
• Big data (Hadoop, Spark)
• Supports “Business Intelligence”
• Quantitative decision-making and control
• Finance, inventory, pricing/marketing, advertising
• Need data for identifying risks, opportunities, conducting
“what-if” analyses
Data Science is currently popular to employers
6. • Business Intelligence
• Statistics
• Data Engineering
• Data Visualization
• Machine Learning
• Data Mining
• Artificial Intelligence
• Big Data
Data Science and related fields
7. • Data Analysis/Statistics
• Discover and clean data
• Visualize trends
• Find hidden correlations between parameters
• Modeling/Machine Learning
• How many cars are we going to sell next year
• Which city is better for opening a new store
• Which products are usually bought together
• Engineering/Prototyping
• Prototype of a working algorithm
• Deploy prediction model to use on a daily basis
Regular Data Science tasks
8. • Data Cleansing
• Filling in missing data (imputing values)
• Detecting and removing outliers
• Smoothing
• Removing noise by averaging values together
• Filtering, sampling
• Keeping only selected representative values
• Feature extraction
• e.g. in a photo database, which people are wearing glasses?
which have more than one person? which are outdoors?
Cleaning data: Garbage-In-Garbage-Out (GIGO)
9. • Numerical data
• Correlations
• Multivariate regression
• Fitting “models”
• Predictive equations that fit the data
e.g. from a real estate database of home sales, we get
housing price = 100*SqFt - 6*DistanceToSchools +
0.1*AverageOfNeighborhood
• ANOVA for testing differences between groups
• R is one of the most commonly used software packages
for doing statistical analysis
• Can load a data table, calculate means and correlations, fit
distributions, estimate parameters, test hypotheses, generate
graphs and histograms
Statistical analysis methods
11. • Clustering (Hierarchical, K-means)
• Similar photos, documents, cases
• Discovery of “structure” in the data
• Example: accident database
• Some clusters might be identified with “accidents involving a
truck and trailer” or “accidents at night”
• Top-down vs. bottom-up clustering methods
• Granularity: how many clusters?
Unsupervised learning methods
12. • Classifiers (Decision Trees)
• What factors, decisions, or treatments led to different
outcomes?
• Recursive partitioning algorithms
• Related methods
• “Discriminant” analysis
• What factors lead to return of product?
• Extract “association rules”
• Boxers dogs tend to have congenital defects
• Covers 5% of patients with 80% confidence
Veterinary database - dogs treated for disease
breed gender age drug sibsp outcome
terrier F 10 methotrexate 4.0 died
spaniel M 5 cytarabine 2.3 survived
doberman F 7 doxorubicin 0.1 died
Supervised learning methods
13. • Other types of data
• Time series and forecasting:
• Model the price of fuel using autoregression
• A function of recent prices, demand, geopolitics...
• De-trend: factor out seasonal trends
• GIS (geographic information systems)
• Longitude/latitude coordinates in the database
• Objects: city/state boundaries, river locations, roads
• Find regions in CS/B with an excess of coffee shops
from: Basic Statistics for Business and Economics, Lind et al (2009), Ch 16.
Toy Sales
credit: Frank Curriero
Miscellaneous methods
15. What IS-IS NOT Data Science
This Not that
Machine
Learning/
Statistics
Collecting
data-storage
Business
Intelligence
Industry
Knowledge
Software
Engineering
Automation
(Applications)
17. Who is a Data Scientist?
• Scientist
• Someone who find new discoveries
• Make a hypothesis
• Investigate that hypothesis
• Data Scientist
• Do the same with data
• Look for meaning, knowledge in the data
• Answering questions and rely on data
“It doesn't matter how beautiful your theory is, it doesn't
matter how smart you are. If it doesn't agree with
experiment, it's wrong. In that simple statement is the key
to science” – Richard Feynman: twitter.com/ProfFeynman
18. What’s in the Data Science toolkit?
Tools
User Experience
Research
Statistical
Methods
Data
Modeling
Time series
analysis
Survival
analysis
Missing data
imputations
Logistic,
multinomial and
multiple linear
regression
techniques
Classification
and
clustering
Forecasting
Pattern
recognition
Principal
component and
factor analysisMachine
learning
Propensity
score
matching
Data
mining
A/B
testing
Sentiment
analysis
Network
analysis
Data
Visualization
Regression
19. What’s in the Data Science toolkit?
Tools
User Experience
Research
Statistical
Methods
Languages
Python
R
SQL
SAS
Javascript
NodeJS
Libraries
NumPy
Pandas
Scikit-
Learn
Tidyverse
Revo
ScaleR
Mahout
+many
others
Data
Engineering
Profiling
ETL
Job notices
APIs
Optimized data
pipelines
Optimized data
storage/access
RDBMS
Hadoop/Spark
Visualization
D3.js
Base R
Leaflet
Power BI
Matplotlib
ggplot2
shiny
21. What is doing Data Science?
Data
Science
Apply Machine
Learning and
Statistics
Data
Engineering
Managing data
for creating
insights
Smarter Work
More efficient and effective organization
22. Finding a needle in the
haystack
Prioritizing a backlog
Flagging “stuff” early
A/B test something
Optimize a resources
Some combination
Something else…
Data Science problems?
23. Service Issue:
Costly changes
which are not
tested before
implementation
Which form? Data Science
Service
Change
Data Science
Process:
Statistical testing
to identify which
is better
Service Change:
Use the best
statistically
validated option
Result: Increases customer satisfaction
62%
respond
78%
respond
Statistical Inference: A/B testing
24. Find Samples:
Identify targets
within a sample
population
Question? Data Science Production
Data Science
Process:
Use existing data
and predictive
modeling to
identify targets
Deploy:
Implement data
science solution
into a production
environment
Result: A successful data science process
Target categories
Target individuals
Target areas
Data Science Process: Machine Learning
25. Where to Learn?
• University
• Online Resources
• Coursera
• edX
• etc.
• Books
26. How to start?
• Your own company
• Open competitions (Kaggle.com)
30. Human vs. Machine
• Unfortunately AI has often been negatively
portrayed in the popular media, for example:
• AI is going to take away our jobs
Or even worse
• Machines are going to kill us all!
• What we actually want is a Human/Machine
partnership going forward…
33. Human vs. Machine
• Human
• Naturally can work with small amount of data
• Have a knowledge about domain
• Good image recognition
• Machines
• Can make intensive computations
• Knows only numbers and strings (well, actually only
numbers)
34. AI, is actually Machine Learning
• What is machine learning?
• Introduction to machine learning algorithms
• Introduction to machine learning languages
35. What is Machine Learning?
• Machine learning overview
• How machine learning fits into data science
• Machine learning concepts and methodologies
• Models
36. Machine Learning overview
Machine learning:
• Detecting patterns and trends
• Statistical analysis
• Creating software models
Examples:
• Predicting success of medical intervention
• Identifying airplane maintenance
• Identifying fraudulent financial transactions
• Recommending books or movies
37. How Machine Learning fits into Data Science
Key questions:
• Is something X or Y?
• What is likely to be the numerical value of X or Y?
• Is something out of the ordinary or unexpected?
• How is this data structured?
38. Machine Learning concepts and methodologies
• Key steps:
1. Obtain raw data
2. Preprocess the data
3. Prepare the data
4. Apply one or more machine learning algorithms
to the data
5. Determine the best model to use
6. Deploy the model
39. Models
Machine learning model: the code generated after
an algorithm has been run
Training models:
• Experiments
• Evaluation
Deploying models:
• Applications
• Retraining
41. Algorithms overview
• Algorithm: set of steps, methods, or actions
• Classification algorithms: yes/no questions, or
identify most likely outcome from multiclass list
• Regression algorithms: make predictions of
outcomes, based on historical patterns
• Clustering algorithms: identify groupings within
dataset
47. Introduction to Machine Learning languages
• Languages overview
• Using R in machine learning
• Using Python in machine learning
48. Languages overview
Machine learning requires computer code:
• Most popular programming languages:
• R and
• Python
Use SQL for queries:
• Select data to use
• Join/filter data
49. Using R in Machine Learning
• R is open-source
• R is specifically designed to support statistics and
data analysis
R packages:
• Collections of functions, data, and code
• Available from CRAN
• Includes 10 000+ R packages
50. Using Python in Machine Learning
Python:
• Not a specialist data science or statistical tool
• Widely used within scientific computing
• Lots of resources available
Python machine learning-related libraries:
• numpy
• pandas
• matplotlib
• scikit-learn
51. Artificial Intelligence – Cognitive Services
• Cognitive Services overview
• Processing image and video
• Processing language
52. What is a cognitive service?
Cognitive Services:
• Vision. Analyze photos and videos
• Speech. Convert speech to text and text to speech
• Language. Understand intent from language
• Search. Find information on the web using Bing
61. Processing language - LUIS
• Language
• Learning to talk
• Using language to make decisions
62. Language
• Natural Language Processing
• Part of speech
• Nouns
• Adjectives
• Verbs
• Tokens
• The yellow fox can’t jump = The – yellow – fox – can –’t
–jump
63. Learning to talk
• Bing Spell Check
• Linguistic analysis
• Text analysis
• Translator
64. Using language to make decisions
• Utterances are translated to intents
• Intents drive app decisions
• Entities describe information about the intent
• Features help identify intents and entities