SlideShare une entreprise Scribd logo
1  sur  37
AC&AI EMEA Masterclass
Data Science 101
Wednesday 2nd December 2020
Ben Keen
Shahzia Holtom
Introductions
aka.ms/benkeen
AGENDA What is a Data Scientist?
Data
AI Ethics & Responsibility
MLOps
What is a Data Scientist?
Data Impact
What makes a Data Scientist?
Scientist
Scientist
Strong understanding of scientific method &
hypothesis testing
Asks clarifying questions and remains
sceptical and objective
Strong critical thinking, root cause analysis,
and research skills
Bases decisions on data and statistical
analysis
What makes a Data Scientist?
Scientist
Engineer
Statistics
Machine Learning Data Storage
Visualisation
Optimisation
Data ProcessingData Manipulation
Programming
Data Lakes
Azure Storage
SQL Server
MySQL
PostgreSQL
Oracle DB
Azure Data Warehouse
HDFS MongoDB
Neo4jAzure Cosmos DB
Cassandra
Word2Vec
SQLite
Spark/Databricks
Azure Data Factory
Airflow
Kubernetes
Azure Event Hub
Azure Service Bus Kafka
Hadoop
Logstash/Elasticsearch
NiFi
Docker
Swarm
Python
statsmodels
scipy
Scikit-learn
PyTorch
spark.ml
SAS
R
ggplot2
TensorflowKeras
Scala
Perl
MATLAB
Node.js
M
VBA
JavaScript
Julia
Jupyter
Weka
Azure Machine Learning
MLFlow
SPSS
Bayesian Statistics
ONNX
XGBoost
Continuous Distributions
PMCC/Spearman’s Rank
Monte Carlo Methods
χ2
Probability Theory
Skewness/Curtosis
Hypothesis Testing
Covariance
matplotlib
Power BID3.js
Highcharts
plotly
sankeymatic
Tableau
seaborn
Bokeh
React-vis
Dash
CanvasJS
Chart.js
Excel
ISOMAP
PIL
ScraPy / BS4
LibROSA
Flink
lifetimes
Bonsai
dplyr
NumPy
pandas
Powershell
Bash
NLTK
spaCy
OpenCV
Gensim
Azure Cognitive Services
pytz
Dijkstra
Gradient Descent
Ant Colony Optimisation
Particle Swarm Optimisation
Evolutionary Algorithms
Mixed-integer linear programming
Differential Calculus
Simulated Annealing
Least Squares
DAX
Tools for the job
Artificial
Intelligence
Machine
Learning
Deep
Learning
Artificial Intelligence
The ability for machines to mimic human behaviours.
See “Computing Machinery and Intelligence”, Turing, 1950.
Machine Learning
The application of mathematical and statistical techniques
that learn parameters from data rather than being
explicitly programmed.
Deep Learning
Subset of machine learning in which neural networks with
many layers are used to learn highly non-linear
relationships from large amounts of training data.
What makes a Data Scientist?
Scientist
EngineerBusiness Analyst
A Simple Example
38 2
4 556
A Simple Example
Business Context: Machinery failure costs
£500,000 but maintenance costs £1,000
Total Cost: £1,004,000
38 2
4 556
Yes
No
A Simple Example
Expensive
Yes
No
Yes
No
A Simple Example
Yes
No
Yes
No
Yes
No
A Simple Example
Yes
No
New Total Cost: £60,000
New Accuracy = 90%
40 0
60 490
Another Simple Example
What makes a Data Scientist?
Scientist
EngineerBusiness Analyst
Types of Data Scientist
ML Engineer Applied DS Research DS Full Stack DS Data Vis. Expert
• Operationalisation
of models
• Focus on MLOps,
Automated Tests,
CI/CD, ETL
• Focus on A/B
Testing, Modelling
and
Experimentation
• View to
contributing to a
product
• Uses Tried &
Tested Techniques
• Experimentation
with view to
expand
community
knowledge and
understanding of
algorithms
• Uses novel
techniques
• Generalist
• Works across
modelling, ETL,
operationalisation
and app
development
• May be less
focused on depth
of modelling
understanding
• Focus on
storytelling with
data
• Wizard with
graphing libraries,
including D3.js
Data
Data
What data do you need?
Data
How much data do you need?
Data
How much data do you need?
How do we know this is a cat?
We have 140 million neurons in V1
And we have V2, V3, V4, V5 and V6
Data
How much data do you need?
88 239 33 178 38 122
208 115 215 36 119 203
229 65 52 64 4 23
92 114 26 29 155 183
101 142 222 54 187 109
45 6 95 67 35 212
93 103 142 57 207 117
174 228 201 24 101 176
100 9 141 241 144 37
8 34 198 125 138 246
178 126 255 108 161 128
How do you get a computer to recognise this
as a cat
Data
How much data do you need?
? ?
?
?
?
Data
How much data do you need?
Garbage in…
…Garbage out
Data
How much data do you need?
HorsesGoats
?
Data
How much data do you need?
? ?
Data
How much data do you need?
? ?
Bias – AI ethics and responsibility
Value realization is only possible
through Continuous Delivery
MLOps
Data Science solutions need to be
integrated with People, Process
and Products
Pilot
PoC
Experiment
PoV
MVP
I have a model
for you…
How do I
deploy, manage,
monitor…Wall
Of
Confusion
Data Science Ops
Data Drift
Model Decay
Stale Models
Concept Drift
Traditional DS Delivery
DevOps is the union of people,
process, and products to enable
continuous delivery of value.
“
”
Build
&
Test
Continuous
Delivery
Deploy
Operate
Monitor
&
Learn
Plan
&
Track
Develop
People
• Collaborate early and often
• Cross-disciplinary teams
• Share common goals and metrics
• Shared responsibility Process
• Agile Principles
• Streamline feedback
• Delivering value faster Products
What is DevOps?
The ability to continuously integrate, automatically
test, build, deploy and monitor Machine Learning
artifacts such as Data & Training pipelines and
models.
MLOps
Data Science is a Team Effort
Architects
Change Management
Data Engineers
Data Scientists
Project Management
App Developers
UX Designers
Conclusions
 Data Scientists are Scientists, Engineers and Business Analysts
 We work best with data of high volume, veracity and variety
 We need to keep in mind ethical considerations and act responsibly
when designing systems
 MLOps is paramount for delivering customer value
 Data Science is a team effort
 Data Science is about turning data into impact
Q&A
Thank you

Contenu connexe

Tendances

Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopCCG
 
Afternoons with Azure - Power BI and Azure Analysis Services
Afternoons with Azure - Power BI and Azure Analysis ServicesAfternoons with Azure - Power BI and Azure Analysis Services
Afternoons with Azure - Power BI and Azure Analysis ServicesCCG
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActionsRob Winters
 
Creating an Enterprise AI Strategy
Creating an Enterprise AI StrategyCreating an Enterprise AI Strategy
Creating an Enterprise AI StrategyAtScale
 
Advanced Analytics for Investment Firms and Machine Learning
Advanced Analytics for Investment Firms and Machine LearningAdvanced Analytics for Investment Firms and Machine Learning
Advanced Analytics for Investment Firms and Machine LearningCloudera, Inc.
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning CCG
 
Simplifying AI and Machine Learning with Watson Studio
Simplifying AI and Machine Learning with Watson StudioSimplifying AI and Machine Learning with Watson Studio
Simplifying AI and Machine Learning with Watson StudioDataWorks Summit
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino Data Lab
 
Analytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopAnalytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopCCG
 
Belladati Meetup Singapore Workshop
Belladati Meetup Singapore WorkshopBelladati Meetup Singapore Workshop
Belladati Meetup Singapore Workshopbelladati
 
Data Science in Enterprise
Data Science in EnterpriseData Science in Enterprise
Data Science in EnterpriseJosh Yeh
 
Overview Microsoft's ML & AI tools
Overview Microsoft's ML & AI toolsOverview Microsoft's ML & AI tools
Overview Microsoft's ML & AI toolsDavid Voyles
 
Webinar: Question Answering and Virtual Assistants with Deep Learning
Webinar: Question Answering and Virtual Assistants with Deep LearningWebinar: Question Answering and Virtual Assistants with Deep Learning
Webinar: Question Answering and Virtual Assistants with Deep LearningLucidworks
 
Data estate modernization feb webinar 2 18 2020
Data estate modernization   feb webinar 2 18 2020Data estate modernization   feb webinar 2 18 2020
Data estate modernization feb webinar 2 18 2020Matthew W. Bowers
 
Citizen Data Science Training using KNIME
Citizen Data Science Training using KNIMECitizen Data Science Training using KNIME
Citizen Data Science Training using KNIMEAli Raza Anjum
 
Software Analytics for Pragmatists [DevOps Camp 2017]
Software Analytics for Pragmatists [DevOps Camp 2017]Software Analytics for Pragmatists [DevOps Camp 2017]
Software Analytics for Pragmatists [DevOps Camp 2017]Markus Harrer
 
Global Data Science Platform : Platform for AI Democratization
Global Data Science Platform : Platform for AI DemocratizationGlobal Data Science Platform : Platform for AI Democratization
Global Data Science Platform : Platform for AI DemocratizationRakuten Group, Inc.
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseSoftServe
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 

Tendances (20)

Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual Workshop
 
Afternoons with Azure - Power BI and Azure Analysis Services
Afternoons with Azure - Power BI and Azure Analysis ServicesAfternoons with Azure - Power BI and Azure Analysis Services
Afternoons with Azure - Power BI and Azure Analysis Services
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActions
 
Creating an Enterprise AI Strategy
Creating an Enterprise AI StrategyCreating an Enterprise AI Strategy
Creating an Enterprise AI Strategy
 
Advanced Analytics for Investment Firms and Machine Learning
Advanced Analytics for Investment Firms and Machine LearningAdvanced Analytics for Investment Firms and Machine Learning
Advanced Analytics for Investment Firms and Machine Learning
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
 
Simplifying AI and Machine Learning with Watson Studio
Simplifying AI and Machine Learning with Watson StudioSimplifying AI and Machine Learning with Watson Studio
Simplifying AI and Machine Learning with Watson Studio
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
 
Analytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopAnalytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual Workshop
 
Belladati Meetup Singapore Workshop
Belladati Meetup Singapore WorkshopBelladati Meetup Singapore Workshop
Belladati Meetup Singapore Workshop
 
Data Science in Enterprise
Data Science in EnterpriseData Science in Enterprise
Data Science in Enterprise
 
Overview Microsoft's ML & AI tools
Overview Microsoft's ML & AI toolsOverview Microsoft's ML & AI tools
Overview Microsoft's ML & AI tools
 
Webinar: Question Answering and Virtual Assistants with Deep Learning
Webinar: Question Answering and Virtual Assistants with Deep LearningWebinar: Question Answering and Virtual Assistants with Deep Learning
Webinar: Question Answering and Virtual Assistants with Deep Learning
 
Data estate modernization feb webinar 2 18 2020
Data estate modernization   feb webinar 2 18 2020Data estate modernization   feb webinar 2 18 2020
Data estate modernization feb webinar 2 18 2020
 
Azure databricks by usama whaba khan
Azure databricks by usama whaba khanAzure databricks by usama whaba khan
Azure databricks by usama whaba khan
 
Citizen Data Science Training using KNIME
Citizen Data Science Training using KNIMECitizen Data Science Training using KNIME
Citizen Data Science Training using KNIME
 
Software Analytics for Pragmatists [DevOps Camp 2017]
Software Analytics for Pragmatists [DevOps Camp 2017]Software Analytics for Pragmatists [DevOps Camp 2017]
Software Analytics for Pragmatists [DevOps Camp 2017]
 
Global Data Science Platform : Platform for AI Democratization
Global Data Science Platform : Platform for AI DemocratizationGlobal Data Science Platform : Platform for AI Democratization
Global Data Science Platform : Platform for AI Democratization
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science Expertise
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 

Similaire à Data science 101 Masterclass

Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionWeCloudData
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02BIWUG
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointJoris Poelmans
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Tips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the EnterpriseTips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the EnterpriseLisa Cohen
 
Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018LoQutus
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistLisa Cohen
 
Brochure data science learning path board-infinity (1)
Brochure   data science learning path board-infinity (1)Brochure   data science learning path board-infinity (1)
Brochure data science learning path board-infinity (1)NirupamNishant2
 
Hiring for data roles - Adwait Bhave (ML Engineer and Data Scientist at Druva
Hiring for data roles - Adwait Bhave (ML Engineer and Data Scientist at DruvaHiring for data roles - Adwait Bhave (ML Engineer and Data Scientist at Druva
Hiring for data roles - Adwait Bhave (ML Engineer and Data Scientist at DruvaAnupran Trivedi
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceAnnie Flippo
 
Data analytics on Azure
Data analytics on AzureData analytics on Azure
Data analytics on AzureElena Lopez
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedcedrinemadera
 
Tour de France Azure PaaS 6/7 Ajouter de l'intelligence
Tour de France Azure PaaS 6/7 Ajouter de l'intelligenceTour de France Azure PaaS 6/7 Ajouter de l'intelligence
Tour de France Azure PaaS 6/7 Ajouter de l'intelligenceAlex Danvy
 
#Datacaeer - AI Guild workshop on data roles in industry with Adam Green
#Datacaeer - AI Guild workshop on data roles in industry with Adam Green#Datacaeer - AI Guild workshop on data roles in industry with Adam Green
#Datacaeer - AI Guild workshop on data roles in industry with Adam GreenAI Guild
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLJordan Birdsell
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine LearningMostafa
 
Artificial Intelligence As a Service
Artificial Intelligence As a ServiceArtificial Intelligence As a Service
Artificial Intelligence As a ServiceJohn Liu
 
Building enterprise advance analytics platform
Building enterprise advance analytics platformBuilding enterprise advance analytics platform
Building enterprise advance analytics platformHaoran Du
 

Similaire à Data science 101 Masterclass (20)

Lean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science teamLean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science team
 
Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info Session
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePoint
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Tips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the EnterpriseTips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the Enterprise
 
Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data Scientist
 
Brochure data science learning path board-infinity (1)
Brochure   data science learning path board-infinity (1)Brochure   data science learning path board-infinity (1)
Brochure data science learning path board-infinity (1)
 
Hiring for data roles - Adwait Bhave (ML Engineer and Data Scientist at Druva
Hiring for data roles - Adwait Bhave (ML Engineer and Data Scientist at DruvaHiring for data roles - Adwait Bhave (ML Engineer and Data Scientist at Druva
Hiring for data roles - Adwait Bhave (ML Engineer and Data Scientist at Druva
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
 
Data analytics on Azure
Data analytics on AzureData analytics on Azure
Data analytics on Azure
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
Tour de France Azure PaaS 6/7 Ajouter de l'intelligence
Tour de France Azure PaaS 6/7 Ajouter de l'intelligenceTour de France Azure PaaS 6/7 Ajouter de l'intelligence
Tour de France Azure PaaS 6/7 Ajouter de l'intelligence
 
#Datacaeer - AI Guild workshop on data roles in industry with Adam Green
#Datacaeer - AI Guild workshop on data roles in industry with Adam Green#Datacaeer - AI Guild workshop on data roles in industry with Adam Green
#Datacaeer - AI Guild workshop on data roles in industry with Adam Green
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Artificial Intelligence As a Service
Artificial Intelligence As a ServiceArtificial Intelligence As a Service
Artificial Intelligence As a Service
 
Building enterprise advance analytics platform
Building enterprise advance analytics platformBuilding enterprise advance analytics platform
Building enterprise advance analytics platform
 

Dernier

LinkedIn Strategic Guidelines April 2024
LinkedIn Strategic Guidelines April 2024LinkedIn Strategic Guidelines April 2024
LinkedIn Strategic Guidelines April 2024Bruce Bennett
 
Gurgaon Call Girls: Free Delivery 24x7 at Your Doorstep G.G.N = 8377087607
Gurgaon Call Girls: Free Delivery 24x7 at Your Doorstep G.G.N = 8377087607Gurgaon Call Girls: Free Delivery 24x7 at Your Doorstep G.G.N = 8377087607
Gurgaon Call Girls: Free Delivery 24x7 at Your Doorstep G.G.N = 8377087607dollysharma2066
 
Protection of Children in context of IHL and Counter Terrorism
Protection of Children in context of IHL and  Counter TerrorismProtection of Children in context of IHL and  Counter Terrorism
Protection of Children in context of IHL and Counter TerrorismNilendra Kumar
 
Application deck- Cyril Caudroy-2024.pdf
Application deck- Cyril Caudroy-2024.pdfApplication deck- Cyril Caudroy-2024.pdf
Application deck- Cyril Caudroy-2024.pdfCyril CAUDROY
 
定制(UQ毕业证书)澳洲昆士兰大学毕业证成绩单原版一比一
定制(UQ毕业证书)澳洲昆士兰大学毕业证成绩单原版一比一定制(UQ毕业证书)澳洲昆士兰大学毕业证成绩单原版一比一
定制(UQ毕业证书)澳洲昆士兰大学毕业证成绩单原版一比一lvtagr7
 
定制英国克兰菲尔德大学毕业证成绩单原版一比一
定制英国克兰菲尔德大学毕业证成绩单原版一比一定制英国克兰菲尔德大学毕业证成绩单原版一比一
定制英国克兰菲尔德大学毕业证成绩单原版一比一z zzz
 
办澳洲詹姆斯库克大学毕业证成绩单pdf电子版制作修改
办澳洲詹姆斯库克大学毕业证成绩单pdf电子版制作修改办澳洲詹姆斯库克大学毕业证成绩单pdf电子版制作修改
办澳洲詹姆斯库克大学毕业证成绩单pdf电子版制作修改yuu sss
 
办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一
办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一
办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一A SSS
 
Ch. 9- __Skin, hair and nail Assessment (1).pdf
Ch. 9- __Skin, hair and nail Assessment (1).pdfCh. 9- __Skin, hair and nail Assessment (1).pdf
Ch. 9- __Skin, hair and nail Assessment (1).pdfJamalYaseenJameelOde
 
定制(SCU毕业证书)南十字星大学毕业证成绩单原版一比一
定制(SCU毕业证书)南十字星大学毕业证成绩单原版一比一定制(SCU毕业证书)南十字星大学毕业证成绩单原版一比一
定制(SCU毕业证书)南十字星大学毕业证成绩单原版一比一z xss
 
AI ppt introduction , advandtage pros and cons.pptx
AI ppt introduction , advandtage pros and cons.pptxAI ppt introduction , advandtage pros and cons.pptx
AI ppt introduction , advandtage pros and cons.pptxdeepakkrlkr2002
 
办理(Salford毕业证书)索尔福德大学毕业证成绩单原版一比一
办理(Salford毕业证书)索尔福德大学毕业证成绩单原版一比一办理(Salford毕业证书)索尔福德大学毕业证成绩单原版一比一
办理(Salford毕业证书)索尔福德大学毕业证成绩单原版一比一diploma 1
 
办理学位证(纽伦堡大学文凭证书)纽伦堡大学毕业证成绩单原版一模一样
办理学位证(纽伦堡大学文凭证书)纽伦堡大学毕业证成绩单原版一模一样办理学位证(纽伦堡大学文凭证书)纽伦堡大学毕业证成绩单原版一模一样
办理学位证(纽伦堡大学文凭证书)纽伦堡大学毕业证成绩单原版一模一样umasea
 
8377877756 Full Enjoy @24/7 Call Girls in Pitampura Delhi NCR
8377877756 Full Enjoy @24/7 Call Girls in Pitampura Delhi NCR8377877756 Full Enjoy @24/7 Call Girls in Pitampura Delhi NCR
8377877756 Full Enjoy @24/7 Call Girls in Pitampura Delhi NCRdollysharma2066
 
LinkedIn for Your Job Search in April 2024
LinkedIn for Your Job Search in April 2024LinkedIn for Your Job Search in April 2024
LinkedIn for Your Job Search in April 2024Bruce Bennett
 
定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一
 定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一 定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一
定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一Fs sss
 
Escort Service Andheri WhatsApp:+91-9833363713
Escort Service Andheri WhatsApp:+91-9833363713Escort Service Andheri WhatsApp:+91-9833363713
Escort Service Andheri WhatsApp:+91-9833363713Riya Pathan
 
Ioannis Tzachristas Self-Presentation for MBA.pdf
Ioannis Tzachristas Self-Presentation for MBA.pdfIoannis Tzachristas Self-Presentation for MBA.pdf
Ioannis Tzachristas Self-Presentation for MBA.pdfjtzach
 
Unlock Your Creative Potential: 7 Skills for Content Creator Evolution
Unlock Your Creative Potential: 7 Skills for Content Creator EvolutionUnlock Your Creative Potential: 7 Skills for Content Creator Evolution
Unlock Your Creative Potential: 7 Skills for Content Creator EvolutionRhazes Ghaisan
 

Dernier (20)

Young Call~Girl in Pragati Maidan New Delhi 8448380779 Full Enjoy Escort Service
Young Call~Girl in Pragati Maidan New Delhi 8448380779 Full Enjoy Escort ServiceYoung Call~Girl in Pragati Maidan New Delhi 8448380779 Full Enjoy Escort Service
Young Call~Girl in Pragati Maidan New Delhi 8448380779 Full Enjoy Escort Service
 
LinkedIn Strategic Guidelines April 2024
LinkedIn Strategic Guidelines April 2024LinkedIn Strategic Guidelines April 2024
LinkedIn Strategic Guidelines April 2024
 
Gurgaon Call Girls: Free Delivery 24x7 at Your Doorstep G.G.N = 8377087607
Gurgaon Call Girls: Free Delivery 24x7 at Your Doorstep G.G.N = 8377087607Gurgaon Call Girls: Free Delivery 24x7 at Your Doorstep G.G.N = 8377087607
Gurgaon Call Girls: Free Delivery 24x7 at Your Doorstep G.G.N = 8377087607
 
Protection of Children in context of IHL and Counter Terrorism
Protection of Children in context of IHL and  Counter TerrorismProtection of Children in context of IHL and  Counter Terrorism
Protection of Children in context of IHL and Counter Terrorism
 
Application deck- Cyril Caudroy-2024.pdf
Application deck- Cyril Caudroy-2024.pdfApplication deck- Cyril Caudroy-2024.pdf
Application deck- Cyril Caudroy-2024.pdf
 
定制(UQ毕业证书)澳洲昆士兰大学毕业证成绩单原版一比一
定制(UQ毕业证书)澳洲昆士兰大学毕业证成绩单原版一比一定制(UQ毕业证书)澳洲昆士兰大学毕业证成绩单原版一比一
定制(UQ毕业证书)澳洲昆士兰大学毕业证成绩单原版一比一
 
定制英国克兰菲尔德大学毕业证成绩单原版一比一
定制英国克兰菲尔德大学毕业证成绩单原版一比一定制英国克兰菲尔德大学毕业证成绩单原版一比一
定制英国克兰菲尔德大学毕业证成绩单原版一比一
 
办澳洲詹姆斯库克大学毕业证成绩单pdf电子版制作修改
办澳洲詹姆斯库克大学毕业证成绩单pdf电子版制作修改办澳洲詹姆斯库克大学毕业证成绩单pdf电子版制作修改
办澳洲詹姆斯库克大学毕业证成绩单pdf电子版制作修改
 
办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一
办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一
办理学位证(UoM证书)北安普顿大学毕业证成绩单原版一比一
 
Ch. 9- __Skin, hair and nail Assessment (1).pdf
Ch. 9- __Skin, hair and nail Assessment (1).pdfCh. 9- __Skin, hair and nail Assessment (1).pdf
Ch. 9- __Skin, hair and nail Assessment (1).pdf
 
定制(SCU毕业证书)南十字星大学毕业证成绩单原版一比一
定制(SCU毕业证书)南十字星大学毕业证成绩单原版一比一定制(SCU毕业证书)南十字星大学毕业证成绩单原版一比一
定制(SCU毕业证书)南十字星大学毕业证成绩单原版一比一
 
AI ppt introduction , advandtage pros and cons.pptx
AI ppt introduction , advandtage pros and cons.pptxAI ppt introduction , advandtage pros and cons.pptx
AI ppt introduction , advandtage pros and cons.pptx
 
办理(Salford毕业证书)索尔福德大学毕业证成绩单原版一比一
办理(Salford毕业证书)索尔福德大学毕业证成绩单原版一比一办理(Salford毕业证书)索尔福德大学毕业证成绩单原版一比一
办理(Salford毕业证书)索尔福德大学毕业证成绩单原版一比一
 
办理学位证(纽伦堡大学文凭证书)纽伦堡大学毕业证成绩单原版一模一样
办理学位证(纽伦堡大学文凭证书)纽伦堡大学毕业证成绩单原版一模一样办理学位证(纽伦堡大学文凭证书)纽伦堡大学毕业证成绩单原版一模一样
办理学位证(纽伦堡大学文凭证书)纽伦堡大学毕业证成绩单原版一模一样
 
8377877756 Full Enjoy @24/7 Call Girls in Pitampura Delhi NCR
8377877756 Full Enjoy @24/7 Call Girls in Pitampura Delhi NCR8377877756 Full Enjoy @24/7 Call Girls in Pitampura Delhi NCR
8377877756 Full Enjoy @24/7 Call Girls in Pitampura Delhi NCR
 
LinkedIn for Your Job Search in April 2024
LinkedIn for Your Job Search in April 2024LinkedIn for Your Job Search in April 2024
LinkedIn for Your Job Search in April 2024
 
定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一
 定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一 定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一
定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一
 
Escort Service Andheri WhatsApp:+91-9833363713
Escort Service Andheri WhatsApp:+91-9833363713Escort Service Andheri WhatsApp:+91-9833363713
Escort Service Andheri WhatsApp:+91-9833363713
 
Ioannis Tzachristas Self-Presentation for MBA.pdf
Ioannis Tzachristas Self-Presentation for MBA.pdfIoannis Tzachristas Self-Presentation for MBA.pdf
Ioannis Tzachristas Self-Presentation for MBA.pdf
 
Unlock Your Creative Potential: 7 Skills for Content Creator Evolution
Unlock Your Creative Potential: 7 Skills for Content Creator EvolutionUnlock Your Creative Potential: 7 Skills for Content Creator Evolution
Unlock Your Creative Potential: 7 Skills for Content Creator Evolution
 

Data science 101 Masterclass

  • 1. AC&AI EMEA Masterclass Data Science 101 Wednesday 2nd December 2020 Ben Keen Shahzia Holtom
  • 3. AGENDA What is a Data Scientist? Data AI Ethics & Responsibility MLOps
  • 4. What is a Data Scientist? Data Impact
  • 5. What makes a Data Scientist? Scientist
  • 6. Scientist Strong understanding of scientific method & hypothesis testing Asks clarifying questions and remains sceptical and objective Strong critical thinking, root cause analysis, and research skills Bases decisions on data and statistical analysis
  • 7. What makes a Data Scientist? Scientist Engineer
  • 8. Statistics Machine Learning Data Storage Visualisation Optimisation Data ProcessingData Manipulation Programming Data Lakes Azure Storage SQL Server MySQL PostgreSQL Oracle DB Azure Data Warehouse HDFS MongoDB Neo4jAzure Cosmos DB Cassandra Word2Vec SQLite Spark/Databricks Azure Data Factory Airflow Kubernetes Azure Event Hub Azure Service Bus Kafka Hadoop Logstash/Elasticsearch NiFi Docker Swarm Python statsmodels scipy Scikit-learn PyTorch spark.ml SAS R ggplot2 TensorflowKeras Scala Perl MATLAB Node.js M VBA JavaScript Julia Jupyter Weka Azure Machine Learning MLFlow SPSS Bayesian Statistics ONNX XGBoost Continuous Distributions PMCC/Spearman’s Rank Monte Carlo Methods χ2 Probability Theory Skewness/Curtosis Hypothesis Testing Covariance matplotlib Power BID3.js Highcharts plotly sankeymatic Tableau seaborn Bokeh React-vis Dash CanvasJS Chart.js Excel ISOMAP PIL ScraPy / BS4 LibROSA Flink lifetimes Bonsai dplyr NumPy pandas Powershell Bash NLTK spaCy OpenCV Gensim Azure Cognitive Services pytz Dijkstra Gradient Descent Ant Colony Optimisation Particle Swarm Optimisation Evolutionary Algorithms Mixed-integer linear programming Differential Calculus Simulated Annealing Least Squares DAX
  • 9. Tools for the job Artificial Intelligence Machine Learning Deep Learning Artificial Intelligence The ability for machines to mimic human behaviours. See “Computing Machinery and Intelligence”, Turing, 1950. Machine Learning The application of mathematical and statistical techniques that learn parameters from data rather than being explicitly programmed. Deep Learning Subset of machine learning in which neural networks with many layers are used to learn highly non-linear relationships from large amounts of training data.
  • 10. What makes a Data Scientist? Scientist EngineerBusiness Analyst
  • 12. A Simple Example Business Context: Machinery failure costs £500,000 but maintenance costs £1,000 Total Cost: £1,004,000 38 2 4 556
  • 15. Yes No Yes No A Simple Example Yes No New Total Cost: £60,000 New Accuracy = 90% 40 0 60 490
  • 17. What makes a Data Scientist? Scientist EngineerBusiness Analyst
  • 18. Types of Data Scientist ML Engineer Applied DS Research DS Full Stack DS Data Vis. Expert • Operationalisation of models • Focus on MLOps, Automated Tests, CI/CD, ETL • Focus on A/B Testing, Modelling and Experimentation • View to contributing to a product • Uses Tried & Tested Techniques • Experimentation with view to expand community knowledge and understanding of algorithms • Uses novel techniques • Generalist • Works across modelling, ETL, operationalisation and app development • May be less focused on depth of modelling understanding • Focus on storytelling with data • Wizard with graphing libraries, including D3.js
  • 19. Data
  • 20. Data What data do you need?
  • 21. Data How much data do you need?
  • 22. Data How much data do you need? How do we know this is a cat? We have 140 million neurons in V1 And we have V2, V3, V4, V5 and V6
  • 23. Data How much data do you need? 88 239 33 178 38 122 208 115 215 36 119 203 229 65 52 64 4 23 92 114 26 29 155 183 101 142 222 54 187 109 45 6 95 67 35 212 93 103 142 57 207 117 174 228 201 24 101 176 100 9 141 241 144 37 8 34 198 125 138 246 178 126 255 108 161 128 How do you get a computer to recognise this as a cat
  • 24. Data How much data do you need? ? ? ? ? ?
  • 25. Data How much data do you need? Garbage in… …Garbage out
  • 26. Data How much data do you need? HorsesGoats ?
  • 27. Data How much data do you need? ? ?
  • 28. Data How much data do you need? ? ?
  • 29. Bias – AI ethics and responsibility
  • 30. Value realization is only possible through Continuous Delivery MLOps Data Science solutions need to be integrated with People, Process and Products Pilot PoC Experiment PoV MVP
  • 31. I have a model for you… How do I deploy, manage, monitor…Wall Of Confusion Data Science Ops Data Drift Model Decay Stale Models Concept Drift Traditional DS Delivery
  • 32. DevOps is the union of people, process, and products to enable continuous delivery of value. “ ” Build & Test Continuous Delivery Deploy Operate Monitor & Learn Plan & Track Develop People • Collaborate early and often • Cross-disciplinary teams • Share common goals and metrics • Shared responsibility Process • Agile Principles • Streamline feedback • Delivering value faster Products What is DevOps?
  • 33. The ability to continuously integrate, automatically test, build, deploy and monitor Machine Learning artifacts such as Data & Training pipelines and models. MLOps
  • 34. Data Science is a Team Effort Architects Change Management Data Engineers Data Scientists Project Management App Developers UX Designers
  • 35. Conclusions  Data Scientists are Scientists, Engineers and Business Analysts  We work best with data of high volume, veracity and variety  We need to keep in mind ethical considerations and act responsibly when designing systems  MLOps is paramount for delivering customer value  Data Science is a team effort  Data Science is about turning data into impact
  • 36. Q&A

Notes de l'éditeur

  1. aka.ms/benkeen is a short URL to my LinkedIn profile
  2. So today we’re going to cover a range of topics in our Data Science 101 Masterclass. We’ll start with what a data scientist is, what the job entails and what should be expected of a data scientist. Then we’ll talk about data – The Economist described Data as “The New Oil” in 2017 and that’s up to some debate, I’m not so sure I agree, but here we’ll be taking a look at a couple of questions about data that data scientists face most commonly. We’ll touch a bit on AI Ethics and Responsibility which, of course, could be an entire masterclass in itself. Finally we’ll cover an important emerging topic in data science – MLOps for operationalising data science
  3. Data science is not about making awesome visualisations, complicated models, or writing lots of code. Data scientists turn data into impact The job is to solve real problems using data
  4. First and foremost, data scientists are scientists – It’s right there in the name. What does this mean? [See next slide]
  5. I don’t come from a mathematics or computer science background myself - my PhD is in molecular genetics – but the skills required of a scientist are the same, whether I’m creating a predictive model as a data scientist now or I’m doing X-ray crystallography on proteins as a biochemist in my old life. [Read Slide Text]
  6. Data Scientists are also engineers – they design systems to fulfil functional objectives. In the next couple of slides, we’ll explore the tools data scientists use to do this.
  7. These are the tools for the job – just as with any other type of engineer, good data scientists will know when to use which tools for which tasks and not shoehorn things in just because they like them. These are the tools data scientists use to create the systems that fulfil those functional objectives. As a side note, this is not an exhaustive list. No single data scientist knows all of this in-depth and we’ll come back to that in a little bit (See types of data scientist). In the next slide I’m going to focus a little more down into this machine learning section as it’s probably the one that’s most associated with data science.
  8. Artificial intelligence is the ability for machines to mimic human behaviours – including things like recognition, behaviour, reasoning etc. Machine learning is a subset of artificial intelligence. Machine Learning is then application of maths and stats techniques that learn parameters from data. Artificial intelligence is not just machine learning – there are rules-based systems in which knowledge is explicitly coded from rules understood by experts within AI too and these can be powerful such as in the cases of classical computer vision and NLP syntax and semantics and can often be combined with machine learning techniques. Deep learning is a subset of machine learning, in which we use neural nets with many hidden layers to learn highly non-linear relationships. Deep learning is often used for things like computer vision, natural language processing, speech/sound recognition, bioinformatics (genetic data), time series. Just like Machine Learning not being all of AI, Deep learning is not all Machine Learning - For work where data is tabular and there are fewer samples, deep learning is often not the best modelling technique and you’ll find a lot of Kaggle competition winners will use other techniques such as gradient boosted decision trees. And sometimes, where relationships are less complex, a simple linear or logistic regression will do just as well. Again – to re-iterate – this is just a tool in the data science job
  9. Finally a data scientist is a business analyst. Over the next few slides we’ll see why this is so important to the role of a data scientist.
  10. Let’s take a simple example – We’ve been brought in to a predictive maintenance engagement and have been given data to go and train a model with no real context. We go away and train a classifier model that’s 99% accurate – happy days, it all looks good, we deploy it but the customer is not happy.
  11. Given some business context – we find out that machinery failure costs £500,000 but maintenance costs £1,000. We’ve predicted 4 cases in which maintenance of a machine was required that wasn’t – that’s a cost of £4,000. However, we predicted 2 cases in which maintenance of a machine wasn’t required, that’s a cost of £1,000,000. So now let’s take a look at how a data scientist could have dealt with this given this business context.
  12. Our prediction is yes the machine needs maintenance in the upper light blue rectangle and no it doesn’t need maintenance in the lower light red rectangle. The shape of the markers indicates whether the machine actually needs maintenance or not, blue circles indicate that actually the machine does need maintenance and red crosses mean it doesn’t. (Transition 1) Predicting “No” but actually needing maintenance is *very* expensive.
  13. We have 2 options Option 1 – We can change our model or parameters and re-train. In this example the sigmoid shown is shifted to the left, this has moved a number of middling values up into the “yes” rectangle Option 2 – We can change our threshold, or “decision boundary” – before we had our decision boundary at ~50%, shifting it down also moves those middling values into the “yes” rectangle Doing either of these has the effect of removing our false negative but introducing more false positives
  14. Making sure we don’t miss required maintenance here has reduced the cost by nearly £1,000,000. This is a simple and extreme example but the point stands that data scientists need to tie technical decision making with business value understanding.
  15. The previous example showed a classification – but how about a regression, where we’re predicting continuous variables and want to look at predictive maintenance from a perspective of remaining useful life. Let’s look at a simple linear regression. Say we want to look at the performance of machines over time and we are just given time and performance metrics. We come to the conclusion that the machines’ performance improves over time. Again our customer is not so happy. (Transition 1) Now we do some business analysis, and find out that actually there are different groups of machines represented from within this data and our original conclusion was wrong. Taken together, this shows that understanding the business context in which the data resides, is incredibly important to doing data science and without it we can cause more harm than good. It’s so important in data science projects to have data scientists engaged early on to ensure things like this don’t happen and we lose trust.
  16. So a data scientist is not just some subset of these roles, a data scientist encompasses all 3 roles.
  17. All of these people are data scientists and all of them will have significant overlap in skills with the others.
  18. Now that we know who are data scientist is, what their tools are and what their objective is – let’s talk about data No “Data Science 101” talk would be complete without a discussion about data. These are possibly the 2 questions we get asked most: - What data do you need? - How much data do you need. As consultants, our diplomatic answer is always – it depends.
  19. This question – what data do you need? - is highly use case dependent and requires BA Workshops. If you cast your mind back to a few slides ago, where we showed the regression was positive without information on the machine groups, and then it became negative. Without business analysis we wouldn’t know that we needed that group data. This is why it’s so important to have data scientists on board so early on in the engagement, so we can cover off some of these requirements. So keeping in mind that the machine learning aim is to mimic human behaviour, if it’s tough for an SME to identify or predict, it is most likely hard for a ML to be trained. There are, of course, as with everything, exceptions to this rule. Exceptions to the rule – is this consulting or is this a research task that we should be taking on?
  20. How much data do you need? Commonly people will just think about volume, they want a number. I can get you 100 data points and that will all be fine. However, we need to get people out of that thinking – we need to combine volume, with veracity (reliability) and variety. The data sample we’re given needs to be representative of the population and only with all 3 will we get this.
  21. Let’s take a computer vision problem - The computer vision examples almost always have cats. How do we know this is a cat? We have 140 million interconnected neurons in our primary visual cortex With 10s of billions of connections between them That’s just V1, we have v2, v3, v4 and v5 for fine tuning We are very good at learning pattern recognition as a result
  22. This is an easy task for us but incredibly difficult for a computer Computers just see pixel values We can’t just write a program to recognise cats – Too many options to encode and too many edge cases So we use machine learning to recognize patterns. I won’t go into the details of convolutional neural networks but as you can imagine, the transformation to go from these numbers here to the label of a cat is highly non-linear and complex. There’s no y = mx + c linear regression here. This kind of pattern recognition is going to need many, many examples to determine how to get from picture of cat to a label of cat. The InceptionV3 image recognition model from Google has just shy of 24 million parameters it needs to get right
  23. Image data might have more complex transformation but if we have tabular data, we still need enough data in order to reason about distributions of data. Let’s say we’re trying to classify our blue circles and red crosses. And we have a new sample – green question mark. (Transition 1) This data alone could be a sample of any number of distributions. Any of these decision boundaries might be valid – we need more volume to get a better idea of distribution (Transition 2) When we have more data, all of these distributions include those first 6 points but we can see how wildly different they are, our top example would classify this as a red cross but the next 2 would be blue circles
  24. If you feed a model with incorrect or unreliable data, the results you get will be unreliable or incorrect. Depending on the accuracy you need, for each incorrect data point you feed in, you might need 5, 10, 50, 100… similar correct examples to drown the effects of it out.
  25. Let’s train a model on these pictures of goats and these pictures of horses (I didn’t use cats!) (Transition 1) Now, given this picture of a horse, what do we think our model would predict? It’s never seen a side profile of a horse – it’s got 4 legs, it’s and a relatively rectangular body – it’s a goat. We could have hundreds of thousands of images like this, but we will still predict this is a goat. You may also have heard about the model that determined dogs from wolves not from the animal itself but the background, if there was snow – it was a wolf. Sample training data needs to be representative of the potential scoring population. If you’re doing crack detection for example, it’s better to have a thousand different images of cracks than a thousand images of the same crack. There is another conversation on how we go about getting this labelled data but it’s perhaps a topic for another day.
  26. This isn’t just true of image data – let’s take a look at some more tabular data. We have two classes of data – represented by blue circles and red crosses. We have a new sample represented by a green question mark and we want to know how to classify this. (Transition 1) If we have a lot of data but all our data is from the same two clusters of data – we may get a separation that looks something like this – this is something an SVC might look do to classify these. But notice that the green question mark is not near either of these clusters – we’ve trained a model the best we can based on the knowledge we have of this training data but this training data is clearly not representative of the population.
  27. Now we have sparser data, there’s much fewer points, they’re actually just a different sample from the same population but this data is a better representation of the population. (Transition 1) Now we’re a little more certain about where this green question mark should be placed. Although previously it would have been in the blue area, now it’s on the red side of our decision boundary. So variety of data is just as, if not more, important as volume of data.
  28. As part of this discussion around the types of data required of data scientists and why our samples need to be representative of the population, we also need to talk about AI ethics and responsibility, which will ultimately fall on the data scientists that design this system. Bias has different meanings in ML and stats – but here I’m talking about bias in which a model is skewed based on the data it is fed relative to accepted legal or moral principles. If you train a model to determine best candidates based on your historical successful hires, but all you’ve hired in the past is men, that model is going to be skewed to hiring men. This is an example where your training sample isn’t necessarily representative of the population. Similarly if you train a model mostly on data you’ve collected from white men, it’s going to perform better on white men. Again, you need a representative sample of the population. The Facebook example here is similar but also highlights another key issue. Even if you remove the actual labels that indicate protected classes like race, sex, religion. We need to consider proxies that might indicate these classes – hormone levels in medical records, certain words in CVs, sports activities or post codes. There are a number of techniques for reducing this kind of bias in models – including pre-processing, in-processing and post-processing algorithms, and class balancing algorithms like SMOTE can help augment under-represented classes but ultimately I think data scientists can best tackle this kind of bias will be to have a good domain understanding and a good understanding of the data that they are using to try and ensure they are not discriminating against any class.
  29. The final thing I want to discuss is MLOps as it’s one of the most important aspects of a data scientist’s role. Whilst experimentation, proof of concepts and proof of values are important – value realization for our customers is only possible through continuous delivery. For successful value realisation, data science solutions need to be integrated with people, process and products.
  30. Historically there has been a disconnect between data scientists and other developers, in which a model is made, perhaps through data science experimentation and then thrown over a wall to developers to deploy. Model requirements in this manner are often poorly understood by developers and changes can be difficult, resulting in model drift.
  31. Modern data science delivery integrates machine learning and DevOps, using tools designed for continuous integration and continuous delivery. Although the goal is to enable each of the technical delivery tasks shown in the top right here, there is a focus on people, process and products in order to make this a success. (Transition 1) People: Project managers, architects, data engineers, data scientists, developers, and testers should all be involved in a use case from an early stage. There is not a specific KPI for data scientists such as an RMSE value, and a different goal for developers such as API response times – the whole team shares common goals. (Transition 2) Process: We follow agile methodology principles of short sprints of 2 or 3 weeks, tracking the story points and burndown of a sprint to use in planning for the next sprints as well as retrospectives to feedback to each other what’s going well, what’s not going well and what actions we can take to maintain velocity. (Transition 3) Products: We should use products that enhance our productivity, not products shoehorned in because they are an individual’s favourite tool. Knowledge sharing through wikis and Teams is encouraged across the team so that we can make sure everyone can contribute to a range of tasks.
  32. MLOps is the integration of machine learning into DevOps processes and, as we see on screen, it is the ability to continuously integrate, automatically test, build, deploy and monitor Machine Learning artifacts such as Data & Training pipelines and models. The aim of data science modelling is not necessarily to be right today but to be less wrong each day through an iterative feedback cycle. Monitoring models and, for those that have done scrum principles or operations management training, the principles of Kaizen (a Japanese term that means “continuous improvement”) are therefore of paramount importance.
  33. Data science isn’t something that happens in a silo. This is a team effort among a team that share common goals. Not all projects are going to need all of these resources but all projects require the concerted effort of a team to make them a success. We want to work with others in order to make sure our engagements are successful.