SlideShare une entreprise Scribd logo
1  sur  23
Applied data science in the industry:
How to build a data science project in a
corporate setting
BEST PRACTICES AND A REAL-WORLD EXAMPLE
Soraya Yama
Wednesday, June 26, 2019
WIMLDS Montreal #3: Business & AI
How to guarantee the success of your
data science project in industry?
Challenges and solutions when building data science projects in industry or in a corporate
environment
 How to generate insights for better business decision making is what drives data science projects?
 How to work with business side by side?
 How to build a reliable and understandable analysis flow/solution/product?
 How to properly communicate results and key elements?
Data science in industry vs in research
Industry Research
Faster pace than academia – quick iterations Experiments are easier in a lab
If analysis does not produce results quickly, drop it
and/or redesign it
Follow best practices to get approvals after peer
reviews
Simple solutions are preferred over novel complex
ones – hard to understand, hard to trust
Let’s go for the fancy cool new algorithms!!!
Limited time and resources so need to balance
research excellence with business needs
Research is expected to take a lot of time
Not everyone you work with understands data science
 need to convince decision makers to use the
insights to drive decisions
Peers understands data science and the importance of
research
The team might not be data-driven or analytics-
minded
You will most likely have more than one analyst in the
team
Explain statistical concepts in layman terms Your peers are more likely to understand the statistical
jargon you use
You won’t do data science only – you might need to
learn new skills (data engineering, new programming
language, new packages etc)
It is less likely that you do data engineering or
architecture while being a data scientist
Rejecting a hypothesis is equally interesting Rejecting a hypothesis can be looked at as a failure
Focus on industry specific projects
Challenges faced
1. Sometimes problems are not well defined
2. Sometimes data is not available or not in a usable format
3. Sometimes tools or data analysis platforms are not available
4. Which models to use? Which algorithms are more suitable for the analysis and the
infrastructure?
5. Sometimes clients or business lines will not understand your analysis, the methods used
6. How to build your data science flow and what to avoid?
7. How to presents results in a way business stake holders understand them?
1. Sometimes problems are not well
defined
Data Science is a science therefore it follows the scientific method
In a scientific method, the process starts with a question to be asked or a
problem to be identified
In data science, the process also starts with a problem to solve
This requires a proper understanding of the business context
Sometimes sitting with the business and help formalize the problem is key
2. Sometimes data is not available or not
in a usable format
Which data sources to use?
◦ data lake, data warehouse, database, row data to be imported like images, sound files, spreadsheets or flat files
How to collect the data?
◦ import data, create a data pipeline
Who to work with for the data acquisition?
◦ data engineers, database system managers etc.
How to convince teams you need this data?
◦ explain the use case, have your manager support you
How to maintain this new data acquisition?
◦ is it a one shot data acquisition, is it a recurrent feed?
Where to store the data?
◦ big data storage, file system, cluster like Hadoop?
If it’s a data stream, how to build it?
◦ Kafka, AWS, Flume etc.
3. Sometimes tools or data analysis
platforms are not available
 Identify which tools or platforms are well adapted to solve the problem and which ones are
available or easy to get
 Request them / install them
 Work on the data infrastructure
Questions to ask
Eg.
 Can I solve this specific use case using a Python script in and IDE?
 Am I looking at big data in which case I might need a distributed system like Spark?
 Shall I store the data in a filesystem or on HDFS?
 The team is using R, but can I productionnize a script written in R?
 There is a vendor product I am asked to use, but is it convenient for the purpose of the use case?
4. Which models to use? Which algorithms
are more suitable for the analysis and the
infrastructure?
 KNN is a weak learner
 Decisions trees work best to detect non-linear interactions (so should not be used for time-
series)
 Radom forests can work with large labelled or unlabelled data
 Ordinary Least Square should be used if high dimensional data set (nb variable > nb
observations)
 Stratified sampling is better than random sampling for classification problems
 Etc.
 Ask yourself the right questions before jumping ahead and using the fanciest model you
can think of. Business might not understand it.
5. Sometimes clients or business lines will not
understand your analysis, the methods used
Start small – use a data sample to build your case
Do a prototype (Proof of Concepts) and show them how they can leverage data analysis
Do not use the statistical jargons, use layman terms to communicate your idea
Sell your idea
Make it simple enough to understand, efficient enough to implement, interesting enough to use
Real world example
Signal analysis followed by a stock price behaviour prediction using a convolutional neural
network Data points to be investigated will labelled 1.
All other cases will be labelled 0.
Detecting ratios anomalies using tradition
statistical detection method and isolation
Forest (clustering for anomaly detection)
-
Process time very long especially when using
millions of rows – need to distribute the data
Isolation Forest exist in sklearn, but has not
yet been fully implemented in MLLib
+
Isolation Forest efficient when handling big
data
Very accurate detection compared to
traditional methods
6. How to build your data science flow
and what to avoid?
Your analysis code has to be understandable and reproducible (structured and testable)
If you are using a data analysis flow, your flow has to be structured
7. How to presents results in a way
business stake holders understand them?
Making complex concepts easy to understand by business lines
 Sometimes a graph is worth a thousand words
 Reports or dashboards have to be clear with ideally one insight per view (do not overload the
page)
 Show the results in a way they are easily interpretable
A real-world example of a real-time
failure prediction using Spark
System failure real-time predictions using:
 Sources systems metrics
 Kafka for data streaming
 Spark for the predictions
 HDFS to store data
 Javascript/Jquery or vendor product for the frontend
Source
systems
Kafka Stream
Spark
Streaming
Spark ML
Front End
HDFS
Offline training – Online testing using
Spark
ANNEXE
Data science tools magic quadrant January 2019

Contenu connexe

Tendances

Data Science Lifecycle
Data Science LifecycleData Science Lifecycle
Data Science LifecycleSwapnilDahake2
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Edureka!
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...Edureka!
 
Barga DIDC'14 Invited Talk
Barga DIDC'14 Invited TalkBarga DIDC'14 Invited Talk
Barga DIDC'14 Invited TalkRoger Barga
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Edureka!
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...Edureka!
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Simplilearn
 
Applications of Data Science in Microsoft Cloud Products
Applications of Data Science in Microsoft Cloud ProductsApplications of Data Science in Microsoft Cloud Products
Applications of Data Science in Microsoft Cloud ProductsLisa Cohen
 
Data Science 101
Data Science 101Data Science 101
Data Science 101odsc
 
Data Analytics and Big Data on IoT
Data Analytics and Big Data on IoTData Analytics and Big Data on IoT
Data Analytics and Big Data on IoTShivam Singh
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistLisa Cohen
 
How to conduct research
How to conduct researchHow to conduct research
How to conduct researchmahmoodaslam
 

Tendances (20)

Data Science Lifecycle
Data Science LifecycleData Science Lifecycle
Data Science Lifecycle
 
Data science 101
Data science 101Data science 101
Data science 101
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
 
Barga DIDC'14 Invited Talk
Barga DIDC'14 Invited TalkBarga DIDC'14 Invited Talk
Barga DIDC'14 Invited Talk
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
 
Machine Learning in Healthcare: A Case Study
Machine Learning in Healthcare: A Case StudyMachine Learning in Healthcare: A Case Study
Machine Learning in Healthcare: A Case Study
 
Crisp dm
Crisp dmCrisp dm
Crisp dm
 
Applications of Data Science in Microsoft Cloud Products
Applications of Data Science in Microsoft Cloud ProductsApplications of Data Science in Microsoft Cloud Products
Applications of Data Science in Microsoft Cloud Products
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
Data Analytics and Big Data on IoT
Data Analytics and Big Data on IoTData Analytics and Big Data on IoT
Data Analytics and Big Data on IoT
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data Scientist
 
How to conduct research
How to conduct researchHow to conduct research
How to conduct research
 

Similaire à How to build a data science project in a corporate setting, by Soraya Christina, Senior Data Scientist at Morgan Stanley

data science and business analytics
data science and business analyticsdata science and business analytics
data science and business analyticssunnypatil1778
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxShanmugasundaram M
 
The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewDr. Ananth Krishnamoorthy
 
10 Tips From A Young Data Scientist
10 Tips From A Young Data Scientist10 Tips From A Young Data Scientist
10 Tips From A Young Data ScientistNuno Carneiro
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning CCG
 
Data Mining and Data Warehouse
Data Mining and Data WarehouseData Mining and Data Warehouse
Data Mining and Data WarehouseAnupam Sharma
 
Tips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the EnterpriseTips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the EnterpriseLisa Cohen
 
Next generation of data scientist
Next generation of data scientistNext generation of data scientist
Next generation of data scientistTanujaSomvanshi1
 
what is data science
 what is data science what is data science
what is data scienceCrampete
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategyHimanshu Bari
 
Designing High Quality Data Driven Solutions 110520
Designing High Quality Data Driven Solutions 110520Designing High Quality Data Driven Solutions 110520
Designing High Quality Data Driven Solutions 110520MariaHalstead1
 
Data Science.pdf
Data Science.pdfData Science.pdf
Data Science.pdfWinduGata3
 
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptxUnit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptxtesfkeb
 
Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPeculium Crypto
 
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxDATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxOTA13NayabNakhwa
 
Data Scientist By: Professor Lili Saghafi
Data Scientist By: Professor Lili SaghafiData Scientist By: Professor Lili Saghafi
Data Scientist By: Professor Lili SaghafiProfessor Lili Saghafi
 
Introduction to data science.pdf
Introduction to data science.pdfIntroduction to data science.pdf
Introduction to data science.pdfalsaid fathy
 
Data Fluency - AUA Conference
Data Fluency - AUA ConferenceData Fluency - AUA Conference
Data Fluency - AUA ConferenceMartha Horler
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?DIGITALSAI1
 

Similaire à How to build a data science project in a corporate setting, by Soraya Christina, Senior Data Scientist at Morgan Stanley (20)

data science and business analytics
data science and business analyticsdata science and business analytics
data science and business analytics
 
Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017 Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape Overview
 
10 Tips From A Young Data Scientist
10 Tips From A Young Data Scientist10 Tips From A Young Data Scientist
10 Tips From A Young Data Scientist
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
 
Data Mining and Data Warehouse
Data Mining and Data WarehouseData Mining and Data Warehouse
Data Mining and Data Warehouse
 
Tips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the EnterpriseTips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the Enterprise
 
Next generation of data scientist
Next generation of data scientistNext generation of data scientist
Next generation of data scientist
 
what is data science
 what is data science what is data science
what is data science
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
 
Designing High Quality Data Driven Solutions 110520
Designing High Quality Data Driven Solutions 110520Designing High Quality Data Driven Solutions 110520
Designing High Quality Data Driven Solutions 110520
 
Data Science.pdf
Data Science.pdfData Science.pdf
Data Science.pdf
 
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptxUnit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
 
Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedback
 
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxDATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
 
Data Scientist By: Professor Lili Saghafi
Data Scientist By: Professor Lili SaghafiData Scientist By: Professor Lili Saghafi
Data Scientist By: Professor Lili Saghafi
 
Introduction to data science.pdf
Introduction to data science.pdfIntroduction to data science.pdf
Introduction to data science.pdf
 
Data Fluency - AUA Conference
Data Fluency - AUA ConferenceData Fluency - AUA Conference
Data Fluency - AUA Conference
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 

Plus de WiMLDSMontreal

The Five Ws of Funding, by Sahar Ansary, Partner, R&D Partners
The Five Ws of Funding, by Sahar Ansary, Partner, R&D PartnersThe Five Ws of Funding, by Sahar Ansary, Partner, R&D Partners
The Five Ws of Funding, by Sahar Ansary, Partner, R&D PartnersWiMLDSMontreal
 
The Agile methodology - Delivering new ways of working, by Sandra Frechette, ...
The Agile methodology - Delivering new ways of working, by Sandra Frechette, ...The Agile methodology - Delivering new ways of working, by Sandra Frechette, ...
The Agile methodology - Delivering new ways of working, by Sandra Frechette, ...WiMLDSMontreal
 
Coveo Machine Learning for E-Commerce: At the Center of Business Challenges, ...
Coveo Machine Learning for E-Commerce: At the Center of Business Challenges, ...Coveo Machine Learning for E-Commerce: At the Center of Business Challenges, ...
Coveo Machine Learning for E-Commerce: At the Center of Business Challenges, ...WiMLDSMontreal
 
Diversity and Knowledge Production, by Jihane Lamouri, Diversity, Equity and ...
Diversity and Knowledge Production, by Jihane Lamouri, Diversity, Equity and ...Diversity and Knowledge Production, by Jihane Lamouri, Diversity, Equity and ...
Diversity and Knowledge Production, by Jihane Lamouri, Diversity, Equity and ...WiMLDSMontreal
 
Diversity & Deep Tech Start-ups, by Eleonora Vella, Program Director & Princi...
Diversity & Deep Tech Start-ups, by Eleonora Vella, Program Director & Princi...Diversity & Deep Tech Start-ups, by Eleonora Vella, Program Director & Princi...
Diversity & Deep Tech Start-ups, by Eleonora Vella, Program Director & Princi...WiMLDSMontreal
 
Ubiquitous Machine Learning: Lessons from DeepRL in Robotics and Speech, by F...
Ubiquitous Machine Learning: Lessons from DeepRL in Robotics and Speech, by F...Ubiquitous Machine Learning: Lessons from DeepRL in Robotics and Speech, by F...
Ubiquitous Machine Learning: Lessons from DeepRL in Robotics and Speech, by F...WiMLDSMontreal
 
Fashion-Gen: The Generative Fashion Dataset and Challenge by Negar Rostamzade...
Fashion-Gen: The Generative Fashion Dataset and Challenge by Negar Rostamzade...Fashion-Gen: The Generative Fashion Dataset and Challenge by Negar Rostamzade...
Fashion-Gen: The Generative Fashion Dataset and Challenge by Negar Rostamzade...WiMLDSMontreal
 
Artistic Applications of AI, by Luba Elliott, AI Curator
Artistic Applications of AI, by Luba Elliott, AI CuratorArtistic Applications of AI, by Luba Elliott, AI Curator
Artistic Applications of AI, by Luba Elliott, AI CuratorWiMLDSMontreal
 
What Scares Me About AI, by Rachel Thomas, Co-founder of fast.ai & Professor ...
What Scares Me About AI, by Rachel Thomas, Co-founder of fast.ai & Professor ...What Scares Me About AI, by Rachel Thomas, Co-founder of fast.ai & Professor ...
What Scares Me About AI, by Rachel Thomas, Co-founder of fast.ai & Professor ...WiMLDSMontreal
 
Building Analytics and Data Science at A Start-Up, by Kathleen Siminyu, Head ...
Building Analytics and Data Science at A Start-Up, by Kathleen Siminyu, Head ...Building Analytics and Data Science at A Start-Up, by Kathleen Siminyu, Head ...
Building Analytics and Data Science at A Start-Up, by Kathleen Siminyu, Head ...WiMLDSMontreal
 
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...WiMLDSMontreal
 

Plus de WiMLDSMontreal (11)

The Five Ws of Funding, by Sahar Ansary, Partner, R&D Partners
The Five Ws of Funding, by Sahar Ansary, Partner, R&D PartnersThe Five Ws of Funding, by Sahar Ansary, Partner, R&D Partners
The Five Ws of Funding, by Sahar Ansary, Partner, R&D Partners
 
The Agile methodology - Delivering new ways of working, by Sandra Frechette, ...
The Agile methodology - Delivering new ways of working, by Sandra Frechette, ...The Agile methodology - Delivering new ways of working, by Sandra Frechette, ...
The Agile methodology - Delivering new ways of working, by Sandra Frechette, ...
 
Coveo Machine Learning for E-Commerce: At the Center of Business Challenges, ...
Coveo Machine Learning for E-Commerce: At the Center of Business Challenges, ...Coveo Machine Learning for E-Commerce: At the Center of Business Challenges, ...
Coveo Machine Learning for E-Commerce: At the Center of Business Challenges, ...
 
Diversity and Knowledge Production, by Jihane Lamouri, Diversity, Equity and ...
Diversity and Knowledge Production, by Jihane Lamouri, Diversity, Equity and ...Diversity and Knowledge Production, by Jihane Lamouri, Diversity, Equity and ...
Diversity and Knowledge Production, by Jihane Lamouri, Diversity, Equity and ...
 
Diversity & Deep Tech Start-ups, by Eleonora Vella, Program Director & Princi...
Diversity & Deep Tech Start-ups, by Eleonora Vella, Program Director & Princi...Diversity & Deep Tech Start-ups, by Eleonora Vella, Program Director & Princi...
Diversity & Deep Tech Start-ups, by Eleonora Vella, Program Director & Princi...
 
Ubiquitous Machine Learning: Lessons from DeepRL in Robotics and Speech, by F...
Ubiquitous Machine Learning: Lessons from DeepRL in Robotics and Speech, by F...Ubiquitous Machine Learning: Lessons from DeepRL in Robotics and Speech, by F...
Ubiquitous Machine Learning: Lessons from DeepRL in Robotics and Speech, by F...
 
Fashion-Gen: The Generative Fashion Dataset and Challenge by Negar Rostamzade...
Fashion-Gen: The Generative Fashion Dataset and Challenge by Negar Rostamzade...Fashion-Gen: The Generative Fashion Dataset and Challenge by Negar Rostamzade...
Fashion-Gen: The Generative Fashion Dataset and Challenge by Negar Rostamzade...
 
Artistic Applications of AI, by Luba Elliott, AI Curator
Artistic Applications of AI, by Luba Elliott, AI CuratorArtistic Applications of AI, by Luba Elliott, AI Curator
Artistic Applications of AI, by Luba Elliott, AI Curator
 
What Scares Me About AI, by Rachel Thomas, Co-founder of fast.ai & Professor ...
What Scares Me About AI, by Rachel Thomas, Co-founder of fast.ai & Professor ...What Scares Me About AI, by Rachel Thomas, Co-founder of fast.ai & Professor ...
What Scares Me About AI, by Rachel Thomas, Co-founder of fast.ai & Professor ...
 
Building Analytics and Data Science at A Start-Up, by Kathleen Siminyu, Head ...
Building Analytics and Data Science at A Start-Up, by Kathleen Siminyu, Head ...Building Analytics and Data Science at A Start-Up, by Kathleen Siminyu, Head ...
Building Analytics and Data Science at A Start-Up, by Kathleen Siminyu, Head ...
 
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
 

Dernier

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

How to build a data science project in a corporate setting, by Soraya Christina, Senior Data Scientist at Morgan Stanley

  • 1. Applied data science in the industry: How to build a data science project in a corporate setting BEST PRACTICES AND A REAL-WORLD EXAMPLE Soraya Yama Wednesday, June 26, 2019 WIMLDS Montreal #3: Business & AI
  • 2. How to guarantee the success of your data science project in industry? Challenges and solutions when building data science projects in industry or in a corporate environment  How to generate insights for better business decision making is what drives data science projects?  How to work with business side by side?  How to build a reliable and understandable analysis flow/solution/product?  How to properly communicate results and key elements?
  • 3. Data science in industry vs in research Industry Research Faster pace than academia – quick iterations Experiments are easier in a lab If analysis does not produce results quickly, drop it and/or redesign it Follow best practices to get approvals after peer reviews Simple solutions are preferred over novel complex ones – hard to understand, hard to trust Let’s go for the fancy cool new algorithms!!! Limited time and resources so need to balance research excellence with business needs Research is expected to take a lot of time Not everyone you work with understands data science  need to convince decision makers to use the insights to drive decisions Peers understands data science and the importance of research The team might not be data-driven or analytics- minded You will most likely have more than one analyst in the team Explain statistical concepts in layman terms Your peers are more likely to understand the statistical jargon you use You won’t do data science only – you might need to learn new skills (data engineering, new programming language, new packages etc) It is less likely that you do data engineering or architecture while being a data scientist Rejecting a hypothesis is equally interesting Rejecting a hypothesis can be looked at as a failure
  • 4. Focus on industry specific projects
  • 5. Challenges faced 1. Sometimes problems are not well defined 2. Sometimes data is not available or not in a usable format 3. Sometimes tools or data analysis platforms are not available 4. Which models to use? Which algorithms are more suitable for the analysis and the infrastructure? 5. Sometimes clients or business lines will not understand your analysis, the methods used 6. How to build your data science flow and what to avoid? 7. How to presents results in a way business stake holders understand them?
  • 6. 1. Sometimes problems are not well defined Data Science is a science therefore it follows the scientific method In a scientific method, the process starts with a question to be asked or a problem to be identified In data science, the process also starts with a problem to solve This requires a proper understanding of the business context Sometimes sitting with the business and help formalize the problem is key
  • 7.
  • 8. 2. Sometimes data is not available or not in a usable format Which data sources to use? ◦ data lake, data warehouse, database, row data to be imported like images, sound files, spreadsheets or flat files How to collect the data? ◦ import data, create a data pipeline Who to work with for the data acquisition? ◦ data engineers, database system managers etc. How to convince teams you need this data? ◦ explain the use case, have your manager support you How to maintain this new data acquisition? ◦ is it a one shot data acquisition, is it a recurrent feed? Where to store the data? ◦ big data storage, file system, cluster like Hadoop? If it’s a data stream, how to build it? ◦ Kafka, AWS, Flume etc.
  • 9. 3. Sometimes tools or data analysis platforms are not available  Identify which tools or platforms are well adapted to solve the problem and which ones are available or easy to get  Request them / install them  Work on the data infrastructure
  • 10. Questions to ask Eg.  Can I solve this specific use case using a Python script in and IDE?  Am I looking at big data in which case I might need a distributed system like Spark?  Shall I store the data in a filesystem or on HDFS?  The team is using R, but can I productionnize a script written in R?  There is a vendor product I am asked to use, but is it convenient for the purpose of the use case?
  • 11. 4. Which models to use? Which algorithms are more suitable for the analysis and the infrastructure?  KNN is a weak learner  Decisions trees work best to detect non-linear interactions (so should not be used for time- series)  Radom forests can work with large labelled or unlabelled data  Ordinary Least Square should be used if high dimensional data set (nb variable > nb observations)  Stratified sampling is better than random sampling for classification problems  Etc.  Ask yourself the right questions before jumping ahead and using the fanciest model you can think of. Business might not understand it.
  • 12. 5. Sometimes clients or business lines will not understand your analysis, the methods used Start small – use a data sample to build your case Do a prototype (Proof of Concepts) and show them how they can leverage data analysis Do not use the statistical jargons, use layman terms to communicate your idea Sell your idea Make it simple enough to understand, efficient enough to implement, interesting enough to use
  • 13. Real world example Signal analysis followed by a stock price behaviour prediction using a convolutional neural network Data points to be investigated will labelled 1. All other cases will be labelled 0.
  • 14. Detecting ratios anomalies using tradition statistical detection method and isolation Forest (clustering for anomaly detection) - Process time very long especially when using millions of rows – need to distribute the data Isolation Forest exist in sklearn, but has not yet been fully implemented in MLLib + Isolation Forest efficient when handling big data Very accurate detection compared to traditional methods
  • 15. 6. How to build your data science flow and what to avoid? Your analysis code has to be understandable and reproducible (structured and testable) If you are using a data analysis flow, your flow has to be structured
  • 16. 7. How to presents results in a way business stake holders understand them? Making complex concepts easy to understand by business lines  Sometimes a graph is worth a thousand words  Reports or dashboards have to be clear with ideally one insight per view (do not overload the page)  Show the results in a way they are easily interpretable
  • 17. A real-world example of a real-time failure prediction using Spark System failure real-time predictions using:  Sources systems metrics  Kafka for data streaming  Spark for the predictions  HDFS to store data  Javascript/Jquery or vendor product for the frontend Source systems Kafka Stream Spark Streaming Spark ML Front End HDFS
  • 18.
  • 19.
  • 20. Offline training – Online testing using Spark
  • 22.
  • 23. Data science tools magic quadrant January 2019