How to build a data science project in a corporate setting, by Soraya Christina, Senior Data Scientist at Morgan Stanley

Applied data science in the industry:
How to build a data science project in a
corporate setting
BEST PRACTICES AND A REAL-WORLD EXAMPLE
Soraya Yama
Wednesday, June 26, 2019
WIMLDS Montreal #3: Business & AI

How to guarantee the success of your
data science project in industry?
Challenges and solutions when building data science projects in industry or in a corporate
environment
 How to generate insights for better business decision making is what drives data science projects?
 How to work with business side by side?
 How to build a reliable and understandable analysis flow/solution/product?
 How to properly communicate results and key elements?

Data science in industry vs in research
Industry Research
Faster pace than academia – quick iterations Experiments are easier in a lab
If analysis does not produce results quickly, drop it
and/or redesign it
Follow best practices to get approvals after peer
reviews
Simple solutions are preferred over novel complex
ones – hard to understand, hard to trust
Let’s go for the fancy cool new algorithms!!!
Limited time and resources so need to balance
research excellence with business needs
Research is expected to take a lot of time
Not everyone you work with understands data science
 need to convince decision makers to use the
insights to drive decisions
Peers understands data science and the importance of
research
The team might not be data-driven or analytics-
minded
You will most likely have more than one analyst in the
team
Explain statistical concepts in layman terms Your peers are more likely to understand the statistical
jargon you use
You won’t do data science only – you might need to
learn new skills (data engineering, new programming
language, new packages etc)
It is less likely that you do data engineering or
architecture while being a data scientist
Rejecting a hypothesis is equally interesting Rejecting a hypothesis can be looked at as a failure

Focus on industry specific projects

Challenges faced
1. Sometimes problems are not well defined
2. Sometimes data is not available or not in a usable format
3. Sometimes tools or data analysis platforms are not available
4. Which models to use? Which algorithms are more suitable for the analysis and the
infrastructure?
5. Sometimes clients or business lines will not understand your analysis, the methods used
6. How to build your data science flow and what to avoid?
7. How to presents results in a way business stake holders understand them?

1. Sometimes problems are not well
defined
Data Science is a science therefore it follows the scientific method
In a scientific method, the process starts with a question to be asked or a
problem to be identified
In data science, the process also starts with a problem to solve
This requires a proper understanding of the business context
Sometimes sitting with the business and help formalize the problem is key

2. Sometimes data is not available or not
in a usable format
Which data sources to use?
◦ data lake, data warehouse, database, row data to be imported like images, sound files, spreadsheets or flat files
How to collect the data?
◦ import data, create a data pipeline
Who to work with for the data acquisition?
◦ data engineers, database system managers etc.
How to convince teams you need this data?
◦ explain the use case, have your manager support you
How to maintain this new data acquisition?
◦ is it a one shot data acquisition, is it a recurrent feed?
Where to store the data?
◦ big data storage, file system, cluster like Hadoop?
If it’s a data stream, how to build it?
◦ Kafka, AWS, Flume etc.

3. Sometimes tools or data analysis
platforms are not available
 Identify which tools or platforms are well adapted to solve the problem and which ones are
available or easy to get
 Request them / install them
 Work on the data infrastructure

Questions to ask
Eg.
 Can I solve this specific use case using a Python script in and IDE?
 Am I looking at big data in which case I might need a distributed system like Spark?
 Shall I store the data in a filesystem or on HDFS?
 The team is using R, but can I productionnize a script written in R?
 There is a vendor product I am asked to use, but is it convenient for the purpose of the use case?

4. Which models to use? Which algorithms
are more suitable for the analysis and the
infrastructure?
 KNN is a weak learner
 Decisions trees work best to detect non-linear interactions (so should not be used for time-
series)
 Radom forests can work with large labelled or unlabelled data
 Ordinary Least Square should be used if high dimensional data set (nb variable > nb
observations)
 Stratified sampling is better than random sampling for classification problems
 Etc.
 Ask yourself the right questions before jumping ahead and using the fanciest model you
can think of. Business might not understand it.

5. Sometimes clients or business lines will not
understand your analysis, the methods used
Start small – use a data sample to build your case
Do a prototype (Proof of Concepts) and show them how they can leverage data analysis
Do not use the statistical jargons, use layman terms to communicate your idea
Sell your idea
Make it simple enough to understand, efficient enough to implement, interesting enough to use

Real world example
Signal analysis followed by a stock price behaviour prediction using a convolutional neural
network Data points to be investigated will labelled 1.
All other cases will be labelled 0.

Detecting ratios anomalies using tradition
statistical detection method and isolation
Forest (clustering for anomaly detection)
-
Process time very long especially when using
millions of rows – need to distribute the data
Isolation Forest exist in sklearn, but has not
yet been fully implemented in MLLib
+
Isolation Forest efficient when handling big
data
Very accurate detection compared to
traditional methods

6. How to build your data science flow
and what to avoid?
Your analysis code has to be understandable and reproducible (structured and testable)
If you are using a data analysis flow, your flow has to be structured

7. How to presents results in a way
business stake holders understand them?
Making complex concepts easy to understand by business lines
 Sometimes a graph is worth a thousand words
 Reports or dashboards have to be clear with ideally one insight per view (do not overload the
page)
 Show the results in a way they are easily interpretable

A real-world example of a real-time
failure prediction using Spark
System failure real-time predictions using:
 Sources systems metrics
 Kafka for data streaming
 Spark for the predictions
 HDFS to store data
 Javascript/Jquery or vendor product for the frontend
Source
systems
Kafka Stream
Spark
Streaming
Spark ML
Front End
HDFS

Offline training – Online testing using
Spark

Data science tools magic quadrant January 2019

How to build a data science project in a corporate setting, by Soraya Christina, Senior Data Scientist at Morgan Stanley

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à How to build a data science project in a corporate setting, by Soraya Christina, Senior Data Scientist at Morgan Stanley

Similaire à How to build a data science project in a corporate setting, by Soraya Christina, Senior Data Scientist at Morgan Stanley (20)

Plus de WiMLDSMontreal

Plus de WiMLDSMontreal (11)

Dernier

Dernier (20)

How to build a data science project in a corporate setting, by Soraya Christina, Senior Data Scientist at Morgan Stanley