"Applied data science in the industry: How to build a data science project in a corporate setting - best practices and a real-world example"
By Soraya Christina, Senior Data Scientist at Morgan Stanley
Abstract:
- Which platforms/technology to use for your analytics project and why? (Spark, Hadoop, vendor products, open source, Python, Scala, etc?)
- How to build your data science flow and what to avoid? (Occam's razor, testable and structured flow)
- How to presents results in a way business stakeholders understand them? (Making complex concepts easy to understand by business lines)
- A real-world example of a real-time failure prediction using Spark streaming and ML components.
The purpose of this talk is to present the challenges and solutions when building data science projects in a corporate environment. Generating insights for better business decision making is what drives data science projects. But working with business side by side, being able to build a reliable flow and properly communicating results and key elements are more than crucial, it is what will guarantee the success of your data science projects.
How to build a data science project in a corporate setting, by Soraya Christina, Senior Data Scientist at Morgan Stanley
1. Applied data science in the industry:
How to build a data science project in a
corporate setting
BEST PRACTICES AND A REAL-WORLD EXAMPLE
Soraya Yama
Wednesday, June 26, 2019
WIMLDS Montreal #3: Business & AI
2. How to guarantee the success of your
data science project in industry?
Challenges and solutions when building data science projects in industry or in a corporate
environment
How to generate insights for better business decision making is what drives data science projects?
How to work with business side by side?
How to build a reliable and understandable analysis flow/solution/product?
How to properly communicate results and key elements?
3. Data science in industry vs in research
Industry Research
Faster pace than academia – quick iterations Experiments are easier in a lab
If analysis does not produce results quickly, drop it
and/or redesign it
Follow best practices to get approvals after peer
reviews
Simple solutions are preferred over novel complex
ones – hard to understand, hard to trust
Let’s go for the fancy cool new algorithms!!!
Limited time and resources so need to balance
research excellence with business needs
Research is expected to take a lot of time
Not everyone you work with understands data science
need to convince decision makers to use the
insights to drive decisions
Peers understands data science and the importance of
research
The team might not be data-driven or analytics-
minded
You will most likely have more than one analyst in the
team
Explain statistical concepts in layman terms Your peers are more likely to understand the statistical
jargon you use
You won’t do data science only – you might need to
learn new skills (data engineering, new programming
language, new packages etc)
It is less likely that you do data engineering or
architecture while being a data scientist
Rejecting a hypothesis is equally interesting Rejecting a hypothesis can be looked at as a failure
5. Challenges faced
1. Sometimes problems are not well defined
2. Sometimes data is not available or not in a usable format
3. Sometimes tools or data analysis platforms are not available
4. Which models to use? Which algorithms are more suitable for the analysis and the
infrastructure?
5. Sometimes clients or business lines will not understand your analysis, the methods used
6. How to build your data science flow and what to avoid?
7. How to presents results in a way business stake holders understand them?
6. 1. Sometimes problems are not well
defined
Data Science is a science therefore it follows the scientific method
In a scientific method, the process starts with a question to be asked or a
problem to be identified
In data science, the process also starts with a problem to solve
This requires a proper understanding of the business context
Sometimes sitting with the business and help formalize the problem is key
7.
8. 2. Sometimes data is not available or not
in a usable format
Which data sources to use?
◦ data lake, data warehouse, database, row data to be imported like images, sound files, spreadsheets or flat files
How to collect the data?
◦ import data, create a data pipeline
Who to work with for the data acquisition?
◦ data engineers, database system managers etc.
How to convince teams you need this data?
◦ explain the use case, have your manager support you
How to maintain this new data acquisition?
◦ is it a one shot data acquisition, is it a recurrent feed?
Where to store the data?
◦ big data storage, file system, cluster like Hadoop?
If it’s a data stream, how to build it?
◦ Kafka, AWS, Flume etc.
9. 3. Sometimes tools or data analysis
platforms are not available
Identify which tools or platforms are well adapted to solve the problem and which ones are
available or easy to get
Request them / install them
Work on the data infrastructure
10. Questions to ask
Eg.
Can I solve this specific use case using a Python script in and IDE?
Am I looking at big data in which case I might need a distributed system like Spark?
Shall I store the data in a filesystem or on HDFS?
The team is using R, but can I productionnize a script written in R?
There is a vendor product I am asked to use, but is it convenient for the purpose of the use case?
11. 4. Which models to use? Which algorithms
are more suitable for the analysis and the
infrastructure?
KNN is a weak learner
Decisions trees work best to detect non-linear interactions (so should not be used for time-
series)
Radom forests can work with large labelled or unlabelled data
Ordinary Least Square should be used if high dimensional data set (nb variable > nb
observations)
Stratified sampling is better than random sampling for classification problems
Etc.
Ask yourself the right questions before jumping ahead and using the fanciest model you
can think of. Business might not understand it.
12. 5. Sometimes clients or business lines will not
understand your analysis, the methods used
Start small – use a data sample to build your case
Do a prototype (Proof of Concepts) and show them how they can leverage data analysis
Do not use the statistical jargons, use layman terms to communicate your idea
Sell your idea
Make it simple enough to understand, efficient enough to implement, interesting enough to use
13. Real world example
Signal analysis followed by a stock price behaviour prediction using a convolutional neural
network Data points to be investigated will labelled 1.
All other cases will be labelled 0.
14. Detecting ratios anomalies using tradition
statistical detection method and isolation
Forest (clustering for anomaly detection)
-
Process time very long especially when using
millions of rows – need to distribute the data
Isolation Forest exist in sklearn, but has not
yet been fully implemented in MLLib
+
Isolation Forest efficient when handling big
data
Very accurate detection compared to
traditional methods
15. 6. How to build your data science flow
and what to avoid?
Your analysis code has to be understandable and reproducible (structured and testable)
If you are using a data analysis flow, your flow has to be structured
16. 7. How to presents results in a way
business stake holders understand them?
Making complex concepts easy to understand by business lines
Sometimes a graph is worth a thousand words
Reports or dashboards have to be clear with ideally one insight per view (do not overload the
page)
Show the results in a way they are easily interpretable
17. A real-world example of a real-time
failure prediction using Spark
System failure real-time predictions using:
Sources systems metrics
Kafka for data streaming
Spark for the predictions
HDFS to store data
Javascript/Jquery or vendor product for the frontend
Source
systems
Kafka Stream
Spark
Streaming
Spark ML
Front End
HDFS