The document discusses demystifying data science by providing motivations, a maturity model, and an ecosystem model with practical examples and advice. It explains data science concepts like data curation, machine learning, and business integration. Examples are given of using data science for time-to-event modeling, topic modeling, and anomaly detection. The importance of communication, iteration, and understanding models as approximations is emphasized.
1. Demystifying Data Science
What does it mean in practice?
Jonathan Sedar
Principal Data Scientist
Applied AI Ltd
www.applied.ai
@applied_ai
@jonsedar
2. Applied AI is a Data Science Consultancy
We create a competitive advantage for financial services
companies through applied artificial intelligence
www.applied.ai
@applied_ai
@jonsedar
3. Know Your Customers Develop Your Market Manage Risk & Regulation
Innovate & Experiment Streamline Operations Embed Data Science
23. Sophisticated Analyses
• Hypothesis testing & data
discovery
• Advanced statistics & predictive
modelling
• Deliver immediate value, guide
strategy
• Advanced data science is
supported thought the organisation
and embedded in:
• Products & Services
• Senior Decision Making
• Business Administration
Full Capability Data Science
• Identify new opportunities and
useful data sources
• Basic modelling
• Senior leaders help to define &
develop the business case
Getting Started
• Create ‘data products’, reports,
new systems to embed change
• Replace legacy systems
• Build internal knowledge and skills
Business Operations
24. • Auto Insurer: “Help me price correctly”
• Extracted, cleaned, parsed data from messy
internal & external sources
• Lightweight multidimensional analysis of customer
base inc interactive dashboards
• Reports and strategic recommendations to board
level, proving the need for further analysis
Getting Started
25. Sophisticated Analyses
• Life & Pensions: “Help me model my customer churn
(a credit risk situation)”
• Sourced, cleaned, prepared internal & external data
• Created advanced time-to-event models using
Bayesian statistics
• Churn modelling output identified key risk groups &
potentially large new revenues and cost savings
26. Business Operations
• Asset Management Co: “Help me price real estate
at the optimal market price”
• Sourced, cleaned, prepared data, undertook initial
investigations and statistical modelling
• Created a price prediction “engine” within a
microservice API, now used within daily operations
• Accurate estimates and reduced manual effort
27. Full Capability Data Science
• The holy grail!
• A centre of excellence guiding:
• Products
• Decision Making
• Business Administration
31. Data Curation
• Making the right data available for
modelling and maintaining it well.
• Garbage-in-garbage-out
• Getting to ‘good data’ is subtle
• 80% of the process
32. Machine Learning
• Learning from data
• The empirical practice at the heart of
statistics.
• A machine (aka computer or model) is
trained on a dataset to predict values
• Predict or infer real-word behaviours.
33. Business Integration
• Conventional business analysis lives and
dies within spreadsheets & presentations
• Expensive dashboards require unstable
data pipelines.
• Huge data warehouses and "lakes" are so
complicated they're barely utilised.
• Business integration is hard, but critical
50. Second: go shopping for
socioeconomic data
Irish census produced every 5 years
15 themes, 500+ features
Captures almost everything about daily life
Aggregated to ‘small areas’ approx 200 households
51. Census themes
Theme Subject Theme Subject
1 Sex, Age & Migration 9 Social Class
2 Ethnicity & Language 10 Education
3 Irish Langage 11 Commuting
4 Families 12 Health
5 Private Housholds 13 Occupation
6 Housing 14 Industries
7 Hospitals & Prisons 15 PC & Internet
8 Principal Status
52. We could do what Experian does,
and also:
We would own the code
We could integrate with any internal project
We could tune it to fit our needs
53. Lets take a look at the data
Not a trivial task…
What we have is a really big matrix
18,488 rows x 767 columns
55. Data Compression
Singular Value Decomposition
Rotate and scale data into new frame of reference
Compress into fewer features while maintaining
information
Compressed 500+ columns into 100
63. Data Curation
• A centralised, up-to-date, traceable,
documented repository for structured
text, tabular & image datasets
• Augment with public data to keep up
with competitors and gain an edge
• Update, maintain and optimise your
primary data sources to allow for high
risk/reward POC projects
65. Learning from data to predict
outcomes and infer behaviours
Supervised (classification, regression)
Unsupervised (clustering, pattern matching)
Reinforcement (behavioural rewards)
66. Hot new area, thus word soup
artificial intelligence
machine intelligence
statistical modelling
robotic process automation
cognitive computing
deep learning
…
89. Feature engineering is critical
Understand the data shape, size, behaviours and the processes
that generated it
90. Machine Learning
• Sophisticated statistical techniques,
good software dev practices and
research-grade, open-source software
• Document and share knowledge to
become technical centre of excellence
• Validate, test, review & maintain your
data pipelines, software and models to
mitigate risk and allow for audit
107. Business Integration
• Clear path from model inference and
predictions to the extrapolation of
business actions and impacts
• Communicate results with non-technical
stakeholders via engaging dashboards
and visualisations
• Integrate an automated, live, on-demand
prediction service with business systems
108. Using a “Data Science” approach:
- Motivations
- A Maturity Model
- An Ecosystem Model
Practical Examples & Advice
109. Learning from data benefits
the whole business
Increase Revenue
tune risk profile
understand the competition
optimise business processes
improve customer retention
inform & adapt to regulatory change
demonstrate leadership
innovate product-market fit
increase customer base
Reduce Cost
Manage Risk Meet Compliance
110.
111.
112. Further reading
•Blogs with good technical articles, insights etc
•http://blog.applied.ai
•http://www.magesblog.com
•https://planet.scipy.org
•http://andrewgelman.com
•http://blog.kaggle.com
• Books / technical articles
•https://www.oreilly.com/ideas/what-is-hardcore-data-science-in-practice
•http://www.oreilly.com/data/free/ten-signs-of-data-science-maturity.csp
•Machine Learning for Hackers http://shop.oreilly.com/product/
0636920018483.do