Dr. Nadine Schöne is a Senior Solutions Architect at Dataiku in Berlin. In this role, she deals with all aspects of the data value chain for all users – including integration of data sources, ETL, cooperation, statistics, modelling, but also operationalization, monitoring, automatization and security during production. She regularly talks at conferences, holds webinars and writes articles.
Speech Overview:
How can you get the most out of your data – while staying flexible in your choice of infrastructure and without having to integrate a multitude of tools for the different personas involved? Maximizing the value you get out of your data is a necessity today. Looking at the whole picture as well as careful planning are the key for success. We will have a look at the complete data value chain from end to end: from the data stores, collaboration features, data preparation, visualization and automation capabilities, and external compute to scheduling, operationalization, monitoring and security.
3. ABOUT US
Gartner - Leader400+ Employees
30K+ Users
300 + Clients
#1 Insurance Brand
#1 Pharma Brand
#1 US Construction Company
#1 Financial Information Company
#1 Flash Sales Company
#1 Car Sharing Company
#1 Parking Device Company
#1 Cosmetics Company
#3 CPG Company
Funded By
6. The data value chain
DATA DATA
DECISIONS
people
systems
automation
preparation
analytics
quality
SCIENCE
machine learning
metrics
statistics
7. Data
● data access
integration, security incl. impersonation
● data quality
● data preparation
filter, join, enrich, prepare, formats…
● changes in input data sets
● changes in data quality
● KPIs / metrics
● basic statistics
● dashboards
10. Why containers for Data Science?
DSS and containers
Resource Allocation
Resource Management
• Leverage cloud native technologies to manage resources extensibility
• Use different hardware configurations (like GPUs)
• Pre-build images with necessary library dependencies
Collaboration
• Control dependencies and isolate runtimes on the same host
• Share work by sharing containers
• Kubernetes makes orchestration of the containers simple
Reproducibility
• Simplify migration by copying containers
• Attach models to a container context and facilitate past work re-run
• Ensure old code/models continue running
Production
• Facilitate self-service to production process
• Easily host models as APIs for downstream applications
• Deploy and monitor batch processes with reproducibility in mind
11. Leverage your infrastructure with containers
DSS and containers
Run Python / R
code in containers
Machine Learning
in containers
12. Models
● automated machine learning
● coding (Python, R)
● model information
● model interpretation
● model performance
incl. monitoring of model drift
● data preparation
● feature engineering
● versioning
● expose trained models via APIs
13. Data Scientists: focus talent on what counts
Code your way
Full programmatic
control
Full fledged API to manage
models, pipelines and automation
Free coding
Use any package with isolated envs
Full Git integration
Reuse and share code
Ensure impact
Self-provisioning
of compute resources
Cloud-based elastic processing for large volumes of
data, users or services
Don’t get distracted
Expedited wrangling
Facilitated connection to SQL, HDFS, cloud
storage, NoSQL, HDFS, APIs,...
Use visual tools where it is faster
Reuse work from other teams/analysts
Low effort CI / CD
Orchestrate pipelines with optional automatic checks
Create deployment artifacts
Deploy your models as containerized APIs
Showcase your insights
Build insights, create webapps
(Shiny, Flask, Bokeh) and deploy in K8S
Package for reuse by target population
Jupyter Notebooks or IDEs
SQL/Python/R/Scala
LDAP
Kerberos
SSO
14. People (Collaboration)
● coders
code environments, git integration, tools etc.
● clickers
basic statistics, explore data, dashboards, download data
● communication in projects
● statistics
● visualizations
● documentation
● share data between projects
● export data and results
16. Models operationalization platform
Solution Overview: Architecture
DATAIKU
DESIGN NODE
Dataiku Automation Node
MONITOR
WORKFLOWS
MONITOR
MODELS
RETRAIN / SCORE
WORKFLOWS
DEPLOY
MODELS
DEPLOY MODELS AND
ANALYTICS ARTIFACTS
Production DWH / DB
Dataiku API Nodes
IT MONITORING
APPLICATION MONITORING
Nagios / Datadog / Zabbix
BUSINESS
APPLICATIONS
Hadoop
Spark
Databases
(JDBC)
etc…
Kubernetes
Cluster
R/W/E
R/W/E
E
Real-Time
Scoring
Fetch Data
HTTP Queries
17. Concrete Steps toward Enterprise AI
Industrialization of Advanced Analytics Capabilities
Big Data Day 0
ML is for specialists
Ad-hoc analytics
Siloed Approach
Enterprise AI
There is no shortcut to Enterprise AI. It is a journey
that organisations need to undertake consciously,
requiring mastering each one of the four key phases,
one after the other.
18. Concrete Steps toward Enterprise AI
Industrialization of Advanced Analytics Capabilities
Big Data Day 0
Initiation
Impact
Acceleration
Systematization
ML is for specialists
Ad-hoc analytics
Siloed Approach
Demonstrate Value
Deliver Business Value
In Actual Operations
Fully align data,
organization and
processes
Structure Execution
and Self-Service
● Assemble first team
● Data: quality, availability,
accessibility, features
● Integration
● Minimal viable product
● Assessment of use cases
● Performance monitoring
● Improve continuously
● Operationalize models
● Get business acceptance
and impact on model
● Onboard analysts
Goals
● Integrate technologies
● Make data available for all
personas involved
● Maintaining models in
production
● New deployments
● Capitalization on previous
projects
● Build up manpower to
expand projects
● Optimization of
infrastructure
● Leveraging of new
technologies
● Optimization of analytics
processes and data
management
Enterprise AI
There is no shortcut to Enterprise AI. It is a journey
that organisations need to undertake consciously,
requiring mastering each one of the four key phases,
one after the other.
19. Gradual Steps toward Enterprise AI:
Main Risks
Dataiku’s Maturity Model
Big Data Day 0
Initiation
Impact
Acceleration
Systematization
ML is for specialists
Ad-hoc analytics
Siloed Approach
Demonstrate Value
Deliver Business Value
In Actual Operations
Fully align data,
organization and
processes
Structure Execution
and Self-Service
● Difficulty to assemble a
first team
● Shifting data
infrastructure/IT systems
● Lack of traction on
business owners
● Difficulty to
operationalize models
● Difficulty to get business
acceptance and impact
on model
● Inability to onboard
analysts
Main Risks
● Fragmented technologies
● Data is limited to ‘experts’
● Maintaining models in
production too costly,
hindering new
deployments
● Lack of capitalization on
previous projects
● Fractionated initiatives
difficult to reconcile
● Lack of manpower to
expand projects
● Accumulated
obsolescence of
deployed projects
● Lack of leveraging of new
technologies
● Data projects remain
fairly specific, lacking
cultural pervasivity
Enterprise AI
20. In a nutshell
Our experience Operationalization / going into production
● initial focus on development and coders
● no initial focus on governance, data protection, auditing
● no initial focus on enterprise security
● difficulty to operationalize models
● maintaining models in production too costly, hindering new
deployments
● accumulated obsolescence of deployed projects
Missing value definition
● Difficulty to get business acceptance and impact on
model
● Lack of traction on business owners
● Lack of capitalization on previous projects
● Data projects remain too specific
Missing Collaboration
● Difficulty to assemble a first team
● Inability to onboard analysts
● Lack of traction on business owners
● Fractionated initiatives difficult to reconcile
● Lack of manpower to expand projects
● Data projects remain too specific
Siloed IT systems & data
● Shifting data infrastructure/IT systems
● Fragmented technologies
● Data is limited to ‘experts’
● Lack of leveraging of new technologies