Automated Analytics at Scale

Automated Analytics at Scale
Model Management in Streaming
Big Data Architectures
Chris Kang

• Machine learning allows organizations to proactively discover patterns and predict
outcomes for their operations, and improving those insights requires deploying
better analytical models on their data.
• Finding the best analytical model requires running thousands of hypotheses on
various datasets and comparing models in a brute force approach.
• Currently a model management framework does not exist - that is, an agnostic
tool or framework that manages and orchestrates the entire lifecycle of a model.
Real-time Analytics at Scale
2Copyright © 2016 Accenture All rights reserved.
Challenges of Model Management

Model Management Framework operationalizes analytics to ease
development and deployment of analytical models
The framework provides key benefits to operationalize and democratize access to analytical
modeling at scale
Captures and
templates analytical
models created by
expert data scientists
for easy reuse
Faster development
of analytical models
with rapid iteration of
training and
comparing models
using brute force
approach
Presents champion-
challenger view to
visually compare and
promote trained
models
Reduces complexity
for data scientists to
train and deploy
models
Enables business
analysts and others to
participate in
modeling process

Model Management Framework is essential for the Internet of Things
platform
The Internet of Things platform exposes thousands of sensors that require models to be
automatically managed and maintained as well as provide easy access to the predicted results
Identify desired
insights
Identify sights for operationalizing devices/machinery for various
purposes: detecting anomaly, prediction maintenance, budget and
resource optimization
Collect data
Collect various types of data (time series or static) and store them
into databases that best fits the data type
Analyze
Train the analytical models using the model management
framework or using other analytical tools such as R then onboard it
to the framework
Actuate and
optimize
Set up rules to act on predicted results from thousands of sensors,
e.g. schedule a maintenance or lower temperature on a device
Copyright © 2016 Accenture All rights reserved. 4

Organizations today have an unprecedented amount of data available
because of the Internet of Things, the web, and social media
In order to take advantage of
this massive set of data,
organizations must build
analytics platforms
Source: IBM, Big Data Hub, 2013

Traditional analytics platforms use big data technologies to process and
analyze large amounts of data
“Excited by big data technology capabilities to store more data, more diverse data and more real-time data, (companies)
focus on data collection. Rapidly growing data stores put increasing pressure on figuring out what to do with this data.
Determining the value of the collected data becomes the top challenge in all industries.”
Source: Svetlana Sicular, Gartner, October 30 2015
Example
Technologies
The steps to derive value out of the data include collecting, processing, and analyzing the data using a variety
of big data tools.
Analytics and
Visualization
Data Processing
Data Collection
Store huge volumes of data in multiple
data stores in a variety of data types for
processing.
Process the data by filtering, transforming,
and applying machine learning algorithms
using computing engines.
Create ad hoc reports on processed data
using business intelligence and
visualization tools.

Enterprises need access to both historical and real-time data to gain the
most value out of big data analytics
• Real-time is data that is processed in sub-seconds to seconds from the time data arrives to when the results are derived.
• Batch processing technologies alone are insufficient because in the time it takes to process a batch (hours, days), real-time
data has accumulated and is missed, which generates a loss of opportunity for proactive decision making.
Storing data in a fault-tolerant, replicated historic store,
processing a large batch of data, and storing the processed data
using batch writes incurs delays that make real-time not feasible
Queries are only
directed at stale data
of up to hours or
days. The lack of
real-time data limits
the analytics to ad-
hoc summarizations
and aggregations.
Because of the
batch processing
delay, by the
time the captured
data is available
for queries, it is
stale
Real-time data is missed by
the time analytics begins
Historic
Data
Store
Batch Batch
Write
Data Query
Storage Processing Serving
Real-Time Data

The Lambda Architecture empowers real-time analytics by handling data at
scale and in real-time using a hybrid architecture
• Designed by Nathan Marz, the creator of the Apache Storm project and previously a lead engineer at Twitter, the goal was
to build a general architecture to process big data at scale.
• The architecture separates batch processing on historical data from stream processing on real-time flow of data, allowing
for analytics on data that combines the most up-to-date data with historical data views.
Real-time analytics
can now be
performed on data
combined from most
up-to-date data with
historical views
BATCH LAYER focuses
on processing historical
data views for queries
SPEED LAYER handles
the complexity of real-time
data collection and analysis
Historic
Data
Store
Batch Batch
Write
Data Query
Storage Processing Serving
Queue Speed Random
Write

In the Internet of Things, predictive modeling on sensor data allows
organizations to discover patterns and predict outcomes for their operations
Remediation
Notification
and Alerts
Oil & Gas
Producer
Water Utility
Client
NoSQL for
Unstructured
Data
Computing Engines
and Stream
Processors
Machine Learning
Algorithms
Model Runtime
Environments
Sensors at
Field Sites
Predictive
Results
Data Collection Data Processing Predictive Modeling Proactive Decision Making
Collects data from over
190,000 sensors
Collects data from sensors
placed along pipes in a water
distribution network
Injects 6,000 rows/second
and 11 billion rows of data
per month – larger analytics
platform than Twitter
Processes data for water flow
rate and pressure
Has over 3,500 models analyzing
data using various algorithms
Apply predictive model to project
forward in time to see spikes or falls
that exhibit warning signs of failure
Enables company to examine huge sets of data,
discover trends to predict outcomes in operation
and exploration efforts
Use results from predictive model to proactively
reduce pressure spikes, avoiding leaks,
prolonging the longevity of assets, and reducing
disruption to customers
• The real value of big data is the insight via the analytics, not just the collection of the data.
• Predictive modeling is the primary means by which companies can discover trends and make proactive, as opposed
to reactive, decisions on data.

The modeling process is iterative and its lifetime spans both the batch mode
model training and real-time prediction
In general, a model creates an output for an unknown target value given a defined set of inputs.
In a time-series model, the target value also depends on time as an input
Build Model
• Identify required data and
how to get it
• Design and validate
specific analytic models
• Verify approach through
initial set of insights on
particular environments
Analyzes a variety of
machine learning
algorithms and identifies
the logistic regression
model as the most suitable
for the problem. Codes
model .JAR file
Train Model
• Prepare historical data
for training
• Select model input
parameters and runtime
environment
• Train the model on data
from historical batch
and/or real-time stream in
runtime environment
Selects input parameters
such as the regularization
parameter for the logistic
regression model. Submits
the model to Spark to train
the model on historical
data in HDFS
Monitor
Execution
• Monitor the status of
training the model in the
runtime environment (e.g.
running, succeeded,
failed)
• Troubleshoot issues in
the runtime environment
if necessary
Opens the terminal, ssh
into the Hadoop cluster,
and enters the commands
to verify the status of the
model as it is trained
Compare
Models
• Compare trained models
in champion-challenger
fashion
• Brute force approach to
finding best-of-breed
model for deploying to
live stream
After iteratively training
many models, select the
best-of-breed based on the
model with the lowest
mean square error
Operationalize
Model
• Deploy best model on live
stream of data
• Generate predicted
results for automated or
manual proactive
decision making
• Observe results to feed
back and fine-tune the
model
Submits the model to
Spark Streaming to be
applied to streaming data
ingested from Kafka, and
model predicts in real-time
whether sensor will fail
I want to deploy
a model that can
detect if a sensor
is faulty in
real-time
Data
Scientist
Data Science System Administration

Challenges with Analytical Modeling
in the Current State

Building, training, and deploying analytical models require a
rare combination of data science and engineering skills
The ability to complete the modeling process is limited to specialized individuals who are experts
in both data science and engineering
“The United States
alone faces a shortage
of 140,000 to 190,000
people with analytical
expertise and
1.5 million
managers and analysts
with the skills to
understand and make
decisions based on the
analysis of big data.”
Source: McKinsey Global Institute
analysis
Traditional Strengths
Potential Hurdles with Model Building
and Deployment
Full Set of Skills Needed for
Model Building and Deployment
Mathematics, statistics,
machine learning, data
mining, pattern recognition,
predictive algorithms, domain
expertise
Troubleshooting and running a runtime
environment such as Spark requires
advanced system engineering skills, which a
data scientist may not be trained in. This can
potentially lead to slower development and
deployment of predictive models.
• Understanding of a variety of
machine learning algorithms,
pattern recognition, as well as
expertise in a domain.
• Ability to build and code accurate
models based on problem
space.
• System administrator skills as
well as deep understanding of
big data systems to deploy
models in runtime environment.
Domain expertise, business
processes, requirements
gathering
Traditional business analysts may lack core
skills in data science or data engineering
because of a lack of experience to build, train,
or deploy models
Combination of data science
skills as well as software
engineering and system
administrator skills for big
data systems
May lack domain expertise, in which case it
may take longer to build and train relevant
models for the use case
Data
Scientist
Business
Analyst
Dual Data
Scientist
and
Engineer

Analytical models are not easily reusable or shareable,
resulting in siloed analytics work
There is no standard method for sharing models to let users leverage models created by other
data scientists, so the analytics work is siloed. This is true for both freshly built models and models
that were already trained on a dataset
Predictive models duplicate and sprawl
as data scientists build and train their
own individual library of models that
are not shared.
No standard for
sharing or
viewing other
data scientist’s
models
Individual Libraries of Models
Data scientists primarily leverage their own libraries
of models and previous datasets they worked with
to select an algorithm and build a model for the
current problem
Model Duplication
As models are built and trained, the same types of
models may be built by more than one data
scientist, particularly if the types of models are
common in the industry’s use cases
Model Sprawl
Over time, as more data scientists build and train
more models, the models begin to sprawl and
duplicate unnecessarily, making the central
management of models more difficult
Train and
deploy
individual
models
Runtime execution
environments for model
training and deployment

Without a framework, current approach is too inflexible to support multiple
runtime execution environments
It is impractical to scale the number of runtime environments to train and deploy models using a
manual approach
Spark model
with R
dependencies
Model with R
dependencies
I have a model,
but I don’t know
which runtime
environment
can support it
I’m only familiar with R,
so I need to learn all the
environments to test my
model
I have a new
type of model
so I need to
learn another
runtime
environment
Runtime environments often times cannot support all types of models. As a result, data scientists must spend time learning
environments instead of using that time for analytical modeling.
Dependencies
match and runtime
can support model
Missing Spark
functionality to
execute model
Missing specific R
dependency so
cannot support
model
All R libraries
supported and can
execute model
Data
Scientist
Update
Test
Learn
• Data scientist needs to acquire the system
administration skills to operate the runtime
environments
• Each runtime environment is unique and
requires time and energy
• In the worst case, the data scientist must try
every runtime environment before successfully
finding a match for the predictive model
• As more model types are needed, additional
runtime environments must be learned
• Learning additional environments becomes a
time-consuming endeavor

Lack of engineering abstraction makes it difficult to
quickly train predictive models on data
Data scientists lose productivity as the process to train models is manual, requiring a manual
check for the status of a model in the environment as well as system administration for
troubleshooting the model in the environment
Need for abstraction
grows as the number
of types of models and
runtime environments
increases
Wasted productivity – Spending time on data
engineering instead of comparing models to
find the best-of-breed for deployment
No abstractions for
training or monitoring
models on runtime
environments
Train model
Repeated for hundreds of models
on various runtime environments
Check
status of
model
Troubleshoot
model
Train model
Check
status of
model
Troubleshoot
model
Train model
Check
status of
model
Troubleshoot
model
Build many
models on
various
algorithms
More time spent
on system
administration
Less time spent
on building
predictive
models
Try different input
parameters and
algorithms to find best-of-
breed model
…..
Manual Process
Data
Scientist
Build Model Train Model and Monitor Status

Model Management Framework
for Automated Analytics at Scale

Model Management Framework simplifies the training, deployment, and
management of a large number of models for a Lambda architecture
Model management is a framework for data scientists and users to more easily
train and deploy analytical models in various runtime environments on the lambda
architecture by abstracting the system administration, reducing the complexity of
train and deploy, and sharing the models in a way that is consumable by users in
your organization, enabling other users such as business analysts to partake in the
modeling process.
The framework in this reference architecture proposes
• Model Store and Trained Model Store: A library of models of commonly used
machine learning algorithms that can be trained on user’s historical datasets, as
well as trained models that are available to be deployed.
• Model Interface Templates: Interfaces that abstract away the complexity of the
machine learning algorithm, allowing users to specify the inputs and outputs of
the model.
• Deployment and Scheduler: Automatic training, deployment, and scheduling of
models on runtime environments so that users do not need to operate the runtime
environments themselves.
• Runtime Verifier: Ability to determine which runtime environments can support a
model prior to execution, enabling faster development of trained models.
• Monitoring Service and Metadata Store: Service monitors the status of the
model during its execution on the runtime environment for the user, as well as
any metadata about its execution which it can then store.
• API: Exposes functionalities with API endpoints for users to verify, train, deploy,
and monitor models on runtime environments.
Real Time Analytics
Runtime Environments
Distributed Computing Scientific Computing
Model Management
Deployment
and Scheduler
Runtime
Verifier
Model Store Metadata StoreTrained Model
Store
Monitoring
Service
API Model Interface
Templates
Users
Data Scientists Business Analysts

• Design for seamless interfaces is the method of connecting various stages throughout modeling pipeline to support
the domain experts/data scientists to create and update models and for the business analysts to extract data insights.
• Model management at scale is specific for large scale data analytics which requires distributed resources allocation
and communicates with various data stores.
Model Management Framework provides seamless interfaces along data
analytics pipeline for model creation, deployment and scheduling
The framework in this technical architecture
proposes
• Runtime Environments: Backend runtime
environments such as Spark, MapReduce, R, and
more interact with distributed resources (e.g. Hadoop)
to train and deploy models
• Historical Data Store: Data virtualization interacts
with various databases (e.g. Cassandra, Redshift, S3)
• Training, Prediction, Model Runtime Services:
Framework services interact with runtime service to
deploy and allocate resources for models as well as
verify models for execution
• APIs: APIs interact with framework services for
various functionalities
• Online Message Queue: Message queue is injected
with real-time data
Prediction Service Training Service
API
User Interface
Resource
Allocation
Service
Model
Store
Results
Store
Model
Metadat
a Store
Historical
Data
Storage
Runtime Environments
Model Runtime Service
Online
Message
Queue
Data
Scientist
Business
Analyst

Demo

Model Management Framework covers a number of features to support
various perspectives
The framework provides the following features from the services to better serve domain
experts/data scientists and business analysts
Feature Explanation
Automatic model deployment on multiple
runtime environments
Automatic preparing trained model to serve real-time data with the saved.jar file to multiple runtime
environments with pre-verification prior to execution.
Modeling algorithm library A library with algorithms for machine learning and statistical learning
Model metadata A model profile to describe the configuration parameters, path to input/output data, model version as well
as resource consumption
Heterogeneous data stores Data can be stored in various databases
Champion-challenger model Multiple models with the best performed model as the champion and the rest as the challengers
Batch mode and real-time mode A combination of model training and serving model to real-time data
Model update Retraining of the current model or re-selecting of the champion model
Job completion time estimation Estimate of how soon a job can be completed given the current resources
Prediction results query and UI Access to prediction results from applying trained model for real-time data for dashboard display
Algorithm parameter tuning Automatic fine tuning of algorithm parameters to achieve the best model quality

Deploy Accenture’s Model Management Framework on-premise to
operationalize analytics in a big data analytics platform
At Accenture Labs, we have a patent-protected invention on the model management framework
that showcases the unique capabilities of our framework. If you have analytical models running in
a big data analytics platform, we can help deploy our model manager in your environment before
problems arise as the number of types of models and runtime environments you need to support
increases
Simplified modeling process for data
scientists
Abstracts data engineering and presents champion-challenger view for your data scientists to
more quickly train, compare, and promote their models for deployment.
Provide analytics for Internet of Things
use cases
Process data from heterogeneous data stores allows for sending data from thousands of
sensors through modeling pipeline to leverage existing platform’s analytical capabilities.
Enabled for real-time analytics
The model manager can deploy prediction jobs that ingest streaming data and applies a trained
model for real-time predictions.
Greater coverage of runtime
environments and models
Extends the capability to support additional runtime environments, increasing the number of
types of models you can use in your data pipeline.
Democratized access to analytics
Share library of models created by experts allows other data scientists and business analysts to
leverage the models for their use cases.

Contact Information
Accenture Labs
Teresa Tung
Technology Labs Fellow
teresa.tung@accenture.com
Carl Dukatz
R&D Senior Manager
carl.m.dukatz@accenture.com
Chris Kang
R&D Associate Principal
chris.kang@accenture.com

Appendix

The solution: A new Model Management Framework
Simplifying model deployment at scale
A simplified
interface
RESULTS
• Enables a catalog approach to finding analytics
• Simplified onboarding of new analytics
• Brute-force approach to retraining and comparing models
Comprises of a model
building service, a prediction
service, and a resource
allocation service
Supports end-to-end
analytical modeling at scale
using the Lambda
Architecture
Hides the complexity of
Lambda and unlocks its
power for data scientists,
domain experts, and
business analysts

Benefits of the new framework
Unlocking the power of Lambda for data scientists, domain experts, and business analysts
Data scientists and domain experts
who generate the models can:
• Select from already captured
modeling approaches or onboard
their own
• Easily compare models in a
champion-challenger fashion
Business analysts who rely on
model’s results can select from a
catalog of models created by
experts

Model Management Framework differs from other approaches in its
enablement of big data capability with heterogeneity and scalability
Other analytics focuses on designing and fine tuning machine learning algorithms to
improve accuracy with modeling tools that are hard to scale or speed. For example, WEKA
libraries provides comprehensive machine learning algorithms but lack the capability to
integrate with big data or manage thousands of models. For example, Apache Mahout
works with Hadoop MapReduce with slowdown from frequent writes to disk.
Comparison Examples
Model Management Framework
• I want to run my analytics on the distributed data set with the size of
TB or PB which is geographically distributed and stored in various
databases
• I want to deploy multiple models on distributed resources and let the
framework automatically select the best model based on the metrics
I have defined
• I want to specify the prediction interval and query the results by
calling API endpoints
• I want to always use the up-to-date model by having the framework
retrain the current model or selecting a new champion model
Other Model Management
• I want to the improve my SVM classification algorithm by 3% in
terms of accuracy with my 300MB dataset residing on my local disk
• I want to try various algorithms and fine tune parameters to see how
the accuracy can be improved
• I want to apply the trained model for new data for prediction by
calling the modeling method and specifying where to store the
results. I need to try multiple prediction intervals to see which works.
• I want to see the prediction results by plotting the data from the file
where results are saved into

Automated Analytics at Scale

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (17)

Similaire à Automated Analytics at Scale

Similaire à Automated Analytics at Scale (20)

Plus de DataWorks Summit/Hadoop Summit

Plus de DataWorks Summit/Hadoop Summit (20)

Dernier

Dernier (20)

Automated Analytics at Scale

Notes de l'éditeur