What is the reference architecture for creating an analytics shop in today?s environment? This presentation will review how businesses are mapping out their next generation analytics architecture based on various decision layers within the Hadoop ecosystem. The architecture also takes into consideration how data engineering, data science, decision science, and decision support play a role in this architecture. This framework recommends how modern analytical departments structure their technology thinking, as well as locate the gaps. This presentation will provide attendees with a better understanding of how to map out their next generation analytics architecture within the Hadoop ecosystem: -Review standard examples witnessed in the industry, compare industry standard vs. reference architecture. Highlight areas where companies are capable, as well as under-represented areas. – Explore why the decision science layer is one of the most omitted areas in production analytics departments. -Review the importance of scale. Predictive methods are required to scale analytics into the operation. – Explain impact of intelligent systems on the technology selection, physics for data in flight vs. data at rest . – Constructing model management systems that are implemented into production. Discuss possible solutions of this perplexing problems.
Linked Data in Production: Moving Beyond Ontologies
Next Generation Analytics: A Reference Architecture
1. 0Mu Sigma Confidential
Chicago, IL
Bangalore, India
www.mu-sigma.com
Proprietary Information
"This document and its attachments are confidential. Any unauthorized copying, disclosure or distribution of the material is strictly forbidden"
Chicago, IL
Bangalore, India
www.mu-sigma.com
Proprietary Information
"This document and its attachments are confidential. Any unauthorized copying, disclosure or distribution of the material is strictly forbidden"
Do The Math
Next Generation Reference Architecture: Analytics Domain
June 2013
Hadoop Summit June 2013 – San Jose
zubin.dowlaty@mu-sigma.com
Twitter: @crunchdata
2. 11Mu Sigma Confidential
Analytics Trends & Big Data
Reference Designs
How to Scale Analytics and Intelligent Systems Reference Design
Agenda
3. 2Mu Sigma Confidential
The information explosion is leading to an unprecedented ability to
store, manage and analyze data; we need to run ahead..
Information Ecosystem
► Text
Expanding ApplicationsData Source Explosion
Technology EvolutionAdvanced Analytical Techniques
Digital Data in 2011:
~ 1750 Exabytes
Facebook: 100+
million users per day
Mobile Phone
Subscription: 4.6
billion
15 Million Bing
recommendation records
RFID Tags: 40 billion
I
N
F
O
R
M
A
T
I
O
N
C
L
O
U
D
Click-stream
data
Purchase intent
data
Social Media,
Blogs
Email
Archives
RFID & Geospatial Information
Events
Opinion Mining
Social Graph
Targeting
Influencer
Marketing
Offer
Optimization
Clinical Data
Analysis
Fraud
Detection
SKU
Rationalization
Massively large
databases
Search
Technologies
Complex Event
Processing
Cloud
Computing
Software as a
service
Interactive
Visualization
In Memory
Analytics
CART
Adaptive
Regression
Splines
Machine
Learning
Radial
Basis Functions
Support Vector
Machines
Neural
Networks Geospatial Predictive
Modeling
4. 3Mu Sigma Confidential
Three Key Macro Analytics Trends Targeting Consumption
Macro Trends: Agenda in All Three
Analytics TrendsComputational Enablement
– Scalable Analytics, Big Data &
NoSQL technologies, GP-GPU, In –
Memory & High Performance
Computing
– Master Data Management, Cloud,
and Federated Data – the end of
the central EDW?
– Analytics as a Service and the
Cloud
– Open Source Maturity
– Rise of Agile Self Service
– Data Model Disruption: ETL vs. ELT
Usability and Visualization
– Dashboards will Evolve into Cockpits
– Improved Interactive Data Visualization & Aesthetics
– Augmented / Virtual Reality / Natural User Interfaces
– Graph Analytics sparked by Social Data
– Discovery with Computational Geometry
– Mobile BI and the Integration of Consumer & Enterprise Devices
– Gamification: Do you want to play a game?
Intelligent Systems
(“Anticipation Denotes Intelligence”)
– Operational Smart Systems are
Back propagating – Pervasive
Analytics
– Real Time Analytics and Event
Streams
– Artificial Intelligence (AI) is not
what we expected
– Don’t forget the 4th tier it’s Juicy
– Metaheuristics and the Ants
– Operational Analytics, You are the
Exception
– Experiments in the Enterprise and
the Quasi
– Just Say No to OLS
– Energy Informatics if Radiating
– Intelligent Search
5. 4Mu Sigma Confidential
Why should you care, is this phenomenon all marketing hype or
business critical?
Data is an asset and growing by leaps and bounds
– Structured and unstructured (DARK MATTER)
– Within and Beyond the firewall
A competitive differentiator is the use of data to improve decision making thru analytics
– “Executives are Professional Gamblers”
– Need an Edge..
» We are in the information age…
Consumption of analytics differentiates
» Embedded analytics vs. ppt culture
The idea is linked to the concept of a
digital age or digital revolution, and
carries the ramifications of a shift from
traditional industry that the industrial
revolution brought through
industrialization, to an economy based
on the manipulation of information, i.e.,
an information society (computation)
It’s all about analytics done at “SCALE”, enabling scale cost effectively.
Who Cares?
6. 5Mu Sigma Confidential
Big Data’s recent enterprise popularity can be attributed to two key
factors
Big Data Technology Evolution
2003 2004 2005 2006 2007 2008 2009 2010 2011
Apache: Hadoop project
Yahoo: 10K core
cluster
IBM
Acquires
Early Research Open source dev
momentum
Initial success
stories
Commercialization &
Adoption
Cloudera named most promising
startup funded in 2009
elastic map-reduce
Teradata buys Aster
(map-reduce DB)
Google Trigger: Releases “Map Reduce” & “GFS” paper
Google: Bigtable
paper
Lucene subproject
HP Acquires Vertica
EMC Acquires
Greenplum
2012
Oracle
MSOFT
SAS
Technology Trigger
– Robust Open Source Technology for Parallel Computation on Commodity Hardware (Hadoop /
NoSQL)
Frustration Trigger..
– Tyranny of the Data Model, Cost & Complexity for data
integration, Agility, Impact, Bureaucracy, Difficulty & Costly to scale of existing EDW technology
Faced with storing massive data sets from indexing the
web, Google searches for & finds a way to store & retrieve
the data in commodity hardware
7. 6Mu Sigma Confidential
A Mindset for Enabling Decision Sciences At
Scale
Dataset -> Toolset -> Skillset -> Mindset
Big Data – What is it?
Skillset
– Map-Reduce, HDFS, HIVE, R, PIG, NoSQL, Unstructured Data
– Parallel computation & Stream Processing
– Data Scientist and Data Explorer
– “…big data job postings on Dice more than tripled year-over-
year – Jan 2013”
Mindset
– Automation and Scale
– Higher utilization of machines for decisions
– Agile
– Learning vs. Knowing
– Consumption
– Algorithmic and Heuristic (Man & Machine)
8. 77Mu Sigma Confidential
Analytics Trends & Big Data
Reference Designs
How to Scale Analytics and Intelligent Systems Reference Design
Agenda
9. 8Mu Sigma Confidential
the Big Data eco-system is a challenge to keep up with..
Mindset: Knowing vs. Learning..
The Big Data Ecosystem
”Formulating a right question is always hard, but with big data, it is an order of
magnitude harder, because you are blazing the trail (not grazing on the green field).”
source: Gartner
Source: The Big Data Group llc
10. 9Mu Sigma Confidential
Mindset: Qualitative
Next Generation Analytics Platform (evoked set)
– Enterprise Architecture (Modern Best of Breed componentized Services, SOA)
– Scale (analytics done at scale)
– Value
– Commodity (scale horizontally using commodity hardware)
– Promote Reuse (published API, PMML, BPMN, SQL, CAMEL, CRISP-DM)
– Build Discipline (model versioning, GIT, unit testing, and coding standards)
– Batch and Real Time Capable
– Agile (integration and rapid deployment – analyst)
– Process Automation
– Knowledge Management
– Model Management
– Open Standards (HTML5, PMML)
– Quality Assurance (Unit Testing)
– Distributed & Parallel Capable (Hadoop, MPI, GPGPU)
– Inject intelligence into the ecosystem (consumption)
– Deploy Intelligent Systems & Intelligent Machines
– Champion / Challenger (test and learn)
– Robust and Agile ETL (HIVE, HBASE, SQL ETL tools)
– Beyond the tyranny of the data model..
11. 10Mu Sigma Confidential
The “decision” stack will lead to bundling and tighter integration of
technology components
Privileged and confidential Copyright 2012, Wal-Mart stores, Inc.
DATA ENGINEERING LAYER
DATA SCIENCE LAYER
DECISION SUPPORT STACK
DECISION SCIENCE LAYER
• Data Acquisition (Data at Rest, Data at Motion)
• Data Storage (Databases, Big Data, In-memory)
• Data Processing (ETL, ELT, Event Processing)
• Workflow Management
• Statistical Analysis
• Reporting
• Data mining
• Visualization
• Text Mining/Unstructured Data Analysis
• Business Contextualization
• Intelligent Analytics Workflows
• Insight Generation tools
• Learning frameworks
• Operationalize & Scale
Math + Biz + Tech
Math + Tech
Technology
12. 11Mu Sigma Confidential
Reference Design
Decision Support Stack for Operationalizing Analytics at Scale
Math + Business
+ Technology
Math + Technology
Technology
Dataset
++ Design
Weak Point for Enterprises Today
In Scaling Decision Sciences
13. 12Mu Sigma Confidential
A typical Big Data Architecture design pattern leverages NoSQL as
a data store “reservoir”
RDBMS
NoSQL
Traditional
EDW
FTP server
IMPORT
SQOOP/ 3rd
party
connectors
Edge Node
Monitoring Workflow engine
(Oozie)
Streams
Cloud
NoSQL DBs HDFS/ MAP REDUCE/(MPI)
Hadoop cluster hardware
ETL/ ELT
EDW for storing the data after analytical processing
(STAR Schema, Snow flake, etc.)
BI Tools
REAL TIME
In Memory: Spark, Storm, Shark,
Impala, Drill, Stinger
NoSQL: Cassandra, MongoDB,
HBase
BATCH
Hive
Pig
Native map/reduce
Hadoop streaming
Business users/ Applications
Traditional
EDW EXPORT
SQOOP/ 3rd
party
connectors
Flume/ Avro/ Storm
Data Governance
Authentication
Authorization
Kerberos
LDAP/ AD
Meta data mgmt.
Data sourcing
Mu Sigma’s POV: Big Data Arch. design pattern
CLOUD
14. 13Mu Sigma Confidential
Reference Design
Decision Support Stack for Operationalizing Analytics at Scale
Math + Business
+ Technology
Math + Technology
Technology
Dataset
++ Design
Weak Point for Enterprises Today
In Scaling Decision Sciences
15. 14Mu Sigma Confidential
Trend: Dashboards will Evolve to be More Actionable and Forward
Looking
Dashboredom?
http://www.information-
management.com/issues/20_7/dashboards_data_management_quality_b
usiness_intelligence-10019101-1.html
“As we move from the information age into the intelligence age”..
Usability and Visualization: Evolution
16. 15Mu Sigma Confidential
Trend: Interactive Visualization Improvements and the Importance
of Aesthetics & Design
Advanced Visualization will continue to increase in depth and
relevance to broader audiences. Visionary vendors
like Tableau, QlikTech, SAS JMP/Visual Analytics, Panopticon,
and Spotfire (now Tibco) made their mark by providing
significantly differentiated visualization capabilities compared
with the trite bar and pie charts of most BI players' reporting
tools.
The latest advances in state-of-the-art UI technologies such as
Microsoft’s SilverLight, Adobe Flex, R, and Processing,
Javascript/ HTML5 via frameworks like Ext JS, D3
http://d3js.org/ augur the era of a revolution in state-of-the art
visualization capabilities, and Multi Touch Devices …
http://bardoli.blogspot.com/2009/12/top-10-trends-for-2010-in-analytics.html movie: Avengers & Iron Man
http://selection.datavisualization.ch/Usability and Visualization: Evolution
17. 16Mu Sigma Confidential
Tapping the power of visual perception
Vision is by far our most powerful sense. Seeing and thinking are intimately connected.
I see what you mean. It isn't accidental that when we begin to understand something we say,
"I see." Not "I hear" or "I smell," but "I see."
Approximately 70% of the sense receptors in our bodies are dedicated to vision
[Few, 2006]
Consumption Enablement
18. 17Mu Sigma Confidential
Visually Encoding Data for Rapid Perception
Pre-attentive processing, the early stage of visual perception that rapidly occurs below the
level of consciousness, is tuned to detect a specific set of visual attributes.
How many 3’s?
1281736875613897654698450698560498286762
9809858453822450985645894509845098096585
9091030209905959595772564675050678904567
8845789809821677654872664908560932949686
1281736875613897654698450698560498286762
9809858453822450985645894509845098096585
9091030209905959595772564675050678904567
8845789809821677654872664908560932949686
Frameworks
Source: Tibco
19. 18Mu Sigma Confidential
Key Elements for Visual Analysis
Data:
• integrate multiple data sources
• enrich your data
Add dimensions to your data:
• color
• ShapE
• Size
• “Labels”
•
• Graph Theory
Interactive:
• Ad-hoc ask & answer (self-service)
• Query filters (Drill down - hierarchy)
• Linked Visualizations (Brushing)
• Bookmarks & Tabs
• Movement (animation) Time Emphasis
Frameworks
Pre-Attentive Attributes of Visual
Perception most applicable to Data
Presentation: Form, Color, Spatial,
and Motion
“Pre-attentive processing is the unconscious
accumulation of information from the environment.
All available information is pre-attentively
processed. Then, the brain filters and processes
what is important. Information that has the highest
salience (a stimulus that stands out the most) or
relevance to what a person is thinking about is
selected for further and more complete analysis by
conscious (attentive) processing”
20. 19Mu Sigma Confidential
Eloquence through Simplicity
Edward R. Tufte introduced a concept in his 1983 classic The Visual Display of Quantitative
Information that he calls the "data-ink ratio." Tufte defines the data-ink ratio in the following
way: “A large share of ink on a graphic should present data-information, the ink changing as
the data change. Data-ink is the non-erasable core of a graphic, the non-redundant ink
arranged in response to variation in the numbers represented”
Reduce the non-data pixels
Eliminate all unnecessary non-data pixels
De-emphasize and regularize the non-data
pixels that remain
Enhance the data pixels
Eliminate all unnecessary data pixels
Highlight the most important data pixels
that remain
Frameworks
21. 20Mu Sigma Confidential
General Principles of Design
What’s taught in Art Class?
Frameworks
http://www.johnlovett.com/test.htm
24. 23Mu Sigma Confidential
Reference Design
Decision Support Stack for Operationalizing Analytics at Scale
Math + Business
+ Technology
Math + Technology
Technology
Dataset
++ Design
Weak Point for Enterprises Today
In Scaling Decision Sciences
25. 24Mu Sigma Confidential
Welcome to DSpace … Decision Sciences Home Page
Each Community
has its own Home
Page.
Recent submissions to this
entire Community are
featured on the Home Page.
Other
Links
Example: Knowledge Management
29. 28Mu Sigma Confidential
Reference Design
Decision Support Stack for Operationalizing Analytics at Scale
Math + Business
+ Technology
Math + Technology
Technology
Dataset
++ Design
Weak Point for Enterprises Today
In Scaling Decision Sciences
30. 29Mu Sigma Confidential
Typical Analytics Workflow
Teradata
15%oftheprocess70%oftheprocess15%
Time Period Selection
Sister Store Selection
Processing of codes
Basic data pull
Email
Request Client – Analytics
Report
Run the optimization
process
Set inputs in the code
SAS
Processing of codes
Optmodel optimization
VBA Excel
Output as excels
Client Report Generation
Excel
CPLEX
optimization
SAS
1
2
4
5
6
7
8
9
10
11
3
Example of Analytics Batch Process Mgmt
31. 30Mu Sigma Confidential
Utilize “Analytics BPM”: Workflow and Business Process
Management Platform to Manage Man & Machine
Batch Analytics Automation
BPM is a rapid analytics automation solution that boosts the
productivity of analytical workgroups. Analytics BPM allows one
to build human and machine workflows and embed them
within your enterprise.
•Nexus BPM v1.5 (open source LGPL; based on Jboss JBPM)
•muFLOW v2.1 (based on Activiti.org)
Performance Gains Across Client Engagements
Industry Engagement Business Impact
Retail
Technology
Healthcare
Automates reporting
framework
Optimizes search engine
marketing spends
Helps create marketing
mix solution
Improvement in analyst productivity
Increased collaboration and utilization of
better process management strategies
Improved model granularity
Flexible bidding process
Increased click through rates by 8%
Easier tool deployment
Reduced development time by 15%
Support creation of additional reports rapidly
Key Features
A modeling environment that “glues”
together reusable components from
multiple technologies and enables
automation
A robust and flexible data integration
capability
Ability to create rich analytics vehicles
that can be customized by client
Provision to leverage past knowledge
Facilitates efficient process
management such as modeling,
automation, monitoring, and analysis
32. 31Mu Sigma Confidential
Professional Shops Require Model Management: “Algo Wars”
Source: SAS Model Manager
Key Features
Model Lifecycle Management
• “Software” Parallels
• Dev, Test, Stage,
Production
• Versioning
• PMML; vendor agnostic
Continuous model performance
management
• Champion/Challenger
• Live Update and
Optimization
Proactive notification and live
switch overs
Model Management
33. 3232Mu Sigma Confidential
Analytics Trends & Big Data
Reference Designs
How to Scale Analytics & Intelligent Systems Reference Design
Agenda
Analytics Trends
34. 33Mu Sigma Confidential
Yippee ! 5 % discount ..
Yeah.. I think I need these
headphones too
The current business scenario demands automating and
injecting intelligence into various enterprise touch points
Business Situation
Webpage
Add to
Cart
Checkout
Yearly Bonus
Let me buy a new laptop
Taxonomy,
Associations,
Nearest Neighbor
Popup
Optimization,
Need State
Personalization
Collaborative
Filtering
Recommender
Systems
Offer Optimization
Behind the
Scenes
Email
Marketing
Mix
CRM
Revenue
Management
Hey ! I have some good
deals being provided here
Wow ! Great offers ..
Since I have some more
money, let me have a look
Customer Segmentation
Propensity Score
Attribution Methods to
Optimize Media Spend
Pricing
Elasticity
Add Intelligent Analytics Everywhere You Can to Unlock The Value In Information
Search Buy laptop
35. 34Mu Sigma Confidential
We are in the „Information Revolution‟ and its high time that
data and analytics capabilities are integrated to scale with
machine intelligence
Intelligent Systems
Intelligent Systems: Next Wave of Innovation and Change
The Next Big Opportunity !
Optimal Utilization of Available Information and Migration to Intelligent Systems
Convergence of the global industrial
system with the power of advanced
computing, analytics, low-cost sensing
and new levels of connectivity
permitted by the Internet
* Source: GE (Industrial Internet – Pushing the Boundaries of Minds and Machines)
Intelligent
Machines
Advanced
Analytics
People
At Work
36. 35Mu Sigma Confidential
Current News on Intelligent Systems
CEP Vendors Acquired June 2013: Streambase (Tibco) and Progress
Apama (Software AG)
USA Today, September 16, 2012
– http://www.usatoday.com/money/business/story/2012/09/16/review-automate-this-
probes-math-in-global-trading/57782600/1
Two Second Advantage: Published 2011
– Excellent portrayal of Intelligent Systems, patterns from the human brains – mental
models
– Wayne Gretzky: Edge: He predicted a little better than anyone else, where the hockey
puck would be.
– Enterprise 2.0: Every transaction became a bit of digital data
– Enterprise 3.0: Every EVENT can become a bit of digital data
– There will always be value in making projections, days, months, years, predictive
analytics, data mining, and analytics systems on historical data – batch systems
– But in today’s world, enterprises need something
more, instantaneous, proactive, predictive capability – real time systems
GE Mind an Machines: November 2012
» Industrial Internet, Intelligent Systems, and Intelligent Machines
» http://www.ge.com/docs/chapters/Industrial_Internet.pdf
» http://www.youtube.com/watch?v=SvI3Pmv-DhE
38. 37Mu Sigma Confidential
Real Time Computation for Analytics: The Agent Metaphor
Real Time Automation
Java Agent Development Environment (JADE) is a java based middleware to aid in
agent oriented programming (AOP) across distributed machines. The Jade platform and
framework provides the functionality to easily develop software agents and control the
lifecycle of these agents.
Agent-based computing is an approach to design and implementation that facilitates
the design and development of sophisticated systems by viewing them as a society of
independent communicating agents working together to meet the goals of the system.
JADE Features:
Agent
Communication
Distributed Platform
Agent Mobility
Monitoring tools like
Remote Monitoring
Agent (RMA)
Agent Attributes:
Autonomous
Goal-directed
Sociable
RJADE
Ragentsembeddedinside
Javawithrsession
associatedwiththem
39. 38Mu Sigma Confidential
Thank You
Chicago, IL
Bangalore, India
2013
www.mu-sigma.com
Proprietary Information
"This document and its attachments are confidential. Any unauthorized copying, disclosure or distribution of the material is strictly prohibited"
zubin.dowlaty@mu-sigma.com
Twitter: @crunchdata
Notes de l'éditeur
How many did you find? The correct answer is five. Whether you got the answer right or not, the process in the first case took you a while because it involved attentive processing. The list of numbers did not exhibit any pre-attentive attributes that you could use to distinguish the threesfrom the other numbers.Much easier the second time, wasn't it? In this figure the threes could easily be distinguished from the other numbers, due to their differing color intensity (one of the pre-attentive attributes discussed in the next slide): the threes are black while all the other numbers are gray, which causes them to stand out in clear contrast. Why couldn't we easily distinguish the threes in the first set of numbers based purely on their unique shape? Because the complex shapes of the numbers are not attributes that we perceive pre-attentively. Simple shapes such as circles and squares are pre-attentively perceived, but the shapes of numbers are too elaborate.
The key idea behind Tufte’s principles is that a major portion of the graphic should directly correlate to the data that is to be presented. The change in the ink should happen only if the data is changing since the major portion of the ink is data-ink. The fluff (non-data ink) should be minimal. Examples of non-data ink are grid lines, axis, etc.In the context of computer systems, we can think in terms of pixels instead of ink (with the background pixels not being considered).